How much should you be earning as a Data Engineer?

Data engineers can increase their comp by relocating or learning new tools - we show which are most likely to help

I recently stumbled across this post on Reddit.

Reddit Discussion on Data Engineering Salaries

As I was scrolling down the comments, I noticed most included salaries in pretty much the same format, as suggested by the post.

Reddit Comment on Data Engineering Salaries

Such structured, and well labeled data! Not only that, there were threads every quarter for the last 5 quarters.

I couldn't resist a mini data analysis project: I wanted to find what influences how much data engineers earn:

Approach to the project

As I saw it, I would need to do the following:

  1. Extract the data from Reddit
  2. Get the data into a table to make analysis easier
  3. Clean the data pretty thoroughly
  4. Analyze and present the results in a easily digestible way

In this post I'll run through the steps of the project, and finish with the final results. The full code, and final dataset is on Github.

1. Extract the data

Reddit has an API. I didn't use it.

I haven't used it before and wasn't sure if they'd even be able to give me the data I wanted. But since the data is in the post, I knew it must be being delivered to the browser somewhere.

Firing up Chrome's trusty devtools and looking at the network tab, I hunted for a request that looked like it would have the data in. After a bit of a trawl I found it.


{
  "account": {...},
  "authorFlair": {...},
  "commentLists": {...},
  "comments": {
    "t1_iaut7cg": {
      ...
      "media": {
        "richtextContent": {
          "document": [
            {"c": [
                {"c": [{"c": [{"e": "text","t": "Data Engineer"}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "1 year. Coming from analyst roles though."}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "Utah.  Remote"}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "95k USD"}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "25k bonus,  options."}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "Health care"}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "Python,  AWS."}],"e": "par"}],"e": "li"}
              ],
              "e": "list",
              "o": true
           }]
        },
        "type": "rtjson",
        "rteMode": "richtext"
      }
    },
    {// LOTS MORE COMMENTS...
    }
  }
}

There was one of these files for each of the 5 threads - I saved them into 5 files.

2. Get the data into a table

I'm not particularly good at extracting data from json files. But for me I knew this was going to be a Python job. I booted up a Jupyter notebook.

I noticed a few things about the structure of the data:

  1. Generally only "top level" comments had salary data. Threads under comments were normally asking the commenter for more information. This made the task easier, as I could discard the other comments.
  2. Most commenters stuck to the numbered list format. However some added a line of text before they started their list. And some used a list, but did not add numbers. This meant I would need a couple of different approaches to getting the data.
  3. Comments with less than 4 lines rarely contained salary data. Because the post was asking for 4-7 data points, those with fewer lines were normally comments without salary data. I excluded these.

I also included the date the thread was created as a column.

After a bit of trial and error I got to this. Done in 40 lines of Python.


import json

# run it for each of the 5 Reddit threads
dates=['2021-06-01','2021-09-01','2021-12-01','2022-03-01','2022-06-01']

# loop through threads and extract data from comments
array = []
for date in dates:
    with open('posts_'+date+'.json', 'r') as f:
        data = json.load(f)
        comment_no = 0
        for key in data:
            if key == "comments":
                for comment in data[key]:
                    row=[]
                    row.append(date) # add the date of the Reddit thread
                    for i in range(0,7): # get the first 7 rows of the comment text
                        try: # if the data is in a list ie 1. 2. 3. 
                            value=data[key][comment]['media']["richtextContent"]["document"][0]['c'][i]['c'][0]['c'][0]['t']
                            row.append(value)
                        except: 
                            try: # if the data is in a list, but has a non-list sentence first. (Posters often add a preamble)
                                value=data[key][comment]['media']["richtextContent"]["document"][1]['c'][i]['c'][0]['c'][0]['t']
                                row.append(value)
                            except:
                                try: # this works if the data is not in a list
                                    value=data[key][comment]['media']["richtextContent"]["document"][i]['c'][0]['t']
                                    row.append(value)
                                except:
                                    pass
                    if len(row)>3: # remove results with less than 4 lines - normally  comments with no salary data
                        array.append(row)
                    comment_no += 1

# put it in a pandas DataFrame, export to csv
import pandas as pd
df=pd.DataFrame(array)
df.to_csv('salary_data.csv', index=False)

The final lines put it into a pandas DataFrame (pandas is the de-facto standard for manipulating tables in Python). It also makes it simple to export to csv.

After all this, I have the following data:

Date Title Experience Location Salary Benefits Industry Tech Stack
2022-06-011)Current title: Data Engineer2)Years of experience (YOE): 23)Location: DMV4)Base salary: $100,000 US5)Bonuses: Year End 10% Bonus6)Industry: Media7)Tech stack: AWS, Python, Airflow
2022-06-01Application Developer II3 YOEMidwest83K USDBonus 6% ~ 5KCan’t say industry without doxxing Tech stack: talend denodo Qlik replicateRole I’m interviewing for rn: Data engineer/ 110K/ Midwest/ working with cloud, spark, python, airflow, and many more/ logistics supply chain company.
2022-06-01BI Analyst2y Developer, 6 months Data AnalystBaltic States1800EUR/month5% Quater BonusHealthcareMySQL, Python
2022-06-01Data Analyst3 yearsAlberta$90k CADNegligible bonus, 2% match RRSP--
2022-06-01Data Engineer1 year. Coming from analyst roles though.Utah. Remote95k USD25k bonus, options.Health carePython, AWS.
2022-06-01Data Engineer3 yearsTexas$140kESOPFinanceSQL, Python, Azure (Databricks/Data Factory)
2022-06-01Data Engineer1 in role, 2 months in DS internship about 6 months before startingVirginia (hcol)102k USDPaltry bonus- 10% 401k contrib, full healthcare premium with no out of pocket, tuition coverage for continuing educationGovernmentPython, sql, unfortunately no cloud resources lol
2022-06-01Data Engineer8 Years as a BI Dev 2 Years Data EngineerLondon UK80,000 GBP before TaxUpto 15% BonusAsset ManagementSQL, Azure Data Factory, Python (just a little bit)
2022-06-01Data Engineer3 YOEManila, Philippines780k PHP or 15.6k USD65k PHP or 1.3k USDPython, SQL, RDS, DMS, KINESIS, LAMBDA , S3, GLUE, ATHENA, REDSHIFT,EC2, Airflow, Git for version control.-
2022-06-01Data Engineer10 yearsLondon£120k20% + equityCommunicationPresto, Spark, Airflow
No Results

335 rows of tagged salary data!

It's a great start, but not usable yet. There are some rows without useful data, the formats are a real mix, and the currency data is very inconsistent. Time for…

3. Data Cleaning

Perhaps unsurprisingly, this was the most time consuming part of the project. I opened a new Jupyter notebook to look at the data.

At this point I want to mention the tool that made this literally 5x faster: GitHub Copilot. It's an AI tool that helps you write code. It's difficult to describe how uncannily good it is: If you write code, try it.

GitHub Copilot
Wow. Copilot's suggestions saved me a lot of time.

Data cleaning steps

Overall this was a pretty gnarly exercise, but in essence:

  1. For erroneous rows, I removed them using a spreadsheet as it was easier to spot them than in a notebook.
  2. For all columns, I converted text to lowercase, removed punctuation and any contents in brackets (normally unhelpful for analysis).
  3. For categorical data columns, (title, industry, location) I aimed to reduce the number of distinct groups that had the same label. For example, if two rows contained the titles "Sr. data engineer ii" and "Senior DE 2" I would want these to be in the same group.
  4. For continuous data columns, (salary, experience) I tried to extract the numbers correctly. For salary specifically I also looked for currency symbols and codes, so that I could convert salaries to USD.

After this, it wasn't perfect but it had 297 rows of very usable data. You can view & download it below.

Day Title Yr Exp Country Salary ($) Industry Tech Stack
2022-06-01application developer ii3.00us$83ktechrole i’m interviewing for rn: data engineer/ 110k/ midwest/ working with cloud, spark, python, airflow, and many more/ logistics supply chain company.
2022-06-01bi analyst0.17baltic states$23khealthcaremysql, python
2022-06-01data analyst3.00canada$69k(blank)-
2022-06-01data engineer1.00remote$95khealthcarepython, aws.
2022-06-01data engineer3.00us$140kfinancesql, python, azure
2022-06-01data engineer2.00us$100kmediatech stack: aws, python, airflow
2022-06-01data engineer9.00us$190ktechpresto,spark, proprietary pipeline tool and proprietary dash boarding tool.
2022-06-01data engineer1.00us$28kinsuranceazure stack
2022-06-01data engineer0.08us$102kgovernmentpython, sql, unfortunately no cloud resources lol
2022-06-01data engineer8.00uk$98kasset managementsql, azure data factory, python
No Results

4. Presenting the data

I used Evidence's charting capabilities to present the data. Difficult not to use your own tool when you work at a data viz company. I stuck the above data into Snowflake to make it easier to query.

4.1 Who posted salary data?

Let's start by understanding the posters.

4.2 Average Data Engineer Salaries & Experience

For continuous fields, I used a histogram.

The $10k range with most salaries in is $60-70k, but the centre of the distribution appears to be around $90-100k.

The median data engineering salary is $100k. Dragged up by a few high values in the dataset, the mean data engineering salary is $105k.

The average data engineer in the dataset has 3.93 years of experience, with the majority of posts from those with less than 5 years of experience:

Having understood the data, I can answer my original questions.

4.3 Should you relocate for a salary increase?

Moving to the US, or landing a remote role could net a $40k+ salary boost, but there might be practical considerations (visas, work permits, etc).

4.4 How should you expect your salary to change with experience?

Plotting salary vs years of experience, it seems the two are correlated.

This is easier to see if we group up by years of experience, and take the median (the size of the bubble represents the number of responses).

There is typically a steep salary trajectory for those with less than 5 years of experience: Salaries rise from $73k in Y1 to $135k by Y5. After that it seems to flatten off (though the number of datapoints is also getting sparse).

4.5 Are some industries more lucrative?

This is more interesting. And perhaps unexpected. There's a lot of variation between industries. I expected "tech" to have the highest salaries, but in this sample telco roles have the highest median salary (with 8 responses).

We're by no means hitting the sample size for statistical significance here, but it seems like if you are working in e.g. healthcare, an industry switch could get you a higher salary.

4.6 Are there tools or technologies you should learn to land a higher salary job?

Not all of these are equivalent, so it's not as if you should invest time "learning" Snowflake.

But in our dataset the highest paid data engineers are using Apache Spark and Airflow, so perhaps if these aren't familiar to you yet, investing time to learn these could get you a higher salary.

Wrapping up

So there you have it. If you want a salary increase: learn Spark, and look for a remote job at a telco company. Or don't make decisions entirely based on what people post on Reddit. Your choice.

As a future build, I'd like to set this up to check for new posts and keep the data up to date - but I'll have to leave that for another day!

Other questions you want to answer with this data? Any different ways you'd have approached this?

Jump into the comments on Reddit, or send me an email at archie@evidence.dev

Powered by