How much should you be earning as a Data Engineer?

Data engineers can increase their comp by relocating or learning new tools - we show which are most likely to help

Archie Wood Archie Wood  


I recently stumbled across this post on Reddit.

As I was scrolling down the comments, I noticed most included salaries in pretty much the same format, as suggested by the post.

Such structured, and well labeled data! Not only that, there were threads every quarter for the last 5 quarters.

I couldn't resist a mini data analysis project: I wanted to find what influences how much data engineers earn:

  • Should you relocate for a salary bump?
  • How much does pay typically increase with experience?
  • Are some industries more lucrative than others?
  • Are there tools you should learn to land a higher salary job?

Approach to the project

As I saw it, I would need to do the following:

  1. Extract the data from Reddit
  2. Get the data into a table to make analysis easier
  3. Clean the data pretty thoroughly
  4. Analyze and present the results in a easily digestible way

In this post I'll run through the steps of the project, and finish with the final results. The full code, and final dataset is on Github.

1. Extract the data

Reddit has an API. I didn't use it.

I haven't used it before and wasn't sure if they'd even be able to give me the data I wanted. But since the data is in the post, I knew it must be being delivered to the browser somewhere.

Firing up Chrome's trusty devtools and looking at the network tab, I hunted for a request that looked like it would have the data in. After a bit of a trawl I found it.

Unfortunately websites typically don't load data in nice tabular formats like CSV, but as .json packets. The data looked like this:

(Hiding some non-useful data with "…" for readability)

{
  "account": {...},
  "authorFlair": {...},
  "commentLists": {...},
  "comments": {
    "t1_iaut7cg": {
      ...
      "media": {
        "richtextContent": {
          "document": [
            {"c": [
                {"c": [{"c": [{"e": "text","t": "Data Engineer"}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "1 year. Coming from analyst roles though."}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "Utah.  Remote"}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "95k USD"}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "25k bonus,  options."}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "Health care"}],"e": "par"}],"e": "li"},
                {"c": [{"c": [{"e": "text","t": "Python,  AWS."}],"e": "par"}],"e": "li"}
              ],
              "e": "list",
              "o": true
           }]
        },
        "type": "rtjson",
        "rteMode": "richtext"
      }
    },
    {// LOTS MORE COMMENTS...
    }
  }
}

There was one of these files for each of the 5 threads - I saved them into 5 files.

2. Get the data into a table

I'm not particularly good at extracting data from json files. But for me I knew this was going to be a Python job. I booted up a Jupyter notebook.

I noticed a few things about the structure of the data:

  1. Generally only "top level" comments had salary data. Threads under comments were normally asking the commenter for more information. This made the task easier, as I could discard the other comments.
  2. Most commenters stuck to the numbered list format. However some added a line of text before they started their list. And some used a list, but did not add numbers. This meant I would need a couple of different approaches to getting the data.
  3. Comments with less than 4 lines rarely contained salary data. Because the post was asking for 4-7 data points, those with fewer lines were normally comments without salary data. I excluded these.

I also included the date the thread was created as a column.

After a bit of trial and error I got to this. Done in 40 lines of Python.

import json

# run it for each of the 5 Reddit threads
dates=['2021-06-01','2021-09-01','2021-12-01','2022-03-01','2022-06-01']

# loop through threads and extract data from comments
array = []
for date in dates:
    with open('posts_'+date+'.json', 'r') as f:
        data = json.load(f)
        comment_no = 0
        for key in data:
            if key == "comments":
                for comment in data[key]:
                    row=[]
                    row.append(date) # add the date of the Reddit thread
                    for i in range(0,7): # get the first 7 rows of the comment text
                        try: # if the data is in a list ie 1. 2. 3. 
                            value=data[key][comment]['media']["richtextContent"]["document"][0]['c'][i]['c'][0]['c'][0]['t']
                            row.append(value)
                        except: 
                            try: # if the data is in a list, but has a non-list sentence first. (Posters often add a preamble)
                                value=data[key][comment]['media']["richtextContent"]["document"][1]['c'][i]['c'][0]['c'][0]['t']
                                row.append(value)
                            except:
                                try: # this works if the data is not in a list
                                    value=data[key][comment]['media']["richtextContent"]["document"][i]['c'][0]['t']
                                    row.append(value)
                                except:
                                    pass
                    if len(row)>3: # remove results with less than 4 lines - normally  comments with no salary data
                        array.append(row)
                    comment_no += 1

# put it in a pandas DataFrame, export to csv
import pandas as pd
df=pd.DataFrame(array)
df.to_csv('salary_data.csv', index=False)

The final lines put it into a pandas DataFrame (pandas is the de-facto standard for manipulating tables in Python). It also makes it simple to export to csv.

After all this, I have the following data:

date title experience location salary benefits industry tech_stack
1
2022-06-01
1)Current title: Data Engineer
2)Years of experience (YOE): 2
3)Location: DMV
4)Base salary: $100,000 US
5)Bonuses: Year End 10% Bonus
6)Industry: Media
7)Tech stack: AWS, Python, Airflow
2
2022-06-01
Application Developer II
3 YOE
Midwest
83K USD
Bonus 6% ~ 5K
Can’t say industry without doxxing Tech stack: talend denodo Qlik replicate
Role I’m interviewing for rn: Data engineer/ 110K/ Midwest/ working with cloud, spark, python, airflow, and many more/ logistics supply chain company.
3
2022-06-01
BI Analyst
2y Developer, 6 months Data Analyst
Baltic States
1800EUR/month
5% Quater Bonus
Healthcare
MySQL, Python
4
2022-06-01
Data Analyst
3 years
Alberta
$90k CAD
Negligible bonus, 2% match RRSP
Ø
Ø
5
2022-06-01
Data Engineer
1 year. Coming from analyst roles though.
Utah. Remote
95k USD
25k bonus, options.
Health care
Python, AWS.
6
2022-06-01
Data Engineer
3 years
Texas
$140k
ESOP
Finance
SQL, Python, Azure (Databricks/Data Factory)
7
2022-06-01
Data Engineer
1 in role, 2 months in DS internship about 6 months before starting
Virginia (hcol)
102k USD
Paltry bonus- 10% 401k contrib, full healthcare premium with no out of pocket, tuition coverage for continuing education
Government
Python, sql, unfortunately no cloud resources lol
8
2022-06-01
Data Engineer
8 Years as a BI Dev 2 Years Data Engineer
London UK
80,000 GBP before Tax
Upto 15% Bonus
Asset Management
SQL, Azure Data Factory, Python (just a little bit)
9
2022-06-01
Data Engineer
3 YOE
Manila, Philippines
780k PHP or 15.6k USD
65k PHP or 1.3k USD
Python, SQL, RDS, DMS, KINESIS, LAMBDA , S3, GLUE, ATHENA, REDSHIFT,EC2, Airflow, Git for version control.
Ø
10
2022-06-01
Data Engineer
10 years
London
£120k
20% + equity
Communication
Presto, Spark, Airflow
Download

335 rows of tagged salary data!

It's a great start, but not usable yet. There are some rows without useful data, the formats are a real mix, and the currency data is very inconsistent. Time for…

3. Data Cleaning

Perhaps unsurprisingly, this was the most time consuming part of the project. I opened a new Jupyter notebook to look at the data.

At this point I want to mention the tool that made this literally 5x faster: GitHub Copilot. It's an AI tool that helps you write code. It's difficult to describe how uncannily good it is: If you write code, try it.

GitHub Copilot
Wow. Copilot's suggestions saved me a lot of time.

Data cleaning steps

Overall this was a pretty gnarly exercise, but in essence:

  1. For erroneous rows, I removed them using a spreadsheet as it was easier to spot them than in a notebook.
  2. For all columns, I converted text to lowercase, removed punctuation and any contents in brackets (normally unhelpful for analysis).
  3. For categorical data columns, (title, industry, location) I aimed to reduce the number of distinct groups that had the same label. For example, if two rows contained the titles "Sr. data engineer ii" and "Senior DE 2" I would want these to be in the same group.
  4. For continuous data columns, (salary, experience) I tried to extract the numbers correctly. For salary specifically I also looked for currency symbols and codes, so that I could convert salaries to USD.

After this, it wasn't perfect but it had 297 rows of very usable data. You can view & download it below.

day title yr_exp country salary_usd industry tech_stack
1
2022-06-01
application developer ii
3
us
83,000
tech
role i’m interviewing for rn: data engineer/ 110k/ midwest/ working with cloud, spark, python, airflow, and many more/ logistics supply chain company.
2
2022-06-01
bi analyst
0
baltic states
22,725
healthcare
mysql, python
3
2022-06-01
data analyst
3
canada
69,318
(blank)
Ø
4
2022-06-01
data engineer
1
remote
95,000
healthcare
python, aws.
5
2022-06-01
data engineer
3
us
140,000
finance
sql, python, azure
6
2022-06-01
data engineer
2
us
100,000
media
tech stack: aws, python, airflow
7
2022-06-01
data engineer
9
us
190,000
tech
presto,spark, proprietary pipeline tool and proprietary dash boarding tool.
8
2022-06-01
data engineer
1
us
27,775
insurance
azure stack
9
2022-06-01
data engineer
0
us
102,000
government
python, sql, unfortunately no cloud resources lol
10
2022-06-01
data engineer
8
uk
98,001
asset management
sql, azure data factory, python
Download

4. Presenting the data

I used Evidence's charting capabilities to present the data. Difficult not to use your own tool when you work at a data viz company. I stuck the above data into Snowflake to make it easier to query.

4.1 Who posted salary data?

Let's start by understanding the posters.

Download
Download
Download

4.2 Average Data Engineer Salaries & Experience

For continuous fields, I used a histogram.

The $10k range with most salaries in is $60-70k, but the centre of the distribution appears to be around $90-100k.

The median data engineering salary is $100k. Dragged up by a few high values in the dataset, the mean data engineering salary is $105k.

Download

The average data engineer in the dataset has 3.93 years of experience, with the majority of posts from those with less than 5 years of experience:

Download

Having understood the data, I can answer my original questions.

4.3 Should you relocate for a salary increase?

Download

Moving to the US, or landing a remote role could net a $40k+ salary boost, but there might be practical considerations (visas, work permits, etc).

4.4 How should you expect your salary to change with experience?

Plotting salary vs years of experience, it seems the two are correlated.

Download

This is easier to see if we group up by years of experience, and take the median (the size of the bubble represents the number of responses).

Download

There is typically a steep salary trajectory for those with less than 5 years of experience: Salaries rise from $72,595 in Y1 to $135,000 by Y5. After that it seems to flatten off (though the number of datapoints is also getting sparse).

4.5 Are some industries more lucrative?

Download

This is more interesting. And perhaps unexpected. There's a lot of variation between industries. I expected "tech" to have the highest salaries, but in this sample telco roles have the highest median salary (with 8 responses).

We're by no means hitting the sample size for statistical significance here, but it seems like if you are working in e.g. healthcare, an industry switch could get you a higher salary.

4.6 Are there tools or technologies you should learn to land a higher salary job?

Download

Not all of these are equivalent, so it's not as if you should invest time "learning" Snowflake.

But in our dataset the highest paid data engineers are using Apache Spark and Airflow, so perhaps if these aren't familiar to you yet, investing time to learn these could get you a higher salary.

Wrapping up

So there you have it. If you want a salary increase: learn Spark, and look for a remote job at a telco company. Or don't make decisions entirely based on what people post on Reddit. Your choice.

As a future build, I'd like to set this up to check for new posts and keep the data up to date - but I'll have to leave that for another day!

Other questions you want to answer with this data? Any different ways you'd have approached this?

Jump into the comments on Reddit, or send me an email at archie@evidence.dev

Powered by