How much should you be earning as a Data Engineer?
Data engineers can increase their comp by relocating or learning new tools - we show which are most likely to help
I recently stumbled across this post on Reddit.
As I was scrolling down the comments, I noticed most included salaries in pretty much the same format, as suggested by the post.
Such structured, and well labeled data! Not only that, there were threads every quarter for the last 5 quarters.
I couldn't resist a mini data analysis project: I wanted to find what influences how much data engineers earn:
- Should you relocate for a salary bump?
- How much does pay typically increase with experience?
- Are some industries more lucrative than others?
- Are there tools you should learn to land a higher salary job?
Approach to the project
As I saw it, I would need to do the following:
- Extract the data from Reddit
- Get the data into a table to make analysis easier
- Clean the data pretty thoroughly
- Analyze and present the results in a easily digestible way
In this post I'll run through the steps of the project, and finish with the final results. The full code, and final dataset is on Github.
1. Extract the data
Reddit has an API. I didn't use it.
I haven't used it before and wasn't sure if they'd even be able to give me the data I wanted. But since the data is in the post, I knew it must be being delivered to the browser somewhere.
Firing up Chrome's trusty devtools and looking at the network tab, I hunted for a request that looked like it would have the data in. After a bit of a trawl I found it.
{
"account": {...},
"authorFlair": {...},
"commentLists": {...},
"comments": {
"t1_iaut7cg": {
...
"media": {
"richtextContent": {
"document": [
{"c": [
{"c": [{"c": [{"e": "text","t": "Data Engineer"}],"e": "par"}],"e": "li"},
{"c": [{"c": [{"e": "text","t": "1 year. Coming from analyst roles though."}],"e": "par"}],"e": "li"},
{"c": [{"c": [{"e": "text","t": "Utah. Remote"}],"e": "par"}],"e": "li"},
{"c": [{"c": [{"e": "text","t": "95k USD"}],"e": "par"}],"e": "li"},
{"c": [{"c": [{"e": "text","t": "25k bonus, options."}],"e": "par"}],"e": "li"},
{"c": [{"c": [{"e": "text","t": "Health care"}],"e": "par"}],"e": "li"},
{"c": [{"c": [{"e": "text","t": "Python, AWS."}],"e": "par"}],"e": "li"}
],
"e": "list",
"o": true
}]
},
"type": "rtjson",
"rteMode": "richtext"
}
},
{// LOTS MORE COMMENTS...
}
}
}
There was one of these files for each of the 5 threads - I saved them into 5 files.
2. Get the data into a table
I'm not particularly good at extracting data from json files. But for me I knew this was going to be a Python job. I booted up a Jupyter notebook.
I noticed a few things about the structure of the data:
- Generally only "top level" comments had salary data. Threads under comments were normally asking the commenter for more information. This made the task easier, as I could discard the other comments.
- Most commenters stuck to the numbered list format. However some added a line of text before they started their list. And some used a list, but did not add numbers. This meant I would need a couple of different approaches to getting the data.
- Comments with less than 4 lines rarely contained salary data. Because the post was asking for 4-7 data points, those with fewer lines were normally comments without salary data. I excluded these.
I also included the date the thread was created as a column.
After a bit of trial and error I got to this. Done in 40 lines of Python.
import json
# run it for each of the 5 Reddit threads
dates=['2021-06-01','2021-09-01','2021-12-01','2022-03-01','2022-06-01']
# loop through threads and extract data from comments
array = []
for date in dates:
with open('posts_'+date+'.json', 'r') as f:
data = json.load(f)
comment_no = 0
for key in data:
if key == "comments":
for comment in data[key]:
row=[]
row.append(date) # add the date of the Reddit thread
for i in range(0,7): # get the first 7 rows of the comment text
try: # if the data is in a list ie 1. 2. 3.
value=data[key][comment]['media']["richtextContent"]["document"][0]['c'][i]['c'][0]['c'][0]['t']
row.append(value)
except:
try: # if the data is in a list, but has a non-list sentence first. (Posters often add a preamble)
value=data[key][comment]['media']["richtextContent"]["document"][1]['c'][i]['c'][0]['c'][0]['t']
row.append(value)
except:
try: # this works if the data is not in a list
value=data[key][comment]['media']["richtextContent"]["document"][i]['c'][0]['t']
row.append(value)
except:
pass
if len(row)>3: # remove results with less than 4 lines - normally comments with no salary data
array.append(row)
comment_no += 1
# put it in a pandas DataFrame, export to csv
import pandas as pd
df=pd.DataFrame(array)
df.to_csv('salary_data.csv', index=False)
The final lines put it into a pandas DataFrame (pandas is the de-facto standard for manipulating tables in Python). It also makes it simple to export to csv.
After all this, I have the following data:
Date | Title | Experience | Location | Salary | Benefits | Industry | Tech Stack |
---|---|---|---|---|---|---|---|
2022-06-01 | 1)Current title: Data Engineer | 2)Years of experience (YOE): 2 | 3)Location: DMV | 4)Base salary: $100,000 US | 5)Bonuses: Year End 10% Bonus | 6)Industry: Media | 7)Tech stack: AWS, Python, Airflow |
2022-06-01 | Application Developer II | 3 YOE | Midwest | 83K USD | Bonus 6% ~ 5K | Can’t say industry without doxxing Tech stack: talend denodo Qlik replicate | Role I’m interviewing for rn: Data engineer/ 110K/ Midwest/ working with cloud, spark, python, airflow, and many more/ logistics supply chain company. |
2022-06-01 | BI Analyst | 2y Developer, 6 months Data Analyst | Baltic States | 1800EUR/month | 5% Quater Bonus | Healthcare | MySQL, Python |
2022-06-01 | Data Analyst | 3 years | Alberta | $90k CAD | Negligible bonus, 2% match RRSP | - | - |
2022-06-01 | Data Engineer | 1 year. Coming from analyst roles though. | Utah. Remote | 95k USD | 25k bonus, options. | Health care | Python, AWS. |
2022-06-01 | Data Engineer | 3 years | Texas | $140k | ESOP | Finance | SQL, Python, Azure (Databricks/Data Factory) |
2022-06-01 | Data Engineer | 1 in role, 2 months in DS internship about 6 months before starting | Virginia (hcol) | 102k USD | Paltry bonus- 10% 401k contrib, full healthcare premium with no out of pocket, tuition coverage for continuing education | Government | Python, sql, unfortunately no cloud resources lol |
2022-06-01 | Data Engineer | 8 Years as a BI Dev 2 Years Data Engineer | London UK | 80,000 GBP before Tax | Upto 15% Bonus | Asset Management | SQL, Azure Data Factory, Python (just a little bit) |
2022-06-01 | Data Engineer | 3 YOE | Manila, Philippines | 780k PHP or 15.6k USD | 65k PHP or 1.3k USD | Python, SQL, RDS, DMS, KINESIS, LAMBDA , S3, GLUE, ATHENA, REDSHIFT,EC2, Airflow, Git for version control. | - |
2022-06-01 | Data Engineer | 10 years | London | £120k | 20% + equity | Communication | Presto, Spark, Airflow |
335 rows of tagged salary data!
It's a great start, but not usable yet. There are some rows without useful data, the formats are a real mix, and the currency data is very inconsistent. Time for…
3. Data Cleaning
Perhaps unsurprisingly, this was the most time consuming part of the project. I opened a new Jupyter notebook to look at the data.
At this point I want to mention the tool that made this literally 5x faster: GitHub Copilot. It's an AI tool that helps you write code. It's difficult to describe how uncannily good it is: If you write code, try it.

Data cleaning steps
Overall this was a pretty gnarly exercise, but in essence:
- For erroneous rows, I removed them using a spreadsheet as it was easier to spot them than in a notebook.
- For all columns, I converted text to lowercase, removed punctuation and any contents in brackets (normally unhelpful for analysis).
- For categorical data columns, (title, industry, location) I aimed to reduce the number of distinct groups that had the same label. For example, if two rows contained the titles "Sr. data engineer ii" and "Senior DE 2" I would want these to be in the same group.
- For continuous data columns, (salary, experience) I tried to extract the numbers correctly. For salary specifically I also looked for currency symbols and codes, so that I could convert salaries to USD.
After this, it wasn't perfect but it had 297 rows of very usable data. You can view & download it below.
Day | Title | Yr Exp | Country | Salary ($) | Industry | Tech Stack |
---|---|---|---|---|---|---|
2022-06-01 | application developer ii | 3.00 | us | $83k | tech | role i’m interviewing for rn: data engineer/ 110k/ midwest/ working with cloud, spark, python, airflow, and many more/ logistics supply chain company. |
2022-06-01 | bi analyst | 0.17 | baltic states | $23k | healthcare | mysql, python |
2022-06-01 | data analyst | 3.00 | canada | $69k | (blank) | - |
2022-06-01 | data engineer | 1.00 | remote | $95k | healthcare | python, aws. |
2022-06-01 | data engineer | 3.00 | us | $140k | finance | sql, python, azure |
2022-06-01 | data engineer | 2.00 | us | $100k | media | tech stack: aws, python, airflow |
2022-06-01 | data engineer | 9.00 | us | $190k | tech | presto,spark, proprietary pipeline tool and proprietary dash boarding tool. |
2022-06-01 | data engineer | 1.00 | us | $28k | insurance | azure stack |
2022-06-01 | data engineer | 0.08 | us | $102k | government | python, sql, unfortunately no cloud resources lol |
2022-06-01 | data engineer | 8.00 | uk | $98k | asset management | sql, azure data factory, python |
4. Presenting the data
I used Evidence's charting capabilities to present the data. Difficult not to use your own tool when you work at a data viz company. I stuck the above data into Snowflake to make it easier to query.
4.1 Who posted salary data?
Let's start by understanding the posters.
4.2 Average Data Engineer Salaries & Experience
For continuous fields, I used a histogram.
The $10k range with most salaries in is $60-70k, but the centre of the distribution appears to be around $90-100k.
The median data engineering salary is $100k. Dragged up by a few high values in the dataset, the mean data engineering salary is $105k.
The average data engineer in the dataset has 3.93 years of experience, with the majority of posts from those with less than 5 years of experience:
Having understood the data, I can answer my original questions.
4.3 Should you relocate for a salary increase?
Moving to the US, or landing a remote role could net a $40k+ salary boost, but there might be practical considerations (visas, work permits, etc).
4.4 How should you expect your salary to change with experience?
Plotting salary vs years of experience, it seems the two are correlated.
This is easier to see if we group up by years of experience, and take the median (the size of the bubble represents the number of responses).
There is typically a steep salary trajectory for those with less than 5 years of experience: Salaries rise from $73k in Y1 to $135k by Y5. After that it seems to flatten off (though the number of datapoints is also getting sparse).
4.5 Are some industries more lucrative?
This is more interesting. And perhaps unexpected. There's a lot of variation between industries. I expected "tech" to have the highest salaries, but in this sample telco roles have the highest median salary (with 8 responses).
We're by no means hitting the sample size for statistical significance here, but it seems like if you are working in e.g. healthcare, an industry switch could get you a higher salary.
4.6 Are there tools or technologies you should learn to land a higher salary job?
Not all of these are equivalent, so it's not as if you should invest time "learning" Snowflake.
But in our dataset the highest paid data engineers are using Apache Spark and Airflow, so perhaps if these aren't familiar to you yet, investing time to learn these could get you a higher salary.
Wrapping up
So there you have it. If you want a salary increase: learn Spark, and look for a remote job at a telco company. Or don't make decisions entirely based on what people post on Reddit. Your choice.
As a future build, I'd like to set this up to check for new posts and keep the data up to date - but I'll have to leave that for another day!
Other questions you want to answer with this data? Any different ways you'd have approached this?
Jump into the comments on Reddit, or send me an email at archie@evidence.dev