Data Scientists: Are You Getting Paid Enough?

Data science-ing the way to higher pay with crowdsourced salary data

Fact: Theatrical makeup artists (at $124k) earn more than data scientists (at $109k) in the US (BLS). My take: A lot of data scientists are being underpaid.

I covered data engineering salaries before, but there is also a treasure trove of salary data on Reddit in r/DataScience, which I wanted to dig deeper into.

In 2019, 2020 and 2021 a post ran that looked like this:

Reddit data science salary thread

And a typical comment looks like the below, which has a well structured data format:

Reddit data science salary comment

Which meant I'd be able to scrape the comment data and use it to build a nice table.

At this point, I should say that in a community full of data scientists, I'm not the first person to have this idea, and there are at least three other posts analyzing the data. However these scraped only a fraction of the total data, and I also thought there was a lot more insight to be had in the data if I got a bit creative.

Particularly I wanted to find out:

The rough process I followed was:

  1. Extract the data from the Reddit comments
  2. Parse the data into a table so that I could easily analyze it
  3. Clean the data and tag it
  4. Analyze the data and present results

1. Extracting the data with DevTools

I used Chrome's DevTools to find the requests that sent back comment data. It took a bit of searching, but eventually I found it.

Finding Requests in Browser
Bingo

The requests sent back a data file in json format. For example:


{
    "account": null,
    "authorFlair": {...},
    "commentLists": {...},
    "comments": {
        "t1_ghe6iex": {
            ...
            "media": {
                "richtextContent": {
                    "document": [
                        {"c": [
                            {"c": [{"c": [{"e": "text","t": "Title: Data Scientist","f": [[1,0,6]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Tenure length: 3yrs","f": [[1,0,14]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Location: Houston","f": [[1,0,9]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Salary: $140,000","f": [[1,0,7]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Company/Industry: Oil and Gas","f": [[1,0,17]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Education: Masters in Applied Statistics","f": [[1,0,10]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Prior Experience: 2yrs of actuarial experience","f": [[1,0,17]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Relocation/Signing Bonus: $15,000 signing bonus","f": [[1,0,25]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Stock and/or recurring bonuses: 15-30% bonus(no bonus this year of course due to Covid)","f": [[1,0,31]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Total comp: $140,000","f": [[1,0,11]]}],"e": "par"}],"e": "li"}],"e": "list","o": false},
                            {"c": [{"e": "text","t": "I'm about to accept a new job that will be include a nice paycut (125K) just to get out of O&G.The industry is on a downturn and I think now is a good time move on.The premium pay is no longer worth the instability."}],"e": "par"}
                    ]
                },
                "type": "rtjson",
                "rteMode": "richtext"
            },
            "profileImage": "https://styles.redditmedia.com/t5_mb2hi/styles/profileIcon_snoo1ac41e44-c7ed-4194-9f09-48672b506ee0-headshot.png?width=256&height=256&crop=256:256,smart&s=0e6131dfcf0c758d2c28fb08b8dbae7ebf688161"
        },
        // & MANY MORE COMMENTS
    }
}

Reddit "lazy loads" - it doesn't show all the comments until you scroll down. So I scrolled until all the comments were loaded, and grabbed the data from all the requests. There were three requests with data per yearly thread: nine in all.

2. Parsing the data into a table with Python

Not every poster is kind enough to conform rigidly to the above format. Some didn't include all of the fields, or didn't break lines after each field:

This meant I needed a couple of different approaches to parse the data. So I opened a Jupyter notebook, and wrote a few lines of python to parse the json files.


import json
import pandas as pd

# run it for each post file
dates=['2021','2020','2019']
pages=['1','2','3']
array = []

for page in pages:
    for date in dates:
        with open(date + '_' + page + '_post.json', 'r') as f:
            data = json.load(f)
            comment_no = 0
            for key in data:
                if key == "comments":
                    for comment in data[key]:
                        row=[]
                        row.append(date)
                        row.append(page)
                        for i in range(0,11):
                            # if the data is in a bulleted list, this works
                            try: 
                                value=data[key][comment]['media']["richtextContent"]["document"][0]['c'][i]['c'][0]['c'][0]['t']
                                # Strips out some optional fields, which otherwise disrupt the columns
                                if "Remote:" not in value:
                                    if "Internship:" not in value:
                                        if "Details:" not in value:
                                            row.append(value)
                            except:
                                # if the data is in a list, but has a non-list sentence first. (Posters often add a preamble)
                                try: 
                                    value=data[key][comment]['media']["richtextContent"]["document"][1]['c'][i]['c'][0]['c'][0]['t']
                                    if "Remote:" not in value:
                                        if "Internship:" not in value:
                                            if "Details:" not in value:
                                                row.append(value)

                                except:
                                    try: 
                                        # this works if the data is not in a list
                                        value=data[key][comment]['media']["richtextContent"]["document"][i]['c'][0]['t']
                                        if "Remote:" not in value:
                                            if "Internship:" not in value:
                                                if "Details:" not in value:
                                                    row.append(value)
                                    except:
                                        pass
                        # remove results with less than 6 lines - these tend to be comments that do not contain salary data (which have 8-10 lines)
                        if len(row)>5:
                            array.append(row)
                        comment_no += 1

df=pd.DataFrame(array)
df.columns=['date','page','title','tenure','location','salary','industry','education', 'prior_experience','signing_bonus','stock_or_bonus','total_comp', 'extra_col']
df.to_csv('salary_data.csv', index=False)


Having run this, I have a table of 311 rows of data. But it was a bit of a mess, with issues including:

On top of this, the columns mainly contain free-text. E.g. the salaries are in different formats, and different currencies.

Raw salary data
Date Title Tenure Location Salary Industry Education Prior Experience Signing Bonus Stock Or Bonus Total Comp
1905Title: Data AnalystTenure length: Accepting in a couple of daysLocation: London, UKSalary: 50k GBPCompany/Industry: FinTechEducation: BSc Maths with StatsPrior Experience: 2 years Data AnalystBonus: Up to 15%, typically 10% apparently--
1905Title: Senior Data ScientistTenure length: < 1 year at this position. Held 3 DS positions at 3 companies in 2 years.Location: NYC / remoteSalary: 205kIndustry: Tech (FAANG-adjacent)Education: BA Poli SciPrior Experience: 4 years Data Analyst, 2 years DSStock and/or recurring bonuses: $297k RSUs (publicly traded company) yearlyTotal comp: $502k-
1905Yeah sure.I had been promoted from Data Analyst to entry-level (L3) DS, then again from L3 to L4, at company 1. That took place over a period of just over 2 years. I then went to company 2 (FAANG), which involved a promo to L5. I did not like this FAANG company, so I jumped to company 3, also at L5, but with significantly higher comp.Company 1 to company 2 wasn't a super fast jump - 2 years - and it involved a level change and a jump in prestige, so that one didn't raise any eyebrows.You will probably be able to guess what company 2 is from this, but let's just say it was a company with some prominent ethical issues playing out very, very publicly. Jumping from this company was an easy narrative to sell, as I was jumping because of those issues specifically.I think the best way to summarize this history is in two points:-----
1905Title: Data Scientist, Analytics InternLocation: New York CitySalary: $7700 per monthCompany/Industry: FAANGEducation: Senior year in undergradRelocation/Signing Bonus: Free relocation, $300 to ship personal items, reimbursement for transportation and mental/physical health needs, health insurance, choice between corporate housing or stipend.----
1905Title: Lead Data ScientistTenure length: 1.5 yearsLocation: São Paulo, BrazilSalary: $55k USD (310k BRL)Company/Industry: Tech/O&G/Mining/IoT/Other pre-IPO spinoff (we are an AI/MLE consultancy, most clients are in O&G or Mining).Education: BS Geological Engineering, MS Mechanical EngineeringPrior Experience: 2.5 years as a DS in oil exploration between startups and a F500 O&G company.Stock and/or recurring bonuses: No idea, I have equity but the company is less than a year old*.--
1905Title: Analytics Engineering ManagerTenure length: 1 year current role; 6 prior years along data analyst track, ending at Sr Data AnalystLocation: Pacific Northwest, USA (hybrid remote)Salary: $150kCompany/Industry: SaaSEducation: BS Economics; BA Int’l StudiesPrior Experience: 4 years customer successRelocation/Signing Bonus: NoneStock and/or recurring bonuses: 15% bonus; ~$70k annual RSUs.Total comp: ~$240k
1905Title: Data ScientistTenure length: 4 years at company (1 has DS)Location: MontrealSalary: 95k$ (CAD)Company/Industry: Oil and GasEducation: Bachelor in mechanical engineering (almost done Msc in software)Prior Experience: NoneRelocation/Signing Bonus: N/AStock and/or recurring bonuses: 10%-
1905Title: Data ScientistTenure length: 2 yearsLocation: SF/Bay AreaSalary: $187k + bonusCompany/Industry: Startup, tech. (I figure out and invent paths forward for new potentially impossible tech, so it's a bit different than standard business DS/DA type work.)Education: None. I got in before the DS title was used in silicon valley.Prior Experience: 11 years$Coop: No.Relocation/Signing Bonus: No, but they tend to do that here.-
1905Title: Senior Data Scientist/Applied ScientistTenure length: OfferLocation: NYCSalary: 175kCompany/Industry: E-commerceEducation: BS, MS in Math/StatsPrior Experience: 3 YOEStock and/or recurring bonuses: 10% target bonus, 400k/4 yearsTotal comp: 292k-
1905Title: VP of Data ScienceTenure length: 6 years: 1 @ VP, 2 @ director, 2 @ manager, 1 @ data scientistLocation: Boston Area. WFH optional. I go in 1-2 days/week.Salary: $200k base, $40k bonus targetCompany/Industry: Marketing agency, ~500 peopleEducation: PhD in STEM field. BA in Physics.Prior Experience: Postdoc related to PhD, then Insight Data ScienceRelocation/Signing Bonus: NoneStock: Equity bonus equivalent to about 10% of salary yearlyTotal comp: ~$260k
No Results

3. Data cleaning

I re-used some of my code from cleaning data engineering salary posts, but for some of the columns I had to do some extra work. The aim of the cleaning was:

For most of the cleaning I used a pretty rule based approach.

E.g. if salary contains k, multiply by 1000, if salary contains EUR, multiply by the EUR-USD FX rate, etc.

However, there were two columns where categorizing was pretty hard: location, and industry. So I enlisted my friend AI.

3.1 Using OpenAI to clean location data

I began by using a rule-based approach to categorize countries, but it turns out almost all the data is from the US.

Instead I decided to compare the different regions in the US. But the raw data has a real mix of place hierarchies, which makes a rule based approach arduous.

Location
fully remote
lcol midwest city
southern, usa
lcol midwest
washington dc
atlanta
socal
west coast
karachi pakistan
washington, dc area
No Results

I'm not a ML engineer, so I wasn't going to write my own model. However, OpenAI has a classifier model (free account needed) I used for this. It's pretty remarkable - you just pass it some text, and it autocompletes it for you.

I passed it the following:


The following is a list of places in the US

lcol midwest city
southern, usa
midwest
washington dc
atlanta
socal
karachi pakistan
west coast
....


The following is a list of regions they fit into:

Midwest
Northeast
Southeast
Southwest
West coast
Unspecified
Non-US


lcol midwest city - Midwest;

Input into OpenAI Classification model

You click the Submit button in the UI and voila, it autocompletes it for you based on the instructions you gave it:


lcol midwest city - Midwest;
southern, usa - Southeast;
midwest - Midwest;
washington dc - Northeast;
atlanta - Southeast;
socal - Southwest;
karachi pakistan - Non-US;
west coast - West coast;
...

Output from model

Pretty cool given how little we tell it about the data. Above you can see it correctly classifies Karachi, Pakistan as Non-US.

I then used the output to map into the original data.

After all the cleaning, it's not perfect, but it's pretty good:

Date Title Tenure Tenure Clean Location Us Region Salary Salary ($) Industry Industry Group Education Education Level
2,019senior data scientist2.52.50denver metroWest coast$151,000$151kinternet/web techTechmsMaster's
2,019biostatistician data scientist6 months0.50miami/ft. lauderdaleSoutheast52.5k$53kclinical research organizationHealthcareb.s. statisticsBachelor's
2,019data scientist2yrs2.00houstonUnspecified$136,000$136koil and gasOil, gas & miningmasters in applied statisticsMaster's
2,019data scientist1 year1.00midwestMidwest83,000$83kfortune 100Manufacturingms statisticsMaster's
2,019data scientist1.5 years1.50bay areaWest coast$155k$155kfbBig Tech (FAANG)phd in engineeringPhD
2,019data scientist2.5 yrs2.50irving, txSouthwest130k$130kentertainmentOther industrymsc psychology, msc data scienceMaster's
2,019data scientist< 1 yr1.00nycNortheast165k$165khealthtech, reinforcement learningHealthcarebachelor'sBachelor's
2,019senior data scientistbeen here 1 month-laWest coast160k$160ktechTechbs engineering/ ms in dsMaster's
2,019data scientist3 months in current role, 2 years as data analyst0.25st. louisMidwest~$95,000$95khealthcareHealthcaremastersMaster's
2,019data scientiststarting spring 2020-seattleWest coast118,000$118kbig tech companyTechbs in cs, data science. completing ms in csMaster's
No Results

4. Analyze the data

This whole article is written using Evidence, including for the charts. It's a great alternative to BI tools for analyzing and presenting data when you want to add narrative (Disclosure: I work there).

4.1 Commenters are mostly highly educated and US-based

I started by exploring who our commenters were.

As may surprise no one, data scientists are pretty educated: Over 50% have either a Master's or a PhD.

Also, roughly 75% of the responses are from the US, with most from the West and Northeast.

4.2 Average data science salaries & experience

Histograms are generally a good fit for displaying continuous data.

The median data science salary is $115k. Dragged up by a few high values in the dataset, the mean salary is $120k.

The average data scientist in the dataset has 1.94 years of experience, with almost half of posts from those with 1 year of experience or less: This data set is a reasonably junior sample.

4.3 Data science salaries are increasing

I looked at the trend of median, 25th percentile and 75th percentile salaries over time.

The median, 25th and 75th percentiles salaries have all increased between the 2019 and 2021 threads:

However, it is not a totally smooth trend (e.g. the 75th percentile in 2020 was lower than in 2019). Relatively small sample sizes might be causing noise here.

4.4 Gaining experience quickly boosts your comp

The most passive way to increase your salary would be to just keep working to gain experience. Let's look at how median salaries change with tenure:

In the first 5 years, salaries increase from $110k to $160k. After this, the sample size is much smaller, but it appears to flatten off.

4.5 Going back to school increases your salary

People often go back to college during an economic downturn, as there are less opportunities in the job market. But there's a debate about whether further degrees are really worth it. Is it really worth doing a Master's or PhD?

Note: While the "High School" salary is just below "Bachelor's", there were few (11) comments with below college education.

In data science at least, higher levels of education are correlated with higher salaries. Earning a Master's could net you +$15k salary on a Bachelor's, while upgrading a Master's to a PhD could be worth +$45k a year.

4.6 Relocating could get you pay-rise

Salaries across the US are different. Where is it most lucrative to work?

Salary by US region
US Region Salaries
Note: See Region definitions

Unsurprisingly, the West (which includes the bay area) is the area with the highest salaries.

However, even without moving to the West coast you can change your salary significantly by relocating. Those in the Southeast could get a $20k raise if they relocated to the Northeast, Southwest or Midwest.

4.7 Changing industry may get you a bigger paycheck

Another way to increase your salary is to change jobs. But what kind of company should you target?

Perhaps unsurprisingly, landing a job at FAANG is a good way to increase your salary. After that, O&G, Tech and Healthcare are all good bets for higher salaries.

If you are working in consulting, manufacturing, retail or logistics - you might be able to get a $20-30k boost by changing industry.

Wrapping up: Top tips for a higher salary

In summary, from the data that's been posted on Reddit:

I hope you found this useful! I certainly enjoyed exploring the data (cleaned version on GitHub). If there's anything else you'd like to see, let me know in the comments on Reddit!

Powered by