Data Scientists: Are You Getting Paid Enough?

Data science-ing the way to higher pay with crowdsourced salary data

Fact: Theatrical makeup artists (at $124k) earn more than data scientists (at $109k) in the US (BLS). My take: A lot of data scientists are being underpaid.

I covered data engineering salaries before, but there is also a treasure trove of salary data on Reddit in r/DataScience, which I wanted to dig deeper into.

In 2019, 2020 and 2021 a post ran that looked like this:

Reddit data science salary thread

And a typical comment looks like the below, which has a well structured data format:

Reddit data science salary comment

Which meant I’d be able to scrape the comment data and use it to build a nice table.

At this point, I should say that in a community full of data scientists, I’m not the first person to have this idea, and there are at least three other posts analyzing the data. However these scraped only a fraction of the total data, and I also thought there was a lot more insight to be had in the data if I got a bit creative.

Particularly I wanted to find out:

The rough process I followed was:

  1. Extract the data from the Reddit comments
  2. Parse the data into a table so that I could easily analyze it
  3. Clean the data and tag it
  4. Analyze the data and present results

1. Extracting the data with DevTools

I used Chrome’s DevTools to find the requests that sent back comment data. It took a bit of searching, but eventually I found it.

Finding Requests in Browser
Bingo

The requests sent back a data file in json format. For example:

{
    "account": null,
    "authorFlair": {...},
    "commentLists": {...},
    "comments": {
        "t1_ghe6iex": {
            ...
            "media": {
                "richtextContent": {
                    "document": [
                        {"c": [
                            {"c": [{"c": [{"e": "text","t": "Title: Data Scientist","f": [[1,0,6]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Tenure length: 3yrs","f": [[1,0,14]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Location: Houston","f": [[1,0,9]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Salary: $140,000","f": [[1,0,7]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Company/Industry: Oil and Gas","f": [[1,0,17]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Education: Masters in Applied Statistics","f": [[1,0,10]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Prior Experience: 2yrs of actuarial experience","f": [[1,0,17]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Relocation/Signing Bonus: $15,000 signing bonus","f": [[1,0,25]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Stock and/or recurring bonuses: 15-30% bonus(no bonus this year of course due to Covid)","f": [[1,0,31]]}],"e": "par"}],"e": "li"},
                            {"c": [{"c": [{"e": "text","t": "Total comp: $140,000","f": [[1,0,11]]}],"e": "par"}],"e": "li"}],"e": "list","o": false},
                            {"c": [{"e": "text","t": "I'm about to accept a new job that will be include a nice paycut (125K) just to get out of O&G.The industry is on a downturn and I think now is a good time move on.The premium pay is no longer worth the instability."}],"e": "par"}
                    ]
                },
                "type": "rtjson",
                "rteMode": "richtext"
            },
            "profileImage": "https://styles.redditmedia.com/t5_mb2hi/styles/profileIcon_snoo1ac41e44-c7ed-4194-9f09-48672b506ee0-headshot.png?width=256&height=256&crop=256:256,smart&s=0e6131dfcf0c758d2c28fb08b8dbae7ebf688161"
        },
        // & MANY MORE COMMENTS
    }
}

Reddit “lazy loads” - it doesn’t show all the comments until you scroll down. So I scrolled until all the comments were loaded, and grabbed the data from all the requests. There were three requests with data per yearly thread: nine in all.

2. Parsing the data into a table with Python

Not every poster is kind enough to conform rigidly to the above format. Some didn’t include all of the fields, or didn’t break lines after each field:

This meant I needed a couple of different approaches to parse the data. So I opened a Jupyter notebook, and wrote a few lines of python to parse the json files.

import json
import pandas as pd

# run it for each post file
dates=['2021','2020','2019']
pages=['1','2','3']
array = []

for page in pages:
    for date in dates:
        with open(date + '_' + page + '_post.json', 'r') as f:
            data = json.load(f)
            comment_no = 0
            for key in data:
                if key == "comments":
                    for comment in data[key]:
                        row=[]
                        row.append(date)
                        row.append(page)
                        for i in range(0,11):
                            # if the data is in a bulleted list, this works
                            try: 
                                value=data[key][comment]['media']["richtextContent"]["document"][0]['c'][i]['c'][0]['c'][0]['t']
                                # Strips out some optional fields, which otherwise disrupt the columns
                                if "Remote:" not in value:
                                    if "Internship:" not in value:
                                        if "Details:" not in value:
                                            row.append(value)
                            except:
                                # if the data is in a list, but has a non-list sentence first. (Posters often add a preamble)
                                try: 
                                    value=data[key][comment]['media']["richtextContent"]["document"][1]['c'][i]['c'][0]['c'][0]['t']
                                    if "Remote:" not in value:
                                        if "Internship:" not in value:
                                            if "Details:" not in value:
                                                row.append(value)

                                except:
                                    try: 
                                        # this works if the data is not in a list
                                        value=data[key][comment]['media']["richtextContent"]["document"][i]['c'][0]['t']
                                        if "Remote:" not in value:
                                            if "Internship:" not in value:
                                                if "Details:" not in value:
                                                    row.append(value)
                                    except:
                                        pass
                        # remove results with less than 6 lines - these tend to be comments that do not contain salary data (which have 8-10 lines)
                        if len(row)>5:
                            array.append(row)
                        comment_no += 1

df=pd.DataFrame(array)
df.columns=['date','page','title','tenure','location','salary','industry','education', 'prior_experience','signing_bonus','stock_or_bonus','total_comp', 'extra_col']
df.to_csv('salary_data.csv', index=False)
select
date as date_yyyy,
title,
tenure,
location,
salary,
industry,
education,
prior_experience,
signing_bonus,
stock_or_bonus,
total_comp
from ds_salary_data_raw
order by date desc

Having run this, I have a table of 311 rows of data. But it was a bit of a mess, with issues including:

On top of this, the columns mainly contain free-text. E.g. the salaries are in different formats, and different currencies.

Raw salary data

Raw salary data

3. Data cleaning

I re-used some of my code from cleaning data engineering salary posts, but for some of the columns I had to do some extra work. The aim of the cleaning was:

For most of the cleaning I used a pretty rule based approach.

E.g. if salary contains k, multiply by 1000, if salary contains EUR, multiply by the EUR-USD FX rate, etc.

However, there were two columns where categorizing was pretty hard: location, and industry. So I enlisted my friend AI.

3.1 Using OpenAI to clean location data

I began by using a rule-based approach to categorize countries, but it turns out almost all the data is from the US.

Instead I decided to compare the different regions in the US. But the raw data has a real mix of place hierarchies, which makes a rule based approach arduous.

select location
from ds_salary_data

Location data

I’m not a ML engineer, so I wasn’t going to write my own model. However, OpenAI has a classifier model(free account needed) I used for this. It’s pretty remarkable - you just pass it some text, and it autocompletes it for you.

I passed it the following:

The following is a list of places in the US

lcol midwest city
southern, usa
midwest
washington dc
atlanta
socal
karachi pakistan
west coast
....


The following is a list of regions they fit into:

Midwest
Northeast
Southeast
Southwest
West coast
Unspecified
Non-US


lcol midwest city - Midwest;
Input into OpenAI Classification model

You click the Submit button in the UI and voila, it autocompletes it for you based on the instructions you gave it:

lcol midwest city - Midwest;
southern, usa - Southeast;
midwest - Midwest;
washington dc - Northeast;
atlanta - Southeast;
socal - Southwest;
karachi pakistan - Non-US;
west coast - West coast;
...
Output from model

Pretty cool given how little we tell it about the data. Above you can see it correctly classifies Karachi, Pakistan as Non-US.

I then used the output to map into the original data.

After all the cleaning, it’s not perfect, but it’s pretty good:

select date, title, tenure, tenure_clean, location, us_region, salary, salary_usd, industry, industry_group, education, education_level
from ds_salary_data

order by date

Cleaned salary data

4. Analyze the data

This whole article is written using Evidence, including for the charts. It’s a great alternative to BI tools for analyzing and presenting data when you want to add narrative (Disclosure: I work there).

4.1 Commenters are mostly highly educated and US-based

I started by exploring who our commenters were.

with all_titles as(
select
title,
count(*) as responses,
ROW_NUMBER () OVER ( ORDER BY count(*) desc ) row_num
from ds_salary_data
group by title
order by responses desc)

select
'   Job Title' as x,
case
        when row_num <= 10 then title
        else 'all other' end as title_group,
sum(responses) as num_responses,
sum(row_num) as row_num
from all_titles
group by title_group
order by row_num
with all_ind as(
select
industry_group,
count(*) as responses,
median(salary_usd) as med_salary_usd0k,
ROW_NUMBER () OVER ( ORDER BY count(*) desc ) row_num
from ds_salary_data
group by industry_group
order by responses desc)

select
'   Industry' as x,
case
when row_num <= 10 then industry_group
else 'all other' end as industry_category,
round(sum(responses*med_salary_usd0k) /sum(responses)) as group_med_salary_usd0k,
sum(responses) as num_responses,
sum(row_num) as row_num
from all_ind
group by industry_category
order by row_num
select
'Education' as x,
case
when education_level is not null then education_level
else '(Not specified)' end as education_level,
count(*) as num_responses
from ds_salary_data
group by education_level
select
'US Region' as x,
case
when us_region ilike 'unspecified' then 'Unspecified US'
else us_region end as us_region,
count(*) as num_responses
from ds_salary_data
group by us_region

Demographics

As may surprise no one, data scientists are pretty educated: Over 50% have either a Master’s or a PhD.

Also, roughly 75% of the responses are from the US, with most from the West and Northeast.

4.2 Average data science salaries & experience

Histograms are generally a good fit for displaying continuous data.

SELECT
round(AVG(salary_usd)) as avg_usd0k,
round(median(salary_usd)) as med_usd0k
FROM ds_salary_data
select
  floor(salary_usd/10000.00)*10000 as bin_floor_usd0k,
  count(*) as responses
from ds_salary_data
group by bin_floor_usd0k
order by bin_floor_usd0k

Salary Distribution

The median data science salary is $115k. Dragged up by a few high values in the dataset, the mean salary is $120k.

select
avg(tenure_clean)
from ds_salary_data
select
--round up to nearest whole year
ceil(tenure_clean) as years_experience_rounded,
count(*)
from ds_salary_data


group by years_experience_rounded

The average data scientist in the dataset has 1.94 years of experience, with almost half of posts from those with 1 year of experience or less: This data set is a reasonably junior sample.

Experience Distribution

4.3 Data science salaries are increasing

I looked at the trend of median, 25th percentile and 75th percentile salaries over time.

select
date::string as year,
round(percentile_cont(0.25) within group (order by salary_usd)) as percentile_25_usd0k,
round(median(salary_usd)) as median_usd0k,
round(percentile_cont(0.75) within group (order by salary_usd)) as percentile_75_usd0k
from ds_salary_data
group by date
order by date

Salaries over time

The median, 25th and 75th percentiles salaries have all increased between the 2019 and 2021 threads:

However, it is not a totally smooth trend (e.g. the 75th percentile in 2020 was lower than in 2019). Relatively small sample sizes might be causing noise here.

4.4 Gaining experience quickly boosts your comp

The most passive way to increase your salary would be to just keep working to gain experience. Let’s look at how median salaries change with tenure:

select
median(salary_usd) as med_salary_usd0k,
ceil(tenure_clean) as years_experience_rounded,
count(*) as num_responses
from ds_salary_data
group by years_experience_rounded
order by years_experience_rounded

Salary by Experience

In the first 5 years, salaries increase from $110k to $160k. After this, the sample size is much smaller, but it appears to flatten off.

4.5 Going back to school increases your salary

People often go back to college during an economic downturn, as there are less opportunities in the job market. But there’s a debate about whether further degrees are really worth it. Is it really worth doing a Master’s or PhD?

select
median(salary_usd) as med_salary_usd0k,
education_level,
count(*) as num_responses
from ds_salary_data
where education_level is not null
group by education_level
order by med_salary_usd0k

Salary by Education

Note: While the "High School" salary is just below "Bachelor's", there were few (11) comments with below college education.

In data science at least, higher levels of education are correlated with higher salaries. Earning a Master’s could net you +$15k salary on a Bachelor’s, while upgrading a Master’s to a PhD could be worth +$45k a year.

4.6 Relocating could get you pay-rise

Salaries across the US are different. Where is it most lucrative to work?

select
median(salary_usd) as med_salary_usd0k,
us_region,
count(*) as num_responses
from ds_salary_data
group by us_region
order by med_salary_usd0k desc
Salary by US region
US Region Salaries
Note: See Region definitions

Unsurprisingly, the West (which includes the bay area) is the area with the highest salaries.

However, even without moving to the West coast you can change your salary significantly by relocating. Those in the Southeast could get a $20k raise if they relocated to the Northeast, Southwest or Midwest.

4.7 Changing industry may get you a bigger paycheck

Another way to increase your salary is to change jobs. But what kind of company should you target?

with all_ind as (
    select
    median(salary_usd) as med_salary_usd0k,
    industry_group,
    count(*) as responses,
    ROW_NUMBER () OVER ( ORDER BY  median(salary_usd) desc ) row_num
    from ds_salary_data
    group by industry_group)

select
    round(sum(responses*med_salary_usd0k) /sum(responses)) as med_salary_usd0k,
    case
        when industry_group ilike '%other%' then 'Other'
        when responses < 6 then 'Other'
        else industry_group
        end as industry,
    sum(responses) as num_responses,
    sum(row_num) as row_num
from all_ind
group by industry
order by row_num

Salary by Industry

Perhaps unsurprisingly, landing a job at FAANG is a good way to increase your salary. After that, O&G, Tech and Healthcare are all good bets for higher salaries.

If you are working in consulting, manufacturing, retail or logistics - you might be able to get a $20-30k boost by changing industry.

Wrapping up: Top tips for a higher salary

In summary, from the data that’s been posted on Reddit:

I hope you found this useful! I certainly enjoyed exploring the data (cleaned version on GitHub). If there’s anything else you’d like to see, let me know in the comments on Reddit!