Authors:
- Annie Wang
- Hans Baumberger
- Mason Lee
- Sileshi Hirpa
Predict the total compensation for a prospective employee based on the professional background, companies and their locations, and macroeconomic factors.
- Goals
- Provide a reasonable expectation for compensation negotiation
- Provide an important benchmark to the companies for competitive compensation package in recruitment.
The technical goal for our project is to maximize R^2 and minimize MSE.
Our project is based on the datasets obtained from:
- Web scraping the Levels.fyi (which lets users compare career levels and compensation packages across different companies) with permission from the administration.
- Inflation rate from rateinflation.com
- Unemployment data from Data World.
The three datasets were merged after data cleaning and EDA.
Term | Description |
---|---|
timestamp | timestamp of compensation record submission |
company | company names |
title | employee's job title |
totalyearlycompensation | total compensation that an employee gets annually |
location | cities where the companies are located |
yearsofexperience | years of experience in a career |
yearsatcompany | experience years of an employee at a particular company |
year | year of timestamp |
month | month of timestamp |
year_month | year month of the data |
inflation_rate | the percentage at which a currency is devalued during a period |
inflation_rate_3mos | inflation rate of 3 months prior to record timestamp |
state | states in the US |
employment_rate | The percentage of the labor force that is employed |
employment_rate_3mos | employment rate of 3 months prior to record timestamp |
- Some of the EDA we used include:
-
Top 10 total compensations by title
-
Workers' location (top 10)
-
Nationwide Inflation Rate
-
Nationwide Unemployment Rate
Most of our models took longer than anticipated amount of time during the hyperparameter tuning process and we decided to run a model (RandomForestRegression) on the AWS platform. The following table summarizes the models we evaluated and the best model the team agreed upon for the compensation preidction: Gradient Boosting Regressor (with GridSearch).
Model | Training Score (R^2) | Testing Score(R^2) | MSE(Train) | MSE(Test) | Comment |
---|---|---|---|---|---|
Linear Regression(with no penality) | 0.5193 | -7.2931Xe^28 | 8286.35 | 1.22Xe^27 | |
Lasso Regularization (CV) | 0.5182 | 0.5143 | 8305.30 | 8157.45 | |
Ridge Regularization (CV) | 0.52 | 0.5097 | 8274.20 | 8234.28 | |
Elastic Net Regularization (CV) | 0.4483 | 0.4499 | 9511.19 | 9238.88 | |
Random Forest Regression (with Gridsearch) | 0.466 | 0.410 | 9060 | 10319 | |
KNN Regressor (with Gridsearch) | 0.9907 | 0.4762 | 158.35 | 9172.55 | |
Gradient Boosting Regressor (no gridsearch) | 0.5973 | 0.5318 | 6834.12 | 8198.40 | |
Gradient Boosting Regressor (with gridsearch) | 0.7131 | 0.5477 | 4867.52 | 7919.98 | Best Model |
Support Vector Regression (SVR) (without gridsearch) | 0.1368 | -0.1287 | |||
Support Vector Regression (SVR) (with Gridsearch) | 0.5029 | 0.4745 | |||
AdaBoost (with Gridsearch) | 0.1930 | 0.1276 | 13693 | 15276 |
- Incorporate more personal background features of the employee into analysis (i.e. Education)
- Incorporate more company and industry background information (i.e. Stock price, Company size, Industry sector)
- Include current data (after Sep. 2020)
- More hyperparameter tuning (GridSearch, RandomizedSearch, BayesSearch)