Group-Project, Data Science Immersive at General Assembly

Project 5: Predicting employee total compensation

Authors:

Annie Wang
Hans Baumberger
Mason Lee
Sileshi Hirpa

Content of the Project:

Problem statement
Data cleaning and EDA
Data Preprocessing and Modeling
Model Selection
Next Steps

Problem statement:

Predict the total compensation for a prospective employee based on the professional background, companies and their locations, and macroeconomic factors.

Goals
1. Provide a reasonable expectation for compensation negotiation
2. Provide an important benchmark to the companies for competitive compensation package in recruitment.

The technical goal for our project is to maximize R^2 and minimize MSE.

Data cleaning and EDA

Our project is based on the datasets obtained from:

Web scraping the Levels.fyi (which lets users compare career levels and compensation packages across different companies) with permission from the administration.
Inflation rate from rateinflation.com
Unemployment data from Data World.

The three datasets were merged after data cleaning and EDA.

Data Dictionary

Term	Description
timestamp	timestamp of compensation record submission
company	company names
title	employee's job title
totalyearlycompensation	total compensation that an employee gets annually
location	cities where the companies are located
yearsofexperience	years of experience in a career
yearsatcompany	experience years of an employee at a particular company
year	year of timestamp
month	month of timestamp
year_month	year month of the data
inflation_rate	the percentage at which a currency is devalued during a period
inflation_rate_3mos	inflation rate of 3 months prior to record timestamp
state	states in the US
employment_rate	The percentage of the labor force that is employed
employment_rate_3mos	employment rate of 3 months prior to record timestamp

Some of the EDA we used include:

Features' correlation
Top 10 total compensations by title
Workers' location (top 10)
Nationwide Inflation Rate
Nationwide Unemployment Rate

Model Performance

Most of our models took longer than anticipated amount of time during the hyperparameter tuning process and we decided to run a model (RandomForestRegression) on the AWS platform. The following table summarizes the models we evaluated and the best model the team agreed upon for the compensation preidction: Gradient Boosting Regressor (with GridSearch).

Model	Training Score (R^2)	Testing Score(R^2)	MSE(Train)	MSE(Test)	Comment
Linear Regression(with no penality)	0.5193	-7.2931Xe^28	8286.35	1.22Xe^27
Lasso Regularization (CV)	0.5182	0.5143	8305.30	8157.45
Ridge Regularization (CV)	0.52	0.5097	8274.20	8234.28
Elastic Net Regularization (CV)	0.4483	0.4499	9511.19	9238.88
Random Forest Regression (with Gridsearch)	0.466	0.410	9060	10319
KNN Regressor (with Gridsearch)	0.9907	0.4762	158.35	9172.55
Gradient Boosting Regressor (no gridsearch)	0.5973	0.5318	6834.12	8198.40
Gradient Boosting Regressor (with gridsearch)	0.7131	0.5477	4867.52	7919.98	Best Model
Support Vector Regression (SVR) (without gridsearch)	0.1368	-0.1287
Support Vector Regression (SVR) (with Gridsearch)	0.5029	0.4745
AdaBoost (with Gridsearch)	0.1930	0.1276	13693	15276

Next Steps

Incorporate more personal background features of the employee into analysis (i.e. Education)
Incorporate more company and industry background information (i.e. Stock price, Company size, Industry sector)
Include current data (after Sep. 2020)
More hyperparameter tuning (GridSearch, RandomizedSearch, BayesSearch)

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
Code		Code
Data		Data
Images		Images
Models		Models
Team Notes		Team Notes
.gitignore		.gitignore
Predict Total Compensation.pdf		Predict Total Compensation.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Group-Project, Data Science Immersive at General Assembly

Project 5: Predicting employee total compensation

Content of the Project:

Problem statement

Data cleaning and EDA

Data Preprocessing and Modeling

Model Selection

Next Steps

Problem statement:

Data cleaning and EDA

Data Dictionary

Model Performance

Next Steps

About

Releases

Packages

Contributors 4

Languages

mh0805/Group-Project

Folders and files

Latest commit

History

Repository files navigation

Group-Project, Data Science Immersive at General Assembly

Project 5: Predicting employee total compensation

Content of the Project:

Problem statement

Data cleaning and EDA

Data Preprocessing and Modeling

Model Selection

Next Steps

Problem statement:

Data cleaning and EDA

Data Dictionary

Model Performance

Next Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages