-
Notifications
You must be signed in to change notification settings - Fork 0
/
Final_Project_Report.Rmd
201 lines (150 loc) · 12 KB
/
Final_Project_Report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: "STAT 331 Final Project Report"
subtitle: Department of Statistics, Cal Poly San Luis Obispo
author: by Sarthak Khillon, Juan Moreno, Christina Walden
output:
html_document:
df_print: paged
---
# Introduction
This project analyzes H1-B visa applications in the United States and the possible association between various fields included in an application. Based on user-defined input, we created two models in order to predict possible salary and outcome of an application. This research analyzes 3,002,458 H1-B applications from the years 2011 to 2016. An H1-B visa allows U.S. employers to employ foreign workers in specialty occupations. According to U.S. Citizenship and Immigration Services, this is a temporary visa that grants 6 months of residence in the country with the possibility of a 6-month extension. For a foreign national to apply for an H1-B visa, a U.S. employer must first offer them a job and submit a petition. This is the most common visa applied for and held by international students once they complete higher education.
This is considered an observational study since the data was recorded without attempts to influence or change the response. The dataset we are going to be analyzing was pulled from Kaggle, a website with thousands of publicly-accessible datasets. The Office of Foreign Labor Certification (OFLC) generates the H1-B data and releases it online on a yearly basis. For further analysis or questions, the raw data is available through the OFLC; however, please note that data is not ready for analysis. There were multiple transformations applied by the initial researcher to the raw data in order to prepare this data for analysis. For reference, there is an article explaining the manipulations provided along with an R Notebook file. We primarily utilized RStudio for the statistical analysis throughout this report.
One of the responses variables in this study is `PREVAILING_WAGE`, defined as the most frequent wage for the corresponding role. The following response variable is `CASE_STATUS`, defined as the status of an application. There are only four possible responses: Withdrawn, Certified, Denied, and Certified-Withdrawn. The rest of the variables in the dataset are possible explanatory variables. There are a few character variables: `EMPLOYER_NAME`, `SOC_NAME`, and `JOB_TITLE`. The dataset holds one logical variable, `FULL_TIME_POSITION`, which has either a yes or no response. The majority of the visa applications are for full-time positions, so we do not expect much information out of this variable. The year that the application was processed is contained within the `YEAR` variable. There is also information pertaining to the location of the employer via the character variable `WORKSITE`, which gives an exact address of the employer worksite, as well as longitude and latitude values stored as `lon` and `lat`. In order to conduct accurate exploratory data analysis of this dataset, we created a Shiny Application through RStudio, which will be further explored in this report. Note that we randomly sampled 10,000 observations from the 3 million original data points for performance reasons. **Even then, the plots in the application still take a few seconds to load**. A link to our shiny app is below:
https://stat331christinasarthakjuan.shinyapps.io/finalproject/
# Statistical Concepts
We applied two major forms of statistical analysis.
The first model is the "K Nearest Neighbors" model (KNN). KNNs are a form of supervised learning classification algorithms: Given `n` different features, the model creates an `n` dimensional graph and looks for clusters by Euclidian distance. Then, given new data points, KNN models can place them in this `n` dimensional space and check which cluster has the closest match. We used this model primarily to predict a new user's case status. Further information on this process is available in the references section.
The data transformations we applied are available below:
```{r, message = FALSE, results='hide'}
library(tidyverse) # Misc tidy data wrapper package.
library(caret) # kNN model
library(ggplot2) # Plotting
library(tidyverse)
library(readr)
library(stringr)
library(leaflet) ## Geographic Exploration Tool
# 1 ===== LOAD DATA =====
# The sampled source csv is read in, cleaned, and converted to a .RData file in generate_env.R
load("visa_info.RData")
# 2 ===== CASE STATUS CLASSIFIER =====
# KNN model to predict categorical Case Status given user input.
# Since KNNs do not continuously "learn" and dimensionality is low, it is a useful choice.
# Source: http://dataaspirant.com/2017/01/09/knn-implementation-r-using-caret-package/
# Functions
encode_numeric <- function(df) {
df$emp_encoding <- as.numeric(as.factor(df$EMPLOYER_NAME))
df$soc_encoding <- as.numeric(as.factor(df$SOC_NAME))
df$job_title_encoding <- as.numeric(as.factor(df$JOB_TITLE))
df$ft_encoding <- as.numeric(as.factor(df$FULL_TIME_POSITION))
df$worksite_encoding <- as.numeric(as.factor(df$WORKSITE))
return(df)
}
# Setup
num_cases <- length(unique(visas$CASE_STATUS))
set.seed(37)
visa_colnames <- colnames(visas)[3:11]
cluster_eval_cols <- setdiff(visa_colnames, "PREVAILING_WAGE")
visas <- encode_numeric(visas)
# Train-test split
in.train.rows <- createDataPartition(visas$CASE_STATUS, p = 0.7, list = FALSE)
in.train.cols <- union(2, 10:16)
v.train <- visas[in.train.rows, in.train.cols]
v.test <- visas[-in.train.rows, in.train.cols]
v.train$CASE_STATUS <- factor(v.train$CASE_STATUS)
# Build classifier
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
classifier <- train(CASE_STATUS ~., data = v.train, method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneLength = 10)
# External Interface.
predict_case_status <- function(user_df) {
user_numeric_df <- encode_numeric(user_df)
predict(classifier, newdata = user_numeric_df)
}
```
Secondly, we used a simple linear regression model to predict wage based on the other variables available. Linear regression holds four general assumptions: linearity, independence, normality, and equal variance.
Within the data, there are outliers that represent possible skews in the data. However, there is no indication that these observations are invalid or correct; for example, a CEO makes significantly more in a year than most other professions. Therefore, we found no reason to eliminate these statistical outliers from the dataset before performing regression.
The code for our regressor model is below:
```{r, message = FALSE, results='hide'}
# 3 ===== PREDICTED WAGE REGRESSOR =====
wage_regressor <- lm(PREVAILING_WAGE ~ lon + lat + emp_encoding + soc_encoding
+ job_title_encoding + ft_encoding + worksite_encoding,
data = visas)
# External interface
predict_wage <- function(user_df) {
user_numeric_df <- encode_numeric(user_df)
predict(wage_regressor, newdata = user_numeric_df)
}
```
Lastly, we performed 3 types of exploratory data analysis:
1) An interactive map to show the frequency of jobs by location.
2) A comparison of case status results using a pie chart.
3) A bar chart that compares prevailing wages based on different case status results.
# Results
### Application Usage
In addition to providing introductory information, the first tab in the application has an optional user input field. A user can input their data to trigger our classifier and regressor models, which will then show a predicted case status and wage on their respective tabs. This information is displayed at the top of a relevant tab; if a user chooses not to provide this information, then a message is shown inviting them to fill in that data. All data is temporarily held for their interactive session and released once they exit the application.
Each tab has a sidebar with which a user can filter the data; however, users must be cautious not to apply too many filters as there may be no observations that match those specifications. For example, if a user chooses to limit employers to "Apple, Inc." and job titles to "Construction worker", then the results will be blank. In that case, a user can simply press the "Reset" button to clear all filters.
Further information on data exploration is available throughout the rest of this section.
### kNN Model Results
The details of the regressor model are hidden from the user; only the output (expected case status) is shown.
```{r}
classifier
```
### Linear Regression Summary
As with the kNN classifier, the details of the regressor are hidden from the user; only the output (predicted wage) is shown.
```{r}
summary(wage_regressor)
```
### Tab 1 - Geographic Exploration
An interactive map shows the number of jobs that fit the filters added by the user. The more zoomed in the map becomes, the more specific the counts and locations are. Below is an example of an interactive map with the following filters:
- Application Certified and not withdrawn
- Full time jobs
- Year 2011, 2013, or 2016
- Is some sort of Executive or Physicist
```{r, echo = FALSE}
sample.map.filtered <- visas %>%
filter(CASE_STATUS == "CERTIFIED" &
FULL_TIME_POSITION == "Y" &
(YEAR == 2011 | YEAR == 2013 | YEAR == 2016) &
PREVAILING_WAGE > 100000 &
grepl("executive|physicists", SOC_NAME, ignore.case = TRUE))
leaflet(sample.map.filtered) %>%
addTiles() %>%
addMarkers(~lon, ~lat, clusterOptions = markerClusterOptions())
```
### Tab 2 - Application Status Pie Chart
Below is a pie chart showing the proportion of different case status results. In this example, the data is filtered to only include the years 2011 and 2015.
```{r, echo = FALSE}
visas %>%
filter(YEAR == 2011 | YEAR == 2015) %>%
ggplot(aes(x = "", y = CASE_STATUS, fill = CASE_STATUS)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
xlab(NULL) +
ylab(NULL) +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank())
```
### Tab 3 - Wages Bar Chart
Below is a bar chart of the wages compared to the case status results. The user may input different filters to subset the data as they desire. Below the data is subsetted by:
- Only full time jobs
- Year 2011, 2013, or 2016
- Is some sort of Executive or Physicist
```{r, echo = FALSE}
visas %>% filter(FULL_TIME_POSITION == "Y" &
(YEAR == 2011 | YEAR == 2013 | YEAR == 2016) &
PREVAILING_WAGE > 100000 &
grepl("executive|physicists", SOC_NAME, ignore.case = TRUE)) %>%
ggplot(aes(x = CASE_STATUS, y = mean(PREVAILING_WAGE), fill = CASE_STATUS)) +
geom_bar(stat = "identity") +
labs(x = "Case Status", y = "Mean Prevailing Wage")
```
# Conclusion
When analyzing the data overall, the issue is that our regression model did not yield any statistically significant variables. This is unfortunate, but it opens the door for further statistical analysis of this specific data set or immigration as a whole. The biggest implication is that there is no association between prevailing wage and the variables we have imposed. This can lead to more questions about what actually does influence the salary of H1-B visa applications. The geographical exploration shows that a majority of the applications are to work in the continental U.S., though there are some applications not on the main continent as well. Our pie chart results, for the most part, have a majority of their applications certified, meaning that there is a fairly high chance that any new application would also be certified. The bar chart shows that, in general, the wages for certified applications are higher than those for uncertified or withdrawn applications. Overall, this data holds great potential that could be explored further in other studies.
# References
- https://www.kaggle.com/gpreda/h-1b-visa-applications
- https://sharan-naribole.github.io/2017/02/24/h1b-eda-part-I.html
- https://www.saedsayad.com/k_nearest_neighbors.htm
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4916348/#__sec7title