The goal of this project is to explore the data and predict whether a student will get admitted into graduate school or not.
This was a rather small and clean dataset which I downloaded from Kaggle.
Almost all features are numeric - except for Research which is boolean. Universities, letters of recommendation and statements of purpose are all rated from 1 to 5. The target variable is Chance of Admittance which is the probability of getting admitted into graduate school. This what we will try and predict.
From the pairplot we can see that there is a positive correlation between GRE Scores and chance of being admitted, and the same can be said about TOEFL Scores. GRE scores & TOEFL scores are positively correlated, and chance of being admitted seems evenly distributed.
Highest Correlations found with target Chance of Admit:
Spoiler Alert - this is most likely NOT how students are admitted into graduate school. The admittance rate would be greater than 90% in this dataset, so almost anyone would get in...
data['admit'] = [1 if x>=.5 else 0 for x in data.chance_of_admit]
temp = data.admit.value_counts().reset_index()
temp['threshold'] =.5
fig, ax = plt.subplots()
sns.barplot(ax=ax, data=temp, y='admit', x='index')
plt.show(fig)
- Plotting the data using distplot
The first plot shows the distribution of the cumulative gpa feature - these scores look normally distributed. We that as university rating increases so does the cumulative GPA from the scatterplot. There are some students attending universityes with ratings 4 with higher cumulative GPA than those attending the highest rated schools. But in general there is a linear relationship that can be seen as the university rating increases.
- Plotting the data using qqplot
This plot generates its own sample of the idealized distribution that we are comparing with, in this case the Gaussian distribution. The idealized normal samples are divided into 5 quantiles, and each data point is paired with a similar sample from the idealized distribution. The X-axis in the plot shows the idealized samples and the Y-axis has the data samples. Our samples closely fit the expected diagonal pattern for a sample from a Gaussian distribution.
- Statistical Normality tests
Taking a closer look at this - our null hypothesis H0 states that our sample is normally distributed.
We set our alpha to 0.05 and
- if p<= alpha: we reject H0 - not normal
- if p> alpha: fail to reject - normal
a) Shapiro-Wilk test
Evaluates a data sample and quantifies how likely it is to be drawn from a Gaussian distribution.
b) D'Agostino test (stats.normaltest)
This calculates summary statistics on the data: skewness - a measure of asymmetry of the data which quantifies how far left/right the data is pushed. Kurtosis quantifies how much of the data is in the tail of the distribution.
In both cases, from the given evidence that we have with this dataset, it looks like our null hypothesis is very likely true - our Cumulative GPA scores are normally distributed. We can not reject the null hypothese that the data is normall distributed because the p-values are greater than the chosen alpha
# 1) general plot of data distribution
fig, ax = plt.subplots()
_ = sns.distplot(data.cgpa, kde=True, kde_kws={'color':'red', 'shade':True, 'lw':1}, ax=ax)
ax.set_xlabel('')
ax.set_title("Distribution of CGPA Scores")
plt.show(fig)
# 2) qqplot
from statsmodels.graphics.gofplots import qqplot
fig, ax = plt.subplots()
_ = qqplot(data.cgpa, line = 's', ax=ax)
ax.set_title("QQplot of CGPA")
plt.show(fig)
# Shapiro normality test
statistic, p = stats.shapiro(x=data.cgpa)
print('Statistics=%.3f, p=%.3f' % (statistic, p))
# interpret
alpha = 0.05
if p > alpha:
print('Our data sample looks Gaussian (fail to reject H0)')
else:
print('Our data sample does not look Gaussian (reject H0)')
Statistics=0.993, p=0.072
Our data sample looks Gaussian (fail to reject H0)
Overall GRE score percentiles are higher for those students with research than that of students who do not have research. From the boxplot we can see that the highest GRE scores belong to students with research experience and they also are highest in chance of getting admitted. We see a similar situation with TOEFL scores.
'''
Box plots of GRE & TOEFL Scores vs chance of admittance
'''
var = ['gre_score','toefl_score']
fig, axes = plt.subplots(figsize=(18, 8),nrows=2)
for i, v in enumerate(var):
sns.boxplot(y = 'chance_of_admit', x =v, ax = axes[i], data=data[[v, 'chance_of_admit']], palette='Set1')
axes[i].set_xlabel('')
axes[i].set_ylabel('')
axes[i].set_title(v)
plt.show(fig)
Using 360 rows of data for training and 40 for testing, we scale the data using the standardscaler (mean = 0 and variance = 1) and then fit a linear regression model to the data.
The first model gets an R^2 score is 0.78, which means that about 78% of the variability in Y can be explained using X.
When looking at RMSE, we are an average of 6% chance of admittance away from the ground truth when making predictions on our test set.
The features with the highest coefficients are Cumulative GPA & TOEFL Score which indicates that they have the most influence on the chance of getting admitted to graduate school in this model.
# shuffle data
X, y = shuffle(data.drop(['serial_no.','chance_of_admit'], axis=1), data.chance_of_admit, random_state=23)
# split data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=1)
# use Linear regression to predict target
linreg = LinearRegression()
scaler = preprocessing.StandardScaler()
# fit scaler on training data
scaler = scaler.fit(X_train)
# transform training data using standard scaler
X_train_transformed = scaler.transform(X_train)
# transform test data fit scaler
X_test_transformed = scaler.transform(X_test)
# fit model to training data
linreg = linreg.fit(X_train_transformed, y_train)
# take a look at R^2 score
linreg_score = linreg.score(X_test_transformed, y_test)
print("Linear Regression R^2 score on training set %.4f" %linreg.score(X_train_transformed,y_train))
print("Linear Regression R^2 score on test set %.4f" %linreg_score)
pred = linreg.predict(X_test_transformed)
linreg_mse = mean_squared_error(y_pred=pred, y_true=y_test)
print("Linear Regression MSE score on training %.4f" %linreg_mse)
print("Linear Regression RMSE %.4f" %math.sqrt(linreg_mse))
# look at coefficients - which variables are most important in model?
linreg_coefs = linreg.coef_
coef_df = pd.DataFrame(data = list(zip(X_train.columns,linreg_coefs)), columns=['feature','coefficient'])
coef_df.sort_values(by = 'coefficient', ascending=False)
Linear Regression R^2 score on training set 0.8542
Linear Regression R^2 score on test set 0.8503
Linear Regression MSE score on training 0.0028
Linear Regression RMSE 0.0530
Feature | Coefficient |
---|---|
Cumulative GPA | 0.053796 |
GRE | 0.017675 |
TOEFL | 0.014883 |
Research | 0.014161 |
Letter of Rec | 0.011608 |
Univ Rating | 0.010832 |
Stat.of purpose | 0.003996 |
In the second model I created a pipeline to scale the date, fit it and perform cross validation.
The average training and test R^2 score remained quite the same at 80% and 78%.
# build pipeline to combine standard scaler and Linear Regression
scaler = preprocessing.StandardScaler()
linreg_pipe = Pipeline(steps=[('standardscaler', scaler ),('linear', LinearRegression())])
scores = cross_validate(return_train_score=True, error_score=True,
estimator=linreg_pipe, X=X ,y=y, cv=10)
print("Average score for train: %.4f" %scores['train_score'].mean())
print("Average score for test: %.4f" %scores['test_score'].mean())
For this model I performed cross valiation in 2 ways:
1) 5-Fold cross validation by looping
I looped through an array of alpha values and performed 5-fold cross validation and assigned that alpha to the Ridge regression model. I did this to then plot alpha vs the average CV score for that fold, where we can see that the optimal alpha gives the best score of 0.785.
2) 5-Fold cross validation using RidgeCV
Here the average R^2 score was very close at 0.788. Using the returned coefficients, the cumulative gpa is seemingly the most informative feature followed by TOEFL scores. Since GRE and TOEFL scores are highly correlated, I think the model would perform similiarly with GRE or TOEFL scores...
print("*"*30)
print("CROSS VALIDATION/K-FOLD FOR RIDGE REGRESSION")
print("*"*30)
alphas = np.logspace(-4, -.5, 30)
#alphas=[0.1, 1.0, 20.0]
scores = list()
scores_std = list()
for alpha in alphas:
ridge.alpha = alpha
this_scores = cross_val_score(estimator = ridge, X=X, y=y, n_jobs=1, cv=5)
scores.append(np.mean(this_scores))
scores_std.append(np.std(this_scores))
plt.figure(figsize=(6, 6))
plt.semilogx(alphas, scores)
plt.ylabel('CV score')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='red')
plt.show()
cv=5
ridge_cv = RidgeCV(alphas=alphas, cv=5)
# fit Ridge Regression using 5fold cross validation
ridge_cv.fit(X_train_transformed, y_train)
ridge_cv_score = ridge_cv.score(X_test_transformed, y_test)
print("Ridge Regression R^2 score %.4f" %ridge_cv_score)
ridge_cv_pred = ridge_cv.predict(X_test_transformed)
ridge_cv_mse = mean_squared_error(y_pred=ridge_cv_pred, y_true=y_test)
print("Ridge Regression MSE score %.4f" %ridge_cv_mse)
print("Ridge Regression RMSE %.4f" %math.sqrt(ridge_cv_mse))
Ridge Regression R^2 score 0.7881
Ridge Regression MSE score 0.0040
Ridge Regression RMSE 0.063
Lastly, I fit ridge regression models for Cumulative GPA, TOEFL & Score scores and then separately using Cumulative GPA and one of the two scores since these are highly correlative. The highest scoring model in the case of GRE vs TOEFL was GRE at 79% R^2.
# fit scaler on training data - CPGA, GRE & TOEFL
scaler_2 = preprocessing.StandardScaler()
scaler_2.fit(X_train[['cgpa','gre_score','toefl_score']])
# transform training data using standard scaler
X_train_transformed_2 = scaler_2.transform(X_train[['cgpa','gre_score','toefl_score']])
X_test_transformed_2 = scaler_2.transform(X_test[['cgpa','gre_score','toefl_score']])
# fit Ridge Regression using 5-fold cross validation
ridge_2 = RidgeCV(alphas=alphas, cv=5)
ridge_2.fit(X_train_transformed_2, y_train)
ridge_2_score = ridge_2.score(X_test_transformed_2, y_test)
print("Ridge Regression R^2 score %.4f" % ridge_2_score)
ridge_2_pred = ridge_2.predict(X_test_transformed_2)
ridge_2_mse = mean_squared_error(y_pred=ridge_2_pred, y_true=y_test)
print("Ridge Regression MSE score %.4f" %ridge_cv_mse)
print("Ridge Regression RMSE %.4f" %math.sqrt(ridge_cv_mse))
Ridge Regression R^2 score 0.7894
Ridge Regression MSE score 0.0040
Ridge Regression RMSE 0.0631
features = ['gre_score','toefl_score']
new_scaler = preprocessing.StandardScaler()
for feat in features:
feature_list = ['cgpa']
feature_list.append(feat)
new_scaler.fit(X_train[feature_list])
X_train_scaled = new_scaler.transform(X_train[feature_list])
X_test_scaled = new_scaler.transform(X_test[feature_list])
# fit Ridge Regression using 5-fold cross validation - CPGA & GRE
ridge_new = RidgeCV(alphas=alphas, cv=5)
ridge_new.fit(X_train_scaled, y_train)
score_new = ridge_new.score(X_test_scaled, y_test)
print("Ridge Regression model using %s is: R^2 score %.4f" %(feat,score_new))
Ridge Regression model using gre_score is: R^2 score 0.7978
Ridge Regression model using toefl_score is: R^2 score 0.7765
Finally, I predicted the chance of admittance on the test set using the model fit using CGPA and TOEFL scores and calculated the squared difference for each row, then picked the row with the highest error.
pred_ridge_new = ridge_new.predict(X_test_scaled)
# make predictions on test data
X_test['predictions'] = pred_ridge_new
# merge predictions & look at rows with highest squared difference
predicted_test = X_test.merge(pd.DataFrame(y_test), right_index = True, left_index=True)
predicted_test['diff_squared'] = predicted_test.apply(lambda x: (x['predictions']-x['chance_of_admit'])**2, axis=1)
predicted_test.sort_values(by = 'diff_squared', ascending=False).head(5)
From the below we can see that the model makes the highest error on a row where the TOEFL score is right around the average but the Cumulative GPA is lower than the average for the training data. We defined the error for each row as the squared difference between our prediction and the chance of getting admitted.
TOEFL | Chance of Admit | Predictions | Diff Squared |
---|---|---|---|
110 | 0.46 | 0.639608 | 0.032259 |
105 | 0.70 | 0.553767 | 0.021384 |
107 | 0.56 | 0.705240 | 0.021095 |
110 | 0.85 | 0.744311 | 0.011170 |
107 | 0.84 | 0.734688 | 0.011091 |