Contributors: Yitaek Hwang, James Schaefer, Steven Wang, Hannah White
Student persistence is a metric scrutinized by both the government and academic communities interested in improving education. Major works to understand persistence by the Obama Administration and other scholars have so far been incomplete. In fact, these reports do not take into consideration important factors such as first-generation college students, financial aid packages, and factors outside of the immediate classroom setting (e.g. having children). Even the studies by Princeton’s National Longitudinal Study of Freshmen use outdated logistic regression methods that yield low prediction power. The following study combined the available dataset and used socioeconomic frameworks created by previous research to determine high-risk population most prone to dropout. Using a machine learning algorithm called random forest, the model was able to predict 73.87% of students who are likely to drop out. This is a significant improvement to the existing logistic regression methods that yielded a poor 18.01% prediction accuracy. The following report summarizes the 21 most important factors including GPA, perception of prejudice on campus, and percentage of classes dropped and ranks them to create a potential SPS. A proposed plan of action is attached to utilize random forest algorithm for BridgeEDU to identify qualifying students for the emergency gap fund.
- Bridge Edu Final Draft.pdf: details the submitted report on student persistence
- Code: Python, Jupyter notebook files
- Model: Saved versions of the random forest model in pickeld format
The raw data we used can be downloaded from: princeton link