Alejandro Contreras (acontr10), Jesus Rodriguez (jrodi97), Hong Zheng (hzheng29), Qingyuan Lin (qlin20)
This project explores the integration of neural network-based feature extraction with traditional linear regression models to improve prediction accuracy in the social sciences. This approach aims to maintain the interpretability of linear models while leveraging the predictive power of neural networks to enhance feature representation. This method could be particularly useful in fields where the interaction effects of variables are significant but difficult to model linearly.
- Problem Type: The problem is both a regression and binary classification problem, aiming to enhance the predictability of outcomes by constructing a better control vector through machine learning techniques. Specifically, the regression problem that we plan to explore is the percent voter turnout in a district (percent-level), while the binary classification problem that we plan to explore is self-identification as a Democrat/Republican.
While previous research has often focused on either machine learning or statistical methods in isolation, combining these approaches is less explored. The supplementary article from Cambridge Relaxing Assumptions in Linear Regression provides a foundational understanding of integrating advanced computational methods with traditional models. This paper highlights potential benefits and trade-offs, which guides our methodology.
- Regression: The dataset that we will be using is the 2020 California Primary Election Precinct Data, along with 2020 census estimate data. The 2020 California General Election Precinct Data is freely available from the following website: Meanwhile, demographic data is available from the 2020 census can be found here. Both election and census data are commonly used by social scientists interested in politics.
- Binary Classification: The dataset that we will be using is the 2020 cross-sectional version of the American National Election Survey (ANES). ANES is a multi-decade long time-series dataset that is popular among social scientists interested in politics as it contains numerous questions regarding public opinion and voting behavior in U.S. presidential elections. Information on the dataset can be found here. Finally, the data will need to be preprocessed and it contains approximately 5,441 pre-election and 4,779 post-election interviews.
The architecture involves a simple neural network for feature extraction, trained on specific interaction variables. The extracted features, termed the control vector, will be used in a linear regression model. The primary challenge will be ensuring the neural network extracts meaningful features without overfitting, considering the limited size of typical social science datasets.
Success will be measured by the improvement in R-squared values of the linear regression model and the reduction in standard errors compared to traditional models. Experiments will include cross-validation tests to evaluate model robustness on unseen data.
Base goal = 50% (randomly guess the persons political self-identity)
Target goal = 75%
Stretch goal = 90%
- Data Representation: The datasets used in social sciences may contain biases reflecting historical or societal inequalities. It’s critical to analyze whether the data accurately represents all subgroups or perpetuates any biases.
- Consequences of Errors: Errors in this model could lead to incorrect policy recommendations or misinterpretations of variable impacts. Stakeholders, including policymakers and academic communities, must be aware of these risks.
- Alejandro Contreras: Manage data preprocessing and integration with the regression model.
- Jesus Rodriguez: Lead neural network training and feature extraction.
- Hong Zheng: Oversee model evaluation, including statistical testing and validation.
- Qingyuan Lin: Coordinate the writing and presentation of findings, ensuring all documentation is clear and comprehensive.
Introduction: This project explores the integration of neural network-based feature extraction with traditional linear regression models to improve prediction accuracy in the social sciences. This approach aims to maintain the interpretability of linear models while leveraging the predictive power of neural networks to enhance feature representation. This method could be particularly useful in fields where the interaction effects of variables are significant but difficult to model linearly.
- Problem Type: The problem is both a regression and binary classification problem, aiming to enhance the predictability of outcomes by constructing a better control vector through machine learning techniques. Specifically, the regression problem that we plan to explore is the percent voter turnout in a district (percent-level), while the binary classification problem that we plan to explore is self-identification as a Democrat/Republican.
Challenges: What has been the hardest part of the project you’ve encountered so far?
- Trouble with coming up with ideas for the project that are complex enough.
Insights: Are there any concrete results you can show at this point?
- Not too much
How is your model performing compared with expectations?
- Model meets 75 percent target but not the
Plan: Are you on track with your project?
- Still in ideation
What do you need to dedicate more time to?
- Model analysis
What are you thinking of changing, if anything?
- Nothing so far, solidifying architecture