-
Notifications
You must be signed in to change notification settings - Fork 0
Baseball Final Project Wiki
This is the wiki for my final project in BDA 696. I use data from the 2007-2012 MLB seasons in an attempt to predict if the home team will win a given game.
Generally, my project involves three steps:
See the pages above for each step to explore the project.
The entire project is wrapped into two Docker containers, one for SQL and one for Python. The Python container acts as the managing container and handles all data management and commands.
The Python container installs all required libraries in requirements.txt
and then curl
s the data. Once the SQL container is spun up, it creates the baseball database and loads it in, then passes the main SQL script, boxscore_2.sql
, to the SQL container. This script cleans the data and creates all the features, putting them all into one final table.
The Python container then runs the main Python scripts that analyze the features and create the final model. All one should need to do to replicate the project is clone the repo, navigate to thebaseball
directory, and then use docker-compose up -d
. Runtime for me is about 50 minutes after the containers are built. And, as always:
All of this was built on a Linux box, so you may have version/dependency problems on MacOS (especially if you're on a new Macbook before they figure out everything with M1) or Windows.
In the baseball
directory, three folders will be created:
-
db
houses the baseball database and is a persistent volume of/var/lib/mysql
in the baseball_db container. -
brute_force_plots
houses all of the plots referenced by tables created during feature examination -
results
houses tables, plots, and figures for the final analyses as well as some pre-analyses:-
pre-analysis
contains the main tables from the feature examination process -
pca_models
contains the main results from the models using the features selected by principal component analysis -
non-pca_models
contains the main results from models not using the PCA-selected variables -
non-pca_models_time
contains the main results from models using the same features as innon-pca_models
but with a different train/test split strategy
-