Skip to content

Latest commit

 

History

History

week3

This week starts with a discussion of machine learning and then involves several assignments on reproducing the results of published research papers.

Day 1

Overfitting, generalization, and model complexity

Day 2

The long tail

Day 3

N-gram data and "Culturonomics"

Day 4

Predicting daily Citibike trips (open-ended)

The point of this exercise is to get experience in an open-ended prediction exercise: predicting the total number of Citibike trips taken on a given day. Create an RMarkdown file named predict_citibike.Rmd and do all of your work in it.

Here are the rules of the game:

  1. Use the trips_per_day.tsv file that has one row for each day, the number of trips taken on that day, and the minimum temperature on that day.
  2. Split the data into randomly selected training, validation, and test sets, with 90% of the data for training and validating the model, and 10% for a final test set (to be used once and only once towards the end of this exercise). You can adapt the code from last week's complexity control notebook to do this. When comparing possible models, you can use a single validation fold or k-fold cross-validation if you'd like a more robust estimate.
  3. Start out with the model in that notebook, which uses only the minimum temperature on each day to predict the number of trips taken that day. Try different polynomial degrees in the minimum temperature and check that you get results similar to what's in that notebook, although they likely won't be identical due to shuffling of which days end up in the train, and validation splits. Quantify your performance using root mean-squared error.
  4. Now get creative and extend the model to improve it. You can use any features you like that are available prior to the day in question, ranging from the weather, to the time of year and day of week, to activity in previous days or weeks, but don't cheat and use features from the future (e.g., the next day's trips). You can even try adding holiday effects. You might want to look at feature distributions to get a sense of what tranformations (e.g., log or manually created factors such as weekday vs. weekend) might improve model performance. You can also interact features with each other. This formula syntax in R reference might be useful.
  5. Try a bunch of different models and ideas, documenting them in your Rmarkdown file. Inspect the models to figure out what the highly predictive features are, and see if you can prune away any negligble features that don't matter much. Report the model with the best performance on the validation data. Watch out for overfitting.
  6. Plot your final best fit model in two different ways. First with the date on the x-axis and the number of trips on the y-axis, showing the actual values as points and predicted values as a line. Second as a plot where the x-axis is the predicted value and the y-axis is the actual value, with each point representing one day.
  7. When you're convinced that you have your best model, clean up all your code so that it saves your best model in a .RData file using the save function.
  8. Commit all of your changes to git, using git add -f to add the model .Rdata file if needed, and push to your Github repository.
  9. Finally, use the model you just developed and pushed to Github to make predictions on the 10% of data you kept aside as a test set. Do this only once, and record the performance in your Rmarkdown file. Use this number to make a guess as to how your model will perform on future data (which we'll test it on!). Do you think it will do better, worse, or the same as it did on the 10% test set you used here? Write your answer in your Rmarkdown notebook. Render the notebook and push the final result to Github.