Skip to content

Latest commit

 

History

History
109 lines (92 loc) · 11.2 KB

final.md

File metadata and controls

109 lines (92 loc) · 11.2 KB

Final Projects (25%)

Working in pairs, students will ask a simple policy question. To answer it they will identify at least two disjoint data sources, merge them, perform a simple but correct statistical analysis and create a simple (but possibly dynamic) dashboard to illustrate this.

Due Dates

By October 31, students should form groups and propose the analysis and project. The proposal should identify a question and data sources that help them to address it (with links). A number of data sources are listed below, for inspiration. It should specify a 'baseline' functionality, describe several extensions, and include a link to the code repository.

By midnight December 1st, all code must be definitively checked in. A README should detail the sources, describe problems encountered and solutions, and provide clear instructions for launching the analysis and accessing the output.

During the finals period (December 4, 1:30-3:30), we will have 2-hour palooza, in which groups will describe their work and demonstrate any dynamic functionality, for others in the class and for the professor and TAs.

Baseline

You are aiming for something like our week 9 example on weather and crime. You can find a straightforward writeup of this work, here:

https://harris-ippp.github.io/weather

  • Your work must* include data from at least two sources, specified in your proposal. I am much more interested in projects that start from some sort of a "question" and that take the data merge seriously. This is not a statistics class, and I'm not looking for iron-clad causality. But neither am I interested in a scatter plot of two random variables.
    • It is not enough to plot dots representing one type of incident on a choropleth map from another source, or plot school quality against crime rates.
    • *I will make an exception for the two source rule, if one or more of the parameters require very significant cleaning. For example, fertility histories in the NLSY are extremely messy. So too, are some state election returns. But exceptions need to be compellingly justified, and you must discuss this with me before your proposal.
  • Produce plots and tables and develop a sensible model with statsmodels. Give that model a workout.
  • Describe what you have done in a clean and presentable notebook (or website), preferably with some degree of manipulability.
  • Document your technical work in the writeup or the README file for the repository.

Suggested extensions

  • I recognize that not all data sources are equally easy to use. I will reward efforts to use datasets that are take a bit of work. For instance, lining up two 20-line excel spreadsheets... not that impressive.
  • Build a sqlite database from the two sources, and load your data from SQL in your functions.
  • Mapping the data... more points for "very complex" maps.
  • Build a website, and put it online as you did in Week 6. You are welcome to use html as in HW6, or Jekyll themes on GitHub to do this, as I do for the class websites. More points for presenting your own navigable HTML or Jekyll; less points for copying your jupyter notebook.
  • Depending on what data you use (if this is meaningful), apply a machine-learning method from sci-kit learn, to predict an outcome from inputs. See slides from last year's class. (This is playing fire -- be careful with this stuff in the real world.)
  • Build interactive functions or a dashboard as in our Thanksgiving lecture.

Grading

The grading rubric will be modified to benefit ambitious projects:

  • Proposal and Interest (20%): does the proposal represent a serious effort to find data and ask or explore an interesting policy question with those data? It is understood that this is a proposal and not a finished product. But if you are unsure if you can do a project with a dataset, please ask me or the TAs for input. You must look at the data, to understand what it can do. The proposal must be on-time.
  • Correctness (20%): is the baseline analysis delivered bug-free with thought for the statistics?
  • Scope (20%): how much do you try to do? You will get points for correct and meaningful: plots, tables, maps, manipulable data (drop-down, etc.), hosting the website online, etc. -- see "extensions."
  • Style (20%): are the front-end and code both navigable? Is the code well-commented? Does its division among files make sense? Do all plots have appropriate legends and labels?
  • Documentation (20%): is the presentation in the data palooza functional and engaging? Does the documentation (README) actually make it possible to understand how to find your data and run your site? Is your code well-commented?

Datasets for Inspiration

Please go hunting for dataset that interest you. In the past, students have come up with really great datasets from their home countries. My suggestions below are super US-centric ones that I've used recently. But I particularly enjoy seeing foreign datasets.