Working in pairs, students will ask a simple policy question. To answer it they will identify at least two disjoint data sources, merge them, perform a simple but correct statistical analysis and create a simple (but possibly dynamic) dashboard to illustrate this.
By October 31, students should form groups and propose the analysis and project. The proposal should identify a question and data sources that help them to address it (with links). A number of data sources are listed below, for inspiration. It should specify a 'baseline' functionality, describe several extensions, and include a link to the code repository.
By midnight December 1st, all code must be definitively checked in. A README should detail the sources, describe problems encountered and solutions, and provide clear instructions for launching the analysis and accessing the output.
During the finals period (December 4, 1:30-3:30), we will have 2-hour palooza, in which groups will describe their work and demonstrate any dynamic functionality, for others in the class and for the professor and TAs.
You are aiming for something like our week 9 example on weather and crime. You can find a straightforward writeup of this work, here:
https://harris-ippp.github.io/weather
- Your work must* include data from at least two sources, specified in your proposal.
I am much more interested in projects that start from some sort of a "question" and that take the data merge seriously.
This is not a statistics class, and I'm not looking for iron-clad causality. But neither am I interested in a scatter plot of two random variables.
- It is not enough to plot dots representing one type of incident on a choropleth map from another source, or plot school quality against crime rates.
- *I will make an exception for the two source rule, if one or more of the parameters require very significant cleaning. For example, fertility histories in the NLSY are extremely messy. So too, are some state election returns. But exceptions need to be compellingly justified, and you must discuss this with me before your proposal.
- Produce plots and tables and develop a sensible model with statsmodels. Give that model a workout.
- Describe what you have done in a clean and presentable notebook (or website), preferably with some degree of manipulability.
- Document your technical work in the writeup or the README file for the repository.
- I recognize that not all data sources are equally easy to use. I will reward efforts to use datasets that are take a bit of work. For instance, lining up two 20-line excel spreadsheets... not that impressive.
- Build a sqlite database from the two sources, and load your data from SQL in your functions.
- Mapping the data... more points for "very complex" maps.
- Build a website, and put it online as you did in Week 6. You are welcome to use html as in HW6, or Jekyll themes on GitHub to do this, as I do for the class websites. More points for presenting your own navigable HTML or Jekyll; less points for copying your jupyter notebook.
- Depending on what data you use (if this is meaningful), apply a machine-learning method from sci-kit learn, to predict an outcome from inputs. See slides from last year's class. (This is playing fire -- be careful with this stuff in the real world.)
- Build interactive functions or a dashboard as in our Thanksgiving lecture.
The grading rubric will be modified to benefit ambitious projects:
- Proposal and Interest (20%): does the proposal represent a serious effort to find data and ask or explore an interesting policy question with those data? It is understood that this is a proposal and not a finished product. But if you are unsure if you can do a project with a dataset, please ask me or the TAs for input. You must look at the data, to understand what it can do. The proposal must be on-time.
- Correctness (20%): is the baseline analysis delivered bug-free with thought for the statistics?
- Scope (20%): how much do you try to do? You will get points for correct and meaningful: plots, tables, maps, manipulable data (drop-down, etc.), hosting the website online, etc. -- see "extensions."
- Style (20%): are the front-end and code both navigable? Is the code well-commented? Does its division among files make sense? Do all plots have appropriate legends and labels?
- Documentation (20%): is the presentation in the data palooza functional and engaging? Does the documentation (README) actually make it possible to understand how to find your data and run your site? Is your code well-commented?
Please go hunting for dataset that interest you. In the past, students have come up with really great datasets from their home countries. My suggestions below are super US-centric ones that I've used recently. But I particularly enjoy seeing foreign datasets.
- US Census APIs and Geographies
- American Community Survey 5-year estimates. This is my go-to for demographic data at very good granularity
- Decennial Census is somewhat more precise for ethnic/racial and breakdowns, but the data are less rich.
- Health Insurance Statistics
- City SDK -- from what I can tell, a slightly less-good version of the Census API.
- Migration flows
- Commuting patterns: LODES
- Also Check out TIGER for mapping.
- National Longitudinal Survey of Youth -- one of the great datasets of all time, but it can be a lot to bite off.
- Integrated Public Use MicroSamples -- standardizes data across many Censuses and the ACS. Great for time series or old data.
- Bureau of Labor Statistics: main page and python example. Let me know if you use this, and I have examples...
- American Time Use Survey -- we used it in class.
- City Data portals (I looked at a bunch of crime, whence the slant... but it's legitimately one of the more interesting datasets that they release)
- FBI Unified Crime Reports ... to complete the crime spree
- National Center for Education Statistics and its DataLab
- CollegeScorecard (not an API, but the superficial information is trivially extractable)
- Common Core Data
- High School Longitudinal Study
- Drop out rates
- National Assessment of Educational Progress
- International Dat Explorer
- Also... used.gov
- Illinois Report Card doesn't have an API, but it's very scrap-able.
- Institute for Health Metrics and Evaluation: http://healthdata.org
- It's not an API, but with a little bit of work, you can pull all their data right out.
- World Bank Development Indicators
- WHO Global Health Observatory
- Voting -- State Election Returns (this is an area I know very well -- ask if you're interested)
- It's very easy to find good county-level election returns for national elections. Precinct-level is harder.
- Louisiana Official, but json backend, and maps available.
- North Carolina returns and maps
- Virginia returns and maps (Actually, I've fixed those maps...)
- Minnesota
- Wisconsin
- Ask if you want more -- these are a pain to work with. But you can get TN, MD, TX, FL, etc.
- Stocks: check out one of the packages like googlefinance.client. Scraping these yourself can get a bit involved.
- Health: CDC Wonder, Area Health Resource Files or Primary Care Service Areas
- Twitter has one of the great APIs
- google maps -- they have APIs for elevations (biking!?), travel times (isolation?), distances, geolocation, etc.
- Weather Underground has a fantastic, free API for current and historical weather data.