NICAR 2024: Data wrangling with code

This repo contains materials for an hourlong class at the NICAR 2024 conference in Baltimore on using code to clean and process data.

The session is scheduled for Friday, March 8, from 5 - 6:15 p.m. in room Kent A on the fourth floor.

Open the Terminal application. Copy and paste this text and hit enter:

cd ~/Desktop/hands_on_classes/20240308-friday-data-wrangling-with-code && source env/bin/activate

Where there's a will, there's a way! Lots of different tools to accomplish your cleaning tasks: csvkit and other CLI tools, R, Python, regular expressions used in various contexts and languages
Find the patterns in the data
Automating your cleaning steps with a notebook!
Think about the spectrum of reproducibility: On one end, manually cleaning all values; on the other, writing code to automatically clean all of your data. Usually you land somewhere in the middle.
Data surgery with csvkit: USACE dams database
- Selecting specific columns of data
- Assessing cleanliness of data values
- Subsetting into smaller files
Dealing with garbage headers and Excel formatting: U.N. population data
- Skipping rows
- Dealing with grouping headers
Parsing a spreadsheet formatted like a paper report: Annapolis rental violations
- Using start/stop flags to delineate and parse individual records
Combining multiple spreadsheets that are (mostly) identically formatted: Fulton county taxes
- Creating rules for parsing specific files to standardize columns
What are your strategies for cleaning data with code?

csvkit
csvmatch (fuzzy matching on the command line)
xsv
janitor R package
tidyr R package
Learn about built-in CLI tools on your (Mac or Linux) machine such as sed
- Example (on OSX): Copy a list of non-unique values, sort them, grab a unique list and load it into your clipboard: pbpaste | sort | uniq | pbcopy
fuzzy_pandas: A Python library for fuzzy matching in pandas
If someone has solved your problem already, use their solution! e.g., address standardization with usaddress-scourgify
Building an iterative dictionary of fixes for a dataset you plan to maintain over time -- example: scraping Texas death row media witnesses (fixes and where it's used in the script)
Using OpenRefine to clean data (one powerful feature is a bespoke javascripty programming language called GREL)
PDFplumber: A Python library for parsing text-based PDFs
Learn about regular expressions!
Learn some basics about other data formats such as JSON and how your progamming language of choice can process
Pythonic data cleaning with pandas and numPy

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
annapolis-rental-violations		annapolis-rental-violations
big-flat-file		big-flat-file
fulton-county-taxes		fulton-county-taxes
really-bad-spreadsheet		really-bad-spreadsheet
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback