Skip to content

Materials for a session at NICAR 2024 on using code to clean data.

License

Notifications You must be signed in to change notification settings

ireapps/nicar-2024-data-wrangling-with-code

Repository files navigation

NICAR 2024: Data wrangling with code

This repo contains materials for an hourlong class at the NICAR 2024 conference in Baltimore on using code to clean and process data.

The session is scheduled for Friday, March 8, from 5 - 6:15 p.m. in room Kent A on the fourth floor.

First step

Open the Terminal application. Copy and paste this text and hit enter:

cd ~/Desktop/hands_on_classes/20240308-friday-data-wrangling-with-code && source env/bin/activate

Class outline

  • Where there's a will, there's a way! Lots of different tools to accomplish your cleaning tasks: csvkit and other CLI tools, R, Python, regular expressions used in various contexts and languages
  • Find the patterns in the data
  • Automating your cleaning steps with a notebook!
  • Think about the spectrum of reproducibility: On one end, manually cleaning all values; on the other, writing code to automatically clean all of your data. Usually you land somewhere in the middle.
  • Data surgery with csvkit: USACE dams database
    • Selecting specific columns of data
    • Assessing cleanliness of data values
    • Subsetting into smaller files
  • Dealing with garbage headers and Excel formatting: U.N. population data
    • Skipping rows
    • Dealing with grouping headers
  • Parsing a spreadsheet formatted like a paper report: Annapolis rental violations
    • Using start/stop flags to delineate and parse individual records
  • Combining multiple spreadsheets that are (mostly) identically formatted: Fulton county taxes
    • Creating rules for parsing specific files to standardize columns
  • What are your strategies for cleaning data with code?

Links/other resources

About

Materials for a session at NICAR 2024 on using code to clean data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published