updates

gboeing · Jan 8, 2024 · 77a8d6e · 77a8d6e
1 parent 634b9ab
commit 77a8d6e
Show file tree

Hide file tree

Showing 32 changed files with 1,312 additions and 2,051 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -7,9 +7,7 @@ on:
     branches: [main]
 
 jobs:
-
   build:
-
     name: ${{ matrix.os }}
     runs-on: ${{ matrix.os }}
     strategy:
@@ -19,24 +17,21 @@ jobs:
 
     defaults:
       run:
-        shell: bash -l {0}
+        shell: bash -elo pipefail {0}
 
     steps:
-
       - name: Checkout repo
         uses: actions/checkout@v3
         with:
           fetch-depth: 2
 
-      - name: Setup Conda environment with Micromamba
-        uses: mamba-org/provision-with-micromamba@v14
+      - name: Create environment with Micromamba
+        uses: mamba-org/setup-micromamba@v1
         with:
           cache-downloads: true
-          cache-env: true
-          channels: conda-forge
-          channel-priority: strict
+          cache-environment: true
           environment-file: environment.yml
-          environment-name: ppde642
+          post-cleanup: none
 
       - name: Test environment
         run: |
@@ -45,3 +40,7 @@ jobs:
           conda info --all
           jupyter kernelspec list
           ipython -c "import osmnx; print('OSMnx version', osmnx.__version__)"
+
+      - name: Lint
+        run: |
+          SKIP=no-commit-to-branch pre-commit run --all-files
diff --git a/.gitignore b/.gitignore
@@ -2,6 +2,7 @@
 data/*
 modules/*/*.gal
 modules/*/*.png
+modules/*/cache/*
 modules/*/keys.py
 syllabus/pdf/*.pdf
 

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,50 @@
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: "v4.5.0"
+    hooks:
+      - id: check-added-large-files
+        args: [--maxkb=50]
+      - id: check-ast
+      - id: check-builtin-literals
+      - id: check-case-conflict
+      - id: check-docstring-first
+      - id: check-json
+      - id: check-merge-conflict
+        args: [--assume-in-merge]
+      - id: check-yaml
+      - id: debug-statements
+      - id: detect-private-key
+      - id: end-of-file-fixer
+      - id: fix-byte-order-marker
+      - id: mixed-line-ending
+      - id: no-commit-to-branch
+        args: [--branch, main]
+      - id: trailing-whitespace
+
+  - repo: https://github.com/pre-commit/mirrors-prettier
+    rev: "v3.0.3"
+    hooks:
+      - id: prettier
+        types_or: [markdown, yaml]
+
+  - repo: https://github.com/nbQA-dev/nbQA
+    rev: "1.7.1"
+    hooks:
+      - id: nbqa-isort
+        additional_dependencies: [isort]
+        args: [--line-length=100, --sl]
+      - id: nbqa-black
+        additional_dependencies: [black]
+        args: [--line-length=100]
+      - id: nbqa-flake8
+        additional_dependencies: [flake8]
+        args: [--max-line-length=100]
+
+  - repo: local
+    hooks:
+      - id: nbconvert
+        name: clear notebook output
+        entry: jupyter nbconvert
+        language: system
+        types: [jupyter]
+        args: ["--clear-output", "--inplace"]
diff --git a/README.md b/README.md
@@ -1,14 +1,12 @@
 [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/gboeing/ppde642/main?urlpath=lab)
 [![Build Status](https://github.com/gboeing/ppde642/workflows/tests/badge.svg?branch=main)](https://github.com/gboeing/ppde642/actions?query=workflow%3A%22tests%22)
 
-
 # PPDE642: Advanced Urban Analytics
 
 This is the second part of a two-course series on **urban data science** that I teach at the **University of Southern California**'s Department of Urban Planning and Spatial Analysis.
 
 This course series takes a computational social science approach to working with urban data. It uses Python and Jupyter notebooks to introduce coding and statistical methods that students can reproduce and experiment with in the cloud. The series as a whole presumes no prior knowledge as it introduces coding, stats, spatial analysis, and applied machine learning from the ground up, but PPDE642 assumes you have completed [PPD534](https://github.com/gboeing/ppd534) or its equivalent.
 
-
 ## Urban Data Science course series
 
 ### PPD534: Data, Evidence, and Communication for the Public Good
@@ -17,14 +15,12 @@ The first course in the series, **PPD534**, starts with the basics of coding wit
 
 **PPD534**'s lecture materials are available on [GitHub](https://github.com/gboeing/ppd534) and interactively on [Binder](https://mybinder.org/v2/gh/gboeing/ppd534/main).
 
-
 ### PPDE642: Advanced Urban Analytics
 
 The second course, **PPDE642**, assumes you have completed PPD534 (or its equivalent) and builds on its topics. It introduces spatial analysis, network analysis, spatial models, and applied machine learning. It also digs deeper into the tools and workflows of urban data science in both research and practice.
 
 **PPDE642**'s lecture materials are available in this repo and interactively on [Binder](https://mybinder.org/v2/gh/gboeing/ppde642/main).
 
-
 ## Not a USC student?
 
 Did you discover this course on GitHub? Come study with us: [consider applying](https://geoffboeing.com/lab/) to the urban planning master's or PhD programs at USC.

diff --git a/assignments/assignment2.md b/assignments/assignment2.md
@@ -6,9 +6,9 @@ You will clean, organize, describe, and visualize the data you downloaded in Ass
 
 Create a new Jupyter notebook. The first cell of your notebook should be markdown explaining what your research question and hypotheses are, where you found your data set, and what it contains. Given your proposed project:
 
-  1. Load your data set and clean/process it as needed.
-  1. Identify at least two variables of interest and calculate relevant descriptive statistics.
-  1. Using the techniques we learned in class, visualize interesting aspects of your data set. Create at least 4 visualizations using at least 3 different visualization types (e.g., scatterplots, barplots, maps, etc).
+1. Load your data set and clean/process it as needed.
+1. Identify at least two variables of interest and calculate relevant descriptive statistics.
+1. Using the techniques we learned in class, visualize interesting aspects of your data set. Create at least 4 visualizations using at least 3 different visualization types (e.g., scatterplots, barplots, maps, etc).
 
 Make sure your code is well-commented throughout for explanatory clarity. Your notebook should be well-organized into high-level sections using markdown headers representing the steps above, plus subheaders as needed. Each visualization should be followed by a markdown cell that explains what you are visualizing, why it is interesting, and why you made your specific graphical design decisions. What story does each visual tell? How does it enrich, confirm, or contradict the descriptive statistics you calculated earlier?
 

diff --git a/assignments/assignment4.md b/assignments/assignment4.md
@@ -6,9 +6,9 @@ You will conduct a spatial analysis using a spatial dataset (ideally the same on
 
 Create a new Jupyter notebook. The first cell of your notebook should be markdown explaining what your research question and hypotheses are, where you found your data set, and what it contains. Use geopandas to load your data set and clean/process it as needed. Make sure your code is well-commented throughout for explanatory clarity. Using the techniques we learned in class, do the following:
 
-  1. conduct a spatial analysis to look for hot/cold spots and assess spatial autocorrelation
-  1. compute spatial diagnostics to pick an appropriate spatial regression model
-  1. estimate and interpret a spatial regression model
+1. conduct a spatial analysis to look for hot/cold spots and assess spatial autocorrelation
+1. compute spatial diagnostics to pick an appropriate spatial regression model
+1. estimate and interpret a spatial regression model
 
 Your notebook should be separated into high-level sections using markdown headers representing the steps above. Each section should conclude with a markdown cell that succinctly explains your analysis/visuals, why you set it up the way you did, and how you interpret its results. Your notebook should conclude with a markdown cell that explains 1) what evidence does this analysis provide for your research question and hypothesis, 2) what is the "big picture" story, and 3) how can planners or policymakers use this finding.
 

diff --git a/assignments/final-project.md b/assignments/final-project.md
@@ -8,27 +8,27 @@ The final project is a cumulative assignment that requires you to use the skills
 
 Identify a conference of interest and familiarize yourself with their paper submission requirements. You might consider the following conferences, among others:
 
-  - Transportation Research Board (TRB)
-  - Association of Collegiate Schools of Planning (ACSP)
-  - American Planning Association's National Planning Conference (APA)
-  - American Association of Geographers (AAG)
-  - Urban Affairs Association (UAA)
+- Transportation Research Board (TRB)
+- Association of Collegiate Schools of Planning (ACSP)
+- American Planning Association's National Planning Conference (APA)
+- American Association of Geographers (AAG)
+- Urban Affairs Association (UAA)
 
 Develop an urban research question that fits with the themes of your chosen conference. Develop a research design to answer this question, then collect data, clean and organize it, visualize it, and analyze it.
 
 Write a conference paper organized into five sections:
 
-  1. introduction: provide a short (3 paragraph) summary of the study's importance, methods, and findings/implications (1 paragraph each)
-  2. background: explain the context of your study and provide a short lit review of relevant related work to establish what is known and what urgent open questions remain
-  3. methods: present your data and your analysis methods with sufficient detail that a reader could reproduce your study
-  4. results: present your findings and include supporting visuals
-  5. discussion: circle back to your research question, interpret your findings, and discuss their importance and how planners or policymakers could use them to improve some aspect of urban living
+1. introduction: provide a short (3 paragraph) summary of the study's importance, methods, and findings/implications (1 paragraph each)
+2. background: explain the context of your study and provide a short lit review of relevant related work to establish what is known and what urgent open questions remain
+3. methods: present your data and your analysis methods with sufficient detail that a reader could reproduce your study
+4. results: present your findings and include supporting visuals
+5. discussion: circle back to your research question, interpret your findings, and discuss their importance and how planners or policymakers could use them to improve some aspect of urban living
 
 Format your paper according to the conference's guidelines. For the purposes of this course, your paper must be at least 3000 words in length (not including tables, figures, captions, or references). It must include the following, at a minimum:
 
-  - a table of descriptive statistics
-  - a table of spatial regression or machine learning model results
-  - 4 aesthetically-pleasing figures containing data visualizations including at least 1 map
+- a table of descriptive statistics
+- a table of spatial regression or machine learning model results
+- 4 aesthetically-pleasing figures containing data visualizations including at least 1 map
 
 You are strongly encouraged, but not required, to actually submit this paper to the conference.
 

diff --git a/assignments/mini-lecture.md b/assignments/mini-lecture.md
@@ -8,21 +8,21 @@ This exercise is intended to be informal and an opportunity for self-discovery.
 
 Instructions:
 
-  - Pick a method listed in the syllabus for those weeks or covered in the reading material.
-  - Learn how the method works by reading the week's reading material.
-  - Practice the method in your own notebook on your own data.
-  - Google for additional usage examples and further information.
-  - Prepare a mini-lecture notebook that would take 8-10 minutes to present that 1) briefly introduces why someone would use the method and how it works (~2 minutes), 2) demonstrates in code how to use the method for a simple data analysis (~5 minutes), 3) summarizes what the analysis revealed (~2 minutes).
+- Pick a method listed in the syllabus for those weeks or covered in the reading material.
+- Learn how the method works by reading the week's reading material.
+- Practice the method in your own notebook on your own data.
+- Google for additional usage examples and further information.
+- Prepare a mini-lecture notebook that would take 8-10 minutes to present that 1) briefly introduces why someone would use the method and how it works (~2 minutes), 2) demonstrates in code how to use the method for a simple data analysis (~5 minutes), 3) summarizes what the analysis revealed (~2 minutes).
 
-8 minutes is not a lot of time, so keep your lecture notebook simple and brief. Have a clean dataset ready to go at the beginning of your lecture. Do not show us a lot of preparatory steps setting things up in your  notebook. Jump right into the analysis that demonstrates your method.
+8 minutes is not a lot of time, so keep your lecture notebook simple and brief. Have a clean dataset ready to go at the beginning of your lecture. Do not show us a lot of preparatory steps setting things up in your notebook. Jump right into the analysis that demonstrates your method.
 
 You will be graded according to the following. In your notebook, did you:
 
-  - summarize why someone would use this method and how it works, at a high-level
-  - demonstrate the method with a simple data analysis
-  - summarize what your analysis revealed
-  - keep it all succinct
+- summarize why someone would use this method and how it works, at a high-level
+- demonstrate the method with a simple data analysis
+- summarize what your analysis revealed
+- keep it all succinct
 
 Make sure your notebook runs from the top without any errors (i.e., restart the kernel and run all cells) and that all the output can be seen inline without me having to re-run your notebook. Via Blackboard, submit your notebook and data files, all zipped as a single file, named `LastName_FirstName_Lecture.zip`. If your submission file exceeds Blackboard's maximum upload size limit, you may provide a Google Drive link to your zipped data in the comment field when you submit.
 
-Note that if you pick a supervised learning method, your assignment is due prior to class in module 11. If you pick an unsupervised learning method, your assignment is due prior to class in module 12. The "presentation" is pretend: you are just creating the lecture notebook you would have presented, and submitting it via Blackboard *before that module's class session* begins.
+Note that if you pick a supervised learning method, your assignment is due prior to class in module 11. If you pick an unsupervised learning method, your assignment is due prior to class in module 12. The "presentation" is pretend: you are just creating the lecture notebook you would have presented, and submitting it via Blackboard _before that module's class session_ begins.
diff --git a/environment.yml b/environment.yml
@@ -5,32 +5,33 @@ channels:
 
 dependencies:
   - beautifulsoup4
-  - black
   - cartopy
   - cenpy
-  - conda
   - contextily
-  - dill
-  - flake8
   - folium
-  - gensim
   - geopandas
-  - isort
   - jupyterlab
   - mapclassify
   - osmnx=1.8.1
-  - nbqa
-  - nltk
   - pandana
   - pandas
+  - pre-commit
   - pysal
   - python=3.11.*
   - rasterio
-  - rtree
   - seaborn
   - scikit-learn
   - scipy
   - statsmodels
+
+  # computer vision and NLP
+  - gensim
+  - nltk
+  - pillow
+  - pytorch
+  - torchvision
+
+  # others (unused)
   # bokeh
   # datashader
   # holoviews

diff --git a/format.sh b/format.sh
diff --git a/modules/01-introduction/readme.md b/modules/01-introduction/readme.md
@@ -2,29 +2,24 @@
 
 In this module, we introduce the course, the syllabus, the semester's expectations and schedule, and set up the computing environment for coursework. Then we introduce the foundational tools underlying much of the modern data science world: package management, version control, and computational notebooks.
 
-
 ## Syllabus
 
 The syllabus is in the [syllabus](../../syllabus) folder.
 
-
 ## Computing environment
 
 Make sure that you have already completed the course's initial [software](../../software) setup before proceeding.
 
-
 ## Package management
 
 A Python **module** is a file of Python code containing variables, classes, functions, etc. A Python **package** is a collection of modules, kind of like a folder of files and subfolders. A package can be thought of as a computer program.
 
 **Package management** is the process of installing, uninstalling, configuring, and upgrading packages on a computer. A **package manager** is a software tool for package management, retrieving information and installing packages from a software repository. The most common Python package managers are `conda` and `pip`. These tools are typically used in the terminal.
 
-
 ### pip
 
 `pip` installs Python packages from [PyPI](https://pypi.org/) in the form of wheels or source code. The latter often requires that you have library dependencies and compatible compilers already installed on your system to install the Python package. This often requires some expertise when installing complicated toolkits, such as the Python geospatial data science ecosystem. For that reason, I recommend using `conda` unless you have to use `pip`.
 
-
 ### conda
 
 `conda` installs packages from Anaconda's software repositories. These packages are binaries, so no compilation is required of the user, and they are multi-language: a package could include Python, C, C++, R, Julia, or other languages. Anaconda software repositories are organized by **channel**. Beyond the "default" channel, the [conda-forge](https://conda-forge.org/) channel includes thousands of community-led packages. `conda` is the recommended way to get started with the Python geospatial data science ecosystem.
@@ -46,25 +41,23 @@ conda env remove -n ox
 
 Read the `conda` [documentation](https://conda.io/) for more details.
 
-
 ## Urban data science in a computational notebook
 
 During the course's initial software setup, you created a conda environment with all the required packages. The required packages are defined in the course's [environment file](../../environment.yml). These are the tools we will use all semester.
 
 All of the lectures and coursework will utilize Jupyter notebooks. These notebooks provide an interactive environment for working with code and have become standard in the data science world. [Read more](https://doi.org/10.22224/gistbok/2021.1.2).
 
-
 ## Version control
 
 Distributed version control is central to modern analytics work in both research and practice. It allows (multiple) people to collaboratively develop source code while tracking changes. Today, git is the standard tool for version control and source code management. Sites like GitHub provide hosting for git repositories.
 
 GitHub Guides provides an excellent [introduction](https://guides.github.com/) to distributed version control with git, so I will not duplicate it here. Take some time to work through their lessons. You need to understand, at a minimum, how to:
 
-  - fork a repo
-  - clone a repo
-  - work with branches
-  - add/commit changes
-  - push and pull to/from a remote repo
-  - merge a feature branch into the main branch
+- fork a repo
+- clone a repo
+- work with branches
+- add/commit changes
+- push and pull to/from a remote repo
+- merge a feature branch into the main branch
 
 Start with their guides on the Git Handbook, Understanding the GitHub flow, Forking Projects, Mastering Markdown, and then explore from there.