Narrative
TODO
Model 1—Subsection-Level Crime Frequency Prediction Using Infrastructure and Socioeconomic Features
- Data
- Development Process
Model 2 Sociodemographic Prediction Using Crime Density Features
- Data
- Development Process

Narrative

Which is a better predictor of crime?

Socioeconomic features about people living somewhere

Infrastructure/geological features about the area

Note: not trying to determine causal relationship, just determining which is a better predictor (in the nature of data science)

Model 1:

Items	Inputs/Features	Output/Target	Model Type
Subsections	Infrastructure features	Crime Frequency	Ridge/Lasso Linear Regression

Model 2:

Items	Inputs/Features	Output/Target	Model Type
Subsections	Crime Density, Possibly crime density at different zoom levels	Socioeconomic features	SVM or Decision Trees

TODO

General

~~(optional - If have time) Combine arrests datasets from diff years into one dataset in beginning of notebook~~ no time
edit written sections of notebook
add inline comments where necessary
Use processes from previous HW
- data scale transform

Data Cleaning/Collection

Use github link to load dependencies instead of requiring local file
~~function to ensure that any dataset used for features fully encompasses/spans all the subsections chosen~~ We already span the entire region (verified visually) by just using the sidewalks dataset as the reference area
function to check geographical domain of datasets then create intersection $\rightarrow$ give to function that creates subsections
Best way to test for feature multicollinearity?
- e.g.,
  - ~~use VIF (Variance Inflation Factor) to test for multicollinearity~~ correlation matrix with a threshold seems adequate
  - plot correlation matrix, etc.
- Resolution strategy: if scales are different → drop the feature that has lower correlation with target variable; else if scales are same → use PCA to combine into single feature
Move feature engineering stuff into data processing, especially because we need the results of that code earlier on (e.g., for data viz)
~~Determine why the business_license dataset is not being working~~ It has addresses but no geometry data, so it cannot be used
Define what an outlier would be → If and how to remove?

Data Exploration

Decide how to use speed limit data -- e.g., use mean speed limit for each subsection to create singular value per data item (row) -- would require considering the length of road as well so it's difficult (not important)
(most likely no longer relevant) ~~Use a more efficient method of joining data by geographic distance. E.g., connecting arrest incidents with nearest sidewalk. Current method could take hours with 50k arrests dataset.~~
In the datasets, go to infrastructure folder, choose other infrastructure datasets (e.g., streetlights), and explore correlations in the same way it's been done for sidewalks in Analzing correlation between distance to sidewalks and arrest frequency
...Explore other trends in the data with other approaches. See these suggestions given by LLM

Feature Engineering

(Optional) Can test with completely random subsections similar to how bagging and random forest work -- i.e., we don't attempt to span the entire area of interest, we just randomly generate subsections within bounds (with replacement).
- ~~Can also randomize the allowed area, which would naturally be random if x and y are randomly generated~~ redundant
- This approach can also be used to essentially create unlimited test data for more extensive evaluation
~~Are subsections actually too long (longitudinally)?~~ NO: the long subsections occur because the outer bounds (Tucson city bounds) create a long rectangle so naturally the subsections mirror that shape
- Solution: do not create $n \times n$ subsections but rather $n \times m$ subsections where $m < n$ - calculate $m$ based on the aspect ratio of the outer bounds
(optional) selecting between geometry.within, geometry.intersects or geometry.overlaps depending on the nature of the data set (choose case-by-case)
Fix create_subsections function not creating sections over entire outer bounds
Determine outer bounds using ~~some better approach~~ (for now: sidewalks feature dataset, since arrests has a ton of geographically dispersed data/outliers way outside bounds of the other datasets)
Setup feature processing for socioeconomic features
Implement the separation of distance_to and density infrastructure features

Data Visualization

Refer EDA slides
Visualize grouped box plots of all the features similar to HW7
~~Create indicator of what the outer bounds are on the visualizations~~ The visualization of the subsections already demonstrates this implicitly
Change visualize_objects_in_subsection function to be more efficient (probably don't need to to filter by objects in subsection and can just plot all objects)
Combine the density-feature distributions plots into a single plot/figure
Heatmaps over scatterplot for infrastructure on real map
- When making Folium maps (geographic maps with popup markers on them), use a plotting technique more appropriate to the data (refer to lecture slides). E.g., a heat map, contour plot, hexagon scatter plot.
- Create a heatmap variant of the crime frequency visualization

Hyperparameter Tuning

Take all abritrary values (or numbers used in functions that can be thought of as arbitrary and parametrized)
- → put into the global config object
- → treat as hyperparameters
- → tune them

Evaluation

On top of using the test data from initial split, also can make more subsections by changing the params of the create_subsections function to use different sizes, diffferent type of randomness, etc.
For both models, need more ways to evaluate
- compare vs baseline model
- ~~compare vs real model in scientific literature or similar algorithm~~ too hard to find
(from rubric) For both models, need more visualizations in the evaluation stage to demonstrate the model's performance and interpret how it works (or our best guess at how it works)

Discussion/Reflection

Can include in discussion: development process (todo, github history, process of recognizing sparse features and changing to distance_to, etc.)
Optional ideas
- ~~Model chain: infra -> predicted density -> predicted socioeconomic feature~~
- ~~Abstract to paths for interesting utility/inference~~ not enough time

Model 1—Subsection-Level Crime Frequency Prediction Using Infrastructure and Socioeconomic Features

Data

Items

subsections of Tucson

Inputs/Features

number of ... included in subsection
- sidewalk
- bicycle boulevards
- landfill
- fire station
- bridge
- crosswalk
- streetcar route
- streetcar stop
- scenic route
- streetlight
- suntran bus stop

Output/Target

number of crimes per time legth of data set (a.k.a. crime frequency)

Development Process

Create subsections
- function that takes:
  - number subsections
    - type: int
    - width/height
  - returns list of: -bbox (bounding box) - type: tuple[Float] - (lat_lower, lat_upper, long_lower, long_upper)
Collecting and organizing data into format:

num sidewalks .... characteristics ... total number of crimes

$x_1$ .... $c_1$ ... $y_1$

$x_2$ .... $c_2$ ... $y_2$

... .... ... ... ...
Data cleaning
- z-score normalize (e.g., use sklearn.preprocessing.StandardScaler)
- remove outliers
- remove missing data
- remove duplicates
- validate geographical area of interest matches with function that creates subsection
Split data into training and testing sets
Determining best regression type:
- For each regression type, hyperpamater tuning (determine optimal params)
  - Best subsets
    - particular subset of features
  - Lasso reg
    - best lambda/alpha (LARS)
  - Ridge reg
    - best lambda/alpha
- Choose best regression type
Use given regression type to fit model
(hyperparamater tuning ?) afterwards
Evaluate model, reflect, make changes (repeat at step 5)
- Choose some ...
  - arbitrary metric
  - significance level
  - baseline model
  - real model developed in actual scientific literature that does same thing
    - theres also a section of the report "Related Works" for this

Model 2 Sociodemographic Prediction Using Crime Density Features

Data

Items

subsections of Tucson

Inputs/Features

crime density
- crime density at different zoom levels

Output/Target

characteristics of subsection
- race
  - column for each main race (or arbitrary groupings)
  - proportion of total
- mean income
- mean education level
- mean age
- ~~mean speed limit~~ Not worth taking time to figure out how to preprocess this data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

todo.md

todo.md

Narrative

TODO

General

Data Cleaning/Collection

Data Exploration

Feature Engineering

Data Visualization

Hyperparameter Tuning

Evaluation

Discussion/Reflection

Model 1—Subsection-Level Crime Frequency Prediction Using Infrastructure and Socioeconomic Features

Data

Items

Inputs/Features

Output/Target

Development Process

Model 2 Sociodemographic Prediction Using Crime Density Features

Data

Items

Inputs/Features

Output/Target

Development Process

num sidewalks	....	characteristics	...	total number of crimes
$x_1$	....	$c_1$	...	$y_1$
$x_2$	....	$c_2$	...	$y_2$
...	....	...	...	...

Files

todo.md

Latest commit

History

todo.md

File metadata and controls

Narrative

TODO

General

Data Cleaning/Collection

Data Exploration

Feature Engineering

Data Visualization

Hyperparameter Tuning

Evaluation

Discussion/Reflection

Model 1—Subsection-Level Crime Frequency Prediction Using Infrastructure and Socioeconomic Features

Data

Items

Inputs/Features

Output/Target

Development Process

Model 2 Sociodemographic Prediction Using Crime Density Features

Data

Items

Inputs/Features

Output/Target

Development Process