- Narrative
- TODO
- Model 1—Subsection-Level Crime Frequency Prediction Using Infrastructure and Socioeconomic Features
- Model 2 Sociodemographic Prediction Using Crime Density Features
Which is a better predictor of crime?
- Socioeconomic features about people living somewhere
- Infrastructure/geological features about the area
Note: not trying to determine causal relationship, just determining which is a better predictor (in the nature of data science)
Model 1:
Items | Inputs/Features | Output/Target | Model Type |
---|---|---|---|
Subsections | Infrastructure features | Crime Frequency | Ridge/Lasso Linear Regression |
Model 2:
Items | Inputs/Features | Output/Target | Model Type |
---|---|---|---|
Subsections | Crime Density, Possibly crime density at different zoom levels | Socioeconomic features | SVM or Decision Trees |
-
(optional - If have time) Combine arrests datasets from diff years into one dataset in beginning of notebookno time - edit written sections of notebook
- add inline comments where necessary
- Use processes from previous HW
- data scale transform
- Use github link to load dependencies instead of requiring local file
-
function to ensure that any dataset used for features fully encompasses/spans all the subsections chosenWe already span the entire region (verified visually) by just using the sidewalks dataset as the reference area - function to check geographical domain of datasets then create intersection
$\rightarrow$ give to function that creates subsections - Best way to test for feature multicollinearity?
- e.g.,
-
use VIF (Variance Inflation Factor) to test for multicollinearitycorrelation matrix with a threshold seems adequate - plot correlation matrix, etc.
-
- Resolution strategy: if scales are different → drop the feature that has lower correlation with target variable; else if scales are same → use PCA to combine into single feature
- e.g.,
- Move feature engineering stuff into data processing, especially because we need the results of that code earlier on (e.g., for data viz)
-
Determine why theIt has addresses but no geometry data, so it cannot be usedbusiness_license
dataset is not being working - Define what an outlier would be → If and how to remove?
-
Decide how to use speed limit data -- e.g., use mean speed limit for each subsection to create singular value per data item (row) -- would require considering the length of road as well so it's difficult(not important) - (most likely no longer relevant)
Use a more efficient method of joining data by geographic distance. E.g., connecting arrest incidents with nearest sidewalk. Current method could take hours with 50k arrests dataset. - In the datasets, go to
infrastructure
folder, choose other infrastructure datasets (e.g.,streetlights
), and explore correlations in the same way it's been done for sidewalks in Analzing correlation between distance to sidewalks and arrest frequency - ...Explore other trends in the data with other approaches. See these suggestions given by LLM
- (Optional) Can test with completely random subsections similar to how bagging and random forest work -- i.e., we don't attempt to span the entire area of interest, we just randomly generate subsections within bounds (with replacement).
-
Can also randomize the allowed area, which would naturally be random if x and y are randomly generatedredundant - This approach can also be used to essentially create unlimited test data for more extensive evaluation
-
-
Are subsections actually too long (longitudinally)?NO: the long subsections occur because the outer bounds (Tucson city bounds) create a long rectangle so naturally the subsections mirror that shape- Solution: do not create
$n \times n$ subsections but rather$n \times m$ subsections where$m < n$ - calculate$m$ based on the aspect ratio of the outer bounds
- Solution: do not create
- (optional) selecting between
geometry.within
,geometry.intersects
orgeometry.overlaps
depending on the nature of the data set (choose case-by-case) - Fix
create_subsections
function not creating sections over entire outer bounds - Determine outer bounds using
some better approach(for now: sidewalks feature dataset, since arrests has a ton of geographically dispersed data/outliers way outside bounds of the other datasets) - Setup feature processing for socioeconomic features
- Implement the separation of distance_to and density infrastructure features
- Refer EDA slides
- Visualize grouped box plots of all the features similar to HW7
-
Create indicator of what the outer bounds are on the visualizationsThe visualization of the subsections already demonstrates this implicitly - Change
visualize_objects_in_subsection
function to be more efficient (probably don't need to to filter by objects in subsection and can just plot all objects) - Combine the density-feature distributions plots into a single plot/figure
- Heatmaps over scatterplot for infrastructure on real map
- When making Folium maps (geographic maps with popup markers on them), use a plotting technique more appropriate to the data (refer to lecture slides). E.g., a heat map, contour plot, hexagon scatter plot.
- Create a heatmap variant of the crime frequency visualization
- Take all abritrary values (or numbers used in functions that can be thought of as arbitrary and parametrized)
- → put into the global config object
- → treat as hyperparameters
- → tune them
- On top of using the test data from initial split, also can make more subsections by changing the params of the
create_subsections
function to use different sizes, diffferent type of randomness, etc. - For both models, need more ways to evaluate
- compare vs baseline model
-
compare vs real model in scientific literature or similar algorithmtoo hard to find
- (from rubric) For both models, need more visualizations in the evaluation stage to demonstrate the model's performance and interpret how it works (or our best guess at how it works)
- Can include in discussion: development process (todo, github history, process of recognizing sparse features and changing to
distance_to
, etc.) - Optional ideas
-
Model chain: infra -> predicted density -> predicted socioeconomic feature -
Abstract to paths for interesting utility/inferencenot enough time
-
- subsections of Tucson
- number of ... included in subsection
- sidewalk
- bicycle boulevards
- landfill
- fire station
- bridge
- crosswalk
- streetcar route
- streetcar stop
- scenic route
- streetlight
- suntran bus stop
- number of crimes per time legth of data set (a.k.a. crime frequency)
-
Create subsections
- function that takes:
- number subsections
- type: int
- width/height
- returns list of: -
bbox
(bounding box) - type:tuple[Float]
- (lat_lower, lat_upper, long_lower, long_upper)
- number subsections
- function that takes:
-
Collecting and organizing data into format:
num sidewalks .... characteristics ... total number of crimes $x_1$ .... $c_1$ ... $y_1$ $x_2$ .... $c_2$ ... $y_2$ ... .... ... ... ... -
Data cleaning
- z-score normalize (e.g., use
sklearn.preprocessing.StandardScaler
) - remove outliers
- remove missing data
- remove duplicates
- validate geographical area of interest matches with function that creates subsection
- z-score normalize (e.g., use
-
Split data into training and testing sets
-
Determining best regression type:
- For each regression type, hyperpamater tuning (determine optimal params)
- Best subsets
- particular subset of features
- Lasso reg
- best lambda/alpha (LARS)
- Ridge reg
- best lambda/alpha
- Best subsets
- Choose best regression type
- For each regression type, hyperpamater tuning (determine optimal params)
-
Use given regression type to fit model
-
(hyperparamater tuning ?) afterwards
-
Evaluate model, reflect, make changes (repeat at step 5)
- Choose some ...
- arbitrary metric
- significance level
- baseline model
- real model developed in actual scientific literature that does same thing
- theres also a section of the report "Related Works" for this
- Choose some ...
- subsections of Tucson
- crime density
- crime density at different zoom levels
- characteristics of subsection
- race
- column for each main race (or arbitrary groupings)
- proportion of total
- mean income
- mean education level
- mean age
mean speed limitNot worth taking time to figure out how to preprocess this data
- race