forked from geo511-2019/2019-geo511-project-Stella-Liao
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
231 lines (187 loc) · 10.6 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
title: "Crime Prediction in Manhattan, NYC"
author: "Stella Liao"
subtitle: Visualization and Classification for Larceny, Assault and Harassment
output: html_document
---
# Introduction
Crime is a social issue, like a disease, which tends to spread as spatial clusters. We are always seeking for a way to minimize and prevent the occurrance of crime. Imagine if we could predict where the probability of crime occurring, our police could deploy the law enforcement to the potentially dangerous areas, which is more efficient. Usually, we may assume occurance of crime as random and researchers used behavioral and social methods to study it. However, with the development of data analysis and techonology, we could use more quantitative ways to analyze it.
For example, there is one program named PredPol, which is conducted by researchers from the University of California, Los Angeles (UCLA). With the help of the department of Los Angeles Police, they collected about 13 billion cases in 80 years and just used two variables, when and where to build models to predict where a crime could happen during each day, which is amazing and shows us the power of the environment influenting human's choice. And another paper written by Dr.Irina Matijosaitiene revealed the effect of land uses on crime type classification and prediction.
When using classification models, they are actually calculating the probability of when and where one crime type may happe. So in this project, I will focus on classification models. Of course, I'd like to use visulazation to give audience an intuitive feel about the relationship between the occurance of crime with time and location.
# Materials and methods
I will use the crime data from 2015-2017 in Manhattan, New York City to build classification models to classify the top three crime types occurred in this study area, which are larceny, assault and harassment. And the main factors input as features in the models are time and location, to be specific, time refers to exact time and day of week, and location refers to land use.
* Dataset Sources
* <a href="https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i">NYPD Complaint Data</a>, a CSV file recording all crime occurance in New York City from 2006-2017
* <a href="https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page">Primary Land Use Tax Lot Output(PLUTO)</a>, a shapfile containing land cover information of New York City
* Relationship between crime types with time and location
* Time Series Analysis
* Effects of Land Uses on Crime Types
* Classification Models
* Logistic Regression
* Random Forest
* Naïve Bayes Classification
## Load all required packages
```{r load_packages, message=FALSE, warning=FALSE}
library(dplyr)
library(stringr)
library(tidyr)
library(readr)
library(lubridate)
library(sp)
library(sf)
library(ggplot2)
library(mapview)
library(knitr)
library(naivebayes)
library(randomForest)
library(ggpubr)
knitr::opts_chunk$set(cache=TRUE,cache.lazy = FALSE) # cache the results for quick compiling
```
## Download and clean all required data
### Crime Dataset
This code chunk is used to download and clean the crime data.
```{r crime_data_cleaned, message=FALSE, warning=FALSE, results='hide'}
#read the raw data
#It may takes a long time to run due to the large size of the raw dataset
crime_file = "nypd.csv"
crime_url = "https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD"
download.file(crime_url,crime_file)
nypd <-read.csv(crime_file,stringsAsFactors = FALSE)
#clean and tidy crime data
#classify exact time into different time ranges
time_interval<- data.frame(id = c("00","01","02","03","04","056","06","07","08","09","10","11","12",
"13","14","15","16","17","18","19","20","21","22","23","24"),
interval = c("00-01","01-02","02-03","03-04","04-05","05-06","06-07","07-08",
"08-09","09-10","10-11","11-12","12-13","13-14","14-15","15-16",
"16-17","17-18","18-19","19-20","20-21","21-22","22-23","23-24","00-01"))
#exract the relative information into crime_MAN dataframe
crime_MAN <- nypd %>%
drop_na(Longitude)%>% # remove NA value
drop_na(Latitude)%>%
drop_na(CMPLNT_FR_DT)%>%
drop_na(CMPLNT_FR_TM)%>%
st_as_sf(coords=c("Longitude","Latitude"),crs = 4326)%>% # add georeferenced information
rename(# rename some column names to operate easily
CrimeID = CMPLNT_NUM,
CrimeType = OFNS_DESC,
Date = CMPLNT_FR_DT,
Time = CMPLNT_FR_TM)%>%
mutate(Date = mdy(Date), #change data column into DATE type
DayofWeek = wday(Date,label = TRUE,abbr = FALSE), #get the information about day of week
Time = hour(hms(Time)))%>% # get the hour of time
mutate(TimeInterval = time_interval$interval[match(.$Time, time_interval$id)])%>% #add a new column storing time ranges
filter(BORO_NM == "MANHATTAN"& #limit the study area
Date >= ymd(20150101) & #limit the study periods
Date <= ymd(20171231))%>%
select("CrimeType","DayofWeek","Time","TimeInterval") # select the relative columns
#combine sub-classes of crime types into big classes
crime_type <- c("LARCENY","ASSAULT","HARRASSMENT","THEFT","ADMINISTRATIVE CODE","HOMICIDE","INTOXICATED","LOITERING","OTHERSTATE LAW","OFFENSES")
for(i in 1:length(crime_type)) {
crime_MAN$CrimeType[grep(crime_type[i],crime_MAN$CrimeType)] <- crime_type[i]
}
knitr::opts_chunk$set(cache = TRUE, warning = FALSE,
message = FALSE, cache.lazy = FALSE)
```
### Landuse Dataset
This code chunk is used to download and clean the land use data.
```{r landuse_data_clean, message=FALSE, warning=FALSE, results='hide'}
#please download and unzip the landuse dataset if you do not have
landuse_url = "https://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_mappluto_19v1_shp.zip"
landuse_file = "pluto.zip"
download.file(landuse_url,destfile = landuse_file)
unzip(landuse_file, exdir = "pluto")
#read the raw data
mapluto <-st_read("pluto/MapPLUTO.shp")
#to add the name of each landuse type
landuse_type <- data.frame(id=c(1:12),type = c("One & Two Family Buildings",
"Multi-Family Walk-Up Buildings",
"Multi-Family Elevator Buildings",
"Mixed Residential & Commercial Buildings",
"Commercial & Office Buildings",
"Industrial & Manufacturing",
"Transportation & Utility",
"Public Facilities & Institutions",
"Open Space and Outdoor Recreation",
"Parking Facilities",
"Vacant Land",
"Unknown"))
#exract the relative information I need
landuse_MAN <- mapluto %>%
st_transform(st_crs(crime_MAN))%>% # make sure the same coordinate system
filter(Borough == "MN")%>% #limit the study area
select("Lot","LandUse")%>%
rename(LanduseID = LandUse)%>% #because the raw dataset just stored landuse id in "LandUse" Column
mutate(LanduseID = as.integer(LanduseID))%>%
replace_na(LanduseID = 12)%>% #replace NA value into 12-Unknown
mutate(Landuse = landuse_type$type[match(.$LanduseID, landuse_type$id)]) #add a new column storing land cover names
knitr::opts_chunk$set(cache = TRUE, warning = FALSE,
message = FALSE, cache.lazy = FALSE)
```
## Which Crime types are most frequently happen?
This code chunk is used to get the answer, which is presented in Result part.
```{r top_10_crime_types, message=FALSE, warning=FALSE, results='hide'}
top10_Crime_MAN <- crime_MAN %>%
group_by(CrimeType)%>%
summarize(amount = n())%>% #calculate the number of each crime type occurred totally
mutate(percent = amount/sum(amount)*100)%>% #calculate the percent of each crime type
arrange(desc(amount))%>% #sort ranging from the highest number to lowest one
st_set_geometry(NULL) #no need to have geometry information
```
## Time Series Analysis
This code chunk is to analyze time preference of top 3 commited crime types and the graphes are presented in Result part.
```{r top_3_crime_types, message=FALSE, warning=FALSE, results='hide'}
top3 <- data.frame(id=c(1:3),type = c("LARCENY","HARRASSMENT","ASSAULT"))
#get the numbers of cases happened of each crime type in different time ranges
time_top3 <- crime_MAN %>%
filter(CrimeType %in% top3$type)%>%
group_by(TimeInterval,CrimeType)%>%
summarize(amount=n())%>%
st_set_geometry(NULL)
#get the numbers of cases happened of each crime type in different days of week
dw_top3 <- crime_MAN %>%
filter(CrimeType %in% top3$type)%>%
drop_na(DayofWeek)%>%
group_by(DayofWeek,CrimeType)%>%
summarize(amount=n())%>%
st_set_geometry(NULL)
```
## Effects of Land Uses on Crime Types
Still Working on it...
<br>Please skip this part and welcome to any suggestions. Thank you!!!
```{r land_use_effect, message=FALSE, warning=FALSE, results='hide'}
#add landuse information into the crime dataset
top3_Crime_Landuse_MAN <- crime_MAN %>%
filter(CrimeType %in% top3$type)%>%
st_join(landuse_MAN,join = st_nearest_feature,left = FALSE)
#get the top 3 crime type dataframe seperately
larceny <- top3_Crime_Landuse_MAN %>% filter(CrimeType == "LARCENY")
harrasment <- top3_Crime_Landuse_MAN %>% filter(CrimeType == "HARRASSMENT")
assault <- top3_Crime_Landuse_MAN %>% filter(CrimeType == "ASSAULT")
```
## Classification Models
Still Working on it...
<br>Please skip this part and welcome to any suggestions. Thank you!!!
# Results
## Top ten most committed crime types
```{r echo=FALSE}
kable(top10_Crime_MAN[1:10,])
```
## The Preference on Time of Top Three Committed Crime Types
```{r echo=FALSE}
ggplot(time_top3,aes(x = TimeInterval, y= amount,group=1))+
geom_point(aes(color = CrimeType))+
geom_line(aes(color = CrimeType))+
facet_grid(~CrimeType)+
theme(legend.position = "none",axis.text.x = element_text(angle = 60, hjust = 1))
```
## The Preference on Day of Week of Top Three Committed Crime Types
```{r echo=FALSE}
ggplot(dw_top3,aes(x = DayofWeek, y= amount, group = 1))+
geom_point(aes(color = CrimeType))+
geom_line(aes(color = CrimeType))+
facet_wrap(~CrimeType)+
theme(legend.position = "none",axis.text.x = element_text(angle = 60, hjust = 1))
```
# Conclusions
What have you learned? Are there any broader implications?
# References