index.Rmd

---
title: "Crime Prediction in Manhattan, NYC"
author: "Stella Liao"
subtitle: Visualization and Classification for Larceny, Assault and Harassment
output: html_document
---

# Introduction
Crime is a social issue, like a disease, which tends to spread as spatial clusters. We are always seeking for a way to minimize and prevent the occurrance of crime. Imagine if we could predict where the probability of crime occurring, our police could deploy the law enforcement to the potentially dangerous areas, which is more efficient. Usually, we may assume occurance of crime as random and researchers used behavioral and social methods to study it. However, with the development of data analysis and techonology, we could use more quantitative ways to analyze it.

For example, there is one program named PredPol, which is conducted by researchers from the University of California, Los Angeles (UCLA). With the help of the department of Los Angeles Police, they collected about 13 billion cases in 80 years and just used two variables, when and where to build models to predict where a crime could happen during each day, which is amazing and shows us the power of the environment influenting human's choice. And another paper written by Dr.Irina Matijosaitiene revealed the effect of land uses on crime type classification and prediction.

When using classification models, they are actually calculating the probability of when and where one crime type may happe. So in this project, I will focus on classification models. Of course, I'd like to use visulazation to give audience an intuitive feel about the relationship between the occurance of crime with time and location.

# Materials and methods 
I will use the crime data from 2015-2017 in Manhattan, New York City to build classification models to classify the top three crime types occurred in this study area, which are larceny, assault and harassment. And the main factors input as features in the models are time and location, to be specific, time refers to exact time and day of week, and location refers to land use.

* Dataset Sources
  * <a href="https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i">NYPD Complaint Data</a>, a CSV file recording all crime occurance in New York City from 2006-2017
  * <a href="https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page">Primary Land Use Tax Lot Output(PLUTO)</a>, a shapfile containing land cover information of New York City
* Relationship between crime types with time and location
  * Time Series Analysis
  * Effects of Land Uses on Crime Types
* Classification Models
  * Logistic Regression
  * Random Forest
  * Naïve Bayes Classification
  
## Load all required packages
```{r load_packages, message=FALSE, warning=FALSE}
library(dplyr)
library(stringr)
library(tidyr)
library(readr)
library(lubridate)
library(sp)
library(sf)
library(ggplot2)
library(mapview)
library(knitr)
library(naivebayes)
library(randomForest)
library(ggpubr)
knitr::opts_chunk$set(cache=TRUE,cache.lazy = FALSE)  # cache the results for quick compiling
```

## Download and clean all required data

### Crime Dataset

This code chunk is used to download and clean the crime data.

```{r crime_data_cleaned, message=FALSE, warning=FALSE, results='hide'}
#read the raw data
#It may takes a long time to run due to the large size of the raw dataset
crime_file = "nypd.csv"
crime_url = "https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD"
download.file(crime_url,crime_file)
nypd <-read.csv(crime_file,stringsAsFactors = FALSE)

#clean and tidy crime data
#classify exact time into different time ranges
time_interval<- data.frame(id = c("00","01","02","03","04","056","06","07","08","09","10","11","12",
                                  "13","14","15","16","17","18","19","20","21","22","23","24"),
                           interval = c("00-01","01-02","02-03","03-04","04-05","05-06","06-07","07-08",
                                        "08-09","09-10","10-11","11-12","12-13","13-14","14-15","15-16",
                                        "16-17","17-18","18-19","19-20","20-21","21-22","22-23","23-24","00-01"))

#exract the relative information into crime_MAN dataframe
crime_MAN <- nypd %>%
  drop_na(Longitude)%>% # remove NA value
  drop_na(Latitude)%>%
  drop_na(CMPLNT_FR_DT)%>%
  drop_na(CMPLNT_FR_TM)%>%
  st_as_sf(coords=c("Longitude","Latitude"),crs = 4326)%>% # add georeferenced information
  rename(# rename some column names to operate easily
    CrimeID = CMPLNT_NUM,
    CrimeType = OFNS_DESC,
    Date = CMPLNT_FR_DT,
    Time = CMPLNT_FR_TM)%>%
  mutate(Date = mdy(Date), #change data column into DATE type
         DayofWeek = wday(Date,label = TRUE,abbr = FALSE), #get the information about day of week
         Time = hour(hms(Time)))%>% # get the hour of time
  mutate(TimeInterval = time_interval$interval[match(.$Time, time_interval$id)])%>% #add a new column storing time ranges
  filter(BORO_NM == "MANHATTAN"& #limit the study area
         Date >= ymd(20150101) & #limit the study periods
         Date <= ymd(20171231))%>%
  select("CrimeType","DayofWeek","Time","TimeInterval") # select the relative columns

#combine sub-classes of crime types into big classes
crime_type <- c("LARCENY","ASSAULT","HARRASSMENT","THEFT","ADMINISTRATIVE CODE","HOMICIDE","INTOXICATED","LOITERING","OTHERSTATE LAW","OFFENSES")
for(i in 1:length(crime_type))  {
  crime_MAN$CrimeType[grep(crime_type[i],crime_MAN$CrimeType)] <- crime_type[i]
}
knitr::opts_chunk$set(cache = TRUE, warning = FALSE, 
                      message = FALSE, cache.lazy = FALSE)

```

### Landuse Dataset

This code chunk is used to download and clean the land use data.

```{r landuse_data_clean, message=FALSE, warning=FALSE, results='hide'}

#please download and unzip the landuse dataset if you do not have
landuse_url = "https://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_mappluto_19v1_shp.zip"
landuse_file = "pluto.zip"
download.file(landuse_url,destfile = landuse_file)
unzip(landuse_file, exdir = "pluto")

#read the raw data
mapluto <-st_read("pluto/MapPLUTO.shp")

#to add the name of each landuse type
landuse_type <- data.frame(id=c(1:12),type = c("One & Two Family Buildings",
                                               "Multi-Family Walk-Up Buildings",
                                               "Multi-Family Elevator Buildings", 
                                               "Mixed Residential & Commercial Buildings",
                                               "Commercial & Office Buildings",
                                               "Industrial & Manufacturing",
                                               "Transportation & Utility",
                                               "Public Facilities & Institutions",
                                               "Open Space and Outdoor Recreation",
                                               "Parking Facilities",
                                               "Vacant Land",
                                               "Unknown"))
#exract the relative information I need
landuse_MAN <- mapluto %>%
  st_transform(st_crs(crime_MAN))%>% # make sure the same coordinate system
  filter(Borough == "MN")%>% #limit the study area
  select("Lot","LandUse")%>%
  rename(LanduseID = LandUse)%>% #because the raw dataset just stored landuse id in "LandUse" Column
  mutate(LanduseID = as.integer(LanduseID))%>%
  replace_na(LanduseID = 12)%>% #replace NA value into 12-Unknown
  mutate(Landuse = landuse_type$type[match(.$LanduseID, landuse_type$id)]) #add a new column storing land cover names
knitr::opts_chunk$set(cache = TRUE, warning = FALSE, 
                      message = FALSE, cache.lazy = FALSE)
```

## Which Crime types are most frequently happen?

This code chunk is used to get the answer, which is presented in Result part.

```{r top_10_crime_types, message=FALSE, warning=FALSE, results='hide'}
top10_Crime_MAN <- crime_MAN %>% 
  group_by(CrimeType)%>%
  summarize(amount = n())%>% #calculate the number of each crime type occurred totally 
  mutate(percent = amount/sum(amount)*100)%>% #calculate the percent of each crime type
  arrange(desc(amount))%>% #sort ranging from the highest number to lowest one
  st_set_geometry(NULL) #no need to have geometry information
```


## Time Series Analysis 

This code chunk is to analyze time preference of top 3 commited crime types and the graphes are presented in Result part.

```{r top_3_crime_types, message=FALSE, warning=FALSE, results='hide'}
top3 <- data.frame(id=c(1:3),type = c("LARCENY","HARRASSMENT","ASSAULT"))

#get the numbers of cases happened of each crime type in different time ranges
time_top3 <- crime_MAN %>%
  filter(CrimeType %in% top3$type)%>%
  group_by(TimeInterval,CrimeType)%>% 
  summarize(amount=n())%>%
  st_set_geometry(NULL)

#get the numbers of cases happened of each crime type in different days of week
dw_top3 <- crime_MAN %>%
  filter(CrimeType %in% top3$type)%>%
  drop_na(DayofWeek)%>%
  group_by(DayofWeek,CrimeType)%>% 
  summarize(amount=n())%>%
  st_set_geometry(NULL)

```

## Effects of Land Uses on Crime Types

Still Working on it...
<br>Please skip this part and welcome to any suggestions. Thank you!!!

```{r land_use_effect, message=FALSE, warning=FALSE, results='hide'}
#add landuse information into the crime dataset
top3_Crime_Landuse_MAN <- crime_MAN %>%
  filter(CrimeType %in% top3$type)%>%
  st_join(landuse_MAN,join = st_nearest_feature,left = FALSE)

#get the top 3 crime type dataframe seperately
larceny <- top3_Crime_Landuse_MAN %>% filter(CrimeType == "LARCENY")
harrasment <- top3_Crime_Landuse_MAN %>% filter(CrimeType == "HARRASSMENT")
assault <- top3_Crime_Landuse_MAN %>% filter(CrimeType == "ASSAULT")

```

## Classification Models

Still Working on it...
<br>Please skip this part and welcome to any suggestions. Thank you!!!

# Results

## Top ten most committed crime types
```{r echo=FALSE}
kable(top10_Crime_MAN[1:10,]) 
```

## The Preference on Time of Top Three Committed Crime Types
```{r echo=FALSE}
ggplot(time_top3,aes(x = TimeInterval, y= amount,group=1))+
  geom_point(aes(color = CrimeType))+
  geom_line(aes(color = CrimeType))+
  facet_grid(~CrimeType)+
  theme(legend.position = "none",axis.text.x = element_text(angle = 60, hjust = 1))
```

## The Preference on Day of Week of Top Three Committed Crime Types
```{r echo=FALSE}
ggplot(dw_top3,aes(x = DayofWeek, y= amount, group = 1))+
  geom_point(aes(color = CrimeType))+
  geom_line(aes(color = CrimeType))+
  facet_wrap(~CrimeType)+
  theme(legend.position = "none",axis.text.x = element_text(angle = 60, hjust = 1))
```

# Conclusions

What have you learned?  Are there any broader implications?

# References