-
Notifications
You must be signed in to change notification settings - Fork 1
/
crime_data_analysis.Rmd
279 lines (215 loc) · 12.8 KB
/
crime_data_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
title: "Crime Data Analysis"
author: "Stuart Barnum"
date: "1/28/2018"
output: html_document
---
The dataset for crime in Seattle for the summer of 2014---June, July, and August---contains 2050 instances in which the location coordinates are recorded as zero (e.g. latitude and longitude of zero). As this location would be in the Atlantic Ocean, I will treat these instances as having missing coordinate information. Because I will be using the coordinate data to plot recorded instances of crime on a Seattle map and thus obtain an idea as to the distributions of certain types of crimes across the city, careful examination of these missing data is warranted. It may also reflect differences in the manner in which crime is recorded. Does this vary with respect to police district or the particular type of crime?
```{r}
library(tidyverse)
library(plotly)
sanfrancisco <- read.csv("sanfrancisco_incidents_summer_2014.csv")
seattle <- read.csv("seattle_incidents_summer_2014.csv")
#To avoid repetition of code in the generation various conditional distributions, I
#defined a function that produces both the dataframes with the tallies and the interactive plots for
#examining the possible relations between missingness and other factors.
calculate_conditional_distribution <- function(column) {
column <- enquo(column)
zero_assess <- seattle %>%
mutate(Position_Data_Missing = Latitude == 0) %>%
group_by(Position_Data_Missing, !!column) %>%
summarize(m = n()) %>%
group_by(!!column) %>%
mutate(n = sum(m)) %>%
ungroup() %>%
mutate(sum_total = sum(m)) %>%
group_by(Position_Data_Missing) %>%
mutate(sum_zero_test = sum(m)) %>%
mutate(total_dist = n / sum_total) %>%
mutate(conditional_distribution = m / sum_zero_test) %>%
ungroup() %>%
group_by(!!column)
#initialize list of items to be returned (a d.f. and a ggplot)
return_list <- list()
return_list[["df"]] <- zero_assess
#create visualizations for assessment of independence
plot <- zero_assess %>%
filter(n > 10) %>%
mutate(Position_Data_Missing =
ifelse(Position_Data_Missing == TRUE,
"missing", "not missing")) %>%
ggplot() +
geom_col(aes_(x = column, y = ~conditional_distribution,
fill = ~Position_Data_Missing)) +
labs(x = str_c(quo_name(column),
"\n(move cursor over bars to see the categories)"),
y = "conditional distribution") +
ggtitle("Assessing the Effect of Missingness of Position Data") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_blank(),
legend.title = element_blank()) +
guides(fill = guide_legend(reverse=TRUE))
return_list[["ggplot"]] <- plot
interactive_plot <-
ggplotly(plot) %>%
layout(margin = list(b = 50, l = 60, r = 10, t = 80))
return_list[["plotly"]] <- interactive_plot
return(return_list)
}
```
We thus find the distributions of certain general categories of crime, as indicated by the variable Summarized.Offense.Description in the data, conditional on both missingness of and non-missingness. If missingness is simply random---in this case, not associated with the type of crime reflected in the "summarized offense description"---we would expect these two conditional distributions to look roughly the same.
```{r}
results <- calculate_conditional_distribution(Summarized.Offense.Description)
results[["plotly"]]
```
In the above, for the types of crimes with large numbers of instances, the length of the each of the red bars is roughly equal to the length of the green bar below it. We thus find little evidence of a systematic relationship between missingness of coordinate data and the particular type of reported crime. However, moving the cursor over the bars, we see that prostitution may appear to be a notable exception, with perhaps all of the cases recorded with coordinate data. Examination of the data frame generated with the above code shows that there were 202 cases of prostitution during the three months from June through August of 2014, and indeed that all of the recorded cases had coordinate data.
Are there other types of crime for which this is the case? Let's see.
```{r}
#A simple manipulation of the data derived in calculate_conditional_distribution suffices.
results <- calculate_conditional_distribution(Summarized.Offense.Description)
library(knitr)
kable(results[["df"]] %>% filter(m == n))
```
```{r}
```
The dataset includes, again, 2050 examples with missing coordinate information, out of a total of 32779 instances---roughly 6.3 percent of the dataset. Given such an overall random distribution of missingness, one would expect that, for some of the categories with very few instances, all of the instances of reported crimes within the category would have been recorded with position data. However, given that prostitution has as many as 202 recorded instances, all recorded with position information, there may be a systematic relationship between this value of Summarized.Offense.Description and missingness. For example, perhaps record-keeping in the case of prostitution was more carefully handled than it was in the cases of most other offenses. Although we can't infer that this is the case---a chi squared test may be somewhat useful here---the data suggest that it might be.
Before moving on to the next plot, note that, as one might expect, most of reported crimes were property crimes--for example, burglary and vehicle theft. One striking aspect, however, is the prevalence of the crime classified as "carprowl," which amounts to almost 20 percent of the reported criminal activity. As may be confirmed with a quick Internet search, this crime, as identified by this term, has been of particular concern in Seattle.
We now turn to the relationship between district sector and missingness of position data.
```{r}
results <- calculate_conditional_distribution(District.Sector)
results[["plotly"]]
```
The mostly-red bar at the left end corresponds to data for which no district sector is indicated in the data, for which it would not be surprising that the coordinate data is also missing. Other than, maybe, the mysterious District Sector 99--with a total of only 40 reported instances--the data suggests that the recording of position data was consistent across the sectors.
Similar considerations apply to the possible relationship between missingness and Zone Beat.
```{r}
results <- calculate_conditional_distribution(Zone.Beat)
results[["plotly"]]
```
In this case, minor differences in the conditional distributions are evident, but likely not significant. (I may conduct a Chi Squared test of this assumption.) We may thus reasonably suppose that a plot of the crime data on a map of the city would not distort the distribution of actual recorded crime across Seattle. Thus plotting all of the instances that have position data (latitude and longitude) in the table, we obtain the following.
```{r}
library(ggmap)
#obtain a suitable Seattle map
seattle_gg <- get_map("Seattle", maptype = "toner-lite",
zoom = 11)
#remove zero latitude and longitude elements, for mapping
seattle_nonzero <- seattle %>%
filter(Latitude != 0 & Longitude != 0)
#show the distribution of recorded reports of crime on the map, using a logarithmic scale to enhance
#the contrast
plot <- ggmap(seattle_gg, darken = c(.01, "black")) +
geom_bin2d(data = seattle_nonzero,
aes(x = Longitude, y = Latitude),
bins = 200) +
scale_fill_gradient(trans = "log10") +
ggtitle ("Total recorded crime in Seattle: Summer of 2014 (n = 30,729)") +
theme(plot.title = element_text(hjust = 0.5))
plot
```
Now focus on violent crimes, with these defined as crimes with "Summarized Offense Description" of Assault, Homicide, or Robbery.
```{r}
library(ggmap)
Violent_Crimes <- c("ASSAULT", "HOMICIDE", "ROBBERY")
data <- seattle_nonzero %>%
filter(is.element(Summarized.Offense.Description, Violent_Crimes))
ggmap(seattle_gg, darken = c(.01, "black")) +
geom_bin2d(data = data,
aes(x = Longitude, y = Latitude),
bins = 200) +
scale_fill_gradient(trans = "log10") +
ggtitle ("Violent crime in Seattle: Summer of 2014 (n = 2,601)") +
theme(plot.title = element_text(hjust = 0.5))
```
And drugs:
```{r}
data <- seattle_nonzero %>%
filter(Summarized.Offense.Description == "NARCOTICS")
ggmap(seattle_gg, darken = c(.01, "black")) +
geom_bin2d(data = data,
aes(x = Longitude, y = Latitude),
bins = 200) +
scale_fill_gradient(trans = "log10") +
ggtitle ("Drug crime in Seattle: Summer of 2014 (n = 367)") +
theme(plot.title = element_text(hjust = 0.5))
```
Given that the population of Seattle in 2014 was 668,000, the number of recorded cases of drug crimes seems strikingly low. (Including the cases for which coordinate data is missing, the number of recorded cases is 391.) With this in mind, compare the results with the equivalent results for San Francisco for the summer of 2014. None of instances of the San Francisco crime dataset for June of 2014 through July of 2015 have missing coordinate data. We thus simply proceed to a map.
```{r}
X_center <- (range(sanfrancisco$X)[2] + range(sanfrancisco$X)[1]) / 2
Y_center <- (range(sanfrancisco$Y)[2] + range(sanfrancisco$Y)[1]) / 2
sf_gg <- get_map(location = c(X_center, Y_center),
zoom = 12, maptype = "toner-lite")
data <- filter(sanfrancisco, Category == "DRUG/NARCOTIC")
ggmap(sf_gg, darken = c(.01, "black")) +
geom_bin2d(data = data, aes(x = X, y = Y), bins = 200) +
scale_fill_gradient(trans = "log10") +
ggtitle ("Drug crime in San Francisco: Summer of 2014 (n = 1,345)") +
theme(plot.title = element_text(hjust = 0.5))
```
This is roughly 3.4 times the number of instances in Seattle, for a population that is only roughly 1.3 times greater. It is doubtful that drug use would be as much as 2.6 percent greater in one of the cities than in the other. We may thus find a suggestion that enforcement of drug laws in San Francisco is somewhat stricter than in Seattle.
Let's do a general comparison between San Francisco and Seattle, for some categories of crime that would likely if roughly the same definition in the two locations.
```{r}
library(forcats)
san_fran_totals <-
sanfrancisco %>%
count(Category) %>%
mutate(Category = fct_recode(Category,
"DUI" = "DRIVING UNDER THE INFLUENCE",
"Prostitution" = "PROSTITUTION",
"Drugs" = "DRUG/NARCOTIC",
"Weapons" = "WEAPON LAWS",
"Liquor\nlaws" = "LIQUOR LAWS",
"Assault" = "ASSAULT",
"Homicide" = "HOMICIDE",
"Robbery" = "ROBBERY",
"Vehicle\ntheft" = "VEHICLE THEFT"))
seattle_totals <-
seattle %>%
rename(Category = Summarized.Offense.Description) %>%
count(Category) %>%
mutate(Category = fct_recode(Category,
"Prostitution" = "PROSTITUTION",
"Drugs" = "NARCOTICS",
"Weapons" = "WEAPON",
"Liquor\nlaws" = "LIQUOR VIOLATION",
"Assault" = "ASSAULT",
"Homicide" = "HOMICIDE",
"Robbery" = "ROBBERY",
"Vehicle\ntheft" = "VEHICLE THEFT"))
joined_totals <- seattle_totals %>%
left_join(san_fran_totals, by = "Category") %>%
rename(Seattle = n.x, `San Francisco` = n.y) %>%
gather(Seattle, `San Francisco`, key = "City", value = "Number of cases") %>%
filter(Category %in%
c("Prostitution",
"Drugs",
"Weapons",
"Liquor\nlaws",
"Assault",
"Homicide",
"Robbery",
"Vehicle\ntheft",
"DUI")) %>%
mutate(`Number of cases` = ifelse(is.na(`Number of cases`), 0, `Number of cases`))
#Homicide anomaly in the San francisco data
ggplot(joined_totals) +
geom_col(aes(x = Category, y = `Number of cases`, fill = City),
position = "dodge") +
ggtitle("Cases of Selected Categories of Crime, as per Official Data") +
theme(plot.title = element_text(hjust = 0.5))
```
Some of contrast in numbers between Seattle and San Francisco may reflect the difference in population between the two cities. We thus consider a plot with some adjusted numbers.
```{r}
adjusted_totals <-
joined_totals %>%
mutate(`Number of cases` =
ifelse(City == "San Francisco",
`Number of cases` * 667963 / 852537,
`Number of cases`))
ggplot(adjusted_totals) +
geom_col(aes(x = Category, y = `Number of cases`, fill = City),
position = "dodge") +
labs(title = "Adjusted Crime Data: Summer of 2014",
ylab = "Number of cases, adjusted to the population of Seattle") +
theme(plot.title = element_text(hjust = 0.5))
```
In the above, the San Francisco numbers have been adjusted down, in proportion of the ratio of the respective populations---with bar height proportional to the number of cases, within the category of crime, per person.