-
Notifications
You must be signed in to change notification settings - Fork 0
/
bird.Rmd
145 lines (103 loc) · 6.46 KB
/
bird.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
title: "Bird Data"
author: "Tim Dennis"
date: "7/21/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(tidyverse)
library(janitor) #package that has a function to rename R variables into legit R names
library(lubridate) #package that helps you work with dates
library(readr)
library(ggmap)
```
## Extracted Bird Data
A researcher came with bird observation data for Los Angeles California from 1950 to 2019. We tried to open in a text editor but it crashed b/c it was too big (2.1 GB). From here, we switched to the command line to inspect the file. First step is to see how many lines:
```{bash}
wc -l ebd_US-CA-037_195001_201902_relJan-2019.txt
```
Ok 5 million lines, for older MACs or PCs that are running a bunch of stuff, this will make that machien
crawl. At this point you could close everything to free up memory, but let's inspect a bit more inside the file. How do we do this if we can't open it? Ah, UNIX is your friend.
Let's use head to look at the top of the file:
```{bash}
head ebd_US-CA-037_195001_201902_relJan-2019.txt
```
Ok, promising, we have bird data! The first header contains the column names or variables. We can use head to pull out just that first row.
```{bash}
head -n 1 ebd_US-CA-037_195001_201902_relJan-2019.txt
```
Just from looking at the data this way we can see that it's tab delimited. We also can see we probably don't need all these columns. Let's use the command `cut` to reduce the number of columns in the dataset. For this demo I'm cutting some stuff I probably wouldn't. I hate counting like that so let's output with some line numbers and transpose our column names. I'm using a command called datamash https://www.gnu.org/software/datamash/examples/.
```{bash}
head -1 ebd_US-CA-037_195001_201902_relJan-2019.txt | datamash transpose | cat -n
```
Now we can cut which cuts out columns for retention:
```{bash}
head -10 ebd_US-CA-037_195001_201902_relJan-2019.txt | cut -f5,17,26,27,28
```
We can run this on the whole file now and save it for further processing:
```{bash}
cut -f5,17,26-27,28 ebd_US-CA-037_195001_201902_relJan-2019.txt | wc -c
```
That's 355.732934 megabytes. Pretty good. We've now inspected our data, selected out a number of columns. Let's save this in a file so we can continue to process it. Optionally, we could `pipe` together more commands to filter the data, but
```{bash}
cut -f5,17,26-27,28 ebd_US-CA-037_195001_201902_relJan-2019.txt > ebd_la_reduced.tsv
```
```{bash}
head -1 ebd_la_reduced.tsv | datamash transpose | cat -n
```
How about we filter out some rows? Our researcher was primarily only interested in data from the 2000s. How would we do this? I'm going to use awk! It's a tool for data extraction in Unix. I'll work on only a few lines at first.
```{bash}
tail -100 ebd_la_reduced.tsv | awk '$7 ~ /^20[0-9]{2}/'
```
Ok, now let's winnow down our data set to just those from the 2000s. We could also use the `shuf` command to randomly sample a number of lines. See this tutorial: <https://shapeshed.com/unix-shuf/#how-to-pick-a-random-line-from-a-file>
```{bash}
head -n 1 ebd_la_reduced.tsv > 2000s_bird_la.tsv
awk '$7 ~ /200[0-9]-[0-9]{2}-[0-9]{2}/' ebd_la_reduced.tsv >> 2000s_bird_la.tsv
```
We can look and see what the file is:
```{bash}
head -20 2000s_bird_la.tsv
```
The `ls` file has a `h` flag that when used with the `l` flag will give us the human readable size (so instead of in bytes, shows us gigabites size).
```{bash}
ls -lh
```
## Reading into R
Ok, we reduced the size of the file. We can now read it into R, by using the following command:
```{r}
birds <- read_tsv("2000s_bird_la.tsv")
```
In R this makes it into a dataframe. To clean up the variable names I'll use the `janitor` package. By default R doesn't like spaces in the variable names. `janitor` has a handy function named `clean_names`. We run this agains the birds data and overwrite it against itself.
```{r}
birds <- clean_names(birds)
```
Ok, now we can look at the data R. Let's use the `count` function to count up the occurances of the type of birds by common name in our dataframe. We'll then sort it in descending order using `arrange`. Finally, I only want to see the top 10 most frequently observed birds.
```{r}
birds %>%
count(common_name) %>%
arrange(desc(n)) %>%
head(n=10)
```
Ok, we now have an ordered list with that code. Next up let's make a bar chart. We'll use the same code we just created, but will pipe it into a graphing function called `ggplot`. In R, the piping syntax is `%>%`. The result will be a bar chart of counts by common name.
```{r}
birds %>%
count(common_name) %>%
arrange(desc(n)) %>%
head(n=10) %>%
ggplot(aes(x = common_name, y = n)) +
geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Next let's put the top ten birds observations on a map. The first step we'll need to do is filter our data set and retain only the occurances of our top ten birds. Why are we doing this? Mapping all birds with all sighting would likely overwhelm the maps with lots of overplotting, making it hard to read.
Below, we'll pipe birds into filter and only retain the birds we want. We'll assign this to a new dataframe called top10birds.
```{r}
top10birds <- birds %>%
filter(common_name == c("American Crow", "Anna's Hummingbird", "Black Phoebe", "California Scrub-Jay", "California Towhee", "House Finch", "Lesser Goldfinch", "Mourning Dove", "Northern Mockingbird", "Yellow-rumped Warbler"))
```
Ok, for the mapping, we use a package called `ggmap`. Inside the `ggmap` function we use a function called `get_stamenmap` and that function wants a bounding box `bbox` - this defines what the rectangular area of the world we want to map to appear. `get_stamen` will return the tiles (images) that represent that bounding box. After this we tell `ggmap` the type of plot we want to overlay on the map `geom_point`. Inside of this we set what variables in our dataset contain the latitude and longitude and how we want to color the dots (by `common_name`). We tell `ggmap` what dataframe to use and finally, we set an `alpha` or transparancy to our points. This makes them slightly transparent so we can see more of the underlying structure if they overlay each othter. Ok, that was a lot. Let's run and look at it.
```{r}
ggmap(get_stamenmap(bbox = c(left = -118.9596, bottom = 33.6633, right =
-117.6825, top = 34.3588), zoom=10)) +
geom_point(aes(x = longitude, y = latitude, color=common_name), data = top10birds, alpha = .5)
```