-
Notifications
You must be signed in to change notification settings - Fork 40
/
06-databasics2.Rmd
1164 lines (852 loc) · 46.9 KB
/
06-databasics2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# (PART) Part III: Intermediate {-}
# Entering and cleaning data #2
The video lectures for this chapter are embedded at relevant places in the text,
with links to download a pdf of the associated slides for each video.
You can also access [a full playlist for the videos for this chapter](https://www.youtube.com/playlist?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4).
```{r echo = FALSE, message = FALSE, warning = FALSE}
library(tidyverse)
library(knitr)
```
## Tidy data
<iframe width="747" height="467" src="https://www.youtube.com/embed/EetWyFQqXbs?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week6_part_1.pdf)
a pdf of the lecture slides for this video.
All of the material in this section comes directly from Hadley Wickham's [paper on tidy data](http://vita.had.co.nz/papers/tidy-data.pdf). You will need to read this paper to prepare for the quiz on this section.
Getting your data into a "tidy" format makes it easier to model and plot. By taking the time to tidy your data at the start of an analysis, you will save yourself time, and make it easier to plan out later steps.
Characteristics of tidy data are:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Here are five common problems that Hadley Wickham has identified that keep data from being tidy:
1. Column headers are values, not variable names.
2. Multiple variables are stored in one column.
3. Variables are stored in both rows and columns.
4. Multiple types of observational units are stored in the same table.
5. A single observational unit is stored in multiple tables.
Here are examples (again, from Hadley Wickham's [paper on tidy data](http://vita.had.co.nz/papers/tidy-data.pdf), which is required reading for this week of the course) of each of these problems.
> 1. Column headers are values, not variable names.
```{r echo = FALSE, out.width = "\\textwidth"}
include_graphics("figures/TidyDataProblem1.png")
```
Solution:
```{r echo = FALSE, out.width = "\\textwidth", fig.align = "center"}
include_graphics("figures/TidyDataSolution1.png")
```
> 2. Multiple variables are stored in one column.
```{r echo = FALSE, out.width = "\\textwidth", fig.align = "center"}
include_graphics("figures/TidyDataProblem2.png")
```
Solution:
```{r echo = FALSE, out.width = "\\textwidth", fig.align = "center"}
include_graphics("figures/TidyDataSolution2.png")
```
> 3. Variables are stored in both rows and columns.
```{r echo = FALSE, out.width = "\\textwidth"}
include_graphics("figures/TidyDataProblem3.png")
```
Solution:
```{r echo = FALSE, out.width = "\\textwidth"}
include_graphics("figures/TidyDataSolution3.png")
```
> 4. Multiple types of observational units are stored in the same table.
```{r echo = FALSE, out.width = "\\textwidth"}
include_graphics("figures/TidyDataProblem4.png")
```
Solution:
```{r echo = FALSE, out.width = "\\textwidth"}
include_graphics("figures/TidyDataSolution4.png")
```
> 5. A single observational unit is stored in multiple tables.
Example: exposure and outcome data stored in different files:
- File 1: Daily mortality counts
- File 2: Daily air pollution measurements
## Joining datasets
<iframe width="747" height="467" src="https://www.youtube.com/embed/UJ8UoK3bSP0?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week6_part_2.pdf)
a pdf of the lecture slides for this video.
So far, you have only worked with a single data source at a time. When you work on your own projects, however, you typically will need to merge together two or more datasets to create the a data frame to answer your research question. For example, for air pollution epidemiology, you will often have to join several datasets:
- Health outcome data (e.g., number of deaths per day)
- Air pollution concentrations
- Weather measurements (since weather can be a confounder)
- Demographic data
The `dplyr` package has a family of different functions to join two dataframes together, the `*_join` family of functions. All combine two dataframes, which I'll call `x` and `y` here. \medskip
The functions include:
- `inner_join(x, y)`: Keep only rows where there are observations in both `x` and `y`.
- `left_join(x, y)`: Keep all rows from `x`, whether they have a match in `y` or not.
- `right_join(x, y)`: Keep all rows from `y`, whether they have a match in `x` or not.
- `full_join(x, y)`: Keep all rows from both `x` and `y`, whether they have a match in the other dataset or not.
In the examples, I'll use two datasets, `x` and `y`. Both datasets include the column `course`. The other column in `x` is `grade`, while the other column in `y` is `day`. Observations exist for courses `x` and `y` in both datasets, but for `w` and `z` in only one dataset.
```{r}
x <- data.frame(course = c("x", "y", "z"),
grade = c(90, 82, 78))
y <- data.frame(course = c("w", "x", "y"),
day = c("Tues", "Mon / Fri", "Tue"))
```
Here is what these two example datasets look like:
```{r}
x
y
```
With `inner_join`, you'll only get the observations that show up in both datasets. That means you'll lose data on `z` (only in the first dataset) and `w` (only in the second dataset).
```{r warning = FALSE}
inner_join(x, y)
```
With `left_join`, you'll keep everything in `x` (the "left" dataset), but not keep things in `y` that don't match something in `x`. That means that, here, you'll lose `w`:
```{r, warning = FALSE}
left_join(x, y)
```
`right_join` is the opposite:
```{r, warning = FALSE}
right_join(x, y)
```
`full_join` keeps everything from both datasets:
```{r warning = FALSE}
full_join(x, y)
```
## Longer data
<iframe width="747" height="467" src="https://www.youtube.com/embed/abUDhRmF_bg?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week6_part_3.pdf)
a pdf of the lecture slides for this video.
There are two functions from the `tidyr` package (another member of the tidyverse) that you can use to change between wider and longr data: `pivot_longer` and `pivot_wider`. Here is a description of these two functions:
- `pivot_longer`: Takes several columns and pivots them down into two columns. One of the new columns contains the former column names and the other contains the former cell values.
- `pivot_wider`: Takes two columns and pivots them up into multiple columns. Column names for the new columns will come from one column and the cell values from the other.
The following examples are show the effects of making a dataset longer or wider.
Here is some simulated wide data:
```{r, include = FALSE}
wide_stocks <- tibble(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
```
```{r}
wide_stocks[1:3, ]
```
In the `wide_stocks` dataset, there are separate columns for three different stocks (`X`, `Y`, and `Z`). Each cell gives the value for a certain stock on a certain day. This data isn't "tidy", because the identify of the stock (`X`, `Y`, or `Z`) is a variable, and you'll probably want to include it as a variable in modeling.
```{r}
wide_stocks[1:3, ]
```
If you want to convert the dataframe to have all stock values in a single column, you can use `pivot_longer` to convert wide data to long data:
```{r}
long_stocks <- pivot_longer(data = wide_stocks,
cols = -time,
names_to = "stock",
values_to = "price")
long_stocks[1:5, ]
```
In this "longer" dataframe, there is now one column that gives the identify of the stock (`stock`) and another column that gives the price of that stock that day (`price`):
```{r}
long_stocks[1:5, ]
```
The format for a `pivots_longer` call is:
```{r eval = FALSE}
## Generic code
new_df <- pivot_longer(old_df,
cols = [name(s) of the columns you want to make longer],
names_to = [name of new column to store the old column names],
values_to = [name of new column to store the old values])
```
Three important notes:
- Everything is pivoted into one of two columns -- one column with the old column names, and one column with the old cell values
- With the `names_to` and `values_to` arguments, you are just providing column names for the two columns that everything's pivoted into.
- If there is a column you don't want to include in the pivot (`date` in the example), use `-` to exclude it in the `cols` argument.
Notice how easy it is, now that the data is longer, to use `stock` for aesthetics of faceting in a `ggplot2` call:
```{r fig.width = 7, fig.height = 2.5}
ggplot(long_stocks, aes(x = time, y = price)) +
geom_line() +
facet_grid(. ~ stock) +
theme_bw()
```
If you have data in a "longer" format and would like to make it "wider", you can use `pivot_wider` to do that:
```{r}
stocks <- pivot_wider(long_stocks,
names_from = "stock",
values_from = price)
stocks[1:5, ]
```
Notice that this reverses the action of `pivot_longer`.
The "wider" your data the less likely it is to be tidy, so won't use `pivot_wider` frequently when you are preparing data for analysis. However, `pivot_wider` can be very helpful in creating tables for final reports and presentations.
For example, if you wanted to create a table with means and standard deviations for each of the three stocks, you could use `pivot_wider` to rearrange the final summary to create an attractive table.
```{r}
stock_summary <- long_stocks %>%
group_by(stock) %>%
summarize(N = n(), mean = mean(price), sd = sd(price))
stock_summary
```
```{r}
stock_summary %>%
mutate("Mean (Std.dev.)" = paste0(round(mean, 2), " (",
round(sd, 2), ")")) %>%
dplyr::select(- mean, - sd) %>%
mutate(N = as.character(N)) %>% # might be able to deal with this in pivot_longer call
pivot_longer(cols = -stock, names_to = "Statistic", values_to = "Value") %>%
pivot_wider(names_from = "stock", values_from = "Value") %>%
knitr::kable()
```
<iframe width="747" height="467" src="https://www.youtube.com/embed/NeisIoTKBfo?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week6_part_4.pdf)
a pdf of the lecture slides for this video.
<iframe width="747" height="467" src="https://www.youtube.com/embed/5yThV_XJgCU?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week6_part_5.pdf)
a pdf of the lecture slides for this video.
## Working with factors
<iframe width="746" height="466" src="https://www.youtube.com/embed/iM9Js4gTqfI?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week6_part_6.pdf)
a pdf of the lecture slides for this video.
Hadley Wickham has developed a package called `forcats` that helps you work with categorical variables (factors). I'll show some examples of its functions using the `worldcup` dataset:
```{r b}
library(forcats)
library(faraway)
data(worldcup)
```
The `fct_recode` function can be used to change the labels of a function (along the lines of using `factor` with `levels` and `labels` to reset factor labels).
One big advantage is that `fct_recode` lets you change labels for some, but not all, levels. For example, here are the team names:
```{r c}
library(stringr)
worldcup %>%
filter(str_detect(Team, "^US")) %>%
slice(1:3) %>% select(Team, Position, Time)
```
If you just want to change "USA" to "United States," you can run:
```{r d}
worldcup <- worldcup %>%
mutate(Team = fct_recode(Team, `United States` = "USA"))
worldcup %>%
filter(str_detect(Team, "^Un")) %>%
slice(1:3) %>% select(Team, Position, Time)
```
You can use the `fct_lump` function to lump uncommon factors into an "Other" category. For example, to lump the two least common positions together, you can run (`n` specifies how many categories to keep outside of "Other"):
```{r e}
worldcup %>%
dplyr::mutate(Position = forcats::fct_lump(Position, n = 2)) %>%
dplyr::count(Position)
```
You can use the `fct_infreq` function to reorder the levels of a factor from most common to least common:
```{r f}
levels(worldcup$Position)
worldcup <- worldcup %>%
mutate(Position = fct_infreq(Position))
levels(worldcup$Position)
```
If you want to reorder one factor by another variable (ascending order), you can use `fct_reorder` (e.g., homework 3). For example, to relevel `Position` by the average shots on goals for each position, you can run:
```{r g}
levels(worldcup$Position)
worldcup <- worldcup %>%
group_by(Position) %>%
mutate(ave_shots = mean(Shots)) %>%
ungroup() %>%
mutate(Position = fct_reorder(Position, ave_shots))
levels(worldcup$Position)
```
## String operations and regular expressions
<iframe width="746" height="466" src="https://www.youtube.com/embed/bnAec9JOwRM?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week6_part_7.pdf)
a pdf of the lecture slides for this video.
<iframe width="746" height="466" src="https://www.youtube.com/embed/7LduecuZW00?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week6_part_9.pdf)
a pdf of the lecture slides for this video.
For these examples, we'll use some data on passengers of the Titanic. You can load this data using:
```{r}
# install.packages("titanic")
library(titanic)
data("titanic_train")
```
We will be using the `stringr` package:
```{r}
library(stringr)
```
This data includes a column called "Name" with passenger names. This column is somewhat messy and includes several elements that we might want to separate (last name, first name, title). Here are the first few values of "Name":
```{r}
titanic_train %>% select(Name) %>% slice(1:3)
```
We've already done some things to manipulate strings. For example, if we wanted to separate "Name" into last name and first name (including title), we could actually do that with the `separate` function:
```{r}
titanic_train %>%
select(Name) %>%
slice(1:3) %>%
separate(Name, c("last_name", "first_name"), sep = ", ")
```
Notice that `separate` is looking for a regular pattern (", ") and then doing something based on the location of that pattern in each string (splitting the string). \bigskip
There are a variety of functions in R that can perform manipulations based on finding regular patterns in character strings.
The `str_detect` function will look through each element of a character vector for a designated pattern. If the pattern is there, it will return `TRUE`, and otherwise `FALSE`. The convention is:
```
## Generic code
str_detect(string = [vector you want to check],
pattern = [pattern you want to check for])
```
For example, to create a logical vector specifying which of the Titanic passenger names include "Mrs.", you can call:
```{r}
mrs <- str_detect(titanic_train$Name, "Mrs.")
head(mrs)
```
The result is a logical vector, so `str_detect` can be used in `filter` to subset data to only rows where the passenger's name includes "Mrs.":
```{r}
titanic_train %>%
filter(str_detect(Name, "Mrs.")) %>%
select(Name) %>%
slice(1:3)
```
There is an older, base R function called `grepl` that does something very similar (although note that the order of the arguments is reversed).
```{r}
titanic_train %>%
filter(grepl("Mrs.", Name)) %>%
select(Name) %>%
slice(1:3)
```
The `str_extract` function can be used to extract a string (if it exists) from each value in a character vector. It follows similar conventions to `str_detect`:
```
## Generic code
str_extract(string = [vector you want to check],
pattern = [pattern you want to check for])
```
For example, you might want to extract "Mrs." if it exists in a passenger's name:
```{r}
titanic_train %>%
mutate(mrs = str_extract(Name, "Mrs.")) %>%
select(Name, mrs) %>%
slice(1:3)
```
Notice that now we're creating a new column (`mrs`) that either has "Mrs." (if there's a match) or is missing (`NA`) if there's not a match.
For this first example, we were looking for an exact string ("Mrs"). However, you can use patterns that match a particular pattern, but not an exact string. For example, we could expand the regular expression to find "Mr." or "Mrs.":
```{r}
titanic_train %>%
mutate(title = str_extract(Name, "Mr\\.|Mrs\\.")) %>%
select(Name, title) %>%
slice(1:3)
```
Note that this pattern uses a special operator (`|`) to find one pattern **or** another. Double backslashes (`\\`) **escape** the special character ".".
<iframe width="746" height="466" src="https://www.youtube.com/embed/JiMMJwwXPBM?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week6_part_10.pdf)
a pdf of the lecture slides for this video.
<iframe width="746" height="466" src="https://www.youtube.com/embed/T0rmoLhs8sI?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week6_part_11.pdf)
a pdf of the lecture slides for this video.
As a note, in regular expressions, all of the following characters are special characters that need to be escaped with backslashes if you want to use them literally:
```
. * + ^ ? $ \ | ( ) [ ] { }
```
Notice that "Mr." and "Mrs." both start with "Mr", end with ".", and may or may not have an "s" in between.
```{r}
titanic_train %>%
mutate(title = str_extract(Name, "Mr(s)*\\.")) %>%
select(Name, title) %>%
slice(1:3)
```
This pattern uses `(s)*` to match zero or more "s"s at this spot in the pattern.
In the previous code, we found "Mr." and "Mrs.", but missed "Miss.". We could tweak the pattern again to try to capture that, as well. For all three, we have the pattern that it starts with "M", has some lowercase letters, and then ends with ".".
```{r}
titanic_train %>%
mutate(title = str_extract(Name, "M[a-z]+\\.")) %>%
select(Name, title) %>%
slice(1:3)
```
The last pattern used `[a-z]+` to match one or more lowercase letters. The `[a-z]`is a **character class**. \bigskip
You can also match digits (`[0-9]`), uppercase letters (`[A-Z]`), just some letters (`[aeiou]`), etc. \bigskip
You can negate a character class by starting it with `^`. For example, `[^0-9]` will match anything that **isn't** a digit.
Sometimes, you want to match a pattern, but then only subset a part of it. For example, each passenger seems to have a title ("Mr.", "Mrs.", etc.) that comes after ", " and before ". ". We can use this pattern to find the title, but then we get some extra stuff with the match:
```{r}
titanic_train %>%
mutate(title = str_extract(Name, ",\\s[A-Za-z]*\\.\\s")) %>%
select(title) %>%
slice(1:3)
```
As a note, in this pattern, `\\s` is used to match a space.
We are getting things like ", Mr. ", when we really want "Mr". We can use the `str_match` function to do this. We group what we want to extract from the pattern in parentheses, and then the function returns a matrix. The first column is the full pattern match, and each following column gives just what matches within the groups.
```{r}
head(str_match(titanic_train$Name,
pattern = ",\\s([A-Za-z]*)\\.\\s"))
```
To get just the title, then, we can run:
```{r}
titanic_train %>%
mutate(title =
str_match(Name, ",\\s([A-Za-z]*)\\.\\s")[ , 2]) %>%
select(Name, title) %>%
slice(1:3)
```
The `[ , 2]` pulls out just the second column from the matrix returned by `str_match`.
Here are some of the most common titles:
```{r}
titanic_train %>%
mutate(title =
str_match(Name, ",\\s([A-Za-z]*)\\.\\s")[ , 2]) %>%
group_by(title) %>% summarize(n = n()) %>%
arrange(desc(n)) %>% slice(1:5)
```
Here are a few other examples of regular expressions in action with this dataset.
Get just names that start with ("^") the letter "A":
```{r}
titanic_train %>%
filter(str_detect(Name, "^A")) %>%
select(Name) %>%
slice(1:3)
```
Get names with "II" or "III" (`{2,}` says to match at least two times):
```{r}
titanic_train %>%
filter(str_detect(Name, "I{2,}")) %>%
select(Name) %>%
slice(1:3)
```
Get names with "Andersen" or "Anderson" (alternatives in square brackets):
```{r}
titanic_train %>%
filter(str_detect(Name, "Anders[eo]n")) %>%
select(Name)
```
Get names that start with ("^" outside of brackets) the letters "A" and "B":
```{r}
titanic_train %>%
filter(str_detect(Name, "^[AB]")) %>%
select(Name) %>%
slice(1:3)
```
Get names that end with ("$") the letter "b" (either lowercase or uppercase):
```{r}
titanic_train %>%
filter(str_detect(Name, "[bB]$")) %>%
select(Name)
```
Some useful regular expression operators include:
```{r echo = FALSE}
reg_exp <- data_frame("Operator" = c(".",
"*",
"*?",
"+",
"+?",
"^",
"$",
"[...]"),
"Meaning" = c("Any character",
"Match 0 or more times (greedy)",
"Match 0 or more times (non-greedy)",
"Match 1 or more times (greedy)",
"Match 1 or more times (non-greedy)",
"Starts with (in brackets, negates)",
"Ends with",
"Character classes"))
knitr::kable(reg_exp)
```
For more on these patterns, see:
- Help file for the `stringi-search-regex` function in the `stringi` package (which should install when you install `stringr`)
- Chapter 14 of R For Data Science
- http://gskinner.com/RegExr: Interactive tool for helping you build regular expression pattern strings
## Tidy select
<iframe width="746" height="466" src="https://www.youtube.com/embed/u_y69p0cL-U?list=PLuGPtwgRXxqJK_Jb-q0g9RQHZ-PponJs4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week6_part_12.pdf)
a pdf of the lecture slides for this video.
There are [`tidyverse`](https://dplyr.tidyverse.org/reference/select.html) functions to make selecting variables more straightforwards. You can call these functions as arguments of the `select` function to streamline variable selection. Examples include: `starts_with()`, `ends_with()`, and `contains()`.
Here we use `starts_with("t")` to select all variables that begin with `t`.
```{r}
titanic_train %>%
select(starts_with("t")) %>%
slice(1:3)
```
The are also tidyverse functions that allow us to easily operate on a selection of variables. These functions are called [`scoped varients`](https://dplyr.tidyverse.org/reference/scoped.html). You can identify these functions by these `_all`, `_at`, and `_if` suffixes.
Here we use `select_if` to select all the numeric variables in a dataframe and covert their names to lower case (a handy function to tidy the variable names).
```{r}
titanic_train %>%
select_if(is.numeric, tolower) %>%
slice(1:3)
```
The `select_if` function takes the following form.
```{r eval = FALSE}
## Generic code
new_df <- select_if(old_df,
.predicate [selects the variable to keep],
.funs = [the function to apply to the selected columns])
```
Here we use `select_at` to select all the variables that contain `ss` in their name and then covert their names to lower case (a handy function to tidy the variable names).
```{r}
titanic_train %>%
select_at(vars(contains("ss")), tolower) %>%
slice(1:3)
```
## In-course exercise Chapter 6
For today's exercise, we'll be using the following three datasets (click on the file name to access the correct file for today's class for each dataset):
File name | Description
-------------------- | -----------------------------------------------
[`country_timeseries.csv`](https://github.com/geanders/RProgrammingForResearch/raw/master/data/country_timeseries.csv) | Ebola cases by country for the 2014 outbreak
[`mexico_exposure.csv`](https://github.com/geanders/RProgrammingForResearch/raw/master/data/mexico_exposure.csv) and [`mexico_deaths.csv`](https://github.com/geanders/RProgrammingForResearch/raw/master/data/mexico_deaths.csv) | Daily death counts and environmental measurements for Mexico City, Mexico, for 2008
[`measles_data/`](https://github.com/geanders/RProgrammingForResearch/tree/master/data/measles_data) | Number of cases of measles in CA since end of Dec. 2014
Note that you likely have already downloaded all the files in the `measles_data` folder, since we used them in an earlier in-course exercise. If so, there is no need to re-download those files.
Here are the sources for this data:
- `country_timeseries.csv` : [Caitlin Rivers' Ebola repository](https://github.com/cmrivers/ebola) (Caitlin originally collected this data from the WHO and WHO Situation reports)
- `mexico_exposure.csv` and `mexico_deaths.csv` : [one of Hadley Wickham's GitHub repos](https://github.com/hadley/mexico-mortality/tree/master/disease) (Hadley got the data originally from the Secretaria de Salud of Mexico's website, although it appears the link is now broken. I separated the data into two dataframes so students could practice merging.)
- `measles_data/`: [one of scarpino's GitHub repos](https://github.com/scarpino/measles-CA-2015) (Data originally from pdfs from the [California Department of Public Health](https://www.cdph.ca.gov/HealthInfo/discond/Pages/MeaslesSurveillanceUpdates.aspx))
```{block type = "rmdwarning"}
If you want to use these data further, you should go back and pull them from their original sources. They are here only for use in R code examples for this course.
```
Here are some of the packages you will need for this exercise:
```{r, message=FALSE}
library(dplyr)
library(gridExtra)
library(ggthemes)
```
### Designing tidy data
1. Check out the [`country_timeseries.csv` file](https://github.com/geanders/RProgrammingForResearch/raw/master/data/country_timeseries.csv) on Ebola for this week's example data. Talk with your partner and decide what changes you would need to make to this dataset to turn it into a "tidy" dataset, in particular which of the five common "untidy" problems the data currently has and why.
2. Do the same for the data on daily mortality and daily weather in Mexico.
3. Do the same for the set of files with measles data.
### Easier data wrangling
- Use `read_csv` to read the Mexico data (exposure and mortality) directly from GitHub into your R session. Call the dataframes `mex_deaths` and `mex_exp`.
- Are there any values of the `day` column in `mex_deaths` that is not present in the `day` column of `mex_exp`? How about vice-versa? (Hint: There are a few ways you could check this. One is to try filtering down to just rows in one dataframe where the `day` values are not present in the `day` values from the other dataframe. The `%in%` logical vector may be useful.)
- Merge the two datasets together to create the dataframe `mexico`. Exclude all columns except the outcome (deaths), day, and mean temperature.
- Convert the day to a Date class.
- If you did not already, try combining all the steps in the previous task into one "chained" pipeline of code using the pipe operator, `%>%`.
- Use this new dataframe to plot deaths by date in Mexico using `ggplot2`. The final plot should look like this:
```{r echo = FALSE, fig.width = 4, fig.height = 2.5, message = FALSE, warning = FALSE}
deaths_url <- paste0("https://github.com/geanders/RProgrammingForResearch/",
"raw/master/data/mexico_deaths.csv")
mex_deaths <- read_csv(deaths_url)
exposure_url <- paste0("https://github.com/geanders/RProgrammingForResearch/",
"raw/master/data/mexico_exposure.csv")
mex_exp <- read_csv(exposure_url)
library(lubridate)
mexico <- full_join(mex_deaths, mex_exp, by = "day") %>%
select(day, deaths, temp_mean) %>%
mutate(day = mdy(day))
ggplot(data = mexico) +
geom_point(mapping = aes(x = day, y = deaths),
size = 1.5, alpha = 0.5) +
xlab("Date in 2008") + ylab("# of deaths") +
ggtitle("Deaths by date") +
theme_few()
```
#### Example R code
Use `read_csv` to read the mexico data (exposure and mortality) directly from GitHub into your R session. Call the dataframes `mex_deaths` and `mex_exp`:
```{r message = FALSE}
deaths_url <- paste0("https://github.com/geanders/RProgrammingForResearch/",
"raw/master/data/mexico_deaths.csv")
mex_deaths <- read_csv(deaths_url)
head(mex_deaths)
exposure_url <- paste0("https://github.com/geanders/RProgrammingForResearch/",
"raw/master/data/mexico_exposure.csv")
mex_exp <- read_csv(exposure_url)
head(mex_exp)
```
Check if there are any values of the `day` column in `mex_deaths` that are not present in the `day` column of `mex_exp` and vice-versa.
```{r}
mex_deaths %>%
filter(!(day %in% mex_exp$day))
mex_exp %>%
filter(!(day %in% mex_deaths$day))
```
One important note is that, when you're doing this check, you do *not* want to overwrite your original dataframe, so be sure that you do not reassign this output to `mex_deaths` or `mex_exp`.
An even quicker way to do check this is to create a logical vector that checks this and use `sum` to add up the values in the logical vector. If the sum is zero, that tells you that the logical check is never true, so there are no cases where there is a `day` value in one dataframe that is not also in the other dataframe.
```{r}
sum(!(mex_deaths$day %in% mex_exp$day))
sum(!(mex_exp$day %in% mex_deaths$day))
```
Merge the two datasets together to create the dataframe `mexico`. Exclude all columns except the outcome (deaths), date, and mean temperature.
```{r message = FALSE}
mexico <- full_join(mex_deaths, mex_exp, by = "day")
mexico <- select(mexico, day, deaths, temp_mean)
```
Convert the date to a date class.
```{r}
library(lubridate) ## For parsing dates
mexico <- mutate(mexico, day = mdy(day))
```
Try combining all the steps in the previous task into one "chained" command:
```{r}
mexico <- full_join(mex_deaths, mex_exp, by = "day") %>%
select(day, deaths, temp_mean) %>%
mutate(day = mdy(day))
head(mexico)
```
Note that, in this case, all the values of `day` in `mex_deaths` have one and only one matching value in `mex_exp`, and vice-versa. Because of this, we would have gotten the same `mexico` dataframe if we'd used `inner_join`, `left_join` or `right_join` instead of `full_join`. The differences between these `*_join` functions come into play when you have some values of your matching column that aren't in both of the dataframes you're joining.
Use this new dataframe to plot deaths by date using `ggplot`:
```{r fig.width = 4, fig.height = 2.5}
ggplot(data = mexico) +
geom_point(mapping = aes(x = day, y = deaths),
size = 1.5, alpha = 0.5) +
labs(x = "Date in 2008", y = "# of deaths") +
ggtitle("Deaths by date") +
theme_few()
```
### More extensive data wrangling
- Read the Ebola data directly from GitHub into your R session. Call the dataframe `ebola`.
- Use `dplyr` functions to create a tidy dataset. First, change it from "wide" data to "long" data. Name the new column with the key `variable` and the new column with the values `count`. The first few lines of the "long" version of the dataset should look like this:
```{r echo = FALSE, message = FALSE, warning = FALSE}
ebola_url <- paste0("https://github.com/geanders/RProgrammingForResearch/",
"raw/master/data/country_timeseries.csv")
ebola <- read_csv(ebola_url) %>%
pivot_longer(cols = c(-Date, -Day),
names_to = "variable",
values_to = "count")
ebola %>% slice(1:6)
```
- Convert the `Date` column to a Date class.
- Use the `separate` function to separate the `variable` column into two columns, `type` ("Cases" or "Deaths") and `country` ("Guinea", "Liberia", etc.). At this point, the data should look like this:
```{r echo = FALSE}
ebola <- ebola %>%
mutate(Date = lubridate::mdy(Date)) %>%
separate(variable, c("type", "country"), sep = "_")
ebola %>% slice(1:6)
```
- Use the `pivot_wider` function to convert the data so you have separate columns for the two variables of numbers of `Cases` and `Deaths`. At this point, the dataframe should look like this:
```{r echo = FALSE}
ebola_wider <- pivot_wider(ebola, names_from = type, values_from = count)
ebola_wider %>% slice(1:6)
```
- Remove any observations where counts of both cases and deaths are missing for that country on that date.
- Now that your data is tidy, create one plot showing Ebola cases by date, faceted by country, and one showing Ebola deaths by date, also faceted by country. Try using the option `scales = "free_y"` in the `facet_wrap` function and see how that changes these graphs. Discuss with your group the advantages and disadvantages of using this option when creating these small multiple plots. The plots should look something like this (if you're using the `scales = "free_y"` option):
```{r echo = FALSE, fig.width = 8, fig.height = 4, message = FALSE, warning = FALSE}
ebola_filtered <- filter(ebola_wider, !is.na(Cases) & !is.na(Deaths))
ggplot(ebola_filtered, aes(x = Date, y = Cases)) +
geom_line() +
facet_wrap(~ country, ncol = 4, scales = "free_y") +
theme_classic()
ggplot(ebola_filtered, aes(x = Date, y = Deaths)) +
geom_line() +
facet_wrap(~ country, ncol = 4, scales = "free_y") +
theme_classic()
```
- Based on these plots, what would your next questions be about this data before you used it for an analysis?
- Can you put all of the steps of this cleaning process into just a few "chained" code pipelines using `%>%`?
- If you have extra time (super-challenge!): There is a function called `fct_reorder` in the `forcats` package that can be used to reorder the levels of a factor in a dataframe based on another column in the same dataframe. This function can be very useful for using a meaningful order when plotting. We'll cover the `forcats` package in a later class, but today check out the help file for `fct_reorder` and see if you can figure out how to use it to reorder the small multiple plots in order of the maximum number of cases or deaths (for the two plots respectively) in each country. You'll be able to do this by changing the code in `facet_wrap` from `~ country` to `~ fct_reorder(country, ...)`, but with the `...` replaced with certain arguments. If you're getting stuck, try running the examples in the `fct_reorder` helpfile to get a feel for how this function can be used when plotting. The plots will look something like this:
```{r,echo = FALSE, message = FALSE, fig.width = 8, fig.height = 4}
ebola_url <- paste0("https://github.com/geanders/RProgrammingForResearch/",
"raw/master/data/country_timeseries.csv")
ebola <- read_csv(ebola_url) %>%
pivot_longer(cols = c(-Date, -Day), names_to = "variable", values_to = "count") %>%
mutate(Date = mdy(Date)) %>%
separate(variable, c("type", "country"), sep = "_") %>%
pivot_wider(names_from = type, values_from = count) %>%
filter(!is.na(Cases) & !is.na(Deaths))
library(forcats)
ggplot(ebola, aes(x = Date, y = Cases)) +
geom_line() +
facet_wrap(~ fct_reorder(country, Cases, .fun = max, .desc = TRUE),
ncol = 4) +
theme_classic() +
ggtitle("Deaths from Ebola")
ggplot(ebola, aes(x = Date, y = Deaths)) +
geom_line() +
facet_wrap(~ fct_reorder(country, Deaths, .fun = max, .desc = TRUE),
ncol = 4) +
theme_classic() +
ggtitle("Cases of Ebola")
```
#### Example R code
Read the data in using `read_csv`.
```{r message = FALSE}
ebola_url <- paste0("https://github.com/geanders/RProgrammingForResearch/",
"raw/master/data/country_timeseries.csv")
ebola <- read_csv(ebola_url)
head(ebola)
```
Change the data to long data using the `pivoter_longer` function from `tidyr`:
```{r}
ebola <- ebola %>%
pivot_longer(cols = c(-Date, -Day), names_to = "variable", values_to = "count")
head(ebola)
```
Convert `Date` to a date class:
```{r}
ebola <- ebola %>%
mutate(Date = mdy(Date))
head(ebola)
```
Split `variable` into `type` and `country`:
```{r}
ebola <- ebola %>%
separate(variable, c("type", "country"), sep = "_")
head(ebola)
```
Convert the data so you have separate columns for the two variables of numbers of `Cases` and `Deaths`:
```{r}
ebola <- pivot_wider(ebola, names_from = type, values_from = count)
head(ebola)
```
Remove any observations where counts of cases or deaths are missing for that country:
```{r}
ebola <- filter(ebola, !is.na(Cases) & !is.na(Deaths))
head(ebola)
```
Now that your data is tidy, create one plot showing ebola cases by date, faceted by country, and one showing ebola deaths by date, also faceted by country:
```{r fig.width = 8, fig.height = 4}
ggplot(ebola, aes(x = Date, y = Cases)) +
geom_line() +
facet_wrap(~ country, ncol = 4) +
theme_classic()
ggplot(ebola, aes(x = Date, y = Deaths)) +
geom_line() +
facet_wrap(~ country, ncol = 4) +
theme_classic()
```
Try using the option `scales = "free_y"` in the `facet_wrap()` function (in the `gridExtra` package) and see how that changes these graphs:
```{r fig.width = 8, fig.height = 4}
ggplot(ebola, aes(x = Date, y = Cases)) +
geom_line() +
facet_wrap(~ country, ncol = 4, scales = "free_y") +
theme_classic()
ggplot(ebola, aes(x = Date, y = Deaths)) +
geom_line() +
facet_wrap(~ country, ncol = 4, scales = "free_y") +
theme_classic()
```
Put all of the steps of this cleaning process into just a few "chaining" calls.
```{r, message = FALSE, fig.width = 8, fig.height = 4}
ebola <- read_csv(ebola_url) %>%
pivot_longer(cols = c(-Date, -Day), names_to = "variable", values_to = "count") %>%
mutate(Date = mdy(Date)) %>%
separate(variable, c("type", "country"), sep = "_") %>%
pivot_wider(names_from = type, values_from = count) %>%
filter(!is.na(Cases) & !is.na(Deaths))
ggplot(ebola, aes(x = Date, y = Cases)) +
geom_line() +
facet_wrap(~ country, ncol = 4) +
theme_classic()
ggplot(ebola, aes(x = Date, y = Deaths)) +
geom_line() +
facet_wrap(~ country, ncol = 4) +
theme_classic()
```
Use the `fct_reorder` function inside the `facet_wrap` function call to reorder the small-multiple graphs.
```{r, message = FALSE, fig.width = 8, fig.height = 4}
library(forcats)
ggplot(ebola, aes(x = Date, y = Cases)) +
geom_line() +
facet_wrap(~ fct_reorder(country, Cases, .fun = max, .desc = TRUE),
ncol = 4) +
theme_classic()
ggplot(ebola, aes(x = Date, y = Deaths)) +
geom_line() +
facet_wrap(~ fct_reorder(country, Deaths, .fun = max, .desc = TRUE),
ncol = 4) +