forked from geanders/RProgrammingForResearch
-
Notifications
You must be signed in to change notification settings - Fork 1
/
01-prelim.Rmd
612 lines (416 loc) · 38.7 KB
/
01-prelim.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
# (PART) Part I: Preliminaries {-}
# R Preliminaries
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1.pdf) a pdf of the lecture slides covering this topic.
## R and R Studio
### What is R?
R in an open-source programming language that evolved from the S language. The S language was developed at Bell Labs in the 1970s, which is the same place (and about the same time) that the C programming language was developed.
R itself was developed in the 1990s--2000s at the University of Auckland. It is open-source software, freely and openly distributed under the GNU General License. The base version of R that you download when you install R on your computer includes the critical code for running R, but you can also install and run "packages" that people all over the world have developed to extend R.
With new developments, R is becoming more and more useful for a variety of programming tasks. However, where it really shines is in working with data and doing statistical analysis. R is currently popular in a number of fields, including:
- Statistics
- Machine learning
- Data journalism / data analysis
R has some of the same strengths (quick and easy to code, interfaces well with other languages, easy to work interactively) and weaknesses (slower than compiled languages) as Python. For data-related tasks, R and Python are fairly neck-and-neck. However, R is still the first choice of statisticians in most fields, so I would argue that R has a an advantage if you want to have access to cutting-edge statistical methods.
> "The best thing about R is that it was developed by statisticians. The worst thing about R is that... it was developed by statisticians."
> -Bo Cowgill, Google, at the Bay Area R Users Group
### Open-source software
R is open-source software. Many other popular statistical programming languages, conversely, are proprietary. It's useful to know what it means for software to be "open-source", both conceptually and in terms of how you will be able to use and add to R in your own work.
R is free, and it's tempting to think of open-source software just as "free software". Things, however, are a little more subtle than that. It helps to consider some different meanings of the word "free". "Free" can mean:
- *Gratis*: Free as in beer
- *Libre*: Free as in speech
Open-source software software is the *libre* type of free. This means that, with software that is open-source, you can:
- Access all of the code that makes up the software
- Change the code as you'd like for your own applications
- Build on the code with your own extensions
- Share the software and its code, as well as your extensions, with others
In practice, this means that, once you are familiar with the software, you can dig deeply into the code to figure out exactly how it's performing certain tasks. This can be useful for finding bugs and eliminating bugs, and also can help researchers figure out if there are any limitations in how the code works for their specific research.
It also means that you can build your own software on top of existing R software and its extensions. I explain a bit more about R packages a bit later, but this open-source nature of R (and other languages, including Python) has created a large community of people worldwide who develop and share extensions to R. As a result, you can pull in packages that let you do all kinds of things in R, like visualizing Tweets, cleaning up accelerometer data, analyzing complex surveys, fitting maching learning models, and a wealth of other cool things.
### What is RStudio?
To get the R software, you'll [download R](https://www.r-project.org) from the R Project for Statistical Computing. This is enough for you to use R on your own computer. However, I would suggest one additional, free piece of software to improve your experience while working with R, RStudio.
RStudio is an integrated development environment (IDE) for R. This basically means that it provides you an interface for running R and coding in R, with a lot of nice extras that will make your life easier.
You download RStudio separately from R-- you'll want to download and install R itself first, and then you can [download RStudio](https://www.rstudio.com/products/rstudio/download2/). You want the Desktop version with the free license.
The company that develops this IDE is a fantastic contributer to the global R community. RStudio currently:
- Develops and freely provides the RStudio IDE
- Provides excellent resources for learning and using R (cheatsheets, )
- Is producing some of the most-used R packages
- Employs some of the top people in R development
R has been advancing by leaps in bounds in terms of what it can do and the elegance with which it does it, in large part because of the enormous contributions of people involved with RStudio.
### Setting up
If do not already have them, you will need to download and install both R and RStudio.
- Go to [CRAN](https://cran.r-project.org) and download the latest version of R for your system. Install.
- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/) and download the latest version of RStudio for your system. Install.
Defaults should be fine for everything when you install both R and RStudio. You will want the latest stable version, rather than the development version, for this course.
## The "package" system
### R packages
Your original download of R is only a starting point. To me, this is a bit like the toy train set that my son was obsessed with for a while. You first buy a very basic set that looks something like Figure \@ref(fig:toy-train-basic).
```{r toy-train-basic, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap = "The toy version of base R."}
knitr::include_graphics("figures/BrioBasicSet.jpg")
```
To take full advantage of R, you'll want to add on packages. In the case of the train set, at this point, a doting grandparent adds on extensively through birthday presents, so you end up with something that looks like Figure \@ref(fig:toy-train-fancy).
```{r toy-train-fancy, echo = FALSE, out.width = "800pt", fig.align = "center", fig.cap = "The toy version of what your R set-up will look like once you find cool packages to use for your research."}
knitr::include_graphics("figures/BrioFancySet.jpg")
```
The main source for installing packages for R remains the Comprehensive R Archive Network, or [CRAN](https://cran.r-project.org). However, [GitHub](https://github.com) is growing in popularity, especially for packages that are still in development. You can also create and share packages among your collaborators or co-workers, without ever posting them publicly.
### Installing from CRAN
The most popular place from which to get packages is currently CRAN. You can install packages from CRAN using R code. For example, telephone keypads include letters for each number (Figure \@ref(fig:phone-keypad)), which allow companies to have "named" phone numbers that are easier for people to remember, like 1-800-GO-FEDEX and 1-800-FLOWERS.
```{r phone-keypad, echo = FALSE, out.width = "150pt", fig.align = "center", fig.cap="Telephone keypad with letters corresponding to each number."}
knitr::include_graphics("figures/telephone_keypad.png")
```
The `phonenumber` package is a cool little package that will convert between numbers and letters based on the telephone keypad. Since this package is on CRAN, you can install the package to your computer using the `install.packages` function:
```{r, eval = FALSE, messages = FALSE, warnings = FALSE, results = FALSE}
install.packages("phonenumber")
```
This downloads the package from CRAN and saves it in a special location on your computer where R can load it when you're ready to use it.
### Loading an installed package
Once you have installed a package, it will be saved to your computer, but you won't be able to access it's functions until you load it in your R session. You can load a package in an R session using the `library` function, with the package name inside the parentheses.
```{r messages = FALSE, warnings = FALSE, results = FALSE}
library(phonenumber)
```
```{block, type = 'rmdtip'}
One thing that people often find confusing when they start using R is knowing when to use and not use quotation marks. The general rule is that you use quotation marks when you want to refer to a character string literally, but no quotation marks when you want to refer to the value in a previously-defined object. For example, if you saved the string `"Anderson"` as the object `my_name` (`my_name <- "Anderson"`), then in later code, if you type `my_name` (no quotation marks), you'll get `"Anderson"`, while if you type out `"my_name"` (with quotation marks), you'll get `"my_name"` (what you typed, literally).
One thing that makes this rule confusing is that there are a few cases in R where you really should (by this rule) use quotation marks, but the function is coded to let you be lazy and get away without them. One example is the `library` function. In the above code, you want to literally load the package "phonenumber", rather than load whatever character string is saved in the object named `phonenumber`. However, `library` is one of the functions where you can be lazy and skip the quotation marks, and it will still load "phonenumber" for you (although, if you want, this function also works if you follow the rule and call `library("phonenumber")` instead).
```
Once a package is loaded, you can use all its exported (i.e., public) functions by calling them directly. For example, the `phonenumber` has a function called `letterToNumber` that converts a character string to a number. Once you've loaded `phonenumber` using `library`, you can use this function in your R session:
```{r}
fedex_number <- "GoFedEx"
letterToNumber(fedex_number)
```
```{block, type = "rmdnote"}
R vectors can have several different *classes*. One common class is the character class, which is the class of the character string we're using here ("GoFedEx"). You'll always put character strings in quotation marks. Another key class is numeric (numbers). Later in the course, we'll introduce other classes that vectors can have, including factors and dates.
```
## Basic code conventions of R
### R's MVP: The *gets arrow*
The *gets arrow*, `<-`, is R's assignment operator. It takes whatever you've created on the right hand side of the `<-` and saves it as an object with the name you put on the left hand side of the `<-`. The basic structure of a call with a gets arrow looks like this:
```{r eval = FALSE}
## Note: Generic code
[name of object] <- [thing I want to save]
```
```{block, type = "rmdwarning"}
Sometimes, we'll show "generic" code in a code block, that doesn't actually work if you put it in R, but instead shows the generic structure of an R call. We'll try to always include a comment with any generic code, so you'll know not to try to run it in R.
```
For example, if I just type `"GoFedEx"` at the R prompt, R will print that string back to me, but won't save it anywhere for me to use later:
```{r}
"GoFedEx"
```
However, if I assign `"GoFedEx"` to an object using a gets arrow, I can print it out or use it later by typing ("referencing") that object name:
```{r}
fedex_number <- "GoFedEx"
fedex_number
letterToNumber(fedex_number)
```
### Assignment operator wars: `<-` vs. `=`
You can make assignments in R using either the gets arrow (`<-`) or `=`. When you read other people's code, you'll see both. R gurus advise using `<-` rather than `=` when coding in R, and as you move to doing more complex things, some subtle problems might crop up if you use `=`. I have heard from someone in the know that you can tell the age of a programmer by whether he or she uses the gets arrow or `=`, with `=` more common among the young and hip. For this course, however, I am asking you to code according to [Hadley Wickham's R style guide](http://adv-r.had.co.nz/Style.html), which specifies using the gets arrow for assignment.
While you will be coding with the gets arrow exclusively in this course, it will be helpful for you to know that the two assignment arrows do pretty much the same thing:
```{r}
one_to_ten <- 1:10
one_to_ten
one_to_ten = 1:10
one_to_ten
```
### Naming objects
When you assign objects, you will need to choose names for them. This object name is what you will type later in your code to reference the object and use it in functions, figures, etc. For example, with the following code, I am assigning the character string "GoFedEx" to an object that I am naming `fedex_number`:
```{r}
fedex_number <- "GoFedEx"
```
There are only two fixed rules for naming objects in R:
- Use only letters, numbers, and underscores
- Don't start with anything but a letter
In addition to these fixed rules, there are also some guidelines for naming objects that you should adopt now, since they will make your life easier as you advance to writing more complex code in R. The following three guidelines for naming objects are from [Hadley Wickham's R style guide](http://adv-r.had.co.nz/Style.html):
- Use lower case for variable names (`fedex_number`, not `FedExNumber`)
- Use an underscore as a separator (`fedex_number`, not `fedex.number` or `fedexNumber`)
- Avoid using names that are already defined in R (e.g., don't name an object `mean`, because a function named `mean` already exists)
Another good practice is to name objects after nouns (e.g., `fedex_number`) and later, when you start writing functions, name those after verbs (e.g., `call_fedex`). You'll want your object names to be short enough that they don't take forever to type as you're coding, but not so short that you can't remember what they stand for.
```{block, type = "rmdtip"}
Sometimes, you'll want to create an object that you won't want to keep for very long. For example, you might want to create a small object to test some code, but you plan to not need the object again once you've done that. You may want to come up with some short, generic object names that you use for these kinds of objects, so that you'll know that you can delete them without problems when you want to clean up your R session.
There are all kinds of traditions for these placeholder variable names in computer science. `foo` and `bar` are two popular choices, as are, evidently, `xyzzy`, `spam`, `ham`, and `norf`. There are different placeholder names in different languages: for example, `toto`, `truc`, and `azerty` (French); and `pippo`, `pluto`, `paperino` (Disney character names; Italian). See the Wikipedia page on [metasyntactic variables](https://en.wikipedia.org/wiki/Metasyntactic_variable) to find out more.
```
### Commenting code
Sometimes, you'll want to include notes in your code. You can do this in all programming languages by using a *comment character* to start the line with your comment. In R, the comment character is the hash symbol, `#`. R will skip any line that starts with `#` in a script. For example, if you run the following code:
```{r}
# Don't print this.
"But print this"
```
R will only print the second, uncommented line.
You can also use a comment in the middle of a line, to add a note on what you're doing in that line of the code. R will skip any part of the code from the hash symbol on. For example:
```{r}
"Print this" ## But not this, it's a comment.
```
## R's most basic object types
An R *object* stores some type of data that you want to use later in your R code, without fully recreating it. The content of R objects can vary from very simple (the `"GoFedEx"` string in the example code above) to very complex objects with lots of elements (for example, a machine learning model).
There are a variety of different object types in R, shaped to fit different types of objects ranging from the simple to complex. In this section, we'll start by describing two object types that you will use most often in basic data analysis, **vectors** (1-dimensional objects) and **dataframes** (2-dimensional objects).
### Vectors
To get an initial grasp of the *vector* object type in R, think of it as a 1-dimensional object, or a string of values. All values in a vector must be of the same class (i.e., all numbers, all characters, all dates). If you try to create a vector with elements from different classes (like "FedEx", which is a character, and 3, a number), R will coerce all of the elements to the most generic class of any of the elements (i.e., "FedEx" and "3" will both become characters, since "3" can be changed to a character, but "FedEx" can't be changed to a number).
To create a vector from different elements, you'll use the concatenation function, `c` to join them together, with commas between the elements. For example, to create a vector with the first five elements of the Fibonacci sequence, you can run:
```{r}
fibonacci <- c(1, 1, 2, 3, 5)
fibonacci
```
Here is an example of creating a vector using elements with the character class instead of numbers (note the quotation marks used around each element for character strings):
```{r}
one_to_five <- c("one", "two", "three", "four", "five")
one_to_five
```
If you mix classes when you create the vector, R will coerce all the elements to most generic of the elements' classes:
```{r}
mixed_classes <- c(1, 3, "five")
mixed_classes
```
A vector's *length* is the number of elements in the vector. You can use the `length` function to determine a vector's length:
```{r}
length(mixed_classes)
```
Once you create an object, you will often want to reference the whole object in future code. However, there will be some times when you'll want to reference just certain elements of the object (for example, the first three values). You can pull out certain values from a vector by using indexing with square brackets (`[...]`) to identify the locations of the elements you want to pull, with a numeric vector inside the brackets that lists the numbered positions of the elements you want to get:
```{r}
fibonacci[2] # Get the second value
fibonacci[c(1, 5)] # Get first and fifth values
fibonacci[1:3] # Get the first three values
```
You can also use logic to pull out some values of a vector. For example, you might only want to pull out even values from the `fibonacci` vector. We'll cover using logical statements to index vectors later in the book.
### Dataframes
A dataframe is a 2-dimensional object, and is made of one or more vectors of the same length stuck together side-by-side. It is the closest R has to an Excel spreadsheet-type structure. You can create dataframes using the `data.frame` function. However, most often you will create a dataframe by reading in data from a file, using something like `read.csv`.
To create a dataframe using `data.frame`, in this case by sticking together vectors you already have saved as R objects, you can run:
```{r}
fibonacci_seq <- data.frame(num_in_seq = one_to_five,
fibonacci_num = fibonacci)
fibonacci_seq
```
Note that this call requires that the `one_to_five` and `fibonacci` vectors are the same length, although they don't have to be (and in this case aren't) the same class of objects (`one_to_five` is a character class, `fibonacci` is numeric).
You can also create a dataframe using `data.frame` even if you don't have the vectors for the columns saved as objects. Instead, in this case, you can put the vector assignment directly within the `data.frame` call:
```{r}
fibonacci_seq <- data.frame(num_in_seq = c("one", "two", "three",
"four", "five"),
fibonacci_num = c(1, 1, 2, 3, 5))
fibonacci_seq
```
```{block, type = "rmdnote"}
You can put more than one function call in a single line of R code, as in this example (the `c` creates a vector, while the `data.frame` creates a dataframe, using the vectors created by the calls to `c`). When you use multiple functions within a single R call, R will evaluate starting from the inner-most parentheses out, much like the order of operations in a math equation with parentheses.
```
The general format for using `data.frame` is:
```{r eval = FALSE}
## Note: Generic code
[name of object] <- data.frame([1st column name] = [1st column content],
[2nd column name] = [2nd column content])
```
with an equals sign between the column name and column content for each column, and commas between each of the columns.
You can use square-bracket indexing (`[..., ...]`) for dataframes, too, but now they'll have two dimensions (rows, then columns). Put the rows you want before the comma, the columns after. If you want all of something (e.g., all rows in the dataframe), leave the designated spot blank. Here are two examples of using square-bracket indexing to pull a subset of the `fibonacci_seq` dataframe we created above:
```{r}
fibonacci_seq[1:2, 2] # First two rows, second column
fibonacci_seq[5, ] # Last row, all columns
```
```{block type = "rmdnote"}
If you forget to put the comma in the indexing for a dataframe (e.g., `fibonacci_seq[1:2]`), you will index out the *columns* that fall at that position or positions. To avoid confusion, I suggest that you always use indexing with a comma when working with dataframes.
```
So far, we've only shown how to create dataframes from scratch within an R session. Usually, however, you'll create R dataframes instead by reading in data from an outside file using `read.csv` and related functions. For example, you might want to analyze data on all the guests that came on the *Daily Show*, *circa* Jon Stewart. If you have this data in a comma-separated (csv) file on your computer called "daily_show_guests.csv", you can read it into your R session with the following code:
```{r echo = FALSE}
daily_show <- read.csv("data/daily_show_guests.csv",
header = TRUE,
skip = 4)
```
```{r eval = FALSE}
daily_show <- read.csv("daily_show_guests.csv",
header = TRUE,
skip = 4)
```
In this code, the `read.csv` function is reading in the data from "daily_show_guests", while the gets arrow (`<-`) assigns that data to the object `daily_show`, which you can then reference in later code to explore and plot the data.
Once you've read in the data and saved the resulting dataframe as an object, you can use square-bracket indexing to look at the first two rows in the data:
```{r}
daily_show[1:2, ]
```
You can use the functions `dim`, `nrow`, and `ncol` to figure out the dimensions (number of rows and columns) of a dataframe:
```{r}
dim(daily_show)
nrow(daily_show)
ncol(daily_show)
```
## Using R functions
### Function structure
In general, functions in R take the following structure:
```{r, eval = FALSE}
## Generic code
function.name(parameter 1 = argument 1, parameter 2 = argument 2,
parameter 3 = argument 3)
```
The result of the function will be output to your R session, unless you choose to save the output in an object:
```{r, eval = FALSE}
## Generic code
new_object <- function.name(parameter 1 = argument 1,
parameter 2 = argument 2,
parameter 3 = argument 3)
```
Here are some example function calls, to give you examples of this structure:
```{r}
head(daily_show)
head(daily_show, n = 3)
```
```{r echo = FALSE}
daily_show <- read.csv("data/daily_show_guests.csv",
skip = 4,
header = TRUE)
```
```{r eval = FALSE}
daily_show <- read.csv("daily_show_guests.csv",
skip = 4,
header = TRUE)
```
Within the function call, *parameters* allow you to customize the function to run in a certain way (e.g., use a certain dataframe as an input, give output in a certain format). Some function parameters will have *default arguments*, which means that you don't have to put a value for that parameter for the function to run, but you can if you want the function to do something other than the default.
### Function help files
You can find out more about a function, include what parameters it has and what the default values, if any, are by using `?` before the function name in the R console. For example, to find out more about the `read.csv` command, run:
```{r eval = FALSE}
?read.csv
```
From the "Usage" section of the help file, you can figure out that the only required parameter is `file`, the pathname of the file that you want to read in, since this is the only argument in the "Usage" example without an argument value:
```
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
```
You can also see from this "Usage" section that the default value of `header` is `TRUE`, the default value of `sep` is a comma, etc.
The "Arguments" section explains each of the parameters, and possible arguments that each can take. For example, here is the explanation of the `nrows` parameter in the `read.csv` function:
```
integer: the maximum number of rows to read in. Negative and other
invalid values are ignored.
```
From this, you can determine that you should put in a whole number, 1 or higher, and the function will only read in that many lines of the dataframe when you run `read.csv`.
### Function parameters
Each function parameter has a name (e.g., `nrows`, `header`, `file`). The safest way to call a function in R is to use the structure `parameter name = argument value` for every parameter, like this:
```{r eval = FALSE}
head(x = daily_show, n = 3)
```
However, you can also give argument values by position. For example, in the `head` function, the first parameter is `x`, the object you want to look at, and the second is `n`, the number of elements you want to include when you look at the object. If you know this, you can call `head` using the shorter call:
```{r eval = FALSE}
head(daily_show, 3)
```
If you use position alone, you will have problems if you don't include arguments in exactly the right order. However, if you use parameter names to set each argument, it doesn't matter what order you include arguments when calling a function:
```{r eval = FALSE}
# These two calls return the exact same object
head(x = daily_show, n = 3)
head(n = 3, x = daily_show)
```
Because code tends to be more robust to errors when you use parameter names to set arguments, we recommend against using position, rather than name, to give arguments when calling functions, at least while you're learning R. It's too easy to forget the exact order and get errors in your code. However, there is one exception-- the first argument to a function is almost always required (i.e., there's not a default value), and you very quickly learn what the first parameter of most functions are as soon as you start using the function regularly. Therefore, it's fine to use position alone to specify the first argument in a function, but for now always use the parameter name to set any later arguments:
```{r eval = FALSE}
head(daily_show, n = 3)
```
```{block, type='rmdtip'}
Using the full parameter names for arguments can take a bit more time, since it requires more typing. However, RStudio helps you out with that by offering *code completion*. Once you start typing the first few letters of a parameter name within a function call, try pressing the tab key. All possible arguments that start with those letters should show up, and you can scroll through to pick the right one, or keep typing until the argument you want is atnthe top of the list of choices, and then press the tab key again.
```
## In-course Exercise
### About the dataset
For today's class, you'll be using a dataset of all the guests on Jon Stewart's *The Daily Show*. This data was originally collected by Nate Silver's website, [FiveThirtyEight](http://fivethirtyeight.com) and is available on [FiveThirtyEight's GitHub page](https://github.com/fivethirtyeight/data/tree/master/daily-show-guests) under the [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/). The only change made to the original file was to add (commented) attribution information at the start of the file.
**First, check out a bit more about this data and its source:**
* Check out [the Creative Commons license](http://creativecommons.org/licenses/by/4.0/). What are we allowed to do with this data? What restrictions are there on using the data?
* It's often helpful to use prior knowledge to help check out or validate your dataset. One thing we might want to know about this data is if it covers the whole time that Jon Stewart hosted *The Daily Show*. Find out the dates he started and finished as host.
* Briefly browse around [FiveThirtyEight's GitHub data page](https://github.com/fivethirtyeight/data). What are some other datasets available that you find interesting? You can scroll to the bottom of the page to get to the compiled README.md content, which gives the full titles and links to relevant datasets. You can also click on any dataset to get more information.
* Look at [the GitHub page for this *Daily Show* data](https://github.com/fivethirtyeight/data/tree/master/daily-show-guests). How many columns will be in this dataset? What kind of information does the data include?
```{block, type = "rmdnote"}
In this exercise, you're using data posted by [FiveThirtyEight](http://fivethirtyeight.com) on [GitHub](https://github.com). We'll be using a lot of data that's on GitHub this semester, and GitHub is being used behind-the-scenes for both this book and the course note slides. We'll talk more about GitHub later, but you might find it interesting to explore a bit now. It's a place where people can post, work on, and share code in a number of programming languages-- it's been referred to as "Facebook for Nerds". You can search GitHub repositories and code specifically by programming language, so it can be a good way to find example R code from which to learn.
```
If you have extra time:
* Check out [the related article](http://fivethirtyeight.com/datalab/every-guest-jon-stewart-ever-had-on-the-daily-show/) on FiveThirtyEight. What are some specific questions they used this data to answer for this article?
* Who is Nate Silver?
### Getting the data onto your computer
First, download the data from GitHub onto your computer. Make a directory (folder) on your computer specifically for this course (I strongly recommend that you put it somewhere where the file path will not have any spaces in it-- for example, putting it in your home directory, under the name "R_Prog_Course" would be great. Putting it in a directory called "R Prog Course" would not.). Put the *Daily Show* data in your directory for this course.
**Take the following steps to get the data onto your computer**
* If you do not yet have a directory (folder) just for this course, make one someplace straightforward like in your home directory. Do not use any spaces in the directory name.
* Download the file [from GitHub](https://github.com/geanders/RProgrammingForResearch/blob/master/data/daily_show_guests.csv). Right click on `Raw` and then choose "Download linked file". Put the file into the directory you created for this course.
* Find out what your home directory is in R. To do this, make sure R is set to your home directory using `setwd("~")`, and then get R to print that home directory path using `getwd()`.
* Outside of R, open Finder (or your system's equivalent). Go to your home directory (mine, for example, is `/Users/brookeanderson`). Figure out how to get from your home directory to the directory you created for this course. For example, from my home directory, I would go `RProgrammingForResearch` to `data`.
* Go back into R. Set R's working directory to your directory for this class using the `setwd` command, now that you know the pathname for the directory. For example, I would use `setwd("~/RProgrammingForResearch/data")`.
* Use the `list.files` command to make sure that the "daily_show_guests.csv" file is in your current working directory.
The full R code for this task might look something like:
```{r, eval=FALSE}
setwd("~")
getwd()
setwd("~/RProgrammingForResearch/data")
getwd()
list.files()
"daily_show_guests.csv" %in% list.files()
```
If you have extra time:
* See if you can figure out the last line of code in the example R code.
### Getting the data into R
Now that you have the dataset in your working directory, you can read it into R. This dataset is in a *.csv* (comma separated values) format. (We will talk more about different file formats next week.) You can read csv files into R using the `read.csv` function.
**Read the data into your R session**
* Use the `read.csv` function to read the data into R and save it as the object `daily_show`.
* Use the help file for the `read.csv` function to figure out how this function works. To pull that up, use `?read.csv`. Why are we using the option `header = TRUE`? Can you figure out why we're using the `skip` option, and why it's set to 4?
* Note that you need to put the file name in quotation marks.
* What would have happened if you'd used `read.csv` but hadn't saved the result as the object `daily_show`? (For example, you'd run the code `read.csv("daily_show_guests.csv")` rather than `daily_show <- read.csv("daily_show_guests.csv")`.)
Example R code:
```{r, eval=FALSE}
daily_show <- read.csv("daily_show_guests.csv", header = TRUE, skip = 4)
```
If you have extra time:
* Say this was a really big dataset. You want to check out just the first 10 rows to make sure that you've got your code right before you take the time to pull in the whole dataset. Use the help file for `read.csv` to figure out how to only read in a few rows.
* Look through the help file for other options available for `read.csv`. Can you think of examples when some of these options would be useful?
Example R code:
```{r, eval=FALSE}
daily_show_first10 <- read.csv("daily_show_guests.csv", header = TRUE,
skip = 4, nrows = 10)
daily_show_first10
```
### Checking out the data
You now have the data available in the `daily_show` option. You'll want to check it out to make sure it read in correctly, and also to get a feel for the data. Throughout, you can use the help pages to figure out more about any of the functions being used (for example, `?dim`).
**Take the following steps to check out the dataset**
* Use indexing to look at the first two rows of the dataset. Based on this, what does each row "measure"? What information do you get for each "measurement"? As a reminder, indexing uses square brackets immediately after the object name. If the object has two dimensions, like a dataframe (rows and columns), you put the rows you want, then a comma, then the columns you want. If you want all rows (or columns), you leave that space blank. For example, if you wanted to get the first two rows and the first three columns, you'd use `daily_show[1:2, 1:3]`. If you wanted to get the first five rows and all the columns, you'd use `daily_show[1:5, ]`.
* Use the `dim` function to find out how many rows and columns this dataframe has. Based on what you found out about the data from the GitHub page, does it have the number of columns you expected? Based on what you know about the data (all the guests who came on The Daily Show with Jon Stewart), do you think it has about the right number of rows?
* Use the `head` function to look at the first few rows of the dataframe. Does it look like the rows go in order by date? What was the date of Jon Stewart's first show? Does it look like this dataset covers that first show?
* Use the `tail` function to look at the last few rows of the dataframe. What is the last show date covered by the dataframe? Who was the last guest?
Example R code:
```{r, eval=FALSE}
daily_show[1:2, ]
dim(daily_show)
head(daily_show)
tail(daily_show)
```
If you have extra time:
* Say you wanted to look at the first ten rows of the dataframe, rather than the first six. How could you use an option with `head` to do this?
Example R code:
```{r, eval=FALSE}
head(daily_show, n = 10)
```
### Using the data to answer questions
Nate Silver was a guest on *The Daily Show*. Let's use this data to figure out how many times he was a guest and when he was on the show.
**Find out more about Nate Silver on The Daily Show**
* Use the `subset` function to create a new dataframe that only has the rows of `daily_show` when Nate Silve was a guest. Put it in the object `nate_silver`.
* Print out the full `nate_silver` dataframe by typing `nate_silver`. (You could just use this to answer both questions, but still try the next steps. They would be important with a bigger dataset.)
* To count the number of times Nate Silver was a guest, you'll need to count the number of rows in the new dataset. You can either use the `dim` function or the `nrow` function to do this. What additional information does the `dim` function give you?
* To get the dates when Nate Silver was a guest, you can print out just the `Show` column of the dataframe. There are a few ways you can do this using indexing: `nate_silver[ , 3]` (since `Show` is the third column), `nate_silver[ , "Show"]`, or `nate_silver$Show`. Try these.
Example R code:
```{r, eval=FALSE}
nate_silver <- subset(daily_show,
Raw_Guest_List == "Nate Silver")
nate_silver
dim(nate_silver)
nrow(nate_silver)
nate_silver[ , 3]
nate_silver[ , "Show"]
```
If you have extra time:
* When you print out the `Show` column, why does it also print out something underneath about Levels? Hint: This has to do with the class that R has saved this column as. What class is it currently? What other classes might we want to consider converting it to as we continue working with the dataset? Check out the example code below to get some ideas.
* Was Nate Silver the only statistician to be a guest on the show?
* What were the occupations that were only represented by one guest visit? Since `GoogleKnowlege_Occupation` is a factor, you can use the `table` function to create a new vector with the number of times each value of `GoogleKnowlege_Occupation` shows up. You can put this information into a new vector and then pull out only the values that equal 1 (so, only had one guest). (Note that "Statistician" doesn't show up-- there was only one person who was a guest, but he had three visits.) Pick your favorite "one-off" example and find out who the guest was for that occupation.
Example R code:
```{r, eval=FALSE}
class(nate_silver$Show)
```
```{r, eval = FALSE}
range(nate_silver$Show)
```
```{r, eval=FALSE}
nate_silver$Show <- as.Date(nate_silver$Show,
format = "%m/%d/%y")
range(nate_silver$Show)
diff(range(nate_silver$Show)) # Time between Nate's first and last shows
```
```{r, eval=FALSE}
statisticians <- subset(daily_show,
GoogleKnowlege_Occupation == "Statistician")
statisticians
```
```{r, eval=FALSE}
num_visits <- table(daily_show[ , 2])
head(num_visits) # Note: This is a vector rather than a dataframe
names(num_visits[num_visits == 1])
subset(daily_show, GoogleKnowlege_Occupation == "chess player")
subset(daily_show, GoogleKnowlege_Occupation == "mathematician")
subset(daily_show, GoogleKnowlege_Occupation == "orca trainer")
subset(daily_show, GoogleKnowlege_Occupation == "Puzzle Creator")
subset(daily_show, GoogleKnowlege_Occupation == "Scholar")
```