07-exploringdata2.Rmd

# Exploring data #2

The video lectures for this chapter are embedded at relevant places in the text, 
with links to download a pdf of the associated slides for each video. 
You can also access [a full playlist for the videos for this chapter](https://www.youtube.com/playlist?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ).


```{r echo = FALSE, message = FALSE, warning = FALSE}
library(tidyverse)
library(knitr)
library(faraway)
data(worldcup)
library(ggthemes)
```

---

## Simple statistical tests in R

<iframe width="667" height="417" src="https://www.youtube.com/embed/JrM0bFff7Pw?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week7_part_2.pdf) 
a pdf of the lecture slides for this video.

<iframe width="667" height="417" src="https://www.youtube.com/embed/QyQTFumtLCE?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week7_part_3.pdf) 
a pdf of the lecture slides for this video.

Let's pull the fatal accident data just for the county that includes Las Vegas, NV. 
Each US county has a unique identifier (FIPS code), composed of a two-digit state FIPS and a three-digit county FIPS code. The state FIPS for Nevada is 32; the county FIPS for Clark County is 003. Therefore, we can filter down to Clark County data in the FARS data we collected with the following code:

```{r message = FALSE, error = FALSE}
library(readr)
library(dplyr)
clark_co_accidents <- read_csv("data/accident.csv") %>% 
  filter(STATE == 32 & COUNTY == 3)
```

We can also check the number of accidents: 

```{r}
clark_co_accidents %>% 
  count()
```

We want to test if the probability, on a Friday or Saturday, of a fatal accident occurring is higher than on other days of the week. Let's clean the data up a bit as a start: 

```{r message = FALSE, warning = FALSE}
library(tidyr)
library(lubridate)
clark_co_accidents <- clark_co_accidents %>% 
  select(DAY, MONTH, YEAR) %>% 
  unite(date, DAY, MONTH, YEAR, sep = "-") %>% 
  mutate(date = dmy(date))
```

Here's what the data looks like now: 

```{r}
clark_co_accidents %>% 
  slice(1:5)
```

Next, let's get the count of accidents by date: 

```{r}
clark_co_accidents <- clark_co_accidents %>% 
  group_by(date) %>% 
  count() %>% 
  ungroup()
clark_co_accidents %>% 
  slice(1:3)
```

We're missing the dates without a fatal crash, so let's add those. First, create a dataframe
with all dates in 2016:

```{r}
all_dates <- tibble(date = seq(ymd("2016-01-01"), 
                                   ymd("2016-12-31"), by = 1))
all_dates %>% 
  slice(1:5)
```

Then merge this with the original dataset on Las Vegas fatal crashes and make any day missing from the fatal crashes dataset have a "0" for number of fatal accidents (`n`):

```{r}
clark_co_accidents <- clark_co_accidents %>% 
  right_join(all_dates, by = "date") %>% 
  # If `n` is missing, set to 0. Otherwise keep value.
  mutate(n = ifelse(is.na(n), 0, n))
clark_co_accidents %>% 
  slice(1:3)
```

Next, let's add some information about day of week and weekend: 

```{r}
clark_co_accidents <- clark_co_accidents %>% 
  mutate(weekday = wday(date, label = TRUE), 
         weekend = weekday %in% c("Fri", "Sat"))
clark_co_accidents %>% 
  slice(1:3)
```

Now let's calculate the probability that a day has at least one fatal crash, separately for weekends and weekdays: 

```{r}
clark_co_accidents <- clark_co_accidents %>% 
  mutate(any_crash = n > 0)
crash_prob <- clark_co_accidents %>% 
  group_by(weekend) %>% 
  summarize(n_days = n(),
            crash_days = sum(any_crash)) %>% 
  mutate(prob_crash_day = crash_days / n_days)
crash_prob
```

In R, you can use `prop.test` to test if two proportions are equal. Inputs include the total number of trials in each group (`n =`) and the number of "successes"" (`x = `):

```{r}
prop.test(x = crash_prob$crash_days, 
          n = crash_prob$n_days)
```

I won't be teaching in this course how to find the correct statistical test. That's 
something you'll hopefully learn in a statistics course. There are also a variety of books that can help you with this, including some that you  can access free online through CSU's library. One servicable introduction is "Statistical Analysis with R for Dummies".

You can create an object from the output of any statistical test in R. Typically, this will be (at least at some level) in an object class called a "list":

```{r}
vegas_test <- prop.test(x = crash_prob$crash_days, 
                        n = crash_prob$n_days)
is.list(vegas_test)
```

So far, we've mostly worked with two object types in R, **dataframes** and **vectors**. In the next subsection we'll look more at two object classes we haven't looked at much,
**matrices** and **lists**. Both have important roles once you start applying more
advanced methods to analyze your data. 

<iframe width="667" height="417" src="https://www.youtube.com/embed/YvTdwfgRHqU?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week7_part_1.pdf) 
a pdf of the lecture slides for this video.

## Matrices

A matrix is like a data frame, but all the values in all columns must be of the same class (e.g., numeric, character). R uses matrices a lot for its underlying math (e.g., for the linear algebra operations required for fitting regression models). R can do matrix operations quite quickly.

You can create a matrix with the `matrix` function. Input a vector with the values to fill the matrix and `ncol` to set the number of columns:

```{r cab}
foo <- matrix(1:10, ncol = 5)
foo
```

By default, the matrix will fill up by column. You can fill it by row with the `byrow` function:

```{r cac}
foo <- matrix(1:10, ncol = 5, byrow = TRUE)
foo
```

In certain situations, you might want to work with a matrix instead of a data frame (for example, in cases where you were concerned about speed -- a matrix is more memory efficient than the corresponding data frame). If you want to convert a data frame to a matrix, you can use the `as.matrix` function:

```{r cad}
foo <- tibble(col_1 = 1:2, col_2 = 3:4,
                  col_3 = 5:6, col_4 = 7:8,
                  col_5 = 9:10)
(foo <- as.matrix(foo))
```

You can index matrices with square brackets, just like data frames:

```{r cae}
foo[1, 1:2]
```

You cannot, however, use `dplyr` functions with matrices:

```{r caf, eval = FALSE}
foo %>% filter(col_1 == 1)
```

All elements in a matrix must have the same class. 

The matrix will default to make all values the most general class of any of the values, in any column. For example, if we replaced one numeric value with the character "a", everything would turn into a character:

```{r cag}
foo[1, 1] <- "a"
foo
```

---

## Lists

A list has different elements, just like a data frame has different columns. However, the different elements of a list can have different lengths (unlike the columns of a data frame). The different elements can also have different classes.

```{r cah}
bar <- list(some_letters = letters[1:3],
            some_numbers = 1:5, 
            some_logical_values = c(TRUE, FALSE))
bar
```

To index an element from a list, use double square brackets. You can use bracket indexing either with numbers (which element in the list?) or with names:

```{r cai}
bar[[1]]
bar[["some_numbers"]]
```

You can also index lists with the `$` operator:

```{r caii}
bar$some_logical_values
```

To access a specific value within a list element we can index the element e.g.:

```{r caiii}
bar[[1]][[2]]
```

Lists can be used to contain data with an unusual structure and / or lots of different components. For example, the information from fitting a regression is often stored as a list:

```{r caj}
my_mod <- glm(rnorm(10) ~ c(1:10))
is.list(my_mod)
```

The `names` function returns the name of each element in the list:

```{r cak}
head(names(my_mod), 3)
my_mod[["coefficients"]]
```

A list can even contain other lists. We can use the `str` function to see the structure of a list:

```{r cakk}
a_list <- list(list("a", "b"), list(1, 2))

str(a_list)
```

Sometimes you'll see unnecessary lists-of-lists, perhaps when importing data into R created. Or a list with multiple elements that you would like to combine. You can remove a level of hierarchy from a list using the `flatten` function from the `purrr` package:

```{r cakl, warning = FALSE}
library(purrr)
a_list
flatten(a_list)
```

Let's look at the list object from the statistical test we ran for Las Vegas: 

```{r}
str(vegas_test)
```

Using `str` to print out the list's structure doesn't produce the easiest to digest output. We can use the `jsonedit` function from the `listviewer` package to create a widget in the `viewer` pane to more esily explore our list.

```{r, eval=FALSE}
library(listviewer)
jsonedit(vegas_test)
```


We can pull out an element using the `$` notation: 

```{r}
vegas_test$p.value
```

Or using the `[[` notation:

```{r}
vegas_test[[4]]
```

You may have noticed, though, that this output is not a tidy dataframe. 
Ack! That means we can't use all the tidyverse tricks we've learned so far in the course!
Fortunately, David Robinson noticed this problem and came up with a package called `broom` that can "tidy up" a lot of these kinds of objects.

The `broom` package has three main functions: 

- `glance`: Return a one-row, tidy dataframe from a model or other R object
- `tidy`: Return a tidy dataframe from a model or other R object
- `augment`: "Augment" the dataframe you input to the statistical function

Here is the output for `tidy` for the `vegas_test` object (`augment`
won't work for this type of object, and `glance` gives the same thing as `tidy`): 

```{r}
library(broom)
tidy(vegas_test)
```

---

## Apply a test multiple times

<iframe width="738" height="461" src="https://www.youtube.com/embed/RvPS6HZD3t8?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week7_part_6.pdf) 
a pdf of the lecture slides for this video.

<iframe width="646" height="404" src="https://www.youtube.com/embed/PWLzdd_n61o?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week7_part_7.pdf) 
a pdf of the lecture slides for this video.

<iframe width="646" height="404" src="https://www.youtube.com/embed/djNbWXhhDRs?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week7_part_8.pdf) 
a pdf of the lecture slides for this video.

Often, we don't want to just apply a statistical test to our entire data set.

Let's look at an example from the `microbiome package`.

```{r, warning = FALSE, message = FALSE}
library(microbiome)
data(peerj32)
print(names(peerj32))
```

Like we saw before, the list-like `phyloseq` objects require a little tidying before we can use them easiliy.

```{r}
library(dplyr)
peerj32_tibble <- (psmelt(peerj32$phyloseq)) %>%
  dplyr::select(-Sample) %>%
  rename_all(tolower) %>%
  as_tibble()
```

```{r, echo=FALSE}
slice(peerj32_tibble, 1:3)
```

We can make a plot from the resulting tibble to help us explore the data.

```{r, fig.height=8, fig.width=10}
library(ggplot2)
ggplot(peerj32_tibble, aes(x = subject, y = abundance, color = sex)) +
  geom_point() +
  theme_bw() +
  facet_wrap(~phylum, ncol = 4, scales = "free") +
  theme(axis.text.x = element_blank(),
        legend.position = "top") +
  xlab("Subject ID")
```

We can group the data by `phylum` and `group` and then use the `nest` function from the `tidyr` package to create a new dataframe containing the nested data:

```{r}
library(tidyr)
nested_data <- peerj32_tibble %>%
  group_by(phylum, group) %>%
  nest()
```

```{r}
slice(nested_data, 1:3)
```

We can then use the `map` function from the `purrr` package to apply functions to each nested dataframe. Let's start by counting the rows in each nested dataframe and filtering out dataframes with less than 25 rows.

```{r}
library(purrr)

filtered_data <- nested_data %>%
  mutate(n_rows = purrr::map(data, ~ nrow(.x))) %>%
  filter(n_rows > 25)
```

Now let's perform a t-test on each out the nested dataframes. Remember, each nested dataframe is one unique combination of phylum and group.

```{r}
t_tests <- filtered_data %>%
  mutate(t_test = purrr::map(data, ~t.test(abundance ~ sex, data = .x)))
```

```{r}
t_tests
```

The resulting dataframe contains a new column, which contains the model objects. Just as we mapped the `t.test` function, we can map the *tidying* functions from the `broom` package to extract the information we want from the nested model objects.

```{r}
summary <- t_tests %>%
  mutate(summary = purrr::map(t_test, broom::glance))
```

```{r}
summary
```

The final step is to return a tidy dataframe, we can do this using the `unnest` function from the `tidyr` package. Note: it is also wise to `ungroup` after you are done operating on your grouped variables - you may run into problems if you forget your dataframe is grouped later on!

```{r}
unnested <- summary %>%
  unnest(summary) %>%
  ungroup() %>%
  dplyr::select(group, phylum, p.value)
```

```{r}
unnested
```

We can then plot the results. Let's do this for the "LGG" (non-placebo) group, using our principles for good graphics.

```{r}
p_data <- unnested %>%
  filter(group == "LGG") %>%
  mutate(phylum = fct_reorder(phylum, p.value))
  
ggplot(p_data, aes(y = p.value, x = phylum)) +
  facet_wrap(~group) +
  geom_bar(stat="identity", fill = "grey90", color = "grey20") +
  coord_flip() +
  theme_bw() +
  ggtitle("Difference in abundance by gender") +
  xlab("") + ylab("p-value from t-test")
```

## Regression models

<iframe width="667" height="417" src="https://www.youtube.com/embed/qGL6kdJIVU0?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week7_part_4.pdf) 
a pdf of the lecture slides for this video.

<iframe width="667" height="417" src="https://www.youtube.com/embed/XW8QB8nlSIc?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week7_part_5.pdf) 
a pdf of the lecture slides for this video.

### Formula structure

*Regression models* can be used to estimate how the expected value of a *dependent variable* changes as *independent variables* change. \medskip

In R, regression formulas take this structure:

```{r eval = FALSE}
## Generic code
[response variable] ~ [indep. var. 1] +  [indep. var. 2] + ...
```

Notice that a tilde, `~`, is used to separate the independent and dependent variables and that a plus sign, `+`, is used to join independent variables. This format mimics the statistical notation:

$$
Y_i \sim X_1 + X_2 + X_3
$$

You will use this type of structure in R fo a lot of different function calls, including those for linear models (fit with the `lm` function) and generalized linear models (fit with the `glm` function).

There are some conventions that can be used in R formulas. Common ones include:

```{r echo = FALSE}
for_convs <- tibble(Convention = c("`I()`", "`:`", "`*`", "`.`",
                                       "`1`", "`-`"),
                        Meaning = c("evaluate the formula inside `I()` before fitting (e.g., `I(x1 + x2)`)",
                                    "fit the interaction between `x1` and `x2` variables",
                                    "fit the main effects and interaction for both variables (e.g., `x1*x2` equals `x1 + x2 + x1:x2`)",
                                    "include as independent variables all variables other than the response (e.g., `y ~ .`)",
                                    "intercept (e.g., `y ~ 1` for an intercept-only model)",
                                    "do not include a variable in the data frame as an independent variables (e.g., `y ~ . - x1`); usually used in conjunction with `.` or `1`"))
pander::pander(for_convs, split.cells = c(1,1,58),
               justify = c("center", "left"))
```

### Linear models

To fit a linear model, you can use the function `lm()`. This function is part of the `stats` package, which comes installed with base R. In this function, you can use the `data` option to specify the data frame from which to get the vectors. 

```{r, echo=FALSE}
library(forcats)
nepali_fct <- as_tibble(nepali) %>%
              mutate(sex = fct_recode(factor(sex), Male = "1", Female = "2"))
```

```{r}
mod_a <- lm(wt ~ ht, data = nepali)
```

This previous call fits the model:

$$ Y_{i} = \beta_{0} + \beta_{1}X_{1,i} + \epsilon_{i} $$

where:

- $Y_{i}$ : weight of child $i$
- $X_{1,i}$ : height of child $i$


If you run the `lm` function without saving it as an object, R will fit the regression and print out the function call and the estimated model coefficients:

```{r}
lm(wt ~ ht, data = nepali)
```

However, to be able to use the model later for things like predictions and model assessments, you should save the output of the function as an R object:

```{r}
mod_a <- lm(wt ~ ht, data = nepali)
```

This object has a special class, `lm`:

```{r}
class(mod_a)
```

This class is a special type of list object. If you use `is.list` to check, you can confirm that this object is a list:

```{r}
is.list(mod_a)
```

There are a number of functions that you can apply to an `lm` object. These include:

```{r echo = FALSE}
mod_objects <- tibble(Function = c("`summary`", "`coefficients`", 
                                   "`fitted`",
                                   "`plot`", "`residuals`"),
                          Description = c("Get a variety of information on the model, including coefficients and p-values for the coefficients",
                                   "Pull out just the coefficients for a model",
                                   "Get the fitted values from the model (for the data used to fit the model)",
                                   "Create plots to help assess model assumptions",
                                   "Get the model residuals"))
pander::pander(mod_objects, split.cells = c(1,1,58),
               justify = c("center", "left"))
```

For example, you can get the coefficients from the model by running:

```{r}
coefficients(mod_a)
```

The estimated coefficient for the intercept is always given under the name "(Intercept)". Estimated coefficients for independent variables are given based on their column names in the original data ("ht" here, for $\beta_1$, or the estimated increase in expected weight for a one unit increase in height).

You can use the output from a `coefficients` call to plot a regression line based on the model fit on top of points showing the original data (Figure \@ref(fig:modelcoefplot)). 

```{r modelcoefplot, fig.height = 3.5, fig.width = 5, warning = FALSE, fig.align = "center", fig.cap = "Example of using the output from a coefficients call to add a regression line to a scatterplot."}
mod_coef <- coefficients(mod_a)
ggplot(nepali, aes(x = ht, y = wt)) + 
  geom_point(size = 0.2) + 
  xlab("Height (cm)") + ylab("Weight (kg)") + 
  geom_abline(aes(intercept = mod_coef[1],
                  slope = mod_coef[2]), col = "blue")
```

```{block, type = "rmdnote"}
You can also add a linear regression line to a scatterplot by adding the geom `geom_smooth` using the argument `method = "lm"`.
```

You can use the function `residuals` on an `lm` object to pull out the residuals from the model fit:

```{r}
head(residuals(mod_a))
```

The result of a `residuals` call is a vector with one element for each of the non-missing observations (rows) in the data frame you used to fit the model. Each value gives the different between the model fitted value and the observed value for each of these observations, in the same order the observations show up in the data frame. The residuals are in the same order as the observations in the original data frame.

```{block, type = "rmdtip"}
You can also use the shorter function `coef` as an alternative to `coefficients` and the shorter function `resid` as an alternative to `residuals`.
```

As noted in the subsection on simple statistics functions, the `summary` function returns different output depending on the type of object that is input to the function. If you input a regression model object to `summary`, the function gives you a lot of information about the model. For example, here is the output returned by running `summary` for the linear regression model object we just created:

```{r}
summary(mod_a)
```

This output includes a lot of useful elements, including (1) basic summary statistics for the residuals (to meet model assumptions, the median should be around zero and the absolute values fairly similar for the first and third quantiles), (2) coefficient estimates, standard errors, and p-values, and (3) some model summary statistics, including residual standard error, degrees of freedom, number of missing observations, and F-statistic.

The object returned by the `summary()` function when it is applied to an `lm` object is a list, which you can confirm using the `is.list` function:

```{r}
is.list(summary(mod_a))
```

With any list, you can use the `names` function to get the names of all of the different elements of the object:

```{r}
names(summary(mod_a))
```

You can use the `$` operator to pull out any element of the list. For example, to pull out the table with information on the estimated model coefficients, you can run:

```{r}
summary(mod_a)$coefficients
```

The `plot` function, like the `summary` function, will give different output depending on the class of the object that you input. For an `lm` object, you can use the `plot` function to get a number of useful diagnostic plots that will help you check regression assumptions (Figure \@ref(fig:plotlmexample)):

```{r eval = FALSE}
plot(mod_a)
```

```{r plotlmexample, echo = FALSE, out.width = '\\textwidth', fig.align = "center", fig.cap = "Example output from running the plot function with an lm object as the input."}
oldpar <- par(mfrow = c(2, 2))
plot(mod_a)
par(oldpar)
```

You can also use binary variables or factors as independent variables in regression models. For example, in the `nepali` dataset, `sex` is a factor variable with the levels "Male" and "Female". You can fit a linear model of weight regressed on sex for this data with the call:

```{r}
mod_b <- lm(wt ~ sex, data = nepali)
```

This call fits the model:

$$ Y_{i} = \beta_{0} + \beta_{1}X_{1,i} + \epsilon_{i} $$

where $X_{1,i}$ : sex of child $i$, where 0 = male and 1 = female. 

Here are the estimated coefficients from fitting this model:

```{r}
summary(mod_b)$coefficients
```

You'll notice that, in addition to an estimated intercept (`(Intercept)`), the other estimated coefficient is `sexFemale` rather than just `sex`, although the column name in the data frame input to `lm` for this variable is `sex`. 

This is because, when a factor or binary variable is input as an independent variable in a linear regression model, R will fit an estimated coefficient for all levels of factors *except* the first factor level. By default, this first factor level is used as the baseline level, and so its estimated mean is given by the estimated intercept, while the other model coefficients give the estimated *difference* from this baseline. 

For example, the model fit above tells us that the estimated mean weight of males is `r round(coef(mod_b)[1], 1)`, while the estimated mean weight of females is `r round(coef(mod_b)[1], 1)` + `r round(coef(mod_b)[2], 1)` = `r round(coef(mod_b)[1], 1) + round(coef(mod_b)[2], 1)`.

<!-- If you would prefer that a different level of the factor be the baseline (for example, "Female" rather than "Male" for the previous regression), you can do that by using the `levels` argument in the `factor` function to reset factor levels. For example: -->

<!-- ```{r} -->
<!-- nepali_reset <- nepali %>% -->
<!--   mutate(sex = factor(sex, levels = c("2", "1"))) -->

<!-- mod_b_reset <- lm(wt ~ sex, data = nepali_reset) -->
<!-- summary(mod_b_reset)$coef -->
<!-- ``` -->

<!-- Now, `(Intercept)` gives the estimated mean weight for females, while the second estimated coefficient gives the estimated mean difference for males compared to the expected value for females. -->

---

## Handling model objects

The `broom` package contains tools for converting statistical objects into nice tidy data frames. The tools in the `broom` package make it easy to process statistical results in R using the tools of the `tidyverse`.

### broom::tidy

The `tidy()` function returns a data frame with information on the fitted model terms. For example, when applied to one of our linear models we get:

```{r tidy_example}
  library(broom)
  kable(tidy(mod_a), digits = 3)
```

```{r tidy_class}
  class(tidy(mod_a))
```

You can pass arguments to the `tidy()` function. For example, include confidence intervals:

```{r tidy_cis}
  kable(tidy(mod_a, conf.int = TRUE), digits = 3)
```

---

### broom::augment

The `augment()` function adds information about a fitted model to the dataset used to fit the model. For example, when applied to one of our linear models we get information on the fitted values and residuals included in the output:

```{r augment_example, warning = FALSE}
  kable(head(broom::augment(mod_a), 3), digits = 3)
```

---

### broom::glance

The `glance()` functions returns a one row summary of a fitted model object: For example:

```{r glance_example, warning = FALSE}
  kable(glance(mod_a, conf.int = TRUE), digits = 3)
```

---

### References-- statistics in R

One great (and free online for CSU students through our library) book to find out more about using R for basic statistics is:

- [Introductory Statistics with R](http://discovery.library.colostate.edu/Record/.b44705323)

If you want all the details about fitting linear models and GLMs in R, Julian Faraway's books are fantastic. He has one on linear models and one on extensions including logistic and Poisson models:

- [Linear Models with R](http://discovery.library.colostate.edu/Record/.b41119691) (also free online through the CSU library)
- [Extending the Linear Model with R](http://www.amazon.com/Extending-Linear-Model-Generalized-Nonparametric/dp/158488424X/ref=sr_1_1?ie=UTF8&qid=1442252668&sr=8-1&keywords=extending+linear+model+r)

---

## Functions

<iframe width="646" height="404" src="https://www.youtube.com/embed/4QAKphImJd8?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week7_part_9.pdf) 
a pdf of the lecture slides for this video.

<iframe width="646" height="404" src="https://www.youtube.com/embed/LRdh_FMaIUc?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week7_part_10.pdf) 
a pdf of the lecture slides for this video.

<iframe width="646" height="404" src="https://www.youtube.com/embed/ph3-4ILsL3Q?list=PLuGPtwgRXxqIPno8Y8NMN1xpopJvK9RRQ" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week7_part_11.pdf) 
a pdf of the lecture slides for this video.

As you move to larger projects, you will find yourself using the same code a lot. \bigskip

Examples include: 

- Reading in data from a specific type of equipment (air pollution monitor, accelerometer)
- Running a specific type of analysis
- Creating a specific type of plot or map

\bigskip 

If you find yourself cutting and pasting a lot, convert the code to a function.

Advantages of writing functions include: 

- Coding is more efficient
- Easier to change your code (if you've cut and paste code and you want to change something, you have to change it everywhere - this is an easy way to accidentally create bugs in your code)
- Easier to share code with others

You can name a function anything you want (although try to avoid names of preexisting-existing functions). You then define any inputs (arguments; separate multiple arguments with commas) and put the code to run in braces:

```{r, eval = FALSE}
## Note: this code will not run
[function name] <- function([any arguments]){
        [code to run]
}
```

Here is an example of a very basic function. This function takes a number as input and adds 1 to that number.

```{r}
add_one <- function(number){
  out <- number + 1
  return(out)
}

add_one(number = 3)
add_one(number = -1)
```

- Functions can input any type of R object (for example, vectors, data frames, even other functions and ggplot objects)
- Similarly, functions can output any type of R object
- When defining a function, you can set default values for some of the parameters
- You can explicitly specify the value to return from the function

For example, the following function inputs a data frame (`datafr`) and a one-element vector (`child_id`) and returns only rows in the data frame where it's `id` column matches `child_id`. It includes a default value for `datafr`, but not for `child_id`. 

```{r}
subset_nepali <- function(datafr = nepali, child_id){
  datafr <- datafr %>%
            filter(id == child_id)
  return(datafr)
}
```

If an argument is not given for a parameter with a default, the function will run using the default value for that parameter. For example:

```{r}
subset_nepali(child_id = "120011")
```


If an argument is not given for a parameter without a default, the function call will result in an error. For example:

```{r error = TRUE}
subset_nepali(datafr = nepali)
```

By default, the function will return the last defined object, although the choice of using `return` can affect printing behavior when you run the function. For example, I could have written the `subset_nepali` function like this:

```{r}
subset_nepali <- function(datafr = nepali, child_id){
  datafr <- datafr %>%
            filter(id == child_id)
}
```

In this case, the output will not automatically print out when you call the function without assigning it to an R object:

```{r}
subset_nepali(child_id = "120011")
```

However, the output can be assigned to an R object in the same way as when the function was defined without `return`:

```{r}
first_childs_data <- subset_nepali(child_id = "120011")
first_childs_data
```

```{block, type = 'rmdwarning'}
R is very "good" at running functions! It will look for (scope) the variables in your function in various places (environments). So your functions may run even when you don't expect them to, potentially, with unexpected results!
```

The `return` function can also be used to return an object other than the last defined object (although this doesn't tend to be something you need to do very often). If you did not use `return` in the following code, it will output "Test output":

```{r}
subset_nepali <- function(datafr = nepali, child_id){
  datafr <- datafr %>%
            filter(id == child_id)
  a <- "Test output"
}
(subset_nepali(child_id = "120011"))
```

Conversely, you can use `return` to output `datafr`, even though it's not the last object defined:

```{r}
subset_nepali <- function(datafr = nepali, child_id){
  datafr <- datafr %>%
            filter(id == child_id)
  a <- "Test output"
  return(datafr)
}
subset_nepali(child_id = "120011")
```

---

### if / else statements

There are other control structures you can use in your R code. Two that you will commonly use within R functions are `if` and `ifelse` statements. \bigskip

An `if` statement tells R that, **if** a certain condition is true, **do** run some code. For example, if you wanted to print out only odd numbers between 1 and 5, one way to do that is with an `if` statement: (Note: the `%%` operator in R returns the remainder of the first value (i) divided by the second value (2).) 

```{r}
for(i in 1:5){
  if(i %% 2 == 1){
    print(i)
  }
}
```

The `if` statement runs some code if a condition is true, but does nothing if it is false. If you'd like different code to run depending on whether the condition is true or false, you can us an if / else or an if / else if / else statement. 

```{r}
for(i in 1:5){
  if(i %% 2 == 1){
    print(i)
  } else {
    print(paste(i, "is even"))
  }
}
```

What would this code do? \bigskip 

```{r eval = FALSE}
for(i in 1:100){
  if(i %% 3 == 0 & i %% 5 == 0){
    print("FizzBuzz")
  } else if(i %% 3 == 0){
    print("Fizz")
  } else if(i %% 5 == 0){
    print("Buzz")
  } else {
    print(i)
  }
}
```

If / else statements are extremely useful in functions. \bigskip

In R, the `if` statement evaluates everything in the parentheses and, if that evaluates to `TRUE`, runs everything in the braces. This means that you can trigger code in an `if` statement with a single-value logical vector: 

```{r}
weekend <- TRUE
if(weekend){
  print("It's the weekend!")
}
```

This functionality can be useful with parameters you choose to include when writing your own functions (e.g., `print = TRUE`).

### Some other control structures

The control structure you are most likely to use in data analysis with R is the  "if / else" statement. However, there are a few other control structures you may occasionally find useful: 

- `next`
- `break`
- `while`

You can use the `next` structure to skip to the next round of a loop when a certain condition is met. For example, we could have used this code to print out odd numbers between 1 and 5:

```{r}
i <- 0
while(i < 5){
  i <- 1 + i
  if(i %% 2 == 0){
    next
  }
  print(i)
}
```

You can use `break` to break out of a loop if a certain condition is met. For example, the final code will break out of the loop once `i` is over 2, so it will only print the numbers 1 through 3:

```{r}
i <- 0
while(i <= 5){
  if(i > 2){
    break
  }
  i <- 1 + i
  print(i)
}
```


## In-course exercise for Chapter 7

### Exploring taxonomic profiling data

- We'll be using a package on Bioconductor called `microbiome`. You'll need to install that package from Bioconductor. This uses code that's different from the default you use to download a package from CRAN. Go to [the Bioconductor page for the microbiome package](https://bioconductor.org/packages/release/bioc/html/microbiome.html) and figure out how to install this package based on instructions on that page. 
- The `microbiome` that includes tools for exploring and analysing microbiome profiling data. This package has a website with tutorial information  [here](https://bioconductor.org/packages/devel/bioc/vignettes/microbiome/inst/doc/vignette.html#installation). We want to explore a dataset on genus-level microbiota profiling (`atlas1006`). Navigate to the tutorial webpage to figure out how you can get this example raw data loaded in your R session. Use the `class` and `str` functions to start exploring this data. Is it in a dataframe (tibble)? Is it in a tidy format? How is the data structured?
- On the [microbiome page](https://bioconductor.org/packages/devel/bioc/vignettes/microbiome/inst/doc/vignette.html), find the documentation describing the `atlas1006` data. Look through this documentation to figure out what information is included in the data. Also, check the helpfile for this dataset and look up the original article describing the data (you can find the article information in the help resources).
- The `atlas1006` data is stored in a special object class called a "phyloseq" object (you should have seen this when you used `class` with the object). You can pull certain parts of this data using special functions called "accessors". One is `get_variable`. Try running `get_variable` with the `atlas1006` data. What do you think this data is showing?
- Which different nationalities are represented by the study subjects, based on the dataframe you extracted in the last step? How many samples have each nationality? Which different BMI groups are included? Does it look like the study was balanced among these groups? 
- Based on the data you extracted, does it look like diversity varies much between males and females? Across BMI groups?
- Discuss what steps you would need to take to create the following plot. To start, don't write any code, just develop a plan. Talk about what the dataset should look like right before you create the plot and what functions you could use to get the data from its current format to that format.
- Try to write the code to create this plot. This will include some code for cleaning the data and some code for plotting. I will add one example answer after class, but I'd like you to try to figure it out yourselves first.


```{r echo = FALSE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 4}
library(tidyverse)
library(ggthemes)
library(microbiome)

data(atlas1006)

atlas1006 %>% 
  get_variable() %>% 
  ggplot(aes(x = bmi_group, y = diversity, color = sex)) + 
  geom_boxplot(alpha = 0.2) + 
  facet_wrap(~ nationality)
```


#### Example R code

Install the `microbiome` package from Bioconductor: 

```{r eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("microbiome")
```

Load the `atlas1006` example data in the `microbiome` package and explore it with `str` and `class`: 

```{r}
library(microbiome)
data(atlas1006)

class(atlas1006)
str(atlas1006)
```

Pull out the data frame that contains information on each study subject in the `atlas1006` data by using the `get_variable` accessor function:

```{r}
get_variable(atlas1006) %>% head()
```

**Note:** Since the first argument to `get_variable` is the phyloseq object (here, the `atlas1006` data object), you can pipe into the function, like this: 

```{r eval = FALSE}
atlas1006 %>% 
  get_variable() %>% 
  head()
```

Find out the different nationalities included in the data: 

```{r}
atlas1006 %>% 
  get_variable() %>% 
  as_tibble() %>% # Change output from a data.frame to a tibble
  group_by(nationality) %>% 
  count()
```

It looks like most subjects were from Central Europe, with the next-largest group from 
Scandinavia. **Note:** If you wanted to rearrange this summary to give the nationalities
in order of the number of subjects in each, you could add on a `forcats` function: 

```{r}
atlas1006 %>% 
  get_variable() %>% 
  as_tibble() %>% # Change output from a data.frame to a tibble
  mutate(nationality = fct_infreq(nationality)) %>% 
  group_by(nationality) %>% 
  count() 
```

Find out the different BMI groups included in the data and if the study seemed to be balanced across those groups: 

```{r}
atlas1006 %>% 
  get_variable() %>% 
  as_tibble() %>% # Change output from a data.frame to a tibble
  group_by(bmi_group) %>% 
  count()
```

There were six different BMI groups. Over 100 study subjects had this information missing. 
The samples were not evenly distributed across these BMI groups. Instead, the most common (lean) had
almost 500 subjects, while the smaller BMI-group samples were around 20 people. 

See if it looks like diversity varies much between males and females:

```{r}
atlas1006 %>% 
  get_variable() %>% 
  as_tibble() %>% 
  group_by(sex) %>% 
  summarize(mean_diversity = mean(diversity))
```

See if it looks like diversity varies much across BMI groups:

```{r}
atlas1006 %>% 
  get_variable() %>% 
  as_tibble() %>% 
  group_by(bmi_group) %>% 
  summarize(mean_diversity = mean(diversity))
```


Here is the code for the plot: 

```{r}
library(tidyverse)
library(ggthemes)
library(microbiome)

data(atlas1006)

atlas1006 %>% 
  get_variable() %>% 
  ggplot(aes(x = bmi_group, y = diversity, color = sex)) + 
  geom_boxplot(alpha = 0.2) + 
  facet_wrap(~ nationality)
```

### More with baby names

Let's look at baby names that we started looking at last class, based on the letter they start with. 

- For the full dataframe, what proportion of baby names start with each letter?
See if you can create a figure to help show this. Create the same plot 
using the names of people from our class. 
- What proportion of names start with "C" or "S" across the full dataset? 

#### Example R code

For the full dataframe, what proportion of baby names start with "S"?

To start, create a column with the first letter of each name. You can use 
functions in the `stringr` package to do this. The easiest might be to 
use the position of the first letter to pull that information.

```{r}
library(babynames)
library(stringr)
top_letters <- babynames %>% 
  mutate(first_letter = str_sub(name, 1, 1))

top_letters %>% 
  select(name, first_letter) %>% 
  slice(1:5)
```

Now we can group by letter and figure out these proportions. First, while the data is
grouped, count the number of names with each letter. Then, ungroup and use mutate
to divide this by the total number of names:

```{r}
top_letters <- top_letters %>% 
  group_by(first_letter) %>% 
  summarize(n = sum(n)) %>% 
  ungroup() %>% 
  mutate(prop = n / sum(n)) %>% 
  arrange(desc(prop))

top_letters
```

Here's one way to visualize this: 

```{r fig.height = 5, fig.width = 4}
library(scales)
top_letters %>% 
  mutate(first_letter = fct_reorder(first_letter, prop)) %>% 
  ggplot(aes(x = first_letter)) + 
  geom_bar(aes(weight = prop)) + 
  coord_flip() + 
  scale_y_continuous(labels = percent) + 
  labs(x = "", y = "Percent of names that start with ...")
```

Create the same plot using the names of people in our class. First, create a 
vector with the names of people in our class: 

```{r fig.width = 3, fig.height = 4}
student_list <- data_frame(name = c("Burton", "Caroline", "Chaoyu", "Collin",
                                    "Daniel", "Eric", "Erin", "Heather",
                                    "Jacob", "Jessica", "Khum", "Kyle",
                                    "Matthew", "Molly", "Nikki", "Rachel",
                                    "Sere",  "Shayna", "Sherry", "Sunny"))
student_list <- student_list %>% 
  mutate(first_letter = str_sub(name, 1, 1))
student_list

library(scales)
student_list %>% 
  group_by(first_letter) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(prop = n / sum(n)) %>% 
  mutate(first_letter = fct_reorder(first_letter, prop)) %>% 
  ggplot(aes(x = first_letter)) + 
  geom_bar(aes(weight = prop)) + 
  coord_flip() + 
  scale_y_continuous(labels = percent) + 
  labs(x = "", y = "Percent of students with\na name that starts with ...")
```

What proportion of names start with "C" or "S" across the full dataset? You can 
create a dataframe that (1) pulls out the first letter of each name (just like 
we did for the last part of the question) and (2) tests whether that first letter
is an "A" or a "K" (using a logical statement):

```{r}
c_or_s <- babynames %>% 
  mutate(first_letter = str_sub(name, 1, 1),
         c_or_s = first_letter %in% c("C", "S"))

c_or_s %>% 
  select(name, first_letter, c_or_s) %>% 
  slice(1:5)
```

Next, group by this logical column (`c_or_s`) and figure out the number of 
baby names for each group. Then, to get the proportion of the total, ungroup
and mutate to divide by the total number across the data: 

```{r}
c_or_s %>% 
  group_by(c_or_s) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(prop = n / sum(n))
```

### Running a simple statistical test

In the last part of the in-course exercise, we found out that about 14% of babies born in the 
United States between 1980 and 1995 had names that started with an "C" or "S" (268,910
babies out of 1,924,665). 

- What is the proportion of people with names that start with an "C" or "S" in our class?
- Use a simple statistical test to test the hypothesis that the class comes from a 
binomial distribution with the same distribution as babies born in the US over the time tracked by this data, in terms of chance of having a name that starts with "C" or "S". (Hint: You will be comparing two proportions. Try googling for a statistical test in R that does that.)
- See if you can figure out a way to make a single "tidy" pipeline for the whole analysis (and output the result as a tidy dataframe). Does the `tidy` function from `broom` give different information from this test than the output we got for the Shapiro-Wilk test?
- You may get the warning "Chi-squared approximation may be incorrect". See if you can 
figure out this warning if you get it with the test you used.

---

#### Example R code

Here is a vector with names in our class: 

```{r}
library(stringr)
student_list <- tibble(name = c("Burton", "Caroline", "Chaoyu", "Collin",
                                    "Daniel", "Eric", "Erin", "Heather",
                                    "Jacob", "Jessica", "Khum", "Kyle",
                                    "Matthew", "Molly", "Nikki", "Rachel",
                                    "Sere",  "Shayna", "Sherry", "Sunny"))
student_list <- student_list %>% 
  mutate(first_letter = str_sub(name, 1, 1))
student_list %>% 
  slice(1:3)
```

Let's get the total number of students, and then the total number with a name that 
starts with "C" or "S": 

```{r}
tot_students <- student_list %>% 
  count()
tot_students

c_or_s_students <- student_list %>% 
  mutate(c_or_s = first_letter %in% c("C", "S")) %>% 
  group_by(c_or_s) %>% 
  count()
c_or_s_students
```

The proportion of students with names starting with "A" or "K" are `r c_or_s_students$n[2]` / 
`r tot_students$n[1]` = `r round(c_or_s_students$n[2] / tot_students$n[1], 2)`.

You could run a statistical test comparing these two proportions (check the help file for the 
function to figure out where to include each piece):

```{r}
prop.test(x = c(268910, 7), n = c(1924665, 20))
```

There are a few different ways you could run this test. 
For example, you could also test whether the proportion in our class is consistent with the null
hypothesis that you were drawn from a binomial distribution with a proportion of 0.14
(in-line with the national values):

```{r}
prop.test(x = 7, n = 20, p = 0.14)
```

You can also see if we can pipe into `prop.test` by summing up the number of successes ("1": the person's name starts with "C" or "S"). Because this is using an unsummarized
form of the data, it lets us use some of the tidyverse tools more easily: 

```{r}
library(purrr)
library(broom)
student_list %>% 
  mutate(c_or_s = first_letter %in% c("C", "S")) %>% 
  pull("c_or_s") %>% 
  sum() %>% 
  prop.test(n = 20, p = 0.14) %>% 
  tidy()
```


Finally, when we run this test, we get the warning that "Chi-squared approximation may be incorrect". Based on googling 'r prop.test "Chi-squared approximation may be incorrect"', 
it sounds like we might be getting this error because we have a pretty low number of 
people in the class. One recommendation is to use `binom.test`, which will run as an exact binomial test:

```{r}
binom.test(x = 7, n = 20, p = 0.14)
```

### Using regression models to explore data #1

For this exercise, you will need the following packages. If do not have them already, you will need to install them. 

```{r}
library(ggplot2)
library(broom)
library(ggfortify)
```

For this part of the exercise, you'll use a dataset on weather, air pollution, and mortality counts in Chicago, IL. This dataset is called `chicagoNMMAPS` and is part of the `dlnm` package. Change the name of the data frame to `chic` (this object name is shorter and will be easier to work with). Check out the data a bit to see what variables you have, and then perform the following tasks:

- Write out (on paper, not in R) the regression equation for regressing dewpoint temperature on temperature. 
- Try fitting a linear regression of dew point temperature (`dptp`) regressed on temperature (`temp`). Save this model as the object `mod_1` (i.e., is the dependent variable of 
dewpoint temperature linearly associated with the independent variable of temperature). 
- Based on this regression, does there seem to be a relationship between temperature and dewpoint temperature in Chicago? (Hint: Try using `glance` and `tidy` from the `broom` package on the model object to get more information about the model you fit.) What is the coefficient for temperature (in other words, for every 1 degree increase in temperature, how much do we expect the dewpoint temperature to change?)? What is the p-value for the coefficient for temperature?
- Plot temperature (x-axis) versus dewpoint temperature (y-axis) for Chicago. Add in the regression line from the model you fit by using the results from `augment`.
- Use `autoplot` on the model object to generate some model diagnostic plots (make sure you have the `ggfortify` package loaded and installed).


---

#### Example R code:

The regression equation for the model you want to fit, regressing dewpoint temperature on temperature, is:

$$
Y_t \sim \beta_0 + \beta_1 X_t
$$

where $Y_t$ is the dewpoint temperature on day $t$, $X_t$ is the temperature on day $t$, and $\beta_0$ and $\beta_1$ are model coefficients. 

Install and load the `dlnm` package and then load the `chicagoNMMAPS` data. Change the name of the data frame to `chic`, so it will be shorter to call for the rest of your work. 
```{r, message = FALSE, warning = FALSE}
# install.packages("dlnm")
library(dlnm)
data("chicagoNMMAPS")
chic <- chicagoNMMAPS
```

Fit a linear regression of `dptp` on `temp` and save as the object `mod_1`:

```{r}
mod_1 <- lm(dptp ~ temp, data = chic)
mod_1
```

Use functions from the `broom` package to pull the same information about the model in a "tidy" format. To find out if the evidence for a linear association between temperature and dewpoint temperature, use the `tidy` function to get model coefficients in a tidy format:

```{r}
tidy(mod_1)
```

There does seem to be an association between temperature and dewpoint temperature: a unit increase in temperature is associated with a `r round(coef(mod_1)[2], 1)` unit increase in dewpoint temperature. The p-value for the temperature coefficient is <2e-16. This is far below 0.05, which suggests we would be very unlikely to see such a strong association by chance if the null hypothesis, that the two variables are not associated, were true.

You can also check overall model summaries using the `glance` function:

```{r}
glance(mod_1)
```

To create plots of the observations and the fit model, use the `augment` function to add model output (e.g., predictions, residuals) to the original data frame of observed temperatures and dew point temperatures:

```{r warning = FALSE, message = FALSE}
augment(mod_1) %>% 
  slice(1:3)
```

Plot these two variables and add in the fitted line from the model (note: I've used the `color` option to make the color of the points gray). Use the output from `augment` to create a plot of the original data, with the predicted values used to plot a fitted line. 

```{r warning = FALSE, message = FALSE, fig.width = 5, fig.height = 4}
augment(mod_1) %>% 
  ggplot(aes(x = temp, y = dptp)) + 
  geom_point(size = 0.8, alpha = 0.5, col = "gray") + 
  geom_line(aes(x = temp, y = .fitted), color = "red", size = 2) + 
  theme_classic()
```

Plot some plots to check model assumptions for the model you fit using the `autoplot` function on your model object:

```{r, fig.width = 6, fig.height = 6}
autoplot(mod_1)
```


---

### Using regression models to explore data #2

- Try fitting the regression from the last part of the in-course exercise as a GLM, using `glm()` (but still assuming the outcome variable is normally distributed). Are your coefficients different?
- Does $PM_{10}$ vary by day of the week? (Hint: The `dow` variable is a factor that gives day of the week. You can do an ANOVA analysis by fitting a linear model using this variable as the independent variable. Some of the overall model summaries will compare this model to an intercept-only model.) What day of the week is $PM_{10}$ generally highest? (Check the model coefficients to figure this out.) Try to write out (on paper) the regression equation for the model you're fitting.
- Try using `glm()` to run a Poisson regression of respiratory deaths (`resp`) on temperature during summer days. Start by creating a subset with just summer days called `summer`. (Hint: Use the `month` function with the argument `label = TRUE` from `lubridate` to do this---just pull out the subset where the month June, July, or August.) Try to write out the regression equation for the model you're fitting.
- The coefficient for the temperature variable in this model is our best estimate (based on this model) of the **log relative risk** for a one degree Celcius increase in temperature. What is the **relative risk** associated with a one degree Celsius increase?

---

#### Example R code:

Try fitting the model from the last part of the in-course exercise using `glm()`. Call it `mod_1a`. Compare the coefficients for the two models. You can use the `tidy` function on an `lm` or `glm` object to pull out just the model coefficients and associated model results. Here, I've used a pipeline of code to create a tidy data frame that merges these "tidy" coefficient outputs (from the two models) into a single data frame):

```{r}
mod_1a <- glm(dptp ~ temp, data = chic)

tidy(mod_1) %>% 
  select(term, estimate) %>% 
  inner_join(mod_1a %>% tidy() %>% select(term, estimate), by = "term") %>% 
  rename(estimate_lm_mod = estimate.x,
         estimate_glm_mod = estimate.y)
```

The results from the two models are identical.

As a note, you could have also just run `tidy` on each model object, without merging them together into a single data frame:

```{r}
tidy(mod_1)
tidy(mod_1a)
```


Fit a model of $PM_{10}$ regressed on day of week, where day of week is a factor. 

```{r}
mod_2 <- lm(pm10 ~ dow, data = chic)
tidy(mod_2)
```

Use `glance` to check some of the overall summaries of this model. The `statistic` column here is the F statistic from test comparing this model to an intercept-only model.

```{r}
glance(mod_2)
```

As a note, you may have heard in previous statistics classes that you can use the `anova()` command to compare this model to a model with only an intercept (i.e., one that only fits a global mean and uses that as the expected value for all of the observations). Note that, in this case, the F value from `anova` for this model comparison is the same as the `statistic` you got in the overall summary statistics you get with `glance` in the previous code. 

```{r}
anova(mod_2)
```

The overall p-value from `anova` for with day-of-week coefficients versus the model that just has an intercept is < 2.2e-16. This is well below 0.05, which suggests that day-of-week is associated with PM10 concentration, as a model that includes day-of-week does a much better job of explaining variation in PM10 than a model without it does. 

Use a boxplot to visually compare PM10 by day of week. 

```{r, fig.height = 3, fig.width = 6, warning = FALSE}
ggplot(chic, aes(x = dow, y = pm10)) + 
  geom_boxplot()
```

Now try the same plot, but try using the `ylim = ` option to change the limits on the y-axis for the graph, so you can get a better idea of the pattern by day of week (some of the extreme values are very high, which makes it hard to compare by eye when the y-axis extends to include them all).

```{r, fig.height = 3, fig.width = 6, message = FALSE}
ggplot(chic, aes(x = dow, y = pm10)) + 
  geom_boxplot() + 
  ylim(c(0, 100))
```

Create a subset called `summer` with just the summer days:

```{r}
library(lubridate)
summer <- chic %>%
  mutate(month = month(date, label = TRUE)) %>% 
  filter(month %in% c("Jun", "Jul", "Aug"))
summer %>% 
  slice(1:3)
```

Use `glm()` to fit a Poisson model of respiratory deaths regressed on temperature. Since you want to fit a Poisson model, use the option `family = poisson(link = "log")`. 

```{r}
mod_3 <- glm(resp ~ temp, data = summer,
             family = poisson(link = "log"))
glance(mod_3)
tidy(mod_3)
```

Use the fitted model coefficient to determine the relative risk for a one degree Celsius increase in temperature. First, remember that you can use the `tidy()` function to read out the model coefficients. The second of these is the value for the temperature coefficient. That means that you can use indexing (`[2]`) to get just that value. That's the log relative risk; take the exponent to get the relative risk.

```{r}
tidy(mod_3) %>% 
  filter(term == "temp") %>% 
  mutate(log_rr = exp(estimate))
```

As a note, you can use the `conf.int` parameter in `tidy` to also pull confidence intervals:

```{r}
tidy(mod_3, conf.int = TRUE)
```

You could use this to get the confidence interval for relative risk (check out the `mutate_at` function if you haven't seen it before):

```{r}
tidy(mod_3, conf.int = TRUE) %>% 
  select(term, estimate, conf.low, conf.high) %>% 
  filter(term == "temp") %>% 
  mutate_at(vars(estimate:conf.high), funs(exp(.)))
```

### Trying out nesting and mapping

- We'll be using data that's in an R package called "nycflights13". This data package can be installed from CRAN. Install the package and then load the package and its "flights" dataset. 
So that this data will be easier to work with, remove all columns except for those for 
the departure delay (`dep_delay`), the carrier (`carrier`), the hour the flight was 
supposed to leave (`hour`), and the airport the flight left from (`origin`). Also, 
limit the dataset to only flights that left from LaGuardia Airport ("LGA").
- We want to figure out if the probability of a flight leaving 15 minutes late or more
increases over the day for flights leaving LaGuardia. Filter out all the rows where the 
departure delay is missing and then create a new column called
`late_dep` that is true if the flight left 15 minutes or more late and false otherwise. 
What proportion of all flights leaving from LaGuardia leave 15 minutes late or later?
- Next, determine what proportion of all flights are delayed base on the hour that the
flight was scheduled to depart (`hour`). Create a plot showing how the probability of 
leaving 15 minutes or more late changes by hour.
- Fit a generalized linear model for the association between the binary variable of whether
the flight was 15 minutes or more late (`late_dep`) and the hour the flight was 
scheduled to leave (`hour`). Use a binomial model (add `family = binomial(link = "logit")`
in the `glm` call). The estimate from the model for `hour` will be an estimate
of the log odds ratio for a one-hour increase in scheduled departure time. Take the
exponent of this estimate with `exp` to get an estimated odds ratio for a one-hour
increase in scheduled departure time. Is this estimate larger than 1.0?
- Next, we want to see if this association is similar across airlines. First, create a 
dataframe called `nested_flights` where the `flights`
data is grouped by airline (`carrier`) and then nest the data, so that there is a "data" list-column
where each item is a dataframe of flight delay data for a specific carrier.
- Use the `map` function from `purrr` inside a `mutate` statement to apply the `glm` code you used earlier for the
whole dataset, but in this case for the data for each airline character. Then use the `map` 
function inside a `mutate` statement again to "tidy" the data. 
- Clean the data up a bit. Remove the columns for `data` and
`glm_result` and then `unnest` the dataframe list-column with the tidy version of the model
results. Filter to get only the estimates for the "hour" term. Then calculate an odds ratio
(`or`) by taking the exponent (check out the `exp` function) of the original estimate. 
- The package has a dataframe with the full name of each carrier (`airlines`). Join this data
into the data you've been working with, so you have the full names of airlines.
- Finally, create the following plot with each airline's odds ratio for the change in the chance of a delay per one-hour increase in the scheduled hour of departure: 

```{r echo = FALSE}
library(purrr)
library(tidyr)
library(dplyr)
library(forcats)
library(broom)
library(ggplot2)
library(nycflights13)
data(flights)
data(airlines)

flights %>% 
  select(dep_delay, carrier, hour, origin) %>% 
  filter(origin == "LGA") %>% 
  filter(!is.na(dep_delay)) %>% 
  mutate(late_dep = dep_delay > 15) %>% 
  group_by(carrier) %>% 
  nest() %>% 
  mutate(glm_result = purrr::map(data, ~ glm(late_dep ~ hour, 
                                      data = .x, family = binomial(link = "logit")))) %>% 
  mutate(glm_tidy = purrr::map(glm_result, ~ tidy(.x))) %>% 
  select(-data, -glm_result) %>% 
  unnest(glm_tidy) %>% 
  filter(term == "hour") %>% 
  mutate(or = exp(estimate)) %>% 
  left_join(airlines, by = "carrier") %>% 
  ungroup() %>% 
  mutate(name = fct_reorder(name, or)) %>% 
  ggplot(aes(x = or, y = name)) + 
  geom_point() + 
  geom_vline(xintercept = 1, linetype = 3) + 
  labs(x = "Odds ratio for one-hour increase\nin scheduled deparature time", y = "")
```


---

#### Example R code:

Install the "nycflights13" package from CRAN. Load the package and its "flights" dataset.

```{r}
library(ggplot2)
library(nycflights13)
data(flights)
```

So that this data will be easier to work with, remove all columns except for those for 
the departure dealy (`dep_delay`), the carrier (`carrier`), the hour the flight was 
supposed to leave (`hour`), and the airport the flight left from (`origin`). 

```{r}
flights <- flights %>% 
  select(dep_delay, carrier, hour, origin) %>% 
  filter(origin == "LGA")

flights
```

Filter out all the rows where the 
departure delay is missing and then create a new column called
`late_dep` that is true if the flight left 15 minutes or more late and false otherwise. 

```{r}
flights <- flights %>% 
  filter(!is.na(dep_delay)) %>% 
  mutate(late_dep = dep_delay > 15)

flights
```

What proportion of all flights leaving from LaGuardia leave 15 minutes late or later? To 
check this, remember that `TRUE` is saved as a "1" and `FALSE` is saved as a "0". That means that
we can take the mean of a logical vector to get the proportion of trials that are true.

```{r}
flights %>% pull("late_dep") %>% mean()
```

Determine what proportion of all flights are delayed base on the hour that the
flight was scheduled to depart (`hour`). Create a plot showing how the probability of 
leaving 15 minutes or more late changes by hour.

```{r}
flights_late <- flights %>% 
  group_by(hour) %>% 
  summarize(prob_late = mean(late_dep))

flights_late

flights_late %>% 
  ggplot(aes(x = hour, y = prob_late)) + 
  geom_line()
```

Fit a generalized linear model for the association between the binary variable of whether
the flight was 15 minutes or more late (`late_dep`) and the hour the flight was 
scheduled to leave (`hour`). Use a binomial model (add `family = binomial(link = "logit")`
in the `glm` call). The estimate from the model for `hour` will be an estimate
of the log odds ratio for a one-hour increase in scheduled departure time. Take the
exponent of this estimate with `exp` to get an estimated odds ratio for a one-hour
increase in scheduled departure time. Is this estimate larger than 1.0?

```{r}
glm(late_dep ~ hour, data = flights, family = binomial(link = "logit")) 

library(broom)
glm(late_dep ~ hour, data = flights, family = binomial(link = "logit")) %>% 
  tidy() %>%                 # Tidy the model results
  filter(term == "hour") %>% # Only look at the estimate for `hour`
  mutate(or = exp(estimate)) # Estimate is log odds ratio. Take exponent for odds ratio
```

Next, we want to see if this association is similar across airlines. First, create a 
dataframe called `nested_flights` where the `flights`
data is grouped by airline (`carrier`) and then nest the data, so that there is a "data" list-column
where each item is a dataframe of flight delay data for a specific carrier: 

```{r}
nested_flights <- flights %>% 
  group_by(carrier) %>% 
  nest()

nested_flights
```

To check the contents of the list-column, try: 

```{r}
nested_flights$data[[1]] %>% # Get the first element of the "data" column
  head()
```

Use the `map` function from `purrr` inside a `mutate` statement to apply the `glm` code you used earlier for the
whole dataset, but in this case for the data for each airline character: 

```{r}
library(purrr)
library(tidyr)

prob_late <- nested_flights %>% 
  mutate(glm_result = purrr::map(data, ~ glm(late_dep ~ hour, 
                                      data = .x, family = binomial(link = "logit")))) 

# Check the results for the first element of the "glm_result" column: 
prob_late$glm_result[[1]]
```

Then use the `map` 
function inside a `mutate` statement again to "tidy" the data. 

```{r}
prob_late <- prob_late %>% 
  mutate(glm_tidy = purrr::map(glm_result, ~ tidy(.x)))
```

Remove the columns for `data` and
`glm_result`:

```{r}
prob_late <- prob_late %>% 
  select(-data, -glm_result)
```

Then ``unnest` the dataframe list-column with the tidy version of the model
results: 

```{r}
prob_late <- prob_late %>% 
  unnest(glm_tidy)
```

Filter to get only the estimates for the "hour" term: 

```{r}
prob_late <- prob_late %>% 
  filter(term == "hour")
```

Then calculate an odds ratio
(`or`) by taking the exponent (check the `exp` function) of the original estimate. 

```{r}
prob_late <- prob_late %>% 
  mutate(or = exp(estimate))
head(prob_late)
```

The package has a dataframe with the full name of each carrier (`airlines`). Join this data
into the data you've been working with, so you have the full names of airlines: 

```{r}
prob_late <- left_join(prob_late, airlines, by = "carrier")
```

Finally, create the following plot with each airline's association between hour in the 
day and the chance of a delay: 


```{r}
library(forcats)
data(airlines)
prob_late %>% 
  ungroup() %>% 
  mutate(estimate = exp(estimate)) %>% 
  mutate(name = fct_reorder(name, estimate)) %>% 
  ggplot(aes(x = estimate, y = name)) + 
  geom_point() + 
  geom_vline(xintercept = 1, linetype = 3) + 
  labs(x = "Odds ratio for one-hour increase in scheduled deparature time", y = "")
```

### Writing functions

- Say that you have a four-letter character string (e.g., "ling") and that you want
to move the last letter to the front of the string to create a new four-letter
character string (e.g., "glin"). Write some R code to do this.
- Next, write a function named `move_letter` that does the same thing---takes a four-letter character string
(e.g., "ling") and creates a new four-letter character string where the last
letter in the original string has been moved to the front of the string (e.g.,
"glin"). It should input one parameter (`word`, the original four-letter character
string or a vector of four-letter character strings). 
- Read the word list at https://raw.githubusercontent.com/dwyl/english-words/master/words.txt
into an R dataframe called `word_list`. It will only have one column; name this column
`word`. 
- Write code that can take a vector of character strings (e.g., `c('ling', 'scat', 'soil')`)
and return a logical vector that says whether that character string is a word in the 
`word_list` dataframe you created in the last step (e.g., `c('FALSE', 'TRUE', 'TRUE')` for 
a vector where the first value isn't a word but the second and third are). You may find 
the `pull` function useful in writing this code (to pull the 
`word` column out of the `word_list` dataframe).
- Write a function called `is_word` that inputs (1) a vector of character strings
and (2) a dataframe with a column called "word" that lists real words. The
function should return a logical vector saying whether each character string is
a real word. The function should have two arguments: `words_to_check`, which is
a vector of character strings (e.g., `c('ling', 'scat', 'soil')`), and `real_word_list`, which is a dataframe with a
column called `words` of real English words (e.g., the `word_list` dataframe you created from the word list on GitHub). Set the `real_word_list` argument
to have the default value of `words`, the dataframe you created earlier in this
exercise by reading in the dataframe of English words from GitHub.
- Try using the function with a different word list. As an example, you could read in 
and use the word list of Google's top 10,000 English words from https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-no-swears.txt.
- Try using these functions to solve the word puzzle problem in the last homework.

----

#### Example R code

Say that you have a four-letter character string (e.g., "ling") and that you want
to move the last letter to the front of the string to create a new four-letter
character string (e.g., "glin"). You can use functions from the 
`stringr` package to help with this.

```{r}
library(stringr)

# Start with an example word. You can use the example word from 
# the problem statement ('ling').
word <- "ling"

# Break the word into two parts
first_three_letters <- str_sub(word, 1, 3)
last_letter <- str_sub(word, 4, 4)

first_three_letters
last_letter

# Put the parts back together in the right order
new_word <- str_c(last_letter, first_three_letters) # You could also use `paste` or `paste0` here
new_word
```

Write a function named `move_letter` that takes a four-letter character string
(e.g., "ling") and creates a new four-letter character string where the last
letter in the original string has been moved to the front of the string (e.g.,
"glin"). It should input one parameter (`word`, the original four-letter character
string or a vector of four-letter character strings). 

To do this, take the code that you just wrote and put it inside a function named
`move_letter`:

```{r}
move_letter <- function(word){
  # Break the word into two parts
  first_three_letters <- str_sub(word, 1, 3)
  last_letter <- str_sub(word, 4, 4)
  
  # Put the parts back together in the right order
  str_c(last_letter, first_three_letters)
}

# Check the function
move_letter(word = "ling")

# Try with some other four-letter strings
move_letter(word = "cats")
move_letter(word = "oils")

# Check using with a vector of four-letter character strings
move_letter(c("cats", "oils"))
```

Notice that now `word` is being used as an argument in the function. It's as if you
assigned the string or vector of strings that you want to convert to the object
name `word`, and then you run all the code. The output of the function is the
last expression in the function code (`str_c(last_letter, first_three_letters)`).
Also, note that you can add comments inside the function with `#`, just like you can
with other R code. 

Read the word list at https://raw.githubusercontent.com/dwyl/english-words/master/words.txt
into an R dataframe called `words`. It will only have one column; name this column
`word_list`. 

```{r}
library(readr)

word_list <- read_csv("https://raw.githubusercontent.com/dwyl/english-words/master/words.txt",
                  col_names = "word")
```

Write code that can take a vector of character strings (e.g., `c('ling', 'scat', 'soil')`)
and return a logical vector that says whether that character string is a word in the 
`word_list` dataframe you created in the last step (e.g., `c('TRUE', 'FALSE', 'TRUE')` for 
a vector where the second value isn't a word but the first and third are).

```{r}
library(purrr)

words_to_check <- c("ling", "scat", "soil")
words_to_check %in% pull(word_list, "word")
```

This code works because you can use `pull` to extract a column (as a vector) from 
a dataframe, and then you can use the `%in%` operator to see if each value in one vector
(the `words_to_check` vector in this example) is one of the values in the second
vector (the `word` column from the `words` dataframe in this example).

Write a function called `is_word` that inputs (1) a vector of character strings
and (2) a dataframe with a column called "word" that lists real words. The
function should return a logical vector saying whether each character string is
a real word. The function should have two arguments: `words_to_check`, which is
a vector of character strings and `real_word_list` which is a dataframe with a
column called `words` of real English words. Set the `real_word_list` argument
to have the default value of `words`, the dataframe you created earlier in this
exercise by reading in the dataframe of English words from GitHub.

```{r}
# Put the code you wrote inside a function
is_word <- function(words_to_check, real_word_list = word_list){
  words_to_check %in% pull(real_word_list, "word")
}

# Check the function
is_word(words_to_check = c("ling", "scat", "soil"))
```

Note that, if you want to use the `word_list` dataframe for your real word list, you 
don't have to specify that when you call the `is_word` function, since you 
set that as your default value. 

If you wanted to use a different word list, you can specify a different value for 
the `real_word_list` argument when you run the function. For example, to use Google's
top 10,000 English word list from GitHub instead, use:

```{r}
google_words <- read_csv("https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-no-swears.txt", 
                         col_names = "word")

is_word(words_to_check = c("ling", "scat", "soil"), real_word_list = google_words)
```

In this case, there was a rarer word ("ling") that counted as a word when the original 
word list was used, but not when the Google top-10,000 words list was used.

Try using these functions to solve the word puzzle problem in the last homework.

```{r}
word_list %>% 
  filter(str_detect(word, "^m.{7}$")) %>%                        # This is using regular expressions
  separate(word, into = c("first_word", "second_word"),          # Note that you can use `separate` with a number to use position
           sep = 4, remove = FALSE) %>%                          
  mutate(second_word = move_letter(second_word)) %>%             # Use the first function here
  filter(is_word(first_word) & is_word(second_word)) %>%         # Use the second function here
  unite("new_word", c("first_word", "second_word"), sep = " ")   # You can use `unite` to put the two new words in a phrase in one column.
```

You can get the choices down to even fewer options if you match the new words against
the Google top-10,000 words list, by using the `real_word_list` option in the `is_word`
function you wrote. (However, "maillots" is not a common enough word that you can 
also start from that shorter word list!)

```{r}
word_list %>% 
  filter(str_detect(word, "^m.{7}$")) %>%                        
  separate(word, into = c("first_word", "second_word"),          
           sep = 4, remove = FALSE) %>%                          
  mutate(second_word = move_letter(second_word)) %>%             
  filter(is_word(first_word, real_word_list = google_words) & 
           is_word(second_word, real_word_list = google_words)) %>%         
  unite("new_word", c("first_word", "second_word"), sep = " ")  
```


<!-- ### Using a function and `purrr` to create state-specific plots -->

<!-- Next, you will write a function to create state-specific plots from this data, then use it to create plots for the states of Colorado, Texas, California, and New York.  -->

<!-- - The FARS data includes a column called `STATE`, but it gives state as a one- or two-digit code, rather than by name. These codes are the state Federal Information Processing Standard (FIPS) codes. A dataset with state names and FIPS codes is available at http://www2.census.gov/geo/docs/reference/state.txt. Read that data into an R object called `state_fips` and clean it so the first few lines look like this (hint: to change the `state` column to an integer class, you can use the function `as.integer`): -->

<!-- ```{r check11, echo = FALSE, warning = FALSE, message = FALSE} -->
<!-- state_fips <- read_delim("http://www2.census.gov/geo/docs/reference/state.txt", -->
<!--                          delim = "|") %>% -->
<!--   rename(state = STATE, -->
<!--          state_name = STATE_NAME) %>% -->
<!--   select(state, state_name) %>% -->
<!--   mutate(state = as.integer(state)) -->
<!-- state_fips %>% slice(1:5) -->
<!-- ``` -->

<!-- - Read the 2015 FARS data into an R object named `accident`. Use all the date and time information to create a column named `date` with the date and time of the accident. Include information on whether the accident was related to drunk driving (FALSE if there were 0 drunk drivers, TRUE if there were one or more), and create columns that gives whether the accident was during the day (7 AM to 7 PM) or not as well as the month of the accident (for this last column, you can either retain it from the original data or recalculate it based on the new `date` variable). Filter out any values where the date-time does not render (i.e., `date` is a missing value). The first few rows of the cleaned data frame should look like this: -->

<!-- ```{r check10, warning = FALSE, message = FALSE, echo = FALSE} -->
<!-- accident <- read_csv("data/accident.csv") %>% -->
<!--   select(STATE, DAY:MINUTE, DRUNK_DR) %>% -->
<!--   rename(state = STATE,  -->
<!--          drunk_dr = DRUNK_DR) %>% -->
<!--   select(-DAY_WEEK) %>% -->
<!--   unite(date, DAY:MINUTE, sep = "-") %>% -->
<!--   mutate(date = dmy_hm(date),  -->
<!--          drunk_dr = drunk_dr >= 1, -->
<!--          daytime = hour(date) %in% c(7:19), -->
<!--          month = month(date)) %>% -->
<!--   filter(!is.na(date)) -->
<!-- accident %>% slice(1:5) -->
<!-- ``` -->

<!-- - Join the information from `state_fips` into the `accident` data frame. There may be a few locations in the `state_fips` data frame that are not included in the `accident` data frame (e.g., Virgin Islands), so when you join keep all observations in `accident` but only the observations in `state_fips` that match at least one row of `accident`. The first few rows of the joined dataset should look like this: -->

<!-- ```{r echo = FALSE} -->
<!-- accident <- accident %>% -->
<!--   left_join(state_fips, by = "state") -->
<!-- accident %>% slice(1:5) -->
<!-- ``` -->

<!-- - Summarize the data to get the total number of accidents in Colorado in each month, separated by (1) daytime and nighttime and (2) related or unrelated to drunk driving  (in other words, in January, how many daytime accidents were there that were unrelated to drunk driving? How many nighttime accidents that were unrelated to drunk driving? etc.). The summarized data should look like this: -->

<!-- ```{r echo = FALSE} -->
<!-- accident %>% -->
<!--   filter(state_name == "Colorado") %>% -->
<!--   group_by(daytime, month, drunk_dr) %>% -->
<!--   summarize(accidents = n()) -->
<!-- ``` -->

<!-- - Write a function that inputs a data frame (`df`) and outputs this type of summary data frame (like the one just created for Colorado) for whatever data is in the input data frame. Below are some examples of how this function should work: -->

<!-- ```{r echo = FALSE} -->
<!-- summarize_fars <- function(df){ -->
<!--   df %>% -->
<!--     group_by(daytime, month, drunk_dr) %>% -->
<!--     summarize(accidents = n()) -->
<!-- } -->
<!-- ``` -->

<!-- ```{r} -->
<!-- colorado_data <- accident %>%  -->
<!--   filter(state_name == "Colorado") -->

<!-- colorado_summary <- summarize_fars(df = colorado_data) -->
<!-- head(colorado_summary) -->

<!-- # Note also that you can pipe with the new function: -->
<!-- accident %>%  -->
<!--   filter(state_name == "Texas") %>%  -->
<!--   summarize_fars() %>%  -->
<!--   tbl_df() %>%  -->
<!--   slice(1:3) -->
<!-- ``` -->

<!-- - Once you've written the function, see if you can figure out what the following code does. How does the new function fit in? (Note: We could have achieved the same thing with basic `dplyr` code, but this framework will allow you to ultimately do a lot more than you can with `dplyr`.) -->

<!-- ```{r eval = FALSE} -->
<!-- library(purrr) -->

<!-- accident %>%  -->
<!--   filter(state_name %in% c("Colorado", "Texas", "California", "New York")) %>%  -->
<!--   group_by(state_name) %>% -->
<!--   nest() %>%  -->
<!--   mutate(summary = map(data, summarize_fars)) %>%  -->
<!--   select(-data) %>%  -->
<!--   unnest()  -->
<!-- ``` -->

<!-- - Write code to create boxplots for Colorado of the distribution of total accidents within each month. Create separate boxplots for daytime and nighttime accidents, and facet by whether the accident was related to drunk driving. The plot should look like the plot below.  -->

<!-- ```{r check9, echo = FALSE, fig.width = 6, fig.height = 3, fig.align = "center"} -->
<!-- accident %>% -->
<!--   filter(state_name == "Colorado") %>% -->
<!--   mutate(drunk_dr = factor(drunk_dr, labels = c("Unrelated to\ndrunk driving", -->
<!--                                                 "Related to\ndrunk driving")), -->
<!--          daytime = factor(daytime, labels = c("Nighttime", "Daytime"))) %>% -->
<!--   group_by(daytime, month, drunk_dr) %>% -->
<!--   summarize(accidents = n()) %>% -->
<!--   ggplot(aes(x = daytime, y = accidents, group = daytime)) +  -->
<!--   geom_boxplot() +  -->
<!--   facet_wrap(~ drunk_dr, ncol = 2) + -->
<!--   xlab("Time of day") + ylab("# of monthly accidents") +  -->
<!--   ggtitle(paste("Fatal accidents in", "Colorado", "in 2015")) -->
<!-- ``` -->

<!-- - Now write a function called `plot_fars` to create a plot like the one you just made for Colorado for any data frame with the format of `accident` (i.e., same number, types, and names of columns). Test it on subsets of the data for several states (Colorado, Texas, California, and New York). (*Hint*: To get a function to print out a plot created with ggplot, you will need to explicitly print the output from your function. See the examples of using the function below.)  -->

<!-- ```{r check8, echo = FALSE, fig.width = 6, fig.height = 3, fig.align = "center"} -->
<!-- plot_fars <- function(df){ -->
<!--   fars_plot <- df %>% -->
<!--     mutate(drunk_dr = factor(drunk_dr, labels = c("Unrelated to\ndrunk driving", -->
<!--                                                   "Related to\ndrunk driving")), -->
<!--            daytime = factor(daytime, labels = c("Nighttime", "Daytime"))) %>% -->
<!--     group_by(daytime, month, drunk_dr) %>% -->
<!--     summarize(accidents = n()) %>% -->
<!--     ggplot(aes(x = daytime, y = accidents, group = daytime)) + -->
<!--     geom_boxplot() + -->
<!--     facet_wrap(~ drunk_dr, ncol = 2) + -->
<!--     xlab("Time of day") + ylab("# of monthly accidents")  -->
<!-- } -->
<!-- ``` -->

<!-- Here are some examples of what should happen when you run this function: -->

<!-- ```{r, fig.width = 6, fig.height = 3, fig.align = "center"} -->
<!-- co_plot <- plot_fars(df = filter(accident, state_name == "Colorado")) -->
<!-- print(co_plot) -->

<!-- accident %>%  -->
<!--   filter(state_name == "Texas") %>%  -->
<!--   plot_fars() %>%  -->
<!--   print() -->
<!-- ``` -->

<!-- - Once you have written this function, what happens when you run the following code? -->

<!-- ```{r eval = FALSE, fig.width = 6, fig.height = 3, fig.align = "center"} -->
<!-- library(purrr) -->

<!-- state_plots <- accident %>%  -->
<!--   filter(state_name %in% c("Colorado", "Texas", "California", "New York")) %>%  -->
<!--   group_by(state_name) %>% -->
<!--   nest() %>%  -->
<!--   mutate(plots = map(data, plot_fars))  -->

<!-- class(state_plots[["plots"]]) -->
<!-- class(state_plots[["plots"]][[1]]) -->
<!-- print(state_plots[["plots"]][[1]])$plot -->
<!-- ``` -->

<!-- - Install the `cowplot` package (this is a `ggplot2` extension) and then try running the following code. What happens when you run this code? -->

<!-- ```{r eval = FALSE} -->
<!-- plot_grid(plotlist = state_plots[["plots"]],  -->
<!--           ncol = 2, labels = "AUTO") -->
<!-- ``` -->

<!-- #### Example R code -->

<!-- Here is the code to read the dataset with state names and FIPS codes at http://www2.census.gov/geo/docs/reference/state.txt into an R object called `state_fips` and clean it so the first few lines: -->

<!-- ```{r check7, echo = FALSE, warning = FALSE, message = FALSE} -->
<!-- state_fips <- read_delim("http://www2.census.gov/geo/docs/reference/state.txt", -->
<!--                          delim = "|") %>% -->
<!--   rename(state = STATE, -->
<!--          state_name = STATE_NAME) %>% -->
<!--   select(state, state_name) %>% -->
<!--   mutate(state = as.integer(state)) -->
<!-- state_fips %>% slice(1:5) -->
<!-- ``` -->

<!-- Note that you can read this file directly from the website using `read_delim`.  -->

<!-- Read the 2015 FARS data into an R object named `accident`. Use all the date and time information to create a column named `date` with the date and time of the accident. Include information on whether the accident was related to drunk driving (FALSE if there were 0 drunk drivers, TRUE if there were one or more), and create columns that gives whether the accident was during the day (7 AM to 7 PM) or not as well as the month of the accident (for this last column, you can either retain it from the original data or recalculate it based on the new `date` variable). Filter out any values where the date-time does not render (i.e., `date` is a missing value). You can use the following code to do all this: -->

<!-- ```{r check6, warning = FALSE, message = FALSE, echo = FALSE} -->
<!-- accident <- read_csv("data/accident.csv") %>% -->
<!--   select(STATE, DAY:MINUTE, DRUNK_DR) %>% -->
<!--   rename(state = STATE,  -->
<!--          drunk_dr = DRUNK_DR) %>% -->
<!--   select(-DAY_WEEK) %>% -->
<!--   unite(date, DAY:MINUTE, sep = "-") %>% -->
<!--   mutate(date = dmy_hm(date),  -->
<!--          drunk_dr = drunk_dr >= 1, -->
<!--          daytime = hour(date) %in% c(7:19), -->
<!--          month = month(date)) %>% -->
<!--   filter(!is.na(date)) -->

<!-- accident %>% slice(1:5) -->
<!-- ``` -->

<!-- A few notes: -->

<!-- - Notice that `select` is using the `:` operator to pick several columns in a row. -->
<!-- - Some of the column names in all caps are changed to lower case to make them easier to work with.  -->
<!-- - The `DAY_WEEK` column is in the middle of other date columns, but if you remove it, you can use `unite` with `:` to join together all the date-time columns and then use `lubridate` to change this column into the right class.  -->
<!-- - A logical operator is used inside a `mutate` call to create a column of whether the accident involved drunk driving (one or more drunk drivers involved) -->
<!-- - The `hour` function from `lubridate` is used to check if the time of the accident falls in "daytime" or not -->
<!-- - Some of the accidents are missing some date information. A `filter` is used to filter that out.  -->

<!-- Join the information from `state_fips` into the `accident` data frame. There may be a few locations in the `state_fips` data frame that are not included in the `accident` data frame (e.g., Virgin Islands), so when you join keep all observations in `accident` but only the observations in `state_fips` that match at least one row of `accident`. You can use the following code for this: -->

<!-- ```{r echo = FALSE} -->
<!-- accident <- accident %>% -->
<!--   left_join(state_fips, by = "state") -->

<!-- accident %>% slice(1:5) -->
<!-- ``` -->

<!-- Summarize the data to get the total number of accidents, separated by (1) daytime and nighttime and (2) related or unrelated to drunk driving, in each month (in other words, in January, how many daytime accidents were there that were unrelated to drunk driving? How many nighttime accidents that were unrelated to drunk driving? etc.). You can do that with this code: -->

<!-- ```{r} -->
<!-- accident %>% -->
<!--   filter(state_name == "Colorado") %>% -->
<!--   group_by(daytime, month, drunk_dr) %>% -->
<!--   summarize(accidents = n()) -->
<!-- ``` -->

<!-- As a note, you may want to create a table (for example, for a report) from the data at this stage. You could use `unite` then `pivot_wider` to do this pretty easily: -->

<!-- ```{r check5} -->
<!-- accident %>% -->
<!--   filter(state_name == "Colorado") %>% -->
<!--   mutate(daytime = factor(daytime, labels = c("Nighttime", "Daytime")), -->
<!--          drunk_dr = factor(drunk_dr,  -->
<!--                            labels = c("Not drunk driving", "Drunk driving"))) %>%  -->
<!--   group_by(daytime, month, drunk_dr) %>% -->
<!--   summarize(accidents = n()) %>%  -->
<!--   ungroup() %>%  -->
<!--   unite(category, daytime, drunk_dr, sep = " / ") %>%  -->
<!--   pivot_wider(names_from = category, values_from = accidents) %>%  -->
<!--   knitr::kable() -->
<!-- ``` -->

<!-- Write a function that inputs a data frame (`df`) and outputs this type of summary data frame (like the one just created for Colorado). You can do that with the following code. Note that, because it inputs a data frame and outputs a data frame, you can include it in a pipeline.  -->

<!-- ```{r} -->
<!-- summarize_fars <- function(df){ -->
<!--   df %>% -->
<!--     group_by(daytime, month, drunk_dr) %>% -->
<!--     summarize(accidents = n()) -->
<!-- } -->
<!-- ``` -->

<!-- Once you've written the function, see if you can figure out what the following code does. This code limits the data to data from four states and then applies the `summarize_fars` function that you just wrote to the subset of data from each state. Finally, since we `nest` to do that, the pipeline includes some lines to `unnest` the data to get back to an unnested data frame: -->

<!-- ```{r check4, eval = FALSE} -->
<!-- library(purrr) -->

<!-- accident %>%  -->
<!--   filter(state_name %in% c("Colorado", "Texas", "California", "New York")) %>%  -->
<!--   group_by(state_name) %>% -->
<!--   nest() %>%  -->
<!--   mutate(summary = map(data, summarize_fars)) %>%  -->
<!--   select(-data) %>%  -->
<!--   unnest()  -->
<!-- ``` -->

<!-- Write code to create boxplots for Colorado of the distribution of total accidents within each month. Create separate boxplots for daytime and nighttime accidents, and facet by whether the accident was related to drunk driving. You can do that with this code: -->

<!-- ```{r check3, fig.width = 6, fig.height = 3, fig.align = "center"} -->
<!-- accident %>% -->
<!--   filter(state_name == "Colorado") %>% -->
<!--   mutate(drunk_dr = factor(drunk_dr, labels = c("Unrelated to\ndrunk driving", -->
<!--                                                 "Related to\ndrunk driving")), -->
<!--          daytime = factor(daytime, labels = c("Nighttime", "Daytime"))) %>% -->
<!--   group_by(daytime, month, drunk_dr) %>% -->
<!--   summarize(accidents = n()) %>% -->
<!--   ggplot(aes(x = daytime, y = accidents, group = daytime)) +  -->
<!--   geom_boxplot() +  -->
<!--   facet_wrap(~ drunk_dr, ncol = 2) + -->
<!--   xlab("Time of day") + ylab("# of monthly accidents") +  -->
<!--   ggtitle(paste("Fatal accidents in", "Colorado", "in 2015")) -->
<!-- ``` -->


<!-- Now write a function called `plot_fars` to create a plot like the one you just made for Colorado for any data frame with the format of `accident` (i.e., same number, types, and names of columns). Test it on subsets of the data for several states (Colorado, Texas, California, and New York). (*Hint*: To get a function to print out a plot created with ggplot, you must explicitly print the plot object. For example, you could assign the plot to `fars_plot`, and then you would run `print(fars_plot)` within your loop as the last step.) -->

<!-- ```{r check2, echo = FALSE, fig.width = 6, fig.height = 3, fig.align = "center"} -->
<!-- plot_fars <- function(df){ -->
<!--   fars_plot <- df %>% -->
<!--     mutate(drunk_dr = factor(drunk_dr, labels = c("Unrelated to\ndrunk driving", -->
<!--                                                   "Related to\ndrunk driving")), -->
<!--            daytime = factor(daytime, labels = c("Nighttime", "Daytime"))) %>% -->
<!--     group_by(daytime, month, drunk_dr) %>% -->
<!--     summarize(accidents = n()) %>% -->
<!--     ggplot(aes(x = daytime, y = accidents, group = daytime)) + -->
<!--     geom_boxplot() + -->
<!--     facet_wrap(~ drunk_dr, ncol = 2) + -->
<!--     xlab("Time of day") + ylab("# of monthly accidents")  -->
<!--   print(fars_plot) -->
<!-- } -->
<!-- ``` -->

<!-- Notice how similar this function is to the code you wrote in the previous step.  -->

<!-- Once you have written this function, what happens when you run the following code? The following code applies this function to the subset of data from each of four states. The output (`state_plots`) is a nested data frame, where the new `plots` column is a list of `ggplot` objects. If you run `print` on this list, it will print each of these plots out separately (use the arrow buttons in the "Plots" Pane in RStudio to browse through these plots). -->

<!-- ```{r check1, fig.width = 6, fig.height = 3, fig.align = "center", eval = FALSE} -->
<!-- library(purrr) -->

<!-- state_plots <- accident %>%  -->
<!--   filter(state_name %in% c("Colorado", "Texas", "California", "New York")) %>%  -->
<!--   group_by(state_name) %>% -->
<!--   nest() %>%  -->
<!--   mutate(plots = map(data, plot_fars))  -->

<!-- state_plots -->
<!-- class(state_plots) -->
<!-- class(state_plots[["plots"]]) -->
<!-- class(state_plots[["plots"]][[1]]) -->
<!-- state_plots[["plots"]][[1]]$plot -->
<!-- ``` -->

<!-- Install the `cowplot` package (this is a `ggplot2` extension) and then try running the following code. What happens? The `plot_grid` function, if you input a list with `ggplot` objects using the `plotlist` argument, will print all the plots out on the same page.  -->

<!-- ```{r fig.width = 12, fig.height = 8, fig.align = "center", eval = FALSE} -->
<!-- library(cowplot) -->
<!-- plot_grid(plotlist = state_plots[["plots"]],  -->
<!--           ncol = 2, labels = "AUTO") -->
<!-- ``` -->

<!-- --- -->