02-A-tidyr.Rmd

---
layout: topic
title: "Introduction to ggplot2 and dplyr"
author: Jes
output: html_document
---


**Assigned Reading:**

> Chapters 1.1, 1.2, 2, 3, 9 and 10 in: Wickham, H. 2009. *ggplot2: Elegant Graphics for Data Analysis.* Springer. DOI: [Stanford Full Text](https://link.springer.com/book/10.1007%2F978-0-387-98141-3)
>
> The sections on mutating and filtering joins in [this vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html)

**Optional Reading:**

> Chapter 4 in: Wickham, H. 2009. *ggplot2: Elegant Graphics for Data Analysis.* Springer. DOI: [Stanford Full Text](https://link.springer.com/book/10.1007%2F978-0-387-98141-3)

RStudio has produced several helpful [cheatsheets](https://www.rstudio.com/resources/cheatsheets/) on data management and visualization with ggplot2 and tidyr. You may want to view or print one or several of them.


```{r include = FALSE}
# This code block sets up the r session when the page is rendered to html
# include = FALSE means that it will not be included in the html document

# Write every code block to the html document 
knitr::opts_chunk$set(echo = TRUE)

# Write the results of every code block to the html document 
knitr::opts_chunk$set(eval = TRUE)

# Define the directory where images generated by knit will be saved
knitr::opts_chunk$set(fig.path = "images/02-A/")

# Set the web address where R will look for files from this repository
# Do not change this address
repo_url <- "https://raw.githubusercontent.com/fukamilab/BIO202/master/"
```

### Before Class

Download [this R script](https://github.com/FukamiLab/BIO202/blob/master/code/02-A-tidyr-exercise-code.R) and save it in your code folder in the research project you created on the first day of class. You will need to click the 'Raw' button and then download the text file that appears. Be sure to remove the .txt extension that may have been added when you saved the file to your computer so that the file ends in .R. You may want to run the first part of the script that downloads the data to be sure that there will be no issues with data access during class.

### Key Points

+ Data should be organized so that observations are in rows and attributes (variables) are in columns.
+ The `dplyr` package contains functions for manipulating data:
    + `spread` and `gather` change the format of data tables.
    + `separate` and `unite` change the format of variables.
    + `filter` creates a subset of a data table.
    + `mutate` creates new variables.
    + `group_by` and `summarise` are used to summarize observations according to the values of their variables.
+ The pipe function `%>%` is used to link several operations together (e.g. function composition).
+ The `ggplot2` package is set of functions for visualizing data.
    + A ggplot2 plot contains three components: (1) the data, (2) the aesthetic mappings between variables and visual properties, and (3) layers describing how to display the observations.
    + The aesthetic mapping `aes` must minimally describe which variables define the plot coordinate space.
    + `facet_wrap` creates plots for specified data subsets.

### Manipulating data with dplyr

Open the R project that you created on the first day of class. Then when R Studio opens, open the R script file that you downloaded for today's class (see above). This script downloads a data set from our course website and then proceeds to manipulate and summarise the data. These data contain three tables from the Winter 2016 BIO46 class on lichen microorganisms:

1. **trees** is a table where observations are trees on which lichens were collected. Columns describe each tree's location, elevation, and species name.
2. **lichens** is a table where each observation is a lichen that was collected. Columns describe the tree on which the lichen was collected, the team that collected the lichen, the collection date, and the lichen species name.
3. **algae** is a table where each observation is a fragment of a lichen from which we attempted to amplify and sequence green algal ITS2 rDNA. We attempted to amplify algal DNA from four sections of each lichen in order to quantity algal diversity within a single lichen. If sequencing was successful then the DNA sequence is given along with a code identifying unique haplotype (GenotypeID). The TaxonID column indicates haplotypes that matched those reported in [Werth and Sork (2010)](http://dw.doi.org/10.3732/ajb.0900276).

The R script explores algal diversity among lichens and trees. Working in pairs, execute each line of code and then add a comment (using `#`) above each line of code with a description of what the code does. 

When you have finished commenting the code, try to complete the following challenge:

> **Challenge:**
>
> How could you modify the code that creates the `lichenXgeno` table so that it displays the fraction of successful sequences belonging to each GenotypeID for each lichen.


### Plotting data with ggplot2

Working with your partner, create the following plots using `ggplot2`:

1. A plot that maps the locations where each lichen was collected and colors the point by the Chao1 richness estimator.
2. Modify the above plot so that three plots are shown, one for each tree species.