This repository has been archived by the owner on Oct 1, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 44
/
03-A-data-exploration.Rmd
290 lines (192 loc) · 9.46 KB
/
03-A-data-exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
layout: topic
title: "Data exploration"
author: Tad, Priscilla San Juan
output: html_document
---
**Assigned Reading:**
> Zuur, A. F., E. N. Ieno, and C. S. Elphick. 2010. A protocol for data exploration to avoid common statistical problems. *Methods in Ecology and Evolution* **1**: 3-14. [DOI: 10.1111/j.2041-210X.2009.00001.x](https://dx.doi.org/10.1111/j.2041-210X.2009.00001.x)
```{r include = FALSE}
# This code block sets up the r session when the page is rendered to html
# include = FALSE means that it will not be included in the html document
# Write every code block to the html document
knitr::opts_chunk$set(echo = TRUE)
# Write the results of every code block to the html document
knitr::opts_chunk$set(eval = TRUE)
# Define the directory where images generated by knit will be saved
knitr::opts_chunk$set(fig.path = "images/03-A/")
# Set the web address where R will look for files from this repository
# Do not change this address
repo_url <- "https://raw.githubusercontent.com/fukamilab/BIO202/master/"
```
### Key Points
#### Data exploration
1. Outliers Y & X
+ Outliers in response variable vs in covariates - to be dealt with differently
+ Transformation less desirable for response variable than in covariates?
2. Homogeneity Y
3. Normality Y
+ Example of importance of biological intuition and graphical investigation
+ Fig 5b - transformation to get normality not desirable
4. Zero troble Y
+ Zero-inflated GLM
+ Double zeros, or joint absences - what do they mean? e.g., spatial clumping
+ Multivariate analysis that ignores double zeros
5. Collinearity X
+ VIF of 10, 3, or 2 - reason for these values?
+ VIFs, or common sense or biological knowledge
6. Relationships Y & X
+ Multi-panel scatter plots for checking for outliers
7. Interactions
+ Are data balanced? Use coplot (Fig 11).
8. Independence Y
+ Mixed effects, etc. to deal with non-independence
+ ACF or variograms for checking for temporal and spatial non-independence
Not all steps always needed
#### Common themes
+ Biological intuition = key to make decisions about stats
+ Hypothesis testing vs hypothesis generation
+ One solution - two data sets - one to create hypotheses and one to test them
+ But only practical for large data sets
+ Emphasis on graphical tools
### Analysis Example
The code below relies on the `phyloseq` package from [Bioconductor](https://www.bioconductor.org/). See these [instructions](https://joey711.github.io/phyloseq/install.html) for how to install it.
```{r, include=FALSE, eval=FALSE}
# Load packages
library(phyloseq) # from Bioconducter - see instructions for installing
library(readr)
library(lattice)
library(qpcR)
library(ggplot2)
library(plotly)
# Load data
otu <- as.matrix((read_csv("~/Documents/R/Eco_stats_proj/Data/raw/OTU_tab_renamed.csv", col_names = TRUE)[,-1]))
tax <- as.matrix((read_csv("~/Documents/R/Eco_stats_proj/Data/raw/TAX_reorder_otu_name.csv", col_names = TRUE)[,-1]))
# Add in rownames
rownames(otu) <- paste0("OTU", 1:nrow(otu))
rownames(tax) <- paste0("OTU", 1:nrow(tax))
# Read file into phyloseq
OTU <- otu_table(otu, taxa_are_rows = TRUE)
TAX <- tax_table(tax)
# Combining OTU and TAX data
data <- phyloseq(OTU, TAX)
# Reading the sample data file
sample = sample_data(data.frame(read_csv("~/Documents/R/Eco_stats_proj/Data/raw/samplesinfo.csv")))
# Naming the the first rows using the first column
row.names(sample) = sample$ID
# Merging OTU/TAX file "Bac.data" with sampledata (includes MM and NEG)
merged <- merge_phyloseq(data, sample)
# Removing MM and NEG from merged_dataset
Real <- as.vector(data.frame(sample_data(merged))$Bird_species)
Real <- (!(Real%in%c("MM", "NEG", "RCRW")))
Real[is.na(Real)] = FALSE
merged_data = prune_samples(Real, merged)
# Remove OTUs with bad BLAST matches
kingdom <- as.vector(data.frame(tax_table(merged_data))$kingdom)
tax.keep <- kingdom=="Bacteria"
tax.keep[is.na(tax.keep)] = FALSE
merged_data <- prune_taxa(tax.keep,merged_data)
# Remove chloroplast (likely plant DNA)
class <- as.vector(data.frame(tax_table(merged_data))$class)
class <- (!(class%in%c("Chloroplast")))
class[is.na(class)] = FALSE
merged_data = prune_taxa(class, merged_data)
# Subset by species - BTSA
BTSA <- as.vector(data.frame(sample_data(merged_data))$Bird_species)
BTSA <- BTSA%in%c("BTSA")
BTSA[is.na(BTSA)] = FALSE
data.BTSA <- prune_samples(BTSA, merged_data)
# Subset by species - CCRO
CCRO <- as.vector(data.frame(sample_data(merged_data))$Bird_species)
CCRO <- CCRO%in%c("CCRO")
CCRO[is.na(CCRO)] = FALSE
data.CCRO <- prune_samples(CCRO, merged_data)
# Subset by species - OBNT
OBNT <- as.vector(data.frame(sample_data(merged_data))$Bird_species)
OBNT <- OBNT%in%c("OBNT")
OBNT[is.na(OBNT)] = FALSE
data.OBNT <- prune_samples(OBNT, merged_data)
# Subset by species - RCWA
RCWA <- as.vector(data.frame(sample_data(merged_data))$Bird_species)
RCWA <- RCWA%in%c("RCWA")
RCWA[is.na(RCWA)] = FALSE
data.RCWA <- prune_samples(RCWA, merged_data)
# Subset by species - SWTH
SWTH <- as.vector(data.frame(sample_data(merged_data))$Bird_species)
SWTH <- SWTH%in%c("SWTH")
SWTH[is.na(SWTH)] = FALSE
data.SWTH <- prune_samples(SWTH, merged_data)
# Subset by species - YWAR
YWAR <- as.vector(data.frame(sample_data(merged_data))$Bird_species)
YWAR <- YWAR%in%c("YWAR")
YWAR[is.na(YWAR)] = FALSE
data.YWAR <- prune_samples(YWAR, merged_data)
```
#### Checking for outliers
Zurr et al. recommends plotting your data using boxplots and dotcharts to detect outliers. Before removing suspected outliers, make sure they are actually outliers!
Since my data is multivariate, I used sequence count per sample for outlier detection in the following examples.
**Boxplot**
![](images/03-A/boxplot.png)
```{r, eval = FALSE}
boxplot(sample_sums(merged_data), ylab = "Sequences per sample")
```
**Cleveland dotchart**
![](images/03-A/dotchart.png)
```{r, eval = FALSE}
dotchart(sample_sums(merged_data), xlab = "Sequences per sample",
ylab = "Samples", labels = F)
```
**Plotly - great visualization tool**
![](images/03-A/ggplot.png)
```{r, eval = FALSE}
seq_num <- data.frame(ID=sample_data(merged_data)$ID, Seq=sample_sums(merged_data))
dot_gg <- ggplot(seq_num, aes(x=Seq,y=ID)) + geom_point(col="coral") +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
ggplotly(dot_gg)
```
**Dotchart for multiple variables**
Since most of the variables in my dataset are categorical, here I used raw data (sequence count) subsetted by bird species.
![](images/03-A/dotchart_variable.png)
```{r, eval = FALSE}
Z <- qpcR:::cbind.na(sample_sums(data.BTSA), sample_sums(data.CCRO), sample_sums(data.OBNT),
sample_sums(data.RCWA), sample_sums(data.SWTH), sample_sums(data.YWAR))
colnames(Z) <- c("BTSA", "CCRO", "OBNT", "RCWA", "SWTH", "YWAR")
dotplot(as.matrix(Z), groups = FALSE,
strip = strip.custom(bg = 'white',
par.strip.text = list(cex = 0.8)),
scales = list(x = list(relation = "free"),
y = list(relation = "free"),
draw = FALSE),
col = 1, cex = 0.5, pch = 16,
xlab = "Number of sequences per sample",
ylab = "Order of the data from text file")
```
***
#### Your data's distribution
Some statistical tests assume normal distributions, making it important to check the shape of your data. Plot a histogram with all of your data and then groups of your data.
> "If you want to apply a statistical test to determine whether there is significant group separation in a discriminant analysis, however, normality of observations of a particular variable within each group is important"
**Histogram**
![](images/03-A/histogram.png)
```{r, eval = FALSE}
hist(sample_sums(merged_data), breaks = 100,col = "#a6bddb", xlab="Sequences per sample", main = ' ')
```
***
### Discussion Questions
**Q1:** When should you let go of an outlier?
**Q2:** How can you check for independence using multivariate data?
> "Hence, it is important to check whether there is dependence in the raw data before doing the analysis, and also the residuals afterwards. These checks can be made by plotting the response variable vs. time or spatial coordinates."
**Q3:** _To transform or not to transform? That is the question._
> "There are three main reasons for a transformation: **to reduce the effect of outliers** (especially in covariates), **to stabilize the variance** and **to linearize relationships**. However, using more advanced techniques like GLS and GAMs, heterogenity and nonlinearity problems can be solved, making transformation less important."
**Q4:** What are some data exploration techniques you have used?
***
### After-class follow-up
+ beeswarm functions recommended by Anna to look at data distribution:
[ggbeeswarm](https://cran.r-project.org/web/packages/ggbeeswarm/index.html)
[ggbeeswarm vignette](https://cran.r-project.org/web/packages/ggbeeswarm/vignettes/usageExamples.pdf)
[beeswarm](http://www.cbs.dtu.dk/~eklund/beeswarm/)
+ O'Hara and Kotze 2010 Do not log-transform count data:
[Publication](http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2010.00021.x/abstract)
[Blog post](https://www.r-bloggers.com/do-not-log-transform-count-data-bitches/)
+ Paper Tad mentioned that uses "principal coordinate analysis of a truncated distance matrix" (PCNM) to incorporate spaptial trends in multivariate data:
[Toju et al. 2017](http://web.stanford.edu/~fukamit/toju-et-al-2017-accepted.pdf)
Also, from Sandra: the PCNM (truncated distance matrix) method is discussed starting on page 244 of this book: https://link.springer.com/content/pdf/10.1007%2F978-1-4419-7976-6.pdf