This repository has been archived by the owner on Oct 1, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 44
/
03-B-linear-models.Rmd
183 lines (132 loc) · 5.65 KB
/
03-B-linear-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
layout: topic
title: "Linear models"
author: Tad, SR Schachat
output: html_document
---
**Assigned Reading:**
> Appendix A from: Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A. and Smith, G. M. 2009. *Mixed Effects Models and Extensions in Ecology with R.* Springer. [Here's the whole book](https://link-springer-com.stanford.idm.oclc.org/book/10.1007%2F978-0-387-87458-6) or [you can download just the appendix](https://link.springer.com/content/pdf/bbm%3A978-0-387-87458-6%2F1.pdf).
```{r include = FALSE}
# This code block sets up the r session when the page is rendered to html
# include = FALSE means that it will not be included in the html document
# Write every code block to the html document
knitr::opts_chunk$set(echo = TRUE)
# Write the results of every code block to the html document
knitr::opts_chunk$set(eval = TRUE)
# Define the directory where images generated by knit will be saved
knitr::opts_chunk$set(fig.path = "images/03-B/")
# Set the web address where R will look for files from this repository
# Do not change this address
repo_url <- "https://raw.githubusercontent.com/fukamilab/BIO202/master/"
```
### Key Points
#### Data exploration: outliers and collinearity
+ Normality not needed for explanatory variables
+ Correlation - "value of 0.6 (and -0.6) is not large enough to worry us"?
#### Linear regression: model selection
+ Which interactions to include - opinions vary (p. 537)
+ `summary(M1)`, `drop1(M1, test="F")`, `anova(M1)` - Is `drop1()` the best method?
+ Which explanatory variables to drop - opinions vary (p. 541)
+ `step(M1)` - easy one-line command; AIC-based backwards selection
#### Linear regression: model validation
+ Mostly done graphically
+ But not very powerful when n is small (as common in ecological data?)
+ Model selection, then model validation - Is this always ok?
#### Linear regression: model interpretation
+ Graphical presentation - Data should be plotted.
#### Additive modeling
+ Zuur et al.'s favorite tool? (more in Ch 3)
+ Try model with all explanatory variables, to look at potential non-linear effect
+ Difficulties: overfitting and need for large data?
***
### Analysis Example
In this example, we'll use linear regression to examine whether there is any sort of relationship beween Damage Type (DT) diversity and herbivorized leaf area on fossil leaves.
First, let's read in the data:
```{r results = 'hide'}
# Read in a data file from the data folder on the BIO 202 GitHub repository
data <- read.csv("https://raw.githubusercontent.com/FukamiLab/BIO202/master/data/03-B-permian.csv")
data <- data[order(data$dts),]
```
Now, we can simply plot DT diversity against herbivorized leaf area:
```{r}
plot(data$dts~data$rem)
```
Not a particularly clear relationship there! We can try running a simple linear regression:
```{r}
fit <- lm(dts~0+rem, data=data)
summary(fit)
```
Very low p-value (< 2.2e-16), very unconvincing R-squared (0.10).
We can try plotting a different line for each site (as in Fig. A.3 from the reading, on page 544).
```{r}
fit <- lm(dts~0+rem+site, data=data)
summary(fit)
D1 <- data[data$site=="CCP",]
D2 <- data[data$site=="MCF",]
P1 <- predict(fit, newdata = D1)
P2 <- predict(fit, newdata = D2)
plot(data$dts~data$rem)
lines(D1$rem, P1, col="blue")
lines(D2$rem, P2, col="red")
```
We can try plotting a different line for each species instead of each site.
```{r}
fit <- lm(dts~0+rem+sp, data=data)
D1 <- data[data$sp=="cau",]
D2 <- data[data$sp=="cev",]
D3 <- data[data$sp=="cta",]
D4 <- data[data$sp=="mgi",]
D5 <- data[data$sp=="mta",]
P1 <- predict(fit, newdata = D1)
P2 <- predict(fit, newdata = D2)
P3 <- predict(fit, newdata = D3)
P4 <- predict(fit, newdata = D4)
P5 <- predict(fit, newdata = D5)
plot(data$dts~data$rem)
lines(D1$rem, P1, co="red")
lines(D2$rem, P2, col="blue")
lines(D3$rem, P3, col="green")
lines(D4$rem, P4, col="purple")
lines(D5$rem, P5, col="yellow")
```
To be thorough, we can try adding all possible predictor variables, dropping them, and seeing what happens.
```{r}
fit.full <- lm(dts~0+rem+sp+site+area, data=data)
drop1(fit.full, test="F")
```
Now a run-of-the-mill GAM:
```{r}
library(mgcv)
AM1 <- gam(dts ~ s(rem) + sp + site + s(area), data = data)
anova(AM1)
```
All predictors except "site" are significant.
Now a fancy GAM with cubic regression spline with shrinkage.
```{r}
AM2 <- gam(dts ~ s(rem, bs = "cs") + sp + site + s(area, bs = "cs"), data = data)
anova(AM2)
```
Values change but no difference in significance at p = 0.05. So let's get rid of "site."
```{r}
AM3 <- gam(dts ~ s(rem, bs = "cs") + sp + s(area, bs = "cs"), data = data)
anova(AM3)
```
We can plot the smoothing function.
```{r}
plot(AM3)
```
### Discussion Questions
1. Page 537: Of the 7 possible ways to treat interaction terms, which do you prefer and why?
2. Page 537: What are "main terms" and why do they matter?
3. Page 537: Why are 4-way interactions a bad thing? What are "Cook distance values" and "non-convergence"?
4. Page 539: Review: What is RSS (residual sum of squares) and what does it tell us?
5. Page 540: What are "nested models" in this context?
6. Page 541: Of the 3 possible ways to treat non-significant explanatory variables, which do you prefer and why?
7. Page 542: How does AIC "judge" model fit?
8. Page 546: What is it about the fit of the lines in figure A.3 (pg. 544) that suggests that a linear model may not be the best fit?
9. Page 546: What is a Generalized Additive Model and how does it work?
10. Page 546: What is a thin plate spline and how is it relevant here?
11. Page 551: When might the Aikake weight be useful?
***
### After-class follow-up
+ df = 0 for site in the `drop1()` analysis above because some species were observed only in one of the two sites.