This repository has been archived by the owner on Sep 15, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
/
r4asme08 cox.Rmd
440 lines (323 loc) · 15.7 KB
/
r4asme08 cox.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
---
title: "8: Cox regression in cohort studies"
subtitle: "R 4 ASME"
authors: Authors – Lakmal Mudalige & Andrea Mazzella [(GitHub)](https://github.com/andreamazzella)
output: html_notebook
---
-------------------------------------------------------------------------------
## Contents
* Cox regression models
* simple
* adjusting for follow-up and age
* adjusting for other covariates
* Comparing Poisson and Cox
NB: the time-to-event (Kaplan-Meier) approach is not used here, but it was explored (with the same dataset!) in [R4SME 03 Survival](https://github.com/andreamazzella/R4SME).
-------------------------------------------------------------------------------
## 0. Packages and options
```{r message=FALSE, warning=FALSE}
# Load packages
library("haven")
library("magrittr")
library("survival")
library("epiDisplay")
library("tidyverse")
# Limit significant digits to 3, reduce scientific notation
options(digits = 3, scipen = 9)
```
-------------------------------------------------------------------------------
# Trinidad
## 1. Data manipulation
Read in dataset trinmlsh.dta. This contains data from a cohort study on cardiovascular risk factors and mortality among ~300 men from Trinidad.
```{r}
# Import the dataset
trin <- read_dta("trinmlsh.dta")
# Preview
trin
```
```{r include=FALSE}
# Explore the data
glimpse(trin)
```
Categorical variables need to be factorised and labelled. I manually took the labels from the Stata file...
The practical later asks to regroup some smoking levels, let's do it now.
```{r data management, include=FALSE}
trin %<>%
mutate(
ethgp = factor(ethgp,
levels = c(1:5),
labels = c("African", "Indian", "European", "mixed", "Chin/Sem")),
alc = factor(alc,
levels = c(0:3),
labels = c("none", "1-4/wk", "5-14/wk", ">=15/wk")),
smokenum = factor(smokenum,
levels = c(0:5),
labels = c("non-smok", "ex-smok", "1-9/d", "10-19/d", "20-29/d", ">=30/d")),
chdstart = factor(chdstart,
levels = c(0, 1),
labels = c("no", "yes")))
# Regroup smoking
trin %$% table(smokenum, useNA = "ifany")
trin$smok3 <- as.factor(ifelse(trin$smokenum == "non-smok" , "non-smok",
ifelse(trin$smokenum == "ex-smok", "ex-smok","smoker"))) %>%
fct_relevel("non-smok", "ex-smok", "smoker")
trin %$% table(smokenum, smok3, useNA = "ifany")
glimpse(trin)
summary(trin)
```
Now create a survival object to assess all-cause mortality, coded as `death`. Note that Surv() can take time data in two different formats: either a combination of data of entry and data of exit (like in session 7), or as a time difference. In this case, `years` codes this time difference, so we'll use it.
```{r}
# Survival object
trin_surv <- trin %$% Surv(time = years, event = death)
```
-------------------------------------------------------------------------------
## 2. Cox regression
We can now examine the smoking-specific mortality rates (per 1,000 person-years). Let's first use the classical technique and then let's use Cox regression.
```{r}
# Calculate rates
pyears(trin_surv ~ smok3, data = trin, scale = 1) %>%
summary(n = F, rate = T, ci.r = T, scale = 1000)
39 / 30.5
53.4 / 30.5
```
The package "survival" contains `coxph()`, the function for Cox regression. Exactly like `pyears()`, `coxph()` takes as first argument a formula with the survival object on the left and the exposure on the right.
To get the confidence intervals you need to use `confint()` and then exponentiate them.
Output:
- The rate ratios are in the column exp(coef).
- Wald's and LRT also visible
- Unfortunately there doesn't seem to be an option to show the base level.
```{r}
# Create Cox model
(cox_smok3 <- coxph(trin_surv ~ smok3, data = trin))
# Calculating 95% CIs for HRs
confint(cox_smok3) %>% exp()
```
The mortality rate ratio appears to increase for each level of smoking; there is some weak evidence for this (LRT p = 0.07).
-------------------------------------------------------------------------------
## 3. Cox with a numerical exposure
So, let's analyse the same association but with smoking coded as quantitative.
```{r}
# Cox with numerical exposure
(cox_smok3_lin <- coxph(trin_surv ~ as.numeric(smok3), data = trin))
confint(cox_smok3_lin) %>% exp()
```
Now there is good evidence for a linear trend: from one level of smoking to the next, the rate increases by 1.30.
-------------------------------------------------------------------------------
## 4. Timescale set as current age
Cox regression always automatically adjusts for a time variable: the way we made our survival object earlier, the timescale was set as time since entry. But what if adjusting for current age gave us different results?
This could happen if there are differences in age between the smoking levels. Let's check this.
```{r}
trin %>% group_by(smok3) %>%
drop_na() %>%
summarise("N" = n(),
"1st quartile" = quantile(ageent, 0.25),
"median" = median(ageent),
"3rd quartile" = quantile(ageent, 0.75))
#-- Wanna feel a bit extra? Represent this visually:
#trin %>% drop_na() %>% ggplot(aes(x = smok3, y = ageent)) + geom_boxplot()
#-- If you want to be QUITE extra:
#trin %>% drop_na() %>% ggplot(aes(x = smok3, y = ageent)) + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) + geom_jitter(aes(x = reorder(smok3, desc(ageent)), y = ageent, colour = as.factor(death)), width=0.15) + labs(title = "Age at entry by smoking status", subtitle = "An unnecessarily *extra* graph", y = "Age at entry (years)", x = "", colour = "CHD death") + theme_bw() + scale_colour_brewer(type = "qual", palette = 2)
```
The age of entry seems similar in all three levels, so the results shouldn't be very different with a different timescale, but ASME wants us to check anyway with Cox regression, so let's do it.
To set the timescale as current age, we need to put the date of birth in the "origin" argument. Also, all these dates need to be converted into numbers of years (since 01/01/1970, the base date in R) and this is how you do it. If this Surv() function seems too complex, there is another version in the comments.
```{r}
# Survival object set for current age
trin_surv_age <- trin %$% Surv(as.numeric(timein) / 365.25,
as.numeric(timeout) / 365.25,
death,
origin = as.numeric(timebth) / 365.25)
#-- You may prefer to create some intermediate object and have a simpler Surv():
#trin$timein_yr <- as.numeric(timein) / 365.25
#trin$timeout_yr <- as.numeric(timeout) / 365.25
#trin$timebth_yr <- as.numeric(timebth) / 365.25
#trin_surv_age_yr <- trin %$% Surv(timein_yr, timeout_yr, death, origin = timebth_yr)
#-- Another way of calculating these years:
#trin$timein_yr <- Epi::cal.yr(trin$timein, format="%d/%m/%Y")
# Cox using the timeage as the time
(cox_smok3_age <- coxph(trin_surv_age ~ smok3, data = trin))
confint(cox_smok3_age) %>% exp()
```
Surprise! (not.) The estimates are almost the same as before doing all this.
-------------------------------------------------------------------------------
# Primary Biliary Cirrhosis (PBC)
This dataset contains information from an RCT comparing the immunosuppressant azathioprine with placebo.
* Outcome: `death`
* Treatment: `treat`
* Time:
* `time` (follow-up in years)
* `age` in years
* Other variables:
* `logb0` (log serum bilirubin concentration)
Given these time variables, how would you set up your survival object?
```{r include=FALSE}
# Read in the pbc1bas.dta dataset
pbc <- read_dta("pbc1bas.dta")
glimpse(pbc)
# Data management
pbc %<>% mutate(death = d,
treat = factor(treat, levels = c(1, 2), labels = c("placebo", "azath")),
cenc0 = factor(cenc0, levels = c(0, 1), labels = c("no", "yes")),
cir0 = factor(cir0, levels = c(0, 1), labels = c("no", "yes")),
gh0 = factor(gh0, levels = c(0, 1), labels = c("no", "yes")),
asc0 = factor(asc0, levels = c(0, 1), labels = c("no", "yes"))) %>%
select(-d)
glimpse(pbc)
summary(pbc)
```
-------------------------------------------------------------------------------
## 5. Comparing Poisson and Cox
Assess the relationship between treatment and mortality adjusting for baseline bilirubin, first with Poisson, and then with Cox regression.
```{r}
# Poisson model
glm(death ~ offset(log(time)) + treat + logb0, family = poisson(), data = pbc) %>% idr.display()
# Create a survival object
surv_pbc <- Surv(pbc$time, pbc$death)
# Cox model
(cox_bilir <- coxph(surv_pbc ~ treat + logb0, data = pbc))
exp(confint(cox_bilir))
```
Poisson: HR 0.73 (0.49, 1.10), LRT p = 0.13
Cox: HR 0.70 (0.43, 0.99), LRT p = 0.04
Unlike Poisson, Cox regression also adjusts for time in study, and its results provide good evidence for an effect of azathioprine.
-------------------------------------------------------------------------------
## 6. Cox'ing the Poisson
In order to make the Poisson model more similar to Cox, we need to account for time in study. We do this by performing a Lexis expansion, either with `survival::survSplit()` or `Epi::splitLexis()`, and then adding the newly-created "period" categorical variable as another covariate.
```{r survSplit}
# Explore distribution of time
hist(pbc$time)
# Split the survival object
pbc_split <- survSplit(surv_pbc ~ .,
data = pbc,
cut = c(2, 4, 6),
episode = "period")
# Factorise variable and label values
pbc_split$period <- factor(pbc_split$period,
levels = c(1, 2, 3, 4),
labels = c("0-2y", "2-4y", "4-6y", "6-12y"))
#View(pbc_split)
table(pbc_split$period, useNA = "ifany")
#-- Broken -- Fit a Poisson model
glm(death ~ offset(log(time)) + treat + logb0 + period, # period goes here
family = poisson(),
data = pbc_split) %>% # use the Lexis-split data, not the old one
idr.display()
#-- Also broken
#strat <- glm(death ~ offset(log(time)) + treat + logb0 + strata(period), # period goes here
# family = poisson(),
# data = pbc_split)
#strat$coefficients %>% exp()
#confint(strat) %>% exp()
```
*Issue* Whilst the overall adjusted HR is similar to the Stata output, the HRs for the period bands are completely wrong. I have no idea why this happens?
*Workaround* You can use the alternative way of calculating this:
```{r splitLexis}
library("Epi")
# Create a Lexis object
pbc$time.in <- 0
Lexis.pbc <-
pbc %>% Lexis(
entry = list(per = time.in),
exit = list(per = time),
exit.status = death,
data = .
)
# Lexis split by period
Lexis.pbc.per <-
splitLexis(Lexis.pbc, breaks = c(0, 2, 4, 6), time.scale = "per")
# Create a variable with the period
Lexis.pbc.per$per <- timeBand(Lexis.pbc.per, "per", type = "factor")
# Fit a Poisson model to the split Lexis object
PCB.m2 <-
glm(
lex.Xst ~ offset(log(lex.dur)) + per + as.factor(treat) + logb0,
family = poisson(),
data = Lexis.pbc.per
)
exp(PCB.m2$coefficients)
confint(PCB.m2) %>% exp()
```
-------------------------------------------------------------------------------
# No such a thing as too much Whitehall
Read in whitehal.dta.
```{r include=FALSE}
whitehall <- read_stata("whitehal.dta")
# Factorise job grade
whitehall$grade <- factor(whitehall$grade,
levels = c(1, 2),
labels = c("higher", "lower"))
glimpse(whitehall)
```
Build a Cox model to estimate the effect of job grade (`grade`) on cardiac mortality (`chd`), first with the follow-up scale. Afterwards, assess whether age is a confounder with both Cox and Poisson techniques. The time variables are `timein`, `timeout` and `timebth`.
```{r}
# Survival object in follow-up scale
surv_white <- whitehall %$% Surv(time = as.numeric(timein) / 365.25,
time2 = as.numeric(timeout) / 365.25,
event = chd)
# Cox
(cox_white <- coxph(surv_white ~ grade, data = whitehall))
confint(cox_white) %>% exp()
```
Now, let's assess if age is a confounder.
```{r}
# Create survival object
surv_white_age <- whitehall %$% Surv(time = as.numeric(timein) / 365.25,
time2 = as.numeric(timeout) / 365.25,
event = chd,
origin = as.numeric(timebth) / 365.25)
# Check youngest age at entry and oldest age at exit
summary(surv_white_age) # 40, 86
# Cox
(cox_white_age <- coxph(surv_white_age ~ grade, data = whitehall))
confint(cox_white_age) %>% exp()
```
This HR is much lower than the previous one, indicating that age is a confounder for the effect of grade.
The same with Poisson:
```{r}
# Split the survival object
split_white <- survSplit(surv_white_age ~ .,
data = whitehall,
cut = c(40, 65, 90),
episode = "age")
# Factorise variable and label values
split_white$age <- factor(split_white$age,
levels = c(1, 2, 3, 4),
labels = c("<=40", "40-65", "65-90", ">=90"))
View(split_white)
table(split_white$age, useNA = "ifany")
split_white$pyears <- split_white %$% as.numeric(timeout - timein) / 365.25
#-- BROKEN Fit a Poisson model
glm(chd ~ offset(log(pyears)) + grade + age, # period goes here
family = poisson(),
data = split_white) %>% # use the Lexis-split data, not the old one
idr.display()
```
*Same issue* as in chunk "survSplit", adj HR estimate is similar but not the same, and the age variable HR is completely off.
*Same workaround* Use the Epi package.
```{r LM}
# Create a variable with follow up time
# For the difftime function enter the two dates that we want the difference between - NB the arguments must be in as.date class
whitehall$follow.up <- difftime(whitehall$timeout, whitehall$timein, units = c("days"))/365.25
# Convert all the date variables for decimals for all Surv/Lexis objects, object spltting, COx/Poission regression
whitehall$timein <- cal.yr(whitehall$timein, format="%d/%m/%Y")
whitehall$timeout <- cal.yr(whitehall$timeout, format="%d/%m/%Y")
whitehall$timebth <- cal.yr(whitehall$timebth, format="%d/%m/%Y")
# The follow up variable can be transferred to numeric using the following command
whitehall$follow.up <- as.numeric(whitehall$follow.up)
# Cox model to estimate the effect of grade - NB cox surival object with follow up time as time scale
coxph(Surv(whitehall$follow.up, whitehall$chd) ~ grade, data= whitehall)
# Cox model to estimate the effect of grade - with the age as the time scale
coxph(Surv(whitehall$timein, whitehall$timeout, whitehall$chd, origin = whitehall$timebth) ~ grade, data= whitehall)
# There is a significant change in the HR once changing the time scale from follow up time to age - therefore there is evidence of confodunding
# Poisson regression
whitehall.m1 <- glm(chd ~ offset(log(follow.up)) + as.factor(grade), family = poisson(), data = whitehall)
exp(whitehall.m1$coefficients)
# Lexis expansion of the whitehall dataset
Lexis.whitehall.chd <- whitehall %>% Lexis(entry = list(per=timein), exit = list( per=timeout, age = timeout - timebth), exit.status = chd, data = . )
# Lexis split by age
Lexis.whitehall.chd<- splitLexis(Lexis.whitehall.chd, breaks = c(0, 40, seq(50,80,5)), time.scale="age" )
# Create a variable with the period
Lexis.whitehall.chd$currage <- timeBand(Lexis.whitehall.chd, "age", type = "factor")
# Fit a poission model to the lexis model
whitehall.m2 <- glm(lex.Xst ~ offset(log(lex.dur)) + currage + as.factor(grade), family = poisson(), data = Lexis.whitehall.chd)
exp(whitehall.m2$coefficients)
```
-------------------------------------------------------------------------------