-
Notifications
You must be signed in to change notification settings - Fork 15
/
08-abundance.Rmd
410 lines (320 loc) · 15.6 KB
/
08-abundance.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
```{r, echo=FALSE}
library(mia)
library(miaViz)
library(dplyr)
library(ANCOMBC)
library(DESeq2)
se <- readRDS("data/se.rds")
tse <- readRDS("data/tse.rds")
tse <- transformCounts(tse, method = "relabundance")
```
# Differential abundance analysis
Here, we analyse abundances with three different methods: **Wilcoxon test** (CLR), **DESeq2**,
and **ANCOM-BC**. All of these test statistical differences between groups.
We will analyse Genus level abundances.
We might want to first perform prevalence filtering to reduce the amount of multiple tests. In this particular dataset, all genera pass a prevalence threshold of 10%, therefore, we do not perform filtering.
## Wilcoxon test
A Wilcoxon test estimates the difference in an outcome between two groups. It is a
non-parametric alternative to a t-test, which means that the Wilcoxon test
does not make any assumptions about the data.
Let's first combine the data for the testing purpose.
```{r}
# Agglomerates data to Genus level
tse_genus <- agglomerateByRank(tse, rank = "Genus")
# Perform clr transformation. A Pseudocount of 1 needs to be added,
# because the data contains zeros and the clr transformation includes a
# log transformation.
tse_genus <- transformCounts(tse_genus, method = "clr", pseudocount = 1)
# Does transpose, so samples are in rows, then creates a data frame.
abundance_analysis_data <- data.frame(t(assay(tse_genus, "clr")))
# We will analyse whether abundances differ depending on the"patient_status".
# There are two groups: "ADHD" and "control".
# Let's include that to the data frame.
abundance_analysis_data <- cbind(
abundance_analysis_data,
patient_status = colData(tse_genus)$patient_status
)
```
Now we can start with the Wilcoxon test. We test all the taxa by looping through columns,
and store individual p-values to a vector. Then we create a data frame from collected
data.
The code below does the Wilcoxon test only for columns that contain abundances,
not for columns that contain patient status.
```{r}
genera <- names(abundance_analysis_data[, !names(abundance_analysis_data) %in% "patient_status"])
wilcoxon_p <- c() # Initialize empty vector for p-values
# Do "for loop" over selected column names
for (i in genera) {
result <- wilcox.test(abundance_analysis_data[, i] ~ patient_status,
data = abundance_analysis_data)
# Stores p-value to the vector with this column name
wilcoxon_p[[i]] <- result$p.value
}
wilcoxon_p <- data.frame(taxa = names(wilcoxon_p),
p_raw = unlist(wilcoxon_p))
```
Multiple tests were performed. These are not independent, so we need
to adjust p-values for multiple testing. Otherwise, we would increase
the chance of a type I error drastically depending on our p-value
threshold. By applying a p-value adjustment, we can keep the false
positive rate at a level that is acceptable. What is acceptable
depends on our research goals. Here we use the fdr method, but there
are several other methods as well.
```{r}
wilcoxon_p$p_adjusted <- p.adjust(wilcoxon_p$p_raw, method = "fdr")
```
```{r}
# prepare a dataframe to plot p values
df <- data.frame(x = c(wilcoxon_p$p_raw, wilcoxon_p$p_adjusted),
type=rep(c("raw", "fdr"),
c(length(wilcoxon_p$p_raw),
length(wilcoxon_p$p_adjusted))))
# make a histrogram of p values and adjusted p values
wilcoxon_plot <- ggplot(df) +
geom_histogram(aes(x=x, fill=type)) +
labs(x = "p-value", y = "Frequency")
wilcoxon_plot
```
## DESeq2
Our second analysis method is [DESeq2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8). This method performs the data
normalization automatically. It also takes care of the p-value
adjustment, so we don't have to worry about that.
DESeq2 utilizes a negative binomial distribution to detect differences in
read counts between groups. Its normalization takes care of the
differences between library sizes and compositions. DESeq2 analysis
includes multiple steps, but they are done automatically. More
information can be found, e.g., from Harvard Chan Bioinformatic Core's
tutorial [Introduction to DGE -
ARCHIVED](https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html)
Now let us show how to do this. First, run the DESeq2 analysis.
```{r}
# Creates DESeq2 object from the data. Uses "patient_status" to create groups.
ds2 <- DESeqDataSet(tse_genus, ~patient_status)
# Does the analysis
dds <- DESeq(ds2)
# Gets the results from the object
res <- results(dds)
# Creates a data frame from results
df <- as.data.frame(res)
# Adds taxon column that includes names of taxa
df$taxon <- rownames(df)
# Orders the rows of data frame in increasing order firstly based on column
# "log2FoldChange" and secondly based on "padj" column
df <- df %>% arrange(log2FoldChange, padj)
knitr::kable(head(df)) %>%
kableExtra::kable_styling("striped") %>%
kableExtra::scroll_box(width = "100%")
```
## ANCOM-BC
[The analysis of composition of microbiomes with bias correction (ANCOM-BC)](https://www.nature.com/articles/s41467-020-17041-7)
is a recently developed method for differential abundance testing. It is based on an
[earlier published approach](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4450248/).
The former version of this method could be recommended as part of several approaches:
A [recent study](https://www.biorxiv.org/content/10.1101/2021.05.10.443486v1.full)
compared several mainstream methods and found that among another method, ANCOM produced the most consistent results and is probably a conservative approach. Please note that based on this and other comparisons, no single method can be recommended across all datasets. Rather, it could be recommended to apply several methods and look at the overlap/differences.
As the only method, ANCOM-BC incorporates the so called *sampling fraction* into the model. The latter term could be empirically estimated by the ratio of the library size to the microbial load. Variations in this sampling fraction would bias differential abundance analyses if ignored. Furthermore, this method provides p-values, and confidence intervals for each taxon.
It also controls the FDR and it is computationally simple to implement.
As we will see below, to obtain results, all that is needed is to pass
a phyloseq object to the `ancombc()` function. Therefore, below we first convert
our `tse` object to a `phyloseq` object. Then, we specify the formula. In this formula, other covariates could potentially be included to adjust for confounding.
Please check the [function documentation](https://rdrr.io/github/FrederickHuangLin/ANCOMBC/man/ancombc.html)
to learn about the additional arguments that we specify below. Also, see [here](https://www.bioconductor.org/packages/release/bioc/vignettes/ANCOMBC/inst/doc/ANCOMBC.html) for another example for more than 1 group comparison.
```{r warning = FALSE}
# currently, ancombc requires the phyloseq format, but we can convert this easily
pseq <- makePhyloseqFromTreeSummarizedExperiment(tse)
pseq_genus <- phyloseq::tax_glom(pseq, taxrank = "Genus")
out = ancombc(
phyloseq = pseq_genus,
formula = "patient_status",
p_adj_method = "fdr",
zero_cut = 0.90, # by default prevalence filter of 10% is applied
lib_cut = 0,
group = "patient_status",
struc_zero = TRUE,
neg_lb = TRUE,
tol = 1e-5,
max_iter = 100,
conserve = TRUE,
alpha = 0.05,
global = TRUE
)
res <- out$res
```
The object `out` contains all relevant information. Again, see the
[documentation of the function](https://rdrr.io/github/FrederickHuangLin/ANCOMBC/man/ancombc.html)
under **Value** for an explanation of all the output objects. Our question can be answered
by looking at the `res` object, which now contains dataframes with the coefficients,
standard errors, p-values and q-values. Conveniently, there is a dataframe `diff_abn`.
Here, we can find all differentially abundant taxa. Below we show the first 6 entries of this dataframe:
```{r}
knitr::kable(head(res$diff_abn)) %>% kableExtra::kable_styling("striped") %>%
kableExtra::scroll_box(width = "100%")
```
In total, this method detects `r sum(out$res$diff_abn$patient_statusControl)` differentially abundant taxa.
## Comparison of the methods
Let's compare results that we got from the methods.
As we can see from the scatter plot, DESeq2 gives lower p-values than Wilcoxon test.
```{r}
mf <- data.frame(df$padj, wilcoxon_p$p_adjusted)
p <- ggplot(mf, aes(x = df$padj, y = wilcoxon_p$p_adjusted)) +
labs(x = "DESeq2 adjusted p-value", y = "Wilcoxon test adjusted p-value") +
geom_count() +
scale_size_area(max_size = 10)
print(p)
```
Prints number of p-values under 0.05
```{r}
print(paste0("Wilcoxon test p-values under 0.05: ", sum(wilcoxon_p$p_adjusted<0.05, na.rm = TRUE), "/", length(wilcoxon_p$p_adjusted)))
print(paste0("DESeq2 p-values under 0.05: ", sum(df$padj<0.05, na.rm = TRUE), "/", length(df$padj)))
print(paste0("ANCOM p-values under 0.05: ", sum(out$res$diff_abn$patient_statusControl), "/", length(out$res$diff_abn$patient_statusControl)))
```
We can also look at the intersection of identified taxa
```{r}
# to let R check this for us, we need to make sure,
# to use the same tax names (I call it labels here) everywhere.
wilcox_labels <- tibble(
wilcox_labels_new = rowData(tse_genus)$Genus,
taxa = colnames(data.frame(t(assay(tse_genus, "clr"))))
)
wilcox_taxa <-wilcoxon_p %>%
left_join(wilcox_labels, by = "taxa") %>%
filter(p_adjusted <= 0.05) %>%
.$wilcox_labels_new
deseq2_taxa <- filter(df, padj <= 0.05) %>%
.$taxon %>%
stringr::str_remove("Genus:") %>%
stringr::str_remove("Order:")
# for ancom we need to assign genus names to ids
taxid_df <- tibble::rownames_to_column(
as.data.frame(rowData(tse)),
"taxid") %>%
select(taxid, Genus)
ancom_taxa <- tibble::rownames_to_column(out$res$diff_abn, "taxid") %>%
left_join(taxid_df, by = "taxid") %>%
filter(patient_statusControl) %>%
.$Genus
# all methods identified in common:
Reduce(intersect, list(deseq2_taxa, wilcox_taxa, ancom_taxa))
```
## Comparison of abundance
In previous steps, we got information which taxa vary between ADHD and control groups.
Let's plot those taxa in the boxplot, and compare visually if abundances of those taxa
differ in ADHD and control samples. For comparison, let's plot also taxa that do not
differ between ADHD and control groups.
Let's first gather data about taxa that have highest p-values.
```{r}
# There are some taxa that do not include Genus level information. They are
# excluded from analysis.
# str_detect finds if the pattern is present in values of "taxon" column.
# Subset is taken, only those rows are included that do not include the pattern.
df <- df[ !stringr::str_detect(df$taxon, "Genus:uncultured"), ]
# Sorts p-values in decreasing order. Takes 3 first ones. Takes those rows that match
# with p-values. Takes taxa.
highest3 <- df[df$padj %in% sort(df$padj, decreasing = TRUE)[1:3], ]$taxon
# From clr transformed table, takes only those taxa that had highest p-values
highest3 <- assay(tse_genus, "clr")[highest3, ]
# Transposes the table
highest3 <- t(highest3)
# Adds colData that includes patient status infomation
highest3 <- data.frame(highest3, as.data.frame(colData(tse_genus)))
# Some taxa names are that long that they don't fit nicely into title. So let's add there
# a line break after e.g. "Genus". Here the dot after e.g. Genus is replaced with
# ": \n"
colnames(highest3)[1:3] <- lapply(colnames(highest3)[1:3], function(x){
# Replaces the first dot
temp <- stringr::str_replace(x, "[.]", ": ")
# Replace all other dots and underscores with space
temp <- stringr::str_replace_all(temp, c("[.]" = " ", "_" = " "))
# Adds line break so that 25 characters is the maximal width
temp <- stringr::str_wrap(temp, width = 25)
})
```
Next, let's do the same but for taxa with lowest p-values.
```{r}
# Sorts p-values in increasing order. Takes 3rd first ones. Takes those rows that match
# with p-values. Takes taxa.
lowest3 <- df[df$padj %in% sort(df$padj, decreasing = FALSE)[1:3], ]$taxon
# From clr transformed table, takes only those taxa that had lowest p-values
lowest3 <-assay(tse_genus, "clr")[lowest3, ]
# Transposes the table
lowest3 <- t(lowest3)
# Adds colData that includes patient status infomation
lowest3 <- data.frame(lowest3, as.data.frame(colData(tse_genus)))
# Some taxa names are that long that they don't fit nicely into title. So let's add there
# a line break after e.g. "Genus". Here the dot after e.g. Genus is replaced with
# ": \n"
colnames(lowest3)[1:3] <- lapply(colnames(lowest3)[1:3], function(x){
# Replaces the first dot
temp <- stringr::str_replace(x, "[.]", ": ")
# Replace all other dots and underscores with space
temp <- stringr::str_replace_all(temp, c("[.]" = " ", "_" = " "))
# Adds line break so that 25 characters is the maximal width
temp <- stringr::str_wrap(temp, width = 25)
})
```
Then we can plot these six different taxa. Let's arrange them into the same picture.
```{r}
# Puts plots in the same picture
gridExtra::grid.arrange(
# Plot 1
ggplot(highest3, aes(x = patient_status, y = highest3[,1])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(highest3)[1]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# Plot 2
ggplot(highest3, aes(x = patient_status, y = highest3[,2])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(highest3)[2]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# Plot 3
ggplot(highest3, aes(x = patient_status, y = highest3[,3])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(highest3)[3]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# Plot 4
ggplot(lowest3, aes(x = patient_status, y = lowest3[,1])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(lowest3)[1]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# Plot 5
ggplot(lowest3, aes(x = patient_status, y = lowest3[,2])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(lowest3)[2]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# Plot 6
ggplot(lowest3, aes(x = patient_status, y = lowest3[,3])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(lowest3)[3]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# 3 columns and 2 rows
ncol = 3,
nrow = 2
)
```
We plotted those taxa that have the highest and lowest p values according to DESeq2. Can you create a plot that shows the difference in abundance in "[Ruminococcus]_gauvreauii_group", which is the other taxon that was identified by all tools. Try for yourself! Below you find one way how to do it.
```{r}
select(
abundance_analysis_data,
patient_status,
Ruminococcus_gauvreauii_group = contains("gauvreauii_group")) %>%
ggplot(aes(patient_status, Ruminococcus_gauvreauii_group)) +
geom_boxplot()
```