-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path1_microbiome_r.Rmd
569 lines (462 loc) · 29.4 KB
/
1_microbiome_r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
---
title: Microbiome analysis in R
---
```{r startup, include = FALSE}
# install package dependency for r markdown files
p <- c("rmarkdown", "knitr", "DT")
install_package <- function(p) {
if (!requireNamespace(p, quietly = TRUE)) {
install.packages(p, repos = "http://cran.us.r-project.org/")
}
}
invisible(lapply(p, install_package))
# xaringanExtra
if (!requireNamespace("xaringanExtra", quietly = TRUE)) {
devtools::install_github("gadenbuie/xaringanExtra", upgrade = "never")
}
```
```{r xaringanExtra-clipboard, echo=FALSE}
htmltools::tagList(
xaringanExtra::use_clipboard(
button_text = "<i class=\"fa fa-clipboard\"></i>",
success_text = "<i class=\"fa fa-check\" style=\"color: #90BE6D\"></i>",
error_text = "<i class=\"fa fa-times-circle\" style=\"color: #F94144\"></i>"
),
rmarkdown::html_dependency_font_awesome()
)
```
**BEFORE YOU START:** This is a tutorial to analyze microbiome data with R. The tutorial starts from the processed output from metagenomic sequencing, i.e. a feature matrix. It's suitable for **R users** who wants to have hand-on tour of the microbiome world. This tutorial covers the common microbiome analysis e.g. alpha/beta diversity, differential abundance analysis.
- The demo data-set comes from the [QIIME 2](https://qiime2.org/) tutorial - [Moving Pictures](https://docs.qiime2.org/2020.11/tutorials/moving-pictures/).
- This is a **beginner tutorial**. You still have time to run away if you're an experienced bioinformatician. `r emo::ji("ghost")`
## What you will work on
After sequencing, you will get some ([fastq](https://en.wikipedia.org/wiki/FASTQ_format)/[fast5](https://nanoporetech.com/nanopore-sequencing-data-analysis)/[h5](http://files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.pdf)) files from the sequencing. These files contain the nucleotide information. If the sequenced sample is a mixed community, i.e. metagenomics, you need to identify where the reads come from (specific species and samples). There are plenty of strategies for metagenomics such as amplicon and shotgun sequencing. But you have more or less similar *structural profiles*:
- Feature table (Necessary), a matrix containing the abundances of detected microbial features (OTUs, ASVs, microbial markers)
- Taxonomy table (Optional), an array indicating the taxonomic assignment of features, which can be integrated in [biom format](https://biom-format.org/documentation/format_versions/biom-2.1.html).
- Phylogenetic tree (Optional), a phylogenetic tree indicating the evolutional similarity of microbial features, potentially used when calculating phylogenetic diversity metrics (optionally integrated in [biom format](https://biom-format.org/documentation/format_versions/biom-2.1.html)).
- Metadata, a data frame containing the grouping information, etc (optionally integrated in [biom format](https://biom-format.org/documentation/format_versions/biom-2.1.html)).
## Preparation
**Start with QIIME 2 output** Here QIIME 2 data is adopted as an example. If you don't have QIIME 2 output, you can skip the `bash code` for `QIIME 2`. All data is imported using R built-ins, but QIIME 2 outputs need extra steps (Here the QIIME 2 tools is used. Some R tricks shall also work if you unzip the files there). QIIME 2 data files are defined in [qza](https://docs.qiime2.org/2020.11/concepts/) format , which are actually just *zipped* files. Below are data from the [Moving Pictures](https://docs.qiime2.org/2020.11/tutorials/moving-pictures/) in QIIME 2.
```{bash, echo = FALSE}
ls *.qza sample-metadata.tsv
```
Download [sample-metadata.tsv](https://data.qiime2.org/2020.11/tutorials/moving-pictures/sample_metadata.tsv), [table.qza](https://docs.qiime2.org/2020.11/data/tutorials/moving-pictures/table.qza), [taxonomy.qza](https://docs.qiime2.org/2020.11/data/tutorials/moving-pictures/taxonomy.qza) and [rooted-tree.qza](https://docs.qiime2.org/2020.11/data/tutorials/moving-pictures/rooted-tree.qza), and name them accordingly.
As the R built-ins do not directly accept `.qza` or `.biom` files, we have to extract the content inside the `.qza` files. You need QIIME 2 (qiime2-2020.11) [installed](https://docs.qiime2.org/2020.11/install/). Open the terminal and run the [bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) script below. Windows users shall enable [WSL 2](https://docs.microsoft.com/en-us/windows/wsl/install-win10) (Windows Subsystem for Linux) or use [visual machines](https://docs.qiime2.org/2020.11/install/virtual/virtualbox/) instead.
```{bash, message = FALSE, results = FALSE, engine.opts = "-i"}
conda activate qiime2-2020.11
for i in *.qza; do
qiime tools export --input-path $i --output-path .
done
biom convert -i feature-table.biom -o feature-table.tsv --to-tsv
conda deactivate
```
And you will have
```{bash, echo = FALSE}
ls *.tsv tree.nwk
```
Install and load relevant packages: [phyloseq](https://github.com/joey711/phyloseq), [tidyverse](https://ggplot2.tidyverse.org/), [vegan](https://cran.r-project.org/web/packages/vegan/index.html), [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) , [ANCOMBC](http://www.bioconductor.org/packages/release/bioc/html/ANCOMBC.html), [ComplexHeatmap](https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html). The following code are solely written in R. You can open the [Rstudio](https://rstudio.com/) and repeat it now.
```{r, message = FALSE}
p1 <- c("tidyverse", "vegan", "BiocManager")
p2 <- c("phyloseq", "ANCOMBC", "DESeq2", "ComplexHeatmap")
load_package <- function(p) {
if (!requireNamespace(p, quietly = TRUE)) {
ifelse(p %in% p1,
install.packages(p, repos = "http://cran.us.r-project.org/"),
BiocManager::install(p))
}
library(p, character.only = TRUE, quietly = TRUE)
}
invisible(lapply(c(p1,p2), load_package))
```
## Build phyloseq object
To build a phyloseq project, the phyloseq description is [here](https://joey711.github.io/phyloseq/import-data.html).
```{r, warning = FALSE}
otu <- read.table(file = "feature-table.tsv", sep = "\t", header = T, row.names = 1,
skip = 1, comment.char = "")
taxonomy <- read.table(file = "taxonomy.tsv", sep = "\t", header = T ,row.names = 1)
# clean the taxonomy, Greengenes format
tax <- taxonomy %>%
select(Taxon) %>%
separate(Taxon, c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), "; ")
tax.clean <- data.frame(row.names = row.names(tax),
Kingdom = str_replace(tax[,1], "k__",""),
Phylum = str_replace(tax[,2], "p__",""),
Class = str_replace(tax[,3], "c__",""),
Order = str_replace(tax[,4], "o__",""),
Family = str_replace(tax[,5], "f__",""),
Genus = str_replace(tax[,6], "g__",""),
Species = str_replace(tax[,7], "s__",""),
stringsAsFactors = FALSE)
tax.clean[is.na(tax.clean)] <- ""
tax.clean[tax.clean=="__"] <- ""
for (i in 1:nrow(tax.clean)){
if (tax.clean[i,7] != ""){
tax.clean$Species[i] <- paste(tax.clean$Genus[i], tax.clean$Species[i], sep = " ")
} else if (tax.clean[i,2] == ""){
kingdom <- paste("Unclassified", tax.clean[i,1], sep = " ")
tax.clean[i, 2:7] <- kingdom
} else if (tax.clean[i,3] == ""){
phylum <- paste("Unclassified", tax.clean[i,2], sep = " ")
tax.clean[i, 3:7] <- phylum
} else if (tax.clean[i,4] == ""){
class <- paste("Unclassified", tax.clean[i,3], sep = " ")
tax.clean[i, 4:7] <- class
} else if (tax.clean[i,5] == ""){
order <- paste("Unclassified", tax.clean[i,4], sep = " ")
tax.clean[i, 5:7] <- order
} else if (tax.clean[i,6] == ""){
family <- paste("Unclassified", tax.clean[i,5], sep = " ")
tax.clean[i, 6:7] <- family
} else if (tax.clean[i,7] == ""){
tax.clean$Species[i] <- paste("Unclassified ",tax.clean$Genus[i], sep = " ")
}
}
metadata <- read.table(file = "sample-metadata.tsv", sep = "\t", header = T, row.names = 1)
OTU = otu_table(as.matrix(otu), taxa_are_rows = TRUE)
TAX = tax_table(as.matrix(tax.clean))
SAMPLE <- sample_data(metadata)
TREE = read_tree("tree.nwk")
# merge the data
ps <- phyloseq(OTU, TAX, SAMPLE,TREE)
```
There's one loop to clean the Greengene format taxonomy. See what has happened there?
**Before**
```r
tax
```
```{r, echo = FALSE}
DT::datatable(tax, options = list(scrollX=T))
```
**After**
```r
taxa.clean
```
```{r, echo = FALSE}
DT::datatable(tax.clean, options = list(scrollX=T))
```
> Tips: The cleaning process helps when you want to collapse the feature table at one specific level, e.g. species or genus. The taxonomy of feature is identified by e.g. the [LCA algorithm](https://en.wikipedia.org/wiki/Lowest_common_ancestor). It's possible that some features are merely classified at genus or family level due to the [sequence homology](https://en.wikipedia.org/wiki/Sequence_homology). The loop is just to iteratively fill in the gaps with respective lowest taxonomic classification, resulting in more details in visualization.
An integrated phyloseq project looks like
```{r, echo = FALSE}
ps
```
## Check the sequencing depth with rarefaction curves
```{r, rarefaction, echo=FALSE}
rarecurve(t(otu_table(ps)), step=50, cex=0.5)
```
```{r, include = FALSE}
depth <- colSums(otu_table(ps))
```
The sequencing depth is `r sprintf("%.3f", mean(depth))` ± `r sprintf("%.3f", sd(depth))` (mean ± SD), ranging from `r min(depth)` to `r max(depth)`.
As we could see, a sequencing depth of 1000 has captured most taxonomic tags. To minimize the bias from varying sequencing depth, rarefaction is recommended before calculating the diversity metrics. However, it's [not always the rule of thumb](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531). To keep consistence with the [Moving pictures tutorial](https://docs.qiime2.org/2020.11/tutorials/moving-pictures/), all samples are rarefied (re-sampled) to the sequencing depth of 1103. As it's random sampling process, the result is unlikely to be identical as the [Moving pictures tutorial](https://docs.qiime2.org/2020.11/tutorials/moving-pictures/)
```{r, message = FALSE}
set.seed(111) # keep result reproductive
ps.rarefied = rarefy_even_depth(ps, rngseed=1, sample.size=1103, replace=F)
```
After rarefaction, the phyloseq looks like
```{r, echo = FALSE}
ps.rarefied
```
```{r, include = FALSE}
depth_r <- colSums(otu_table(ps.rarefied))
```
The sequencing depth of all remaining `r ncol(otu_table(ps.rarefied))` samples is `r mean(depth_r)`.
## Alpha diversity
Alpha diversity metrics assess the species diversity within the ecosystems, telling you how diverse a sequenced community is.
```{r, alpha_diversity}
plot_richness(ps.rarefied, x="body.site", measures=c("Observed", "Shannon")) +
geom_boxplot() +
theme_classic() +
theme(strip.background = element_blank(), axis.text.x.bottom = element_text(angle = -90))
```
We could see tongue samples had the lowest alpha diversity. Next, some statistics: pairwise test with Wilcoxon rank-sum test, corrected by FDR method. For the number of observed tags,
```{r, warning = FALSE, results = FALSE, message = FALSE}
rich = estimate_richness(ps.rarefied, measures = c("Observed", "Shannon"))
wilcox.observed <- pairwise.wilcox.test(rich$Observed,
sample_data(ps.rarefied)$body.site,
p.adjust.method = "BH")
tab.observed <- wilcox.observed$p.value %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "group1") %>%
gather(key="group2", value="p.adj", -group1) %>%
na.omit()
tab.observed
```
```{r, echo = FALSE}
DT::datatable(tab.observed, rownames = FALSE)
```
Similarly for Shannon diversity,
```{r, warning = FALSE, results = FALSE, message = FALSE}
wilcox.shannon <- pairwise.wilcox.test(rich$Shannon,
sample_data(ps.rarefied)$body.site,
p.adjust.method = "BH")
tab.shannon <- wilcox.shannon$p.value %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "group1") %>%
gather(key="group2", value="p.adj", -group1) %>%
na.omit()
tab.shannon
```
```{r, echo = FALSE}
DT::datatable(tab.shannon, rownames = FALSE)
```
## Beta diversity
Beta diversity metrics assess the dissimilarity between ecosystem, telling you to what extend one community is different from another. Here are some demo code using Bray Curtis and binary Jaccard distance metrics. I personally do not recommend directly using the UniFrac metrics supported by phyloseq, especially when your tree is not dichotomous. This will happen when some children branches lose their parents in rarefaction.`r emo::ji("sad")` You need to re-root your tree in e.g. QIIME 2 if you run into this case, see [here](https://github.com/joey711/phyloseq/issues/936). Besides, the UniFrac distance calculation by phyloseq has not been checked with the source code, see the posts in [phyloseq issues](https://github.com/joey711/phyloseq/issues/956) and [QIIME 2 forum](https://forum.qiime2.org/t/very-different-weighted-unifrac-results-for-qiime2-versus-phyloseq/4591). So it's safer to use QIIME 2 version given that both QIIME and UniFrac concepts came from the [Knight lab](https://knightlab.ucsd.edu/).
#### [PCoA](https://mb3is.megx.net/gustame/dissimilarity-based-methods/principal-coordinates-analysis) and [PERMANOVA/ADONIS](https://mb3is.megx.net/gustame/hypothesis-tests/manova/npmanova)
- *Bray curtis*
`distance` is in overlapped use, so we specify the package by `phyloseq::distance`
```{r, pcoa_bray}
dist = phyloseq::distance(ps.rarefied, method="bray")
ordination = ordinate(ps.rarefied, method="PCoA", distance=dist)
plot_ordination(ps.rarefied, ordination, color="body.site") +
theme_classic() +
theme(strip.background = element_blank())
```
*PERMANOVA/ADONIS*
```{r, results = FALSE, message = FALSE}
metadata <- data.frame(sample_data(ps.rarefied))
test.adonis <- adonis(dist ~ body.site, data = metadata)
test.adonis <- as.data.frame(test.adonis$aov.tab)
test.adonis
```
```{r, echo = FALSE}
DT::datatable(test.adonis, options = list(scrollX=T))
```
*PAIRWISE PERMANOVA*
```{r, results = FALSE, message = FALSE}
cbn <- combn(x=unique(metadata$body.site), m = 2)
p <- c()
for(i in 1:ncol(cbn)){
ps.subs <- subset_samples(ps.rarefied, body.site %in% cbn[,i])
metadata_sub <- data.frame(sample_data(ps.subs))
permanova_pairwise <- adonis(phyloseq::distance(ps.subs, method = "bray") ~ body.site,
data = metadata_sub)
p <- c(p, permanova_pairwise$aov.tab$`Pr(>F)`[1])
}
p.adj <- p.adjust(p, method = "BH")
p.table <- cbind.data.frame(t(cbn), p=p, p.adj=p.adj)
p.table
```
```{r, echo = FALSE}
DT::datatable(p.table, rownames = FALSE)
```
As we could see, there's no difference in composition between the left and right palm samples. Similar result can be found using binary Jaccard metrics. Adjust the script of PERMANOVA/ADONIS accordingly with the code below and give it a try.
```r
dist = distance(ps.rarefied, method="jaccard", binary = TRUE)
```
And
```r
permanova_pairwise <- adonis(distance(ps.subs, method="jaccard", binary = TRUE) ~ body.site, data = metadata_sub)
```
#### [NMDS](https://mb3is.megx.net/gustame/dissimilarity-based-methods/nmds) and [ANOSIM](https://mb3is.megx.net/gustame/hypothesis-tests/anosim)
The microbiome data is compositional and sparse. Non-parametric multidimentional scaling (NMDS) is favored also, usually aligned with ANOSIM test.
- *Binary Jaccard*
```{r, nmds_jaccard, results = FALSE, message = FALSE}
dist = phyloseq::distance(ps.rarefied, method="jaccard", binary = TRUE)
ordination = ordinate(ps.rarefied, method="NMDS", distance=dist)
```
```{r}
plot_ordination(ps.rarefied, ordination, color="body.site") +
theme_classic() +
theme(strip.background = element_blank())
```
*ANOSIM*
```{r}
metadata <- data.frame(sample_data(ps.rarefied))
anosim(dist, metadata$body.site)
```
*PAIRWISE ANOSIM*
```{r}
cbn <- combn(x=unique(metadata$body.site), m = 2)
p <- c()
for(i in 1:ncol(cbn)){
ps.subs <- subset_samples(ps.rarefied, body.site %in% cbn[,i])
metadata_sub <- data.frame(sample_data(ps.subs))
permanova_pairwise <- anosim(phyloseq::distance(ps.subs, method="jaccard", binary = TRUE),
metadata_sub$body.site)
p <- c(p, permanova_pairwise$signif[1])
}
p.adj <- p.adjust(p, method = "BH")
p.table <- cbind.data.frame(t(cbn), p=p, p.adj=p.adj)
```
```{r, echo = FALSE}
DT::datatable(p.table, rownames = FALSE)
```
The NMDS and ANOSIM result were similar to that from PCoA and ADONIS.
## Abundance bar plot
Here the demos use a un-rarefied table to visualize all detected tags.
- *Phylum*
```{r, phylum_abun, message = FALSE}
ps.rel = transform_sample_counts(ps, function(x) x/sum(x)*100)
# agglomerate taxa
glom <- tax_glom(ps.rel, taxrank = 'Phylum', NArm = FALSE)
ps.melt <- psmelt(glom)
# change to character for easy-adjusted level
ps.melt$Phylum <- as.character(ps.melt$Phylum)
ps.melt <- ps.melt %>%
group_by(body.site, Phylum) %>%
mutate(median=median(Abundance))
# select group median > 1
keep <- unique(ps.melt$Phylum[ps.melt$median > 1])
ps.melt$Phylum[!(ps.melt$Phylum %in% keep)] <- "< 1%"
#to get the same rows together
ps.melt_sum <- ps.melt %>%
group_by(Sample,body.site,Phylum) %>%
summarise(Abundance=sum(Abundance))
ggplot(ps.melt_sum, aes(x = Sample, y = Abundance, fill = Phylum)) +
geom_bar(stat = "identity", aes(fill=Phylum)) +
labs(x="", y="%") +
facet_wrap(~body.site, scales= "free_x", nrow=1) +
theme_classic() +
theme(strip.background = element_blank(),
axis.text.x.bottom = element_text(angle = -90))
```
`< 1%` indicates the rare taxa in each group, with median relative abundance < 1%.
- *Genus*
```{r, genus_abun, message = FALSE}
ps.rel = transform_sample_counts(ps, function(x) x/sum(x)*100)
# agglomerate taxa
glom <- tax_glom(ps.rel, taxrank = 'Genus', NArm = FALSE)
ps.melt <- psmelt(glom)
# change to character for easy-adjusted level
ps.melt$Genus <- as.character(ps.melt$Genus)
ps.melt <- ps.melt %>%
group_by(body.site, Genus) %>%
mutate(median=median(Abundance))
# select group mean > 1
keep <- unique(ps.melt$Genus[ps.melt$median > 2.5])
ps.melt$Genus[!(ps.melt$Genus %in% keep)] <- "< 2.5%"
#to get the same rows together
ps.melt_sum <- ps.melt %>%
group_by(Sample,body.site,Genus) %>%
summarise(Abundance=sum(Abundance))
ggplot(ps.melt_sum, aes(x = Sample, y = Abundance, fill = Genus)) +
geom_bar(stat = "identity", aes(fill=Genus)) +
labs(x="", y="%") +
facet_wrap(~body.site, scales= "free_x", nrow=1) +
theme_classic() +
theme(legend.position = "right",
strip.background = element_blank(),
axis.text.x.bottom = element_text(angle = -90))
```
`< 2.5%` indicates the rare taxa in each group, with median relative abundance < 2.5%.
## Differential abundance analysis
Differential abundance analysis allows you to identify differentially abundant taxa between groups. There's many methods, here two commonly used ones are given as examples. Features are collapsed at the species level prior to the calculation.
```{r, message = FALSE}
sample_data(ps)$body.site <- as.factor(sample_data(ps)$body.site) # factorize for DESeq2
ps.taxa <- tax_glom(ps, taxrank = 'Species', NArm = FALSE)
```
Let's see what features are enriched differentially between tongue and gut samples.
### DESeq2
DESeq2 is a software designed for RNA-seq, but also used in [microbiome analysis](https://joey711.github.io/phyloseq-extensions/DESeq2.html), and the detailed use of DESeq2 can be found [here](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#variations-to-the-standard-workflow).
> You may be troubled by the "zero" issues in microbiome analysis.
> Previously, a pseudo count is suggested to avoid the errors.
> However, the pseduo count violates the negative binomial distribution assumption of DESeq2, increasing false postives. [Phyloseq discussion](https://github.com/joey711/phyloseq/issues/642https://github.com/joey711/phyloseq/issues/642#issuecomment-243211388)
> According to the recent update, `estimateSizeFactors` introduces alternate estimators (`type="iterate"` and `type="poscounts"`),
> which can be used even when all genes contain a sample with a zero ([Biostar Discussion 1](https://support.bioconductor.org/p/63229/),
> [Biostar Discussion 2](https://support.bioconductor.org/p/100201/)).
> For DESeq2, it's suggested to use alternative estimators for the size factors rather than use tranformed counts.
> Meanwhile, filtering sparse features (zeros in almost every observation) also benefits DESeq2 modelling.
> It's highly suggested to go through the official [DESeq2 manual](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) if the experiment desgin is complex.
>
> Below we just filter features with more than 90% of zeros, for a quick demo.
```{r, message = FALSE}
# pairwise comparison between gut and tongue
ps.taxa.sub <- subset_samples(ps.taxa, body.site %in% c("gut", "tongue"))
# filter sparse features, with > 90% zeros
ps.taxa.pse.sub <- prune_taxa(rowSums(otu_table(ps.taxa.sub) == 0) < ncol(otu_table(ps.taxa.sub)) * 0.9, ps.taxa.sub)
ps_ds = phyloseq_to_deseq2(ps.taxa.pse.sub, ~ body.site)
# use alternative estimator on a condition of "every gene contains a sample with a zero"
ds <- estimateSizeFactors(ps_ds, type="poscounts")
ds = DESeq(ds, test="Wald", fitType="parametric")
alpha = 0.05
res = results(ds, alpha=alpha)
res = res[order(res$padj, na.last=NA), ]
taxa_sig = rownames(res[1:20, ]) # select bottom 20 with lowest p.adj values
ps.taxa.rel <- transform_sample_counts(ps, function(x) x/sum(x)*100)
ps.taxa.rel.sig <- prune_taxa(taxa_sig, ps.taxa.rel)
# Only keep gut and tongue samples
ps.taxa.rel.sig <- prune_samples(colnames(otu_table(ps.taxa.pse.sub)), ps.taxa.rel.sig)
```
Visualize these biomarkers in heatmap using R package [ComplexHeatmap](https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html), which is an improved package on the basis of pheatmap. A detailed reference book about ComplexHeatmap is [here](https://jokergoo.github.io/ComplexHeatmap-reference/book/). I like it because of its flexibility relative to built-in heatmap(), gplots::heatmap.2(), pheatmap::pheatmap() and heatmap3::heatmap(). However, the mentioned ones can also be adopted given the same theory of heatmap. The added function [ComplexHeatmap::pheatmap](https://jokergoo.github.io/ComplexHeatmap-reference/book/integrate-with-other-packages.html#pheatmap) inherited most settings of pheatmap, so I adopt it here to make code easily understood.
To make `pheatmap` specific, it is written as `ComplexHeatmap::pheatmap`
```{r, deseq2}
matrix <- as.matrix(data.frame(otu_table(ps.taxa.rel.sig)))
rownames(matrix) <- as.character(tax_table(ps.taxa.rel.sig)[, "Species"])
metadata_sub <- data.frame(sample_data(ps.taxa.rel.sig))
# Define the annotation color for columns and rows
annotation_col = data.frame(
Subject = as.factor(metadata_sub$subject),
`Body site` = as.factor(metadata_sub$body.site),
check.names = FALSE
)
rownames(annotation_col) = rownames(metadata_sub)
annotation_row = data.frame(
Phylum = as.factor(tax_table(ps.taxa.rel.sig)[, "Phylum"])
)
rownames(annotation_row) = rownames(matrix)
# ann_color should be named vectors
phylum_col = RColorBrewer::brewer.pal(length(levels(annotation_row$Phylum)), "Paired")
names(phylum_col) = levels(annotation_row$Phylum)
ann_colors = list(
Subject = c(`subject-1` = "red", `subject-2` = "blue"),
`Body site` = c(gut = "purple", tongue = "yellow"),
Phylum = phylum_col
)
ComplexHeatmap::pheatmap(matrix, scale= "row",
annotation_col = annotation_col,
annotation_row = annotation_row,
annotation_colors = ann_colors)
```
### ANCOM-BC
Here I will introduce another statistical method specially designed for Microbiome data. I like its theory behind - sequencing is a process of sampling. For a microbial community under adequate sequencing, the ratio of two specific taxa should be theoretically the "same" when calculated using either the profiled relative abundances or the absolute abundance. The theory can be found in their [published paper](https://www.nature.com/articles/s41467-020-17041-7). ANCOM is one recommended method by QIIME 2 and similar theory is also adopted by another QIIME 2 plugin- [Geneiss](https://docs.qiime2.org/2020.11/tutorials/gneiss/). ANCOM is originally written in R and hosted in NIH, but the link has expired now. The [q2-compositon](https://docs.qiime2.org/2020.11/plugins/available/composition/) plugin is one accessible version to my knowledge. In that QIIME has no official R API, I recommend using improved ANCOM version e.g. [ANCOM 2](https://github.com/FrederickHuangLin/ANCOM) or [ANCOM-BC](https://github.com/FrederickHuangLin/ANCOMBC). These two have not been included in QIIME 2 and maintained by the lab members of the original ANCOM paper. The key differences between ANCOM 2 and ANCOM are how they treat *zeros* and the introduced process of repeated data, and more can be found in the [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5682008/). Considering the structure of code, ANCOM 2 is more like a wrapped R function of ANCOM, especially relative to ANCOM-BC. ANCOM-BC inherited the methodology of pre-processing *zeros* in ANCOM but estimated the sampling bias by linear regression. Besides, it provides *p* values and according to the author, it is blazingly fast while controlling the FDR and maintaining adequate power. ANCOM-BC is maintained at [bioconductor](http://www.bioconductor.org/packages/release/bioc/html/ANCOMBC.html), and a vignette is [here](http://www.bioconductor.org/packages/release/bioc/vignettes/ANCOMBC/inst/doc/ANCOMBC.html). It's worth noting that ANCOM-BC is in development and now (Jan 2021) [only support testing for co-variates and global test](https://github.com/FrederickHuangLin/ANCOMBC/issues/5).
ANCOM-BC is adopted here to repeat the test above. Here it takes *gut* label as a reference to other *body site* groups.
```{r, message = FALSE, results = FALSE, warning = FALSE}
# pairwise comparison between gut and tongue
ps.taxa.sub <- subset_samples(ps.taxa, body.site %in% c("gut", "tongue"))
# ancombc
out <- ancombc(phyloseq = ps.taxa.sub, formula = "body.site",
p_adj_method = "holm", zero_cut = 0.90, lib_cut = 1000,
group = "body.site", struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5,
max_iter = 100, conserve = TRUE, alpha = 0.05, global = TRUE)
res <- out$res
# select the bottom 20 with lowest p values
res.or_p <- rownames(res$q_val["body.sitetongue"])[base::order(res$q_val[,"body.sitetongue"])]
taxa_sig <- res.or_p[1:20]
ps.taxa.rel.sig <- prune_taxa(taxa_sig, ps.taxa.rel)
# Only keep gut and tongue samples
ps.taxa.rel.sig <- prune_samples(colnames(otu_table(ps.taxa.sub)), ps.taxa.rel.sig)
```
> Heatmap may not be a good choice to visualize ANCOM-BC results.
> `W` statistic is the suggested considering the concept of infering absolute variance by ANCOM-BC ([Github Answer](https://github.com/FrederickHuangLin/ANCOMBC/issues/10#issuecomment-731500283)).
> You can follow the official [ANCOM-BC tutorial](https://bioconductor.org/packages/devel/bioc/vignettes/ANCOMBC/inst/doc/ANCOMBC.html)。
> Here we just take a quick look at the results through heatmap.
Repeat heatmap script for the ANCOM result
```{r, ancombc}
matrix <- as.matrix(data.frame(otu_table(ps.taxa.rel.sig)))
rownames(matrix) <- as.character(tax_table(ps.taxa.rel.sig)[, "Species"])
metadata_sub <- data.frame(sample_data(ps.taxa.rel.sig))
# Define the annotation color for columns and rows
annotation_col = data.frame(
Subject = as.factor(metadata_sub$subject),
`Body site` = as.factor(metadata_sub$body.site),
check.names = FALSE
)
rownames(annotation_col) = rownames(metadata_sub)
annotation_row = data.frame(
Phylum = as.factor(tax_table(ps.taxa.rel.sig)[, "Phylum"])
)
rownames(annotation_row) = rownames(matrix)
# ann_color should be named vectors
phylum_col = RColorBrewer::brewer.pal(length(levels(annotation_row$Phylum)), "Paired")
names(phylum_col) = levels(annotation_row$Phylum)
ann_colors = list(
Subject = c(`subject-1` = "red", `subject-2` = "blue"),
`Body site` = c(`gut` = "purple", tongue = "yellow"),
Phylum = phylum_col
)
ComplexHeatmap::pheatmap(matrix,
annotation_col = annotation_col,
annotation_row = annotation_row,
annotation_colors = ann_colors)
```
The tour shall end here. In that you know how to import the microbiome data in R, you can continue exploring your data according to diverse tutorials including [phyloseq](https://joey711.github.io/phyloseq/) from U.S., [ampvis2](https://madsalbertsen.github.io/ampvis2/articles/ampvis2.html) from Denmark, and [MicrobiotaProcess](https://yulab-smu.top/MicrobiotaProcessWorkshop/articles/MicrobiotaProcessWorkshop.html) from China.
If you want a GUI software for your microbiome analysis, you may like [microbiomeanalyst](https://www.microbiomeanalyst.ca/), which was recently published on [Nature protocol](https://www.nature.com/articles/s41596-019-0264-1).