Skip to content

Commit

Permalink
Merge pull request #4 from PacificBiosciences/develop
Browse files Browse the repository at this point in the history
Merge v0.3 from develop
  • Loading branch information
proteinosome authored Aug 4, 2022
2 parents 030f1bc + dab848b commit caead89
Show file tree
Hide file tree
Showing 7 changed files with 96 additions and 49 deletions.
46 changes: 25 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ nextflow run main.nf --help
removal) (default: trim with cutadapt)
--colorby Columns in metadata TSV file to use for coloring the MDS plot
in HTML report (default: condition)
--run_picrust2 Run PICRUSt2 pipeline (default: false)
--download_db Download databases needed for taxonomy classification only. Will not
run the pipeline. Databases will be downloaded to a folder "databases"
in the Nextflow pipeline directory.
Expand Down Expand Up @@ -247,15 +248,15 @@ primers removal?
The "besttax" assignment uses the `assignTaxonomy` Naive-Bayes classifier function from DADA2
to carry out taxonomy assignment. It uses 3 databases to classify the ASVs
(requiring a minimum bootstrap of 80 using the minBoot parameter) and the priority of assignment
is Silva 138, followed by GTDB r202, then lastly RefSeq + RDP. This means for example
if an ASV is not assigned at Species level using Silva, it will check if it can be assigned
with GTDB. This ensure we assign as many ASVs as possible.
is GTDB r207, followed by Silva v138, then lastly RefSeq + RDP. This means for example
if an ASV is not assigned at Species level using GTDB, it will check if it can be assigned
with Silva. This ensure we assign as many ASVs as possible.

This process is done first at Species level, then at Genus level. In addition, if any ASV
is assigned as "uncultured" or "metagenome", it will go through the iterative assignment
process just like the unclassified ASVs. Note that while this method will assign
a high amount of ASVs, there may be issues such as how the taxonomy is annotated
in different databases.
is assigned as "uncultured" or "metagenome" (many entries like this in Silva),
it will go through the iterative assignment process just like the unclassified ASVs.
Note that while this method will assign a high amount of ASVs, there may be issues
such as how the taxonomy is annotated in different databases.

As such, there is also a VSEARCH taxonomy classification using GTDB database (r207) only in the file called
`results/vsearch_merged_freq_tax.tsv` that may provide a more consistent annotation. This uses
Expand Down Expand Up @@ -285,7 +286,8 @@ primers removal?
* Can I download the databases for taxonomic classification manually?

The taxonomy classification step of the pipeline requires a few databases that will be downloaded with the
`--download_db` parameters into a "databases" folder. These databases can also be downloaded
`--download_db` parameters into a "databases" folder. All the databases are also
collected on [Zenodo](https://zenodo.org/record/6912512). These databases can also be downloaded
manually from the following links if the download script above does not work. The GTDB database
for VSEARCH will require some processing using the `QIIME 2` package. See `scripts/download_db.sh` for details.

Expand All @@ -311,33 +313,35 @@ primers removal?

## References
### QIIME 2
1. Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol 37, 852–857 (2019).
* Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol 37, 852–857 (2019).
* For individual citations of plugins, you can use the `--citations` command for the relevant plugins. For example, if you want
to cite `VSEARCH` plugin, type `qiime feature-classifier classify-consensus-vsearch --citations` after activating the conda
environment. You can activate the environment installed by the pipelines by typing `conda activate $HOME/nf_conda/$ENV` (Change
`$ENBV` to the name of the environment you want to activate).
### DADA2
2. Callahan, B. J. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13, 581–583 (2016).
* Callahan, B. J. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13, 581–583 (2016).
### Seqkit
3. Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE 11, e0163962 (2016).
* Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE 11, e0163962 (2016).
### Cutadapt
4. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12 (2011).
* Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12 (2011).
### GTDB database
5. Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36, 996–1004 (2018).
6. Parks, D. H. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol 38, 1079–1086 (2020).
* Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36, 996–1004 (2018).
* Parks, D. H. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol 38, 1079–1086 (2020).
### SILVA database
7. Yilmaz, P. et al. The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Research 42, D643–D648 (2014).
8. Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research 41, D590–D596 (2013).
* Yilmaz, P. et al. The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Research 42, D643–D648 (2014).
* Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research 41, D590–D596 (2013).
### RDP database
9. Cole, J. R. et al. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Res 42, D633–D642 (2014).
* Cole, J. R. et al. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Res 42, D633–D642 (2014).
### RefSeq database
10. O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44, D733-745 (2016).
* O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44, D733-745 (2016).
### Krona plot
11. Bd, O., Nh, B. & Am, P. Interactive metagenomic visualization in a Web browser. BMC bioinformatics 12, (2011).
* Bd, O., Nh, B. & Am, P. Interactive metagenomic visualization in a Web browser. BMC bioinformatics 12, (2011).
* We use the QIIME 2 plugin implementation here: https://github.com/kaanb93/q2-krona
### Phyloseq and Tidyverse (for HTML report visualization)
12. McMurdie, P. J. & Holmes, S. phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLOS ONE 8, e61217 (2013).
13. Wickham, H. et al. Welcome to the Tidyverse. Journal of Open Source Software 4, 1686 (2019).
* McMurdie, P. J. & Holmes, S. phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLOS ONE 8, e61217 (2013).
* Wickham, H. et al. Welcome to the Tidyverse. Journal of Open Source Software 4, 1686 (2019).
### PICRUSt2
* Douglas, G. M. et al. PICRUSt2 for prediction of metagenome functions. Nat Biotechnol 38, 685–688 (2020).

## DISCLAIMER
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA,
Expand Down
1 change: 1 addition & 0 deletions container_def/pb-16s-pbtools.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ dependencies:
- csvtk=0.23.0
- kraken2=2.1.2
- bracken=2.6.2
- picrust2=2.5.0
1 change: 1 addition & 0 deletions container_def/pb-16s-pbtools_docker/pb-16s-pbtools.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ dependencies:
- csvtk=0.23.0
- kraken2=2.1.2
- bracken=2.6.2
- picrust2=2.5.0
1 change: 1 addition & 0 deletions env/pb-16s-pbtools.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ dependencies:
- csvtk=0.23.0
- kraken2=2.1.2
- bracken=2.6.2
- picrust2=2.5.0
51 changes: 41 additions & 10 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ def helpMessage() {
removal) (default: trim with cutadapt)
--colorby Columns in metadata TSV file to use for coloring the MDS plot
in HTML report (default: condition)
--run_picrust2 Run PICRUSt2 pipeline. Note that pathway inference with 16S using PICRUSt2
has not been tested systematically (default: false)
--download_db Download databases needed for taxonomy classification only. Will not
run the pipeline. Databases will be downloaded to a folder "databases"
in the Nextflow pipeline directory.
Expand All @@ -81,10 +83,11 @@ def helpMessage() {
params.help = false
if (params.help) exit 0, helpMessage()
params.version = false
version = "0.2"
version = "0.3"
if (params.version) exit 0, log.info("$version")
params.download_db = false
params.skip_primer_trim = false
params.run_picrust2 = false
params.filterQ = 20
params.min_len = 1000
params.max_len = 1600
Expand Down Expand Up @@ -127,7 +130,7 @@ params.vsearch_db = "$projectDir/databases/GTDB_ssu_all_r207.qza"
params.vsearch_tax = "$projectDir/databases/GTDB_ssu_all_r207.taxonomy.qza"
params.silva_db = "$projectDir/databases/silva_nr99_v138.1_wSpecies_train_set.fa.gz"
params.refseq_db = "$projectDir/databases/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz"
params.gtdb_db = "$projectDir/databases/GTDB_bac120_arc122_ssu_r202_fullTaxo.fa.gz"
params.gtdb_db = "$projectDir/databases/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz"
params.maxreject = 100
params.maxaccept = 100
params.rarefaction_depth = null
Expand Down Expand Up @@ -389,6 +392,7 @@ process dada2_denoise {
path "dada2-ccs_table.qza", emit: asv_freq
path "dada2-ccs_stats.qza", emit:asv_stats
path "seqtab_nochim.rds", emit: dada2_rds
path "plot_error_model.pdf"

script:
"""
Expand Down Expand Up @@ -781,6 +785,28 @@ process export_biom {
"""
}

process picrust2 {
conda (params.enable_conda ? "$projectDir/env/pb-16s-pbtools.yml" : null)
container "kpinpb/pb-16s-nf-tools:latest"
label 'cpu32'
publishDir "$params.outdir/results"

input:
path dada2_asv
path vsearch_biom

output:
path "picrust2"

script:
"""
picrust2_pipeline.py -s $dada2_asv -i $vsearch_biom \
-o picrust2 \
--in_traits EC,KO \
--verbose -p $task.cpus
"""
}

process barplot {
conda (params.enable_conda ? "$projectDir/env/qiime2-2022.2-py38-linux-conda.yml" : null)
container "kpinpb/pb-16s-nf-qiime:latest"
Expand Down Expand Up @@ -935,7 +961,7 @@ process download_db {
echo "Downloading SILVA sequences for VSEARCH..."
wget -N --content-disposition 'https://zenodo.org/record/4587955/files/silva_nr99_v138.1_wSpecies_train_set.fa.gz?download=1'
echo "Downloading GTDB sequences for VSEARCH..."
wget -N --content-disposition 'https://zenodo.org/record/4735821/files/GTDB_bac120_arc122_ssu_r202_fullTaxo.fa.gz?download=1'
wget -N --content-disposition 'https://zenodo.org/record/6655692/files/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz?download=1'
echo "Downloading RefSeq + RDP sequences for VSEARCH..."
wget -N --content-disposition 'https://zenodo.org/record/4735821/files/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz?download=1'
Expand All @@ -944,13 +970,15 @@ process download_db {
wget -N --content-disposition 'https://data.qiime2.org/2022.2/common/silva-138-99-tax.qza'
echo "Downloading GTDB database"
wget -N --content-disposition --no-check-certificate 'https://data.gtdb.ecogenomic.org/releases/release207/207.0/genomic_files_all/ssu_all_r207.tar.gz'
tar xfz ssu_all_r207.tar.gz
mv ssu_all_r207.fna GTDB_ssu_all_r207.fna
grep "^>" GTDB_ssu_all_r207.fna | awk '{print gensub(/^>(.*)/, "\\\\1", "g", \$1),gensub(/^>.*\\ (d__.*)\\ \\[.*\\[.*\\[.*/, "\\\\1", "g", \$0)}' OFS=\$'\\t' > GTDB_ssu_all_r207.taxonomy.tsv
qiime tools import --type 'FeatureData[Sequence]' --input-path GTDB_ssu_all_r207.fna --output-path GTDB_ssu_all_r207.qza
qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat \
--input-path GTDB_ssu_all_r207.taxonomy.tsv --output-path GTDB_ssu_all_r207.taxonomy.qza
wget -N --content-disposition --no-check-certificate 'https://zenodo.org/record/6912512/files/GTDB_ssu_all_r207.taxonomy.qza?download=1'
wget -N --content-disposition --no-check-certificate 'https://zenodo.org/record/6912512/files/GTDB_ssu_all_r207.qza?download=1'
# wget -N --content-disposition --no-check-certificate 'https://data.gtdb.ecogenomic.org/releases/release207/207.0/genomic_files_all/ssu_all_r207.tar.gz'
# tar xfz ssu_all_r207.tar.gz
# mv ssu_all_r207.fna GTDB_ssu_all_r207.fna
# grep "^>" GTDB_ssu_all_r207.fna | awk '{print gensub(/^>(.*)/, "\\\\1", "g", \$1),gensub(/^>.*\\ (d__.*)\\ \\[.*\\[.*\\[.*/, "\\\\1", "g", \$0)}' OFS=\$'\\t' > GTDB_ssu_all_r207.taxonomy.tsv
# qiime tools import --type 'FeatureData[Sequence]' --input-path GTDB_ssu_all_r207.fna --output-path GTDB_ssu_all_r207.qza
# qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat \
# --input-path GTDB_ssu_all_r207.taxonomy.tsv --output-path GTDB_ssu_all_r207.taxonomy.qza
"""
}

Expand Down Expand Up @@ -1000,6 +1028,9 @@ workflow pb16S {
dada2_assignTax(filter_dada2.out.asv_seq_fasta, filter_dada2.out.asv_seq, filter_dada2.out.asv_freq,
params.silva_db, params.gtdb_db, params.refseq_db, params.dadaAssign_script)
export_biom(filter_dada2.out.asv_freq, dada2_assignTax.out.best_nb_tax, class_tax.out.tax_tsv)
if (params.run_picrust2){
picrust2(filter_dada2.out.asv_seq_fasta, export_biom.out.biom_vsearch)
}
barplot(filter_dada2.out.asv_freq, dada2_assignTax.out.best_nb_tax_qza, class_tax.out.tax_vsearch, metadata_file)
if (params.skip_primer_trim){
html_rep_skip_cutadapt(dada2_assignTax.out.best_nb_tax_tsv, metadata_file, qiime2_manifest,
Expand Down
36 changes: 18 additions & 18 deletions scripts/dada2_assign_tax.R
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,10 @@ gtdb_spec <- as.data.frame(gtdb_spec)
colnames(gtdb_spec)[grepl("tax.", colnames(gtdb_spec))] <-
gsub("tax\\.(.*)", "\\1", colnames(gtdb_spec)[grepl("tax.", colnames(gtdb_spec))])
# Rename species
gtdb_spec[, 'Species'] <- gsub("(.*)\\ (.*)\\(.*\\)", "\\2", gtdb_spec[, 'Species'])
buf <- data.frame("Assignment" = rep("GTDB r202", nrow(gtdb_spec)))
gtdb_spec[, 'Species'] <- gsub("(.*)\\(.*\\)", "\\1", gtdb_spec[, 'Species'])
# Replace GTDB underscore with space
gtdb_spec[, 'Species'] <- gsub("_", "\\ ", gtdb_spec[, 'Species'])
buf <- data.frame("Assignment" = rep("GTDB r207", nrow(gtdb_spec)))
gtdb_spec <- cbind(gtdb_spec, buf)
refseq_spec <- assignTaxonomy(seqs, refFasta = refseq_db, minBoot = minBoot_num,
multithread = threads, outputBootstraps = TRUE)
Expand All @@ -46,37 +48,35 @@ buf <- data.frame("Assignment" = rep("RefSeq + RDP", nrow(refseq_spec)))
refseq_spec <- cbind(refseq_spec, buf)

# Iteratively join at species level
final_spec <- silva_spec
final_spec[is.na(final_spec[, 'Species']), ] <- gtdb_spec[is.na(final_spec[, 'Species']), ]
final_spec <- gtdb_spec
final_spec[is.na(final_spec[, 'Species']), ] <- silva_spec[is.na(final_spec[, 'Species']), ]
final_spec[is.na(final_spec[, 'Species']), ] <- refseq_spec[is.na(final_spec[, 'Species']), ]
# For those that's been replaced by either SILVA or refSeq, paste species name
class_nonGTDB <- final_spec[is.na(gtdb_spec[, 'Species']), ]
class_nonGTDB <- class_nonGTDB[!is.na(class_nonGTDB[, 'Species']),]
final_spec[rownames(class_nonGTDB), 'Species'] <-
paste(final_spec[rownames(class_nonGTDB), 'Genus', drop=TRUE],
final_spec[rownames(class_nonGTDB), 'Species', drop=TRUE], sep=" ")

# Iteratively join at genus level
final_spec[is.na(final_spec[, 'Genus']), ] <- gtdb_spec[is.na(final_spec[, 'Genus']), ]
final_spec[is.na(final_spec[, 'Genus']), ] <- silva_spec[is.na(final_spec[, 'Genus']), ]
final_spec[is.na(final_spec[, 'Genus']), ] <- refseq_spec[is.na(final_spec[, 'Genus']), ]

# uncultured or metagenome
final_spec[grepl("uncultured", final_spec[, 'Species'], ignore.case = TRUE)] <-
gtdb_spec[grepl("uncultured", final_spec[, 'Species'], ignore.case = TRUE)]
final_spec[grepl("uncultured", final_spec[, 'Species'], ignore.case = TRUE)] <-
refseq_spec[grepl("uncultured", final_spec[, 'Species'], ignore.case = TRUE)]
final_spec[grepl("metagenome", final_spec[, 'Species'], ignore.case = TRUE)] <-
gtdb_spec[grepl("metagenome", final_spec[, 'Species'], ignore.case = TRUE)]
final_spec[grepl("metagenome", final_spec[, 'Species'], ignore.case = TRUE)] <-
refseq_spec[grepl("metagenome", final_spec[, 'Species'], ignore.case = TRUE)]

# uncultured or metagenome at genus level
final_spec[grepl("uncultured", final_spec[, 'Genus'], ignore.case = TRUE)] <-
gtdb_spec[grepl("uncultured", final_spec[, 'Genus'], ignore.case = TRUE)]
final_spec[grepl("uncultured", final_spec[, 'Genus'], ignore.case = TRUE)] <-
refseq_spec[grepl("uncultured", final_spec[, 'Genus'], ignore.case = TRUE)]
final_spec[grepl("metagenome", final_spec[, 'Genus'], ignore.case = TRUE)] <-
gtdb_spec[grepl("metagenome", final_spec[, 'Genus'], ignore.case = TRUE)]
final_spec[grepl("metagenome", final_spec[, 'Genus'], ignore.case = TRUE)] <-
refseq_spec[grepl("metagenome", final_spec[, 'Genus'], ignore.case = TRUE)]

# If still NA, revert back to Silva
final_spec[is.na(final_spec[, 'Species']), ] <- silva_spec[is.na(final_spec[, 'Species']), ]
final_spec[is.na(final_spec[, 'Genus']), ] <- silva_spec[is.na(final_spec[, 'Genus']), ]
# If still NA, revert back to GTDB
final_spec[is.na(final_spec[, 'Species']), ] <- gtdb_spec[is.na(final_spec[, 'Species']), ]
final_spec[is.na(final_spec[, 'Genus']), ] <- gtdb_spec[is.na(final_spec[, 'Genus']), ]
rownames(final_spec) <- otu_id

# For each row, look for the deepest level assigned and assign confidence value
Expand Down Expand Up @@ -105,7 +105,7 @@ for (i in 1:nrow(final_spec)){
tosave_row <- data.frame(
"Feature ID" = rownames(row),
"Taxon" = paste(
paste(c("d__", "p__", "c__", "o__", "f__", "g__", "s__"), c(row[, c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")], paste(row[, "Genus"], row[, "Species"], sep=" ")), sep = ""),
paste(c("d__", "p__", "c__", "o__", "f__", "g__", "s__"), c(row[, c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")]), sep=""),
collapse = "; "),
"Confidence" = conf,
"Assignment Database" = row['Assignment']
Expand All @@ -119,4 +119,4 @@ write.table(to_save, "best_taxonomy.tsv", quote = FALSE,

write.table(to_save2, "best_taxonomy_withDB.tsv", quote = FALSE,
sep = "\t",row.names = FALSE,
col.names = c("Feature ID", "Taxon", "Confidence", "Assignment Database"))
col.names = c("Feature ID", "Taxon", "Confidence", "Assignment Database"))
9 changes: 9 additions & 0 deletions scripts/run_dada_ccs.R
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,11 @@
#
#

# Added by Khi Pin to save all outputs from script
con <- file("dada2.log")
sink(con, append=TRUE)
sink(con, append=TRUE, type="message")

cat(R.version$version.string, "\n")
errQuit <- function(mesg, status=1) { message("Error: ", mesg); q(status=status) }
args <- commandArgs(TRUE)
Expand Down Expand Up @@ -229,6 +234,10 @@ cat("3) Learning Error Rates\n")
err <- suppressWarnings(learnErrors(filts, nreads=nreads.learn,
errorEstimationFunction=dada2:::PacBioErrfun,
multithread=multithread, BAND_SIZE=32))
err_plot <- plotErrors(err)
pdf("plot_error_model.pdf", width=12, height=8, useDingbats=FALSE)
err_plot
dev.off()

### PROCESS ALL SAMPLES ###
# Loop over rest in streaming fashion with learned error rates
Expand Down

0 comments on commit caead89

Please sign in to comment.