Merge pull request #4 from PacificBiosciences/develop

Merge v0.3 from develop
PacificBiosciences · Aug 4, 2022 · caead89 · caead89
2 parents 030f1bc + dab848b
commit caead89
Show file tree

Hide file tree

Showing 7 changed files with 96 additions and 49 deletions.
diff --git a/README.md b/README.md
@@ -118,6 +118,7 @@ nextflow run main.nf --help
                         removal) (default: trim with cutadapt)
   --colorby    Columns in metadata TSV file to use for coloring the MDS plot
                in HTML report (default: condition)
+  --run_picrust2    Run PICRUSt2 pipeline (default: false)
   --download_db    Download databases needed for taxonomy classification only. Will not
                    run the pipeline. Databases will be downloaded to a folder "databases"
                    in the Nextflow pipeline directory.
@@ -247,15 +248,15 @@ primers removal?
   The "besttax" assignment uses the `assignTaxonomy` Naive-Bayes classifier function from DADA2 
   to carry out taxonomy assignment. It uses 3 databases to classify the ASVs 
   (requiring a minimum bootstrap of 80 using the minBoot parameter) and the priority of assignment
-  is Silva 138, followed by GTDB r202, then lastly RefSeq + RDP. This means for example
-  if an ASV is not assigned at Species level using Silva, it will check if it can be assigned
-  with GTDB. This ensure we assign as many ASVs as possible.
+  is GTDB r207, followed by Silva v138, then lastly RefSeq + RDP. This means for example
+  if an ASV is not assigned at Species level using GTDB, it will check if it can be assigned
+  with Silva. This ensure we assign as many ASVs as possible.
 
   This process is done first at Species level, then at Genus level. In addition, if any ASV
-  is assigned as "uncultured" or "metagenome", it will go through the iterative assignment
-  process just like the unclassified ASVs. Note that while this method will assign
-  a high amount of ASVs, there may be issues such as how the taxonomy is annotated
-  in different databases. 
+  is assigned as "uncultured" or "metagenome" (many entries like this in Silva), 
+  it will go through the iterative assignment process just like the unclassified ASVs. 
+  Note that while this method will assign a high amount of ASVs, there may be issues 
+  such as how the taxonomy is annotated in different databases. 
 
   As such, there is also a VSEARCH taxonomy classification using GTDB database (r207) only in the file called 
   `results/vsearch_merged_freq_tax.tsv` that may provide a more consistent annotation. This uses
@@ -285,7 +286,8 @@ primers removal?
 * Can I download the databases for taxonomic classification manually?
 
   The taxonomy classification step of the pipeline requires a few databases that will be downloaded with the
-  `--download_db` parameters into a "databases" folder. These databases can also be downloaded
+  `--download_db` parameters into a "databases" folder. All the databases are also 
+  collected on [Zenodo](https://zenodo.org/record/6912512). These databases can also be downloaded
   manually from the following links if the download script above does not work. The GTDB database
   for VSEARCH will require some processing using the `QIIME 2` package. See `scripts/download_db.sh` for details.
 
@@ -311,33 +313,35 @@ primers removal?
 
 ## References
 ### QIIME 2
-1. Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol 37, 852–857 (2019).
+* Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol 37, 852–857 (2019).
 * For individual citations of plugins, you can use the `--citations` command for the relevant plugins. For example, if you want
   to cite `VSEARCH` plugin, type `qiime feature-classifier classify-consensus-vsearch --citations` after activating the conda
   environment. You can activate the environment installed by the pipelines by typing `conda activate $HOME/nf_conda/$ENV` (Change 
   `$ENBV` to the name of the environment you want to activate).
 ### DADA2
-2. Callahan, B. J. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13, 581–583 (2016).
+* Callahan, B. J. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13, 581–583 (2016).
 ### Seqkit
-3. Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE 11, e0163962 (2016).
+* Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE 11, e0163962 (2016).
 ### Cutadapt
-4. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12 (2011).
+* Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12 (2011).
 ### GTDB database
-5. Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36, 996–1004 (2018).
-6. Parks, D. H. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol 38, 1079–1086 (2020).
+* Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36, 996–1004 (2018).
+* Parks, D. H. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol 38, 1079–1086 (2020).
 ### SILVA database
-7. Yilmaz, P. et al. The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Research 42, D643–D648 (2014).
-8. Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research 41, D590–D596 (2013).
+* Yilmaz, P. et al. The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Research 42, D643–D648 (2014).
+* Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research 41, D590–D596 (2013).
 ### RDP database
-9. Cole, J. R. et al. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Res 42, D633–D642 (2014).
+* Cole, J. R. et al. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Res 42, D633–D642 (2014).
 ### RefSeq database
-10. O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44, D733-745 (2016).
+* O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44, D733-745 (2016).
 ### Krona plot
-11. Bd, O., Nh, B. & Am, P. Interactive metagenomic visualization in a Web browser. BMC bioinformatics 12, (2011).
+* Bd, O., Nh, B. & Am, P. Interactive metagenomic visualization in a Web browser. BMC bioinformatics 12, (2011).
 * We use the QIIME 2 plugin implementation here: https://github.com/kaanb93/q2-krona
 ### Phyloseq and Tidyverse (for HTML report visualization)
-12. McMurdie, P. J. & Holmes, S. phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLOS ONE 8, e61217 (2013).
-13. Wickham, H. et al. Welcome to the Tidyverse. Journal of Open Source Software 4, 1686 (2019).
+* McMurdie, P. J. & Holmes, S. phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLOS ONE 8, e61217 (2013).
+* Wickham, H. et al. Welcome to the Tidyverse. Journal of Open Source Software 4, 1686 (2019).
+### PICRUSt2
+* Douglas, G. M. et al. PICRUSt2 for prediction of metagenome functions. Nat Biotechnol 38, 685–688 (2020).
 
 ## DISCLAIMER
 THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, 

diff --git a/container_def/pb-16s-pbtools.yml b/container_def/pb-16s-pbtools.yml
@@ -8,3 +8,4 @@ dependencies:
   - csvtk=0.23.0
   - kraken2=2.1.2
   - bracken=2.6.2
+  - picrust2=2.5.0
diff --git a/container_def/pb-16s-pbtools_docker/pb-16s-pbtools.yml b/container_def/pb-16s-pbtools_docker/pb-16s-pbtools.yml
@@ -8,3 +8,4 @@ dependencies:
   - csvtk=0.23.0
   - kraken2=2.1.2
   - bracken=2.6.2
+  - picrust2=2.5.0
diff --git a/env/pb-16s-pbtools.yml b/env/pb-16s-pbtools.yml
@@ -8,3 +8,4 @@ dependencies:
   - csvtk=0.23.0
   - kraken2=2.1.2
   - bracken=2.6.2
+  - picrust2=2.5.0
diff --git a/main.nf b/main.nf
@@ -70,6 +70,8 @@ def helpMessage() {
                         removal) (default: trim with cutadapt)
   --colorby    Columns in metadata TSV file to use for coloring the MDS plot
                in HTML report (default: condition)
+  --run_picrust2    Run PICRUSt2 pipeline. Note that pathway inference with 16S using PICRUSt2
+                    has not been tested systematically (default: false)
   --download_db    Download databases needed for taxonomy classification only. Will not
                    run the pipeline. Databases will be downloaded to a folder "databases"
                    in the Nextflow pipeline directory.
@@ -81,10 +83,11 @@ def helpMessage() {
 params.help = false
 if (params.help) exit 0, helpMessage()
 params.version = false
-version = "0.2"
+version = "0.3"
 if (params.version) exit 0, log.info("$version")
 params.download_db = false
 params.skip_primer_trim = false
+params.run_picrust2 = false
 params.filterQ = 20
 params.min_len = 1000
 params.max_len = 1600
@@ -127,7 +130,7 @@ params.vsearch_db = "$projectDir/databases/GTDB_ssu_all_r207.qza"
 params.vsearch_tax = "$projectDir/databases/GTDB_ssu_all_r207.taxonomy.qza"
 params.silva_db = "$projectDir/databases/silva_nr99_v138.1_wSpecies_train_set.fa.gz"
 params.refseq_db = "$projectDir/databases/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz"
-params.gtdb_db = "$projectDir/databases/GTDB_bac120_arc122_ssu_r202_fullTaxo.fa.gz"
+params.gtdb_db = "$projectDir/databases/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz"
 params.maxreject = 100
 params.maxaccept = 100
 params.rarefaction_depth = null
@@ -389,6 +392,7 @@ process dada2_denoise {
   path "dada2-ccs_table.qza", emit: asv_freq
   path "dada2-ccs_stats.qza", emit:asv_stats
   path "seqtab_nochim.rds", emit: dada2_rds
+  path "plot_error_model.pdf"
 
   script:
   """
@@ -781,6 +785,28 @@ process export_biom {
   """
 }
 
+process picrust2 {
+  conda (params.enable_conda ? "$projectDir/env/pb-16s-pbtools.yml" : null)
+  container "kpinpb/pb-16s-nf-tools:latest"
+  label 'cpu32'
+  publishDir "$params.outdir/results"
+
+  input:
+  path dada2_asv
+  path vsearch_biom
+
+  output:
+  path "picrust2"
+
+  script:
+  """
+  picrust2_pipeline.py -s $dada2_asv -i $vsearch_biom \
+    -o picrust2 \
+    --in_traits EC,KO \
+    --verbose -p $task.cpus
+  """
+}
+
 process barplot {
   conda (params.enable_conda ? "$projectDir/env/qiime2-2022.2-py38-linux-conda.yml" : null)
   container "kpinpb/pb-16s-nf-qiime:latest"
@@ -935,7 +961,7 @@ process download_db {
   echo "Downloading SILVA sequences for VSEARCH..."
   wget -N --content-disposition 'https://zenodo.org/record/4587955/files/silva_nr99_v138.1_wSpecies_train_set.fa.gz?download=1'
   echo "Downloading GTDB sequences for VSEARCH..."
-  wget -N --content-disposition 'https://zenodo.org/record/4735821/files/GTDB_bac120_arc122_ssu_r202_fullTaxo.fa.gz?download=1'
+  wget -N --content-disposition 'https://zenodo.org/record/6655692/files/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz?download=1'
   echo "Downloading RefSeq + RDP sequences for VSEARCH..."
   wget -N --content-disposition 'https://zenodo.org/record/4735821/files/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz?download=1'
 
@@ -944,13 +970,15 @@ process download_db {
   wget -N --content-disposition 'https://data.qiime2.org/2022.2/common/silva-138-99-tax.qza'
 
   echo "Downloading GTDB database"
-  wget -N --content-disposition --no-check-certificate 'https://data.gtdb.ecogenomic.org/releases/release207/207.0/genomic_files_all/ssu_all_r207.tar.gz'
-  tar xfz ssu_all_r207.tar.gz
-  mv ssu_all_r207.fna GTDB_ssu_all_r207.fna
-  grep "^>" GTDB_ssu_all_r207.fna | awk '{print gensub(/^>(.*)/, "\\\\1", "g", \$1),gensub(/^>.*\\ (d__.*)\\ \\[.*\\[.*\\[.*/, "\\\\1", "g", \$0)}' OFS=\$'\\t' > GTDB_ssu_all_r207.taxonomy.tsv
-  qiime tools import --type 'FeatureData[Sequence]' --input-path GTDB_ssu_all_r207.fna --output-path GTDB_ssu_all_r207.qza
-  qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat \
-    --input-path GTDB_ssu_all_r207.taxonomy.tsv --output-path GTDB_ssu_all_r207.taxonomy.qza
+  wget -N --content-disposition --no-check-certificate 'https://zenodo.org/record/6912512/files/GTDB_ssu_all_r207.taxonomy.qza?download=1'
+  wget -N --content-disposition --no-check-certificate 'https://zenodo.org/record/6912512/files/GTDB_ssu_all_r207.qza?download=1'
+  #   wget -N --content-disposition --no-check-certificate 'https://data.gtdb.ecogenomic.org/releases/release207/207.0/genomic_files_all/ssu_all_r207.tar.gz'
+  #   tar xfz ssu_all_r207.tar.gz
+  #   mv ssu_all_r207.fna GTDB_ssu_all_r207.fna
+  #   grep "^>" GTDB_ssu_all_r207.fna | awk '{print gensub(/^>(.*)/, "\\\\1", "g", \$1),gensub(/^>.*\\ (d__.*)\\ \\[.*\\[.*\\[.*/, "\\\\1", "g", \$0)}' OFS=\$'\\t' > GTDB_ssu_all_r207.taxonomy.tsv
+  #   qiime tools import --type 'FeatureData[Sequence]' --input-path GTDB_ssu_all_r207.fna --output-path GTDB_ssu_all_r207.qza
+  #   qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat \
+  #     --input-path GTDB_ssu_all_r207.taxonomy.tsv --output-path GTDB_ssu_all_r207.taxonomy.qza
   """
 }
 
@@ -1000,6 +1028,9 @@ workflow pb16S {
     dada2_assignTax(filter_dada2.out.asv_seq_fasta, filter_dada2.out.asv_seq, filter_dada2.out.asv_freq,
         params.silva_db, params.gtdb_db, params.refseq_db, params.dadaAssign_script)
     export_biom(filter_dada2.out.asv_freq, dada2_assignTax.out.best_nb_tax, class_tax.out.tax_tsv)
+    if (params.run_picrust2){
+      picrust2(filter_dada2.out.asv_seq_fasta, export_biom.out.biom_vsearch)
+    }
     barplot(filter_dada2.out.asv_freq, dada2_assignTax.out.best_nb_tax_qza, class_tax.out.tax_vsearch, metadata_file)
     if (params.skip_primer_trim){
       html_rep_skip_cutadapt(dada2_assignTax.out.best_nb_tax_tsv, metadata_file, qiime2_manifest,

diff --git a/scripts/dada2_assign_tax.R b/scripts/dada2_assign_tax.R
@@ -32,8 +32,10 @@ gtdb_spec <- as.data.frame(gtdb_spec)
 colnames(gtdb_spec)[grepl("tax.", colnames(gtdb_spec))] <-
   gsub("tax\\.(.*)", "\\1", colnames(gtdb_spec)[grepl("tax.", colnames(gtdb_spec))])
 # Rename species
-gtdb_spec[, 'Species'] <- gsub("(.*)\\ (.*)\\(.*\\)", "\\2", gtdb_spec[, 'Species'])
-buf <- data.frame("Assignment" = rep("GTDB r202", nrow(gtdb_spec)))
+gtdb_spec[, 'Species'] <- gsub("(.*)\\(.*\\)", "\\1", gtdb_spec[, 'Species'])
+# Replace GTDB underscore with space
+gtdb_spec[, 'Species'] <- gsub("_", "\\ ", gtdb_spec[, 'Species'])
+buf <- data.frame("Assignment" = rep("GTDB r207", nrow(gtdb_spec)))
 gtdb_spec <- cbind(gtdb_spec, buf)
 refseq_spec <- assignTaxonomy(seqs, refFasta = refseq_db, minBoot = minBoot_num,
                               multithread = threads, outputBootstraps = TRUE)
@@ -46,37 +48,35 @@ buf <- data.frame("Assignment" = rep("RefSeq + RDP", nrow(refseq_spec)))
 refseq_spec <- cbind(refseq_spec, buf)
 
 # Iteratively join at species level
-final_spec <- silva_spec
-final_spec[is.na(final_spec[, 'Species']), ] <- gtdb_spec[is.na(final_spec[, 'Species']), ]
+final_spec <- gtdb_spec
+final_spec[is.na(final_spec[, 'Species']), ] <- silva_spec[is.na(final_spec[, 'Species']), ]
 final_spec[is.na(final_spec[, 'Species']), ] <- refseq_spec[is.na(final_spec[, 'Species']), ]
+# For those that's been replaced by either SILVA or refSeq, paste species name
+class_nonGTDB <- final_spec[is.na(gtdb_spec[, 'Species']), ]
+class_nonGTDB <- class_nonGTDB[!is.na(class_nonGTDB[, 'Species']),]
+final_spec[rownames(class_nonGTDB), 'Species'] <-
+  paste(final_spec[rownames(class_nonGTDB), 'Genus', drop=TRUE],
+        final_spec[rownames(class_nonGTDB), 'Species', drop=TRUE], sep=" ")
 
 # Iteratively join at genus level
-final_spec[is.na(final_spec[, 'Genus']), ] <- gtdb_spec[is.na(final_spec[, 'Genus']), ]
+final_spec[is.na(final_spec[, 'Genus']), ] <- silva_spec[is.na(final_spec[, 'Genus']), ]
 final_spec[is.na(final_spec[, 'Genus']), ] <- refseq_spec[is.na(final_spec[, 'Genus']), ]
 
 # uncultured or metagenome
-final_spec[grepl("uncultured", final_spec[, 'Species'], ignore.case = TRUE)] <- 
-  gtdb_spec[grepl("uncultured", final_spec[, 'Species'], ignore.case = TRUE)]
 final_spec[grepl("uncultured", final_spec[, 'Species'], ignore.case = TRUE)] <- 
   refseq_spec[grepl("uncultured", final_spec[, 'Species'], ignore.case = TRUE)]
-final_spec[grepl("metagenome", final_spec[, 'Species'], ignore.case = TRUE)] <- 
-  gtdb_spec[grepl("metagenome", final_spec[, 'Species'], ignore.case = TRUE)]
 final_spec[grepl("metagenome", final_spec[, 'Species'], ignore.case = TRUE)] <- 
   refseq_spec[grepl("metagenome", final_spec[, 'Species'], ignore.case = TRUE)]
 
 # uncultured or metagenome at genus level
 final_spec[grepl("uncultured", final_spec[, 'Genus'], ignore.case = TRUE)] <- 
   gtdb_spec[grepl("uncultured", final_spec[, 'Genus'], ignore.case = TRUE)]
-final_spec[grepl("uncultured", final_spec[, 'Genus'], ignore.case = TRUE)] <- 
-  refseq_spec[grepl("uncultured", final_spec[, 'Genus'], ignore.case = TRUE)]
 final_spec[grepl("metagenome", final_spec[, 'Genus'], ignore.case = TRUE)] <- 
   gtdb_spec[grepl("metagenome", final_spec[, 'Genus'], ignore.case = TRUE)]
-final_spec[grepl("metagenome", final_spec[, 'Genus'], ignore.case = TRUE)] <- 
-  refseq_spec[grepl("metagenome", final_spec[, 'Genus'], ignore.case = TRUE)]
 
-# If still NA, revert back to Silva
-final_spec[is.na(final_spec[, 'Species']), ] <- silva_spec[is.na(final_spec[, 'Species']), ]
-final_spec[is.na(final_spec[, 'Genus']), ] <- silva_spec[is.na(final_spec[, 'Genus']), ]
+# If still NA, revert back to GTDB
+final_spec[is.na(final_spec[, 'Species']), ] <- gtdb_spec[is.na(final_spec[, 'Species']), ]
+final_spec[is.na(final_spec[, 'Genus']), ] <- gtdb_spec[is.na(final_spec[, 'Genus']), ]
 rownames(final_spec) <- otu_id
 
 # For each row, look for the deepest level assigned and assign confidence value
@@ -105,7 +105,7 @@ for (i in 1:nrow(final_spec)){
   tosave_row <- data.frame(
     "Feature ID" = rownames(row),
     "Taxon" = paste(
-      paste(c("d__", "p__", "c__", "o__", "f__", "g__", "s__"), c(row[, c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")], paste(row[, "Genus"], row[, "Species"], sep=" ")), sep = ""), 
+      paste(c("d__", "p__", "c__", "o__", "f__", "g__", "s__"), c(row[, c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")]), sep=""), 
       collapse = "; "),
     "Confidence" = conf,
     "Assignment Database" = row['Assignment']
@@ -119,4 +119,4 @@ write.table(to_save, "best_taxonomy.tsv", quote = FALSE,
 
 write.table(to_save2, "best_taxonomy_withDB.tsv", quote = FALSE,
            sep = "\t",row.names = FALSE, 
-           col.names = c("Feature ID", "Taxon", "Confidence", "Assignment Database"))
+           col.names = c("Feature ID", "Taxon", "Confidence", "Assignment Database"))
diff --git a/scripts/run_dada_ccs.R b/scripts/run_dada_ccs.R
@@ -125,6 +125,11 @@
 #
 #
 
+# Added by Khi Pin to save all outputs from script
+con <- file("dada2.log")
+sink(con, append=TRUE)
+sink(con, append=TRUE, type="message")
+
 cat(R.version$version.string, "\n")
 errQuit <- function(mesg, status=1) { message("Error: ", mesg); q(status=status) }
 args <- commandArgs(TRUE)
@@ -229,6 +234,10 @@ cat("3) Learning Error Rates\n")
 err <- suppressWarnings(learnErrors(filts, nreads=nreads.learn,
                                     errorEstimationFunction=dada2:::PacBioErrfun,
                                     multithread=multithread, BAND_SIZE=32))
+err_plot <- plotErrors(err)
+pdf("plot_error_model.pdf", width=12, height=8, useDingbats=FALSE)
+err_plot
+dev.off()
 
 ### PROCESS ALL SAMPLES ###
 # Loop over rest in streaming fashion with learned error rates