add pipeline to generate, format, filter SILVA database #29

nbokulich · 2020-06-03T21:01:03Z

inputs:

start with output of add method to automatically download data files from SILVA #28 ? or start with files from SILVA?

steps:

add method to automatically download data files from SILVA #28 or parse_silva_taxonomy? any way to make auto-download optional?
screen_seqs
filter_seqs_length_by_taxon
dereplicate (question: use one more more modes for taxonomy derep?)
any other processing steps?
evaluate with evaluate_taxonomy and cross_validate (with kfold disabled)

outputs:

trained NB classifier
rep seqs
rep taxonomy
should we add steps to build an alignment/tree? or can we use the SILVA rep tree for this?

The text was updated successfully, but these errors were encountered:

mikerobeson · 2020-06-03T23:04:44Z

If using NR, the dereplicate step will likely not do anything, unless they are using the full db or the making use of qiime feature-classifier extract-reads. I am not sure if we should have separate pipelines or include, as an additional flag, the option to further make amplicon region set(s). Perhaps the user would be able to provide sets of primers? The idea would be... for each primer set, the extract-reads command would be run, followed by the remaining steps, i.e. dereplicate, etc.. This would make it very easy for use to make a bunch of 16S-variable-region-region specific classifiers.

mikerobeson · 2020-06-03T23:16:21Z

About the SILVA rep tree. I am thinking we can simply filter the curated SILVA alignment based on the seqIDs that survive the processing through rescript. Then we can run q2-phylogeny on a that alignment, after masking. I guess another route would be to manually extract the tree from ARB and prune it based on the seqIDs we have remaining, just to be consistent. My preference would be for making the tree ourselves, just so that it is automated.

nbokulich · 2020-06-03T23:17:48Z

I like the idea of pruning the existing alignment

nbokulich · 2020-06-03T23:19:55Z

re: derep, good point.

Out of curiosity, how does SILVA dereplicate the taxonomy in the NR database? do they do any consensus/majority rule?

this issue will be to make a Q2 pipeline. We should make another larger pipeline (using snakemake, or just a shell script) to make the full formatted release including dbs for V4 and maybe other subdomains.

mikerobeson · 2020-06-03T23:29:00Z

Not sure, specifically, how they do the consensus taxonomy. But there is indeed, quite a bit of manual curation involved. See here.

mikerobeson · 2020-06-12T16:05:41Z

I supposed we can consider making an a separate pipeline (or optional steps to insert at 5), to generate an amplicon-region-specific reference?

nbokulich · 2020-06-12T16:11:05Z

yeah, maybe input a list of primer pairs and the pipeline could (optionally) generate amplicon-specific references for each

nbokulich mentioned this issue Jun 4, 2020

allow handling of compressed SILVA files. #31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add pipeline to generate, format, filter SILVA database #29

add pipeline to generate, format, filter SILVA database #29

nbokulich commented Jun 3, 2020

mikerobeson commented Jun 3, 2020 •

edited

Loading

mikerobeson commented Jun 3, 2020 •

edited

Loading

nbokulich commented Jun 3, 2020

nbokulich commented Jun 3, 2020

mikerobeson commented Jun 3, 2020

mikerobeson commented Jun 12, 2020

nbokulich commented Jun 12, 2020

add pipeline to generate, format, filter SILVA database #29

add pipeline to generate, format, filter SILVA database #29

Comments

nbokulich commented Jun 3, 2020

mikerobeson commented Jun 3, 2020 • edited Loading

mikerobeson commented Jun 3, 2020 • edited Loading

nbokulich commented Jun 3, 2020

nbokulich commented Jun 3, 2020

mikerobeson commented Jun 3, 2020

mikerobeson commented Jun 12, 2020

nbokulich commented Jun 12, 2020

mikerobeson commented Jun 3, 2020 •

edited

Loading

mikerobeson commented Jun 3, 2020 •

edited

Loading