Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pipeline to generate, format, filter SILVA database #29

Open
nbokulich opened this issue Jun 3, 2020 · 7 comments
Open

add pipeline to generate, format, filter SILVA database #29

nbokulich opened this issue Jun 3, 2020 · 7 comments

Comments

@nbokulich
Copy link
Collaborator

inputs:

  1. start with output of add method to automatically download data files from SILVA #28 ? or start with files from SILVA?

steps:

  1. add method to automatically download data files from SILVA #28 or parse_silva_taxonomy? any way to make auto-download optional?
  2. screen_seqs
  3. filter_seqs_length_by_taxon
  4. dereplicate (question: use one more more modes for taxonomy derep?)
  5. any other processing steps?
  6. evaluate with evaluate_taxonomy and cross_validate (with kfold disabled)

outputs:

  1. trained NB classifier
  2. rep seqs
  3. rep taxonomy
  4. should we add steps to build an alignment/tree? or can we use the SILVA rep tree for this?
@mikerobeson
Copy link
Collaborator

mikerobeson commented Jun 3, 2020

If using NR, the dereplicate step will likely not do anything, unless they are using the full db or the making use of qiime feature-classifier extract-reads. I am not sure if we should have separate pipelines or include, as an additional flag, the option to further make amplicon region set(s). Perhaps the user would be able to provide sets of primers? The idea would be... for each primer set, the extract-reads command would be run, followed by the remaining steps, i.e. dereplicate, etc.. This would make it very easy for use to make a bunch of 16S-variable-region-region specific classifiers.

@mikerobeson
Copy link
Collaborator

mikerobeson commented Jun 3, 2020

About the SILVA rep tree. I am thinking we can simply filter the curated SILVA alignment based on the seqIDs that survive the processing through rescript. Then we can run q2-phylogeny on a that alignment, after masking. I guess another route would be to manually extract the tree from ARB and prune it based on the seqIDs we have remaining, just to be consistent. My preference would be for making the tree ourselves, just so that it is automated.

@nbokulich
Copy link
Collaborator Author

I like the idea of pruning the existing alignment

@nbokulich
Copy link
Collaborator Author

re: derep, good point.

Out of curiosity, how does SILVA dereplicate the taxonomy in the NR database? do they do any consensus/majority rule?

this issue will be to make a Q2 pipeline. We should make another larger pipeline (using snakemake, or just a shell script) to make the full formatted release including dbs for V4 and maybe other subdomains.

@mikerobeson
Copy link
Collaborator

Not sure, specifically, how they do the consensus taxonomy. But there is indeed, quite a bit of manual curation involved. See here.

@mikerobeson
Copy link
Collaborator

I supposed we can consider making an a separate pipeline (or optional steps to insert at 5), to generate an amplicon-region-specific reference?

@nbokulich
Copy link
Collaborator Author

yeah, maybe input a list of primer pairs and the pipeline could (optionally) generate amplicon-specific references for each

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants