Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prototype specified prior testing for q2-feature-classifier #107

Open
BenKaehler opened this issue Feb 15, 2017 · 1 comment
Open

prototype specified prior testing for q2-feature-classifier #107

BenKaehler opened this issue Feb 15, 2017 · 1 comment

Comments

@BenKaehler
Copy link
Collaborator

The machine learning classifiers used for taxonomic assignment usually assume that all taxonomies are equally likely (eg. Wang, 2007). That assumption can be relaxed for q2-feature-classifier.

To prototype setting the prior probabilities for the classifier, copy

https://github.com/caporaso-lab/short-read-tax-assignment/blob/dev/ipynb/mock-community/generate-tax-assignments-qiime2.ipynb

to

https://github.com/caporaso-lab/short-read-tax-assignment/tree/dev/ipynb/simulated-community

then experiment with setting fit_prior to false for the uniform prior assumption or class_prior to the appropriate probabilities and assess the impact on classification. Those arguments can be set in the nb_params dictionary initially, but ultimately it would be good to extend method_paramaters_combinations and gen_param_sweep to facilitate automation.

The prior probabilities can be found in, for example,

https://github.com/caporaso-lab/short-read-tax-assignment/blob/dev/data/simulated-community/sake/expected-composition.txt

and the data for testing can be found in

https://github.com/caporaso-lab/short-read-tax-assignment/blob/dev/data/simulated-community

Reference:

Wang, Q., Garrity, G. M., Tiedje, J. M., and Cole, J. R. (2007). Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Applied and environmental microbiology, 73(16):5261–5267.

@nbokulich
Copy link
Contributor

@BenKaehler and I have discussed a bit more on this:

Take-home messages:

  1. prior weights should probably be modified for expected presence/absence (think of a vaginal sample vs, say, a soil sample)... the point is that class probabilities are not uniform, so classifying them like they are introduces bias.
  2. there is a possibility that systematic bias is being created by some species having more copies (so more variants) of the same amplicon. abundance isn't the driving factor in this non-uniformity, just presence and absence, then intra-species variation.

Some problems

  1. multiple gene copy numbers distort expected vs. observed abundance. Especially if we do not know the number of copies...
  2. multiple sequence variants distort presence/absence. We know the sequence variants if the full genome has been sequenced and all variants (within each strain) are in the expected-sequences.tsv for a mock community.

Some possible solutions

  1. mock-18 and mock-19 were prepared by mixing plasmid clones of 16S rRNA genes. So we know the precise abundance and precise sequence (presence/absence) of each sequence.
  2. mock-26 was prepared by mixing ITS amplicons of pure fungal species. So we know the precise abundance. We have one expected-sequnce for each species added to the mock community, which is probably the most abundance sequence variant. But there are most likely other sequence variants present that are not accounted for in expected-sequences.tsv, so we could have a presence/absence problem (i.e., dada2 should detect these variants, but they will not be expected. The sequences are probably not dramatically different). @BenKaehler indicates that approximate is fine.
  3. We use simulated communities (i.e., reference sequences compiled at known proportions) to test and calibrate prior probabilities. These would be simple, small, and fast for testing purposes.

Miscellaneous notes on discussion

Problem with calculating prior probabilities on observed compositions

dada2 is giving us what it thinks are the unique amplicons. If the process were perfect, then the class distribution in the result would be zero for classes absent from the sample and the normalised number of unique reads that fall into each class for those classes present in the sample.

How to calculate prior probabilities?

Do we want to use expected-sequences or expected-taxonomy or trueish-taxonomies?
I had assumed expected-taxonomy, as that gives the expected abundance of each expected class label.
However, @BenKaehler notes:
we want to weight those taxonomic labels depending on how many different expected sequences have each label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants