prototype specified prior testing for q2-feature-classifier #107

BenKaehler · 2017-02-15T03:39:38Z

The machine learning classifiers used for taxonomic assignment usually assume that all taxonomies are equally likely (eg. Wang, 2007). That assumption can be relaxed for q2-feature-classifier.

To prototype setting the prior probabilities for the classifier, copy

https://github.com/caporaso-lab/short-read-tax-assignment/blob/dev/ipynb/mock-community/generate-tax-assignments-qiime2.ipynb

to

https://github.com/caporaso-lab/short-read-tax-assignment/tree/dev/ipynb/simulated-community

then experiment with setting fit_prior to false for the uniform prior assumption or class_prior to the appropriate probabilities and assess the impact on classification. Those arguments can be set in the nb_params dictionary initially, but ultimately it would be good to extend method_paramaters_combinations and gen_param_sweep to facilitate automation.

The prior probabilities can be found in, for example,

https://github.com/caporaso-lab/short-read-tax-assignment/blob/dev/data/simulated-community/sake/expected-composition.txt

and the data for testing can be found in

https://github.com/caporaso-lab/short-read-tax-assignment/blob/dev/data/simulated-community

Reference:

Wang, Q., Garrity, G. M., Tiedje, J. M., and Cole, J. R. (2007). Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Applied and environmental microbiology, 73(16):5261–5267.

The text was updated successfully, but these errors were encountered:

nbokulich · 2017-04-21T15:04:42Z

@BenKaehler and I have discussed a bit more on this:

Take-home messages:

prior weights should probably be modified for expected presence/absence (think of a vaginal sample vs, say, a soil sample)... the point is that class probabilities are not uniform, so classifying them like they are introduces bias.
there is a possibility that systematic bias is being created by some species having more copies (so more variants) of the same amplicon. abundance isn't the driving factor in this non-uniformity, just presence and absence, then intra-species variation.

Some problems

multiple gene copy numbers distort expected vs. observed abundance. Especially if we do not know the number of copies...
multiple sequence variants distort presence/absence. We know the sequence variants if the full genome has been sequenced and all variants (within each strain) are in the expected-sequences.tsv for a mock community.

Some possible solutions

mock-18 and mock-19 were prepared by mixing plasmid clones of 16S rRNA genes. So we know the precise abundance and precise sequence (presence/absence) of each sequence.
mock-26 was prepared by mixing ITS amplicons of pure fungal species. So we know the precise abundance. We have one expected-sequnce for each species added to the mock community, which is probably the most abundance sequence variant. But there are most likely other sequence variants present that are not accounted for in expected-sequences.tsv, so we could have a presence/absence problem (i.e., dada2 should detect these variants, but they will not be expected. The sequences are probably not dramatically different). @BenKaehler indicates that approximate is fine.
We use simulated communities (i.e., reference sequences compiled at known proportions) to test and calibrate prior probabilities. These would be simple, small, and fast for testing purposes.

Miscellaneous notes on discussion

Problem with calculating prior probabilities on observed compositions

dada2 is giving us what it thinks are the unique amplicons. If the process were perfect, then the class distribution in the result would be zero for classes absent from the sample and the normalised number of unique reads that fall into each class for those classes present in the sample.

How to calculate prior probabilities?

Do we want to use expected-sequences or expected-taxonomy or trueish-taxonomies?
I had assumed expected-taxonomy, as that gives the expected abundance of each expected class label.
However, @BenKaehler notes:
we want to weight those taxonomic labels depending on how many different expected sequences have each label

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prototype specified prior testing for q2-feature-classifier #107

prototype specified prior testing for q2-feature-classifier #107

BenKaehler commented Feb 15, 2017

nbokulich commented Apr 21, 2017

prototype specified prior testing for q2-feature-classifier #107

prototype specified prior testing for q2-feature-classifier #107

Comments

BenKaehler commented Feb 15, 2017

nbokulich commented Apr 21, 2017

Take-home messages:

Some problems

Some possible solutions

Miscellaneous notes on discussion

Problem with calculating prior probabilities on observed compositions

How to calculate prior probabilities?