-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collapse will create different features for the same level, because of non rank-aware padding #137
Comments
this is really quite specific to greengenes... other databases usually do not have empty annotations, and might use different rank padding conventions. Furthermore, is this really desirable? collapsing features assigned to different taxonomic ranks seems to make many assumptions. For the sake of plotting, it might be convenient to collapse the following, but in terms of interpretation these taxonomies could mean very different things:
i.e., features assigned these different taxonomies should not be considered the same at family or genus level... it might be convenient to lump these together as "all clostridiales with unknown family and genus", but these could in fact belong to very different groups, so lumping together could smooth over important differences. |
Hello @nbokulich, In the proposed PR, the padding would only happen if there is a rank convention for the database used for classification, in particular, only for Many databases follow the convention of "aligned" rank names, e.g. the single letter for greengenes and PR2, or none for SILVA (taxonomy dump lookup). In case of no convention, note that this PR would do the current padding (adding One caveat: the rank is inferred using the string before the first
would pad to:
The code can be made robust to such edge cases, and so, I'd say that having the possibility to fix the above issue is indeed desirable. Here's a few thoughts about your points:
In fact, the real issue is that QIIME2 creates features with empty taxonomic rank (
will first be created:
and then after the collapse, are obtained the features:
In this case, the three different ASVs that were only assigned to I agree that padding should be avoid when the rank do not exist in the first place (and notably if exist elsewhere for "unassigned" assignments), but isn't it an issue that created features such as |
hey @FranckLejzerowicz , I recommend creating a separate plugin for this action (q2-greengenes?), but curious what others think |
I indeed tried to identify where this would break (and be improved to solve the issue). Note that if just one semicolon-delimited taxon do not have all its first's underscore-delimited characters the same, this PR would do nothing. Hence, the assumptions about taxonomic conventions that are not universal could be solved by adding a parameter:
Since I agree with your first point above on lumping different things, the issue remains that users would get features created by padding to the collapsing level. i.e., collapsing
at genus level would yield:
Collapsing to:
Not sure the user readily understands the assumptions beyond this feature, vs. |
Improvement Description
Taxon path padding can be made rank aware to avoid the following situation:
Say you have taxonomic classifications such as:
(the first is only assigned with min. confidence to
o__Clostridiales
, and the other one, the unassigned species of Clostridiales:o__Clostridiales; f__; g__; s__
).Current Behavior
Collapsing the above example to genus would not collapse to the same level, but result in two separate features:
Proposed Behavior
Make both of these collapse to the same taxon, which for the above would be:
Note: this would only be feasible if the ranks are homogeneous (which is the case for e.g. GreenGenes and other good databases).
Comments
The current padding happens in a function nested within the
_collapse_table()
function of_util.py
. I'd say this could be made more important functions (larger scope), that would be controlled by a command line parameter, in order to let user decide whether he/she prefers to remove taxa that would need padding, for example, one may want to get rid of things not annotated to genus after collapsing ot genus (in the above example, the two entries would be deleted).Note: I am making a PR for this - see below
The text was updated successfully, but these errors were encountered: