Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collapse will create different features for the same level, because of non rank-aware padding #137

Open
FranckLejzerowicz opened this issue Mar 18, 2021 · 4 comments

Comments

@FranckLejzerowicz
Copy link

Improvement Description
Taxon path padding can be made rank aware to avoid the following situation:
Say you have taxonomic classifications such as:

k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__

(the first is only assigned with min. confidence to o__Clostridiales, and the other one, the unassigned species of Clostridiales: o__Clostridiales; f__; g__; s__).

Current Behavior
Collapsing the above example to genus would not collapse to the same level, but result in two separate features:

k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;__;__
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__; g__

Proposed Behavior
Make both of these collapse to the same taxon, which for the above would be:

k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__; g__

Note: this would only be feasible if the ranks are homogeneous (which is the case for e.g. GreenGenes and other good databases).

Comments
The current padding happens in a function nested within the _collapse_table() function of _util.py. I'd say this could be made more important functions (larger scope), that would be controlled by a command line parameter, in order to let user decide whether he/she prefers to remove taxa that would need padding, for example, one may want to get rid of things not annotated to genus after collapsing ot genus (in the above example, the two entries would be deleted).

Note: I am making a PR for this - see below

@nbokulich
Copy link
Member

this is really quite specific to greengenes... other databases usually do not have empty annotations, and might use different rank padding conventions.

Furthermore, is this really desirable? collapsing features assigned to different taxonomic ranks seems to make many assumptions. For the sake of plotting, it might be convenient to collapse the following, but in terms of interpretation these taxonomies could mean very different things:

k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__

i.e., features assigned these different taxonomies should not be considered the same at family or genus level... it might be convenient to lump these together as "all clostridiales with unknown family and genus", but these could in fact belong to very different groups, so lumping together could smooth over important differences.

@FranckLejzerowicz
Copy link
Author

FranckLejzerowicz commented Mar 18, 2021

Hello @nbokulich,

In the proposed PR, the padding would only happen if there is a rank convention for the database used for classification, in particular, only for ;-separated taxonomic fields that have homogenous rank labels through out the taxonomy.

Many databases follow the convention of "aligned" rank names, e.g. the single letter for greengenes and PR2, or none for SILVA (taxonomy dump lookup). In case of no convention, note that this PR would do the current padding (adding __ down to the collapsing level).

One caveat: the rank is inferred using the string before the first _ character. If no _ character is present, the padding would remain __ (current behaviour). However, unusually taxonomy would create unusual padding,
e.g. the unlikely, poor taxonomy:

taxon_name1; taxon_name1.1; taxon_name1.1.1
taxon_name1; taxon_name1.2; taxon_name1.2.1
taxon_name2; taxon_name2.1

would pad to:

taxon_taxon_name1; taxon_taxon_name1.1; taxon_taxon_name1.1.1
taxon_taxon_name1; taxon_taxon_name1.2; taxon_taxon_name1.2.1
taxon_taxon_name2; taxon_taxon_name2.1; taxon_

The code can be made robust to such edge cases, and so, I'd say that having the possibility to fix the above issue is indeed desirable. Here's a few thoughts about your points:

  • As always, the user must be aware about what he/she is doing, and could well desire to group things that containing different signal (Tikhonov et al. etc...):
    • a user deciding to collapse at family, or genus, also decides to disregard the fact that "f_eatures assigned these different taxonomies should not be considered the same at family or genus level_", or rather would consider all unassigned the same genus and all assigned to an unknown genus the same genus too (= current QIIME2 behaviour).
    • collapsing the above to k__Bacteria; p__Firmicutes; c__Clostridia or even to k__Bacteria; p__Firmicutes is possible would lump things that - I agree - "belong to very different groups", but it is still possible:
    • you further repeat that "lumping together could smooth over important differences" so in fact, these points lead us to question the very existence of qiime taxa collapse: maybe it should be forbidden to collapse, or, never above genus level? But it is possible, because we assume the user makes well-informed choices :)
  • I would like to hear more about the assumption you're referring to:
    • I agree that an ASV assigned only to a given taxonomic rank (e.g. k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales) and another ASV assigned to another, deeper rank (e.g. k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__) are different and my understanding (about what this assumes) is that the former could be a novel species (within the scope of current database) while the latter matches well an already known reference... which unfortunately is an empty. What other assumptions did you have in mind? (just curious)
    • isn't greengenes a very often-used database in QIIME2, that make this issue not that specific, after all?

In fact, the real issue is that QIIME2 creates features with empty taxonomic rank (__ padding) for sequences not assigned down to the collapsing rank. Let's illustrate with another example and collapse at genus level:
for these ASVs:

ASV1    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales
ASV2    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales
ASV3    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales
ASV4    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__

will first be created:

ASV1    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; __; __
ASV2    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; __; __
ASV3    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; __; __
ASV4    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__

and then after the collapse, are obtained the features:

k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; __; __
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; g__; s__

In this case, the three different ASVs that were only assigned to o__Clostridiales - and thus that are potentially novel, i.e. unexpected in the microbial system - would be lumped together, creating the issue you highlighted.

I agree that padding should be avoid when the rank do not exist in the first place (and notably if exist elsewhere for "unassigned" assignments), but isn't it an issue that created features such as k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; __; __ are confusing and should be discarded instead?

@nbokulich
Copy link
Member

hey @FranckLejzerowicz ,
your explanation above makes it clear that the PR connected to this would cause more problems than it would solve, as this makes assumptions about taxonomic conventions that are not universal. Semicolon-delimited taxonomy is the only convention you mention that is followed by Q2.

I recommend creating a separate plugin for this action (q2-greengenes?), but curious what others think

@FranckLejzerowicz
Copy link
Author

FranckLejzerowicz commented Mar 19, 2021

I indeed tried to identify where this would break (and be improved to solve the issue). Note that if just one semicolon-delimited taxon do not have all its first's underscore-delimited characters the same, this PR would do nothing. Hence, the assumptions about taxonomic conventions that are not universal could be solved by adding a parameter:

  --p-conventional-ranks / --p-no-conventional-ranks
                         The taxonomy is labeled with conventional ranks, e.g. k__Bacteria
                         (and not just Bacteria).                              [default: False]

Since I agree with your first point above on lumping different things, the issue remains that users would get features created by padding to the collapsing level.

i.e., collapsing

ASV1   k__Bacteria
ASV2   k__Bacteria
ASV3   k__Bacteria
ASV4   k__Bacteria
ASV5   k__Bacteria
ASV6   k__Bacteria

at genus level would yield:

ASV1   k__Bacteria;__;__;__;__
ASV2   k__Bacteria;__;__;__;__
ASV3   k__Bacteria;__;__;__;__
ASV4   k__Bacteria;__;__;__;__
ASV5   k__Bacteria;__;__;__;__
ASV6   k__Bacteria;__;__;__;__

Collapsing to:

k__Bacteria;__;__;__;__

Not sure the user readily understands the assumptions beyond this feature, vs. k__Bacteria;p__;__;__;__, k__Bacteria;p__;o__;__;__, or k__Bacteria;p__;o__;f__;__
Sorry for the long messages and for wasting you time if this is not relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants