add support for `FeatureTable` and `FeatureData` generation from `classify-kraken` results #36

gregcaporaso · 2023-04-04T22:20:11Z

gregcaporaso
Apr 4, 2023
Collaborator

Ultimately we'll want FeatureTable and FeatureData[Taxonomy] results to use these data in downstream applications. One option would be to use Bracken to go from SampleData[Kraken2Output] and/or SampleData[Kraken2Report] to a FeatureTable and FeatureData[Taxonomy]. Should we start thinking about that, or is there another approach that is planned?

misialq · 2023-04-05T14:42:08Z

misialq
Apr 5, 2023
Maintainer

Hey @gregcaporaso,
yes, I'm planning on writing an action for Bracken already (haven't started working on it yet though). This would only work though when we classify reads directly (not MAGs), right?

0 replies

gregcaporaso · 2023-04-05T15:32:58Z

gregcaporaso
Apr 5, 2023
Collaborator Author

@misialq, I wasn't certain about that but that's what makes sense to me too. Do you have thoughts on the pathway from Kraken results on MAGs to a feature table? We do have FeatureTable[PresenceAbsence], which could be relevant for MAGs, at least to get us started.

0 replies

nbokulich · 2023-04-16T10:21:16Z

nbokulich
Apr 16, 2023
Maintainer

Hey this is a complex issue, with a few other issues linked. I opened the following issue after looking into this a bit, and discussing with @misialq that FeatureTable generation should be separated from taxonomic classification of MAGs (e.g., with kraken2): #33

Regarding feature tables:

For reads, bracken can be used to estimate abundances and generate a table as @misialq noted.
For MAGs, feature table generation should happen prior to taxonomic classification, hence the issue above. This would most likely happen at the point of MAG dereplication, see this issue: Implement MAG dereplication #4

Regarding FeatureData[Taxonomy]:

For read-based profiling, the problem is that we are working with raw sequences (a SampleData type), not dereplicated FeatureData[Sequence] data. I do not think that a FeatureData[Taxonomy] is useful to map back to the raw sequences, though technically this would be straightforward to put together from the kraken2 ouptut files. So a FeatureTable[Frequency] of taxa X samples might be all we need/can get.
For dereplicated MAG-based profiling, a FeatureData[Taxonomy] is more meaningful.

0 replies

gregcaporaso · 2023-04-17T22:53:33Z

gregcaporaso
Apr 17, 2023
Collaborator Author

Thanks @nbokulich!

For reads, bracken can be used to estimate abundances and generate a table as @misialq noted.

That seems like the way to go. Just FYI, @colinvwood is doing some exploration of this on our end as we need to use this for some experimental analyses that we're running.

For MAGs, feature table generation should happen prior to taxonomic classification

We were thinking that it may be possible to generate a FeatureTable[PresenceAbsence] from taxonomically annotated MAG data that we're currently generating. That would at least let us get a view of what taxa are expected to be present based on MAG data (also helpful for some experimental analyses that we're working on). Does that seem problematic to you?

For read-based profiling, the problem is that we are working with raw sequences...

It may still be useful for practical purposes to generate FeatureData[Taxonomy] (even if the information in there is redundant) so we can use it e.g. to create a taxa barplot or to collapse by taxonomy.

0 replies

nbokulich · 2023-04-18T05:13:29Z

nbokulich
Apr 18, 2023
Maintainer

Hey @gregcaporaso ,

We were thinking that it may be possible to generate a FeatureTable[PresenceAbsence] from taxonomically annotated MAG data that we're currently generating

If the MAGs are not yet dereplicated then this would make a massively sparse table with MAGs unique to samples. Without abundance information I am not sure what value this would have. But I propose that we discuss MAGs->table on the other issue to keep it separate from taxonomic classification.

It may still be useful for practical purposes to generate FeatureData[Taxonomy] (even if the information in there is redundant) so we can use it e.g. to create a taxa barplot or to collapse by taxonomy.

For read-based profiling the table we will get (via bracken) will already be collapsed by taxonomy. The only other way would be a massive sparse presence-absence table with each feature observed once, which seems like an inefficient solution.

For the purposes of taxa barplots, I think that we should instead alter the action to make FeatureData[Taxonomy] optional — if not passed then the labels are pulled from the index. This would allow it to operate on a collapsed table or directly on ASV abundances — both are requests that have come up on the forum with some open issues in q2-taxa. What do you think about this?

0 replies

nbokulich · 2023-04-18T11:18:58Z

nbokulich
Apr 18, 2023
Maintainer

regardless of the decision we make here wrt where/how MAG abundances are tabulated, I put in this PR to make it possible for collapsed feature tables (e.g., from bracken) to be passed to barplot:
qiime2/q2-taxa#153

0 replies

gregcaporaso · 2023-04-18T16:33:03Z

gregcaporaso
Apr 18, 2023
Collaborator Author

But I propose that we discuss MAGs->table on the other issue

👍

The only other way would be a massive sparse presence-absence table with each feature observed once, which seems like an inefficient solution.

Yep, I agree that that wouldn't make sense or be useful. I was thinking that the feature ids in the table would be taxa, but that we might want to have a way to create a corresponding FeatureData[Taxonomy] as well, even if the ids and the taxa labels are redundant (see example below) to facilitate use with existing actions. Generating barplots is the main action of interest for me right now though, so I agree that your other solution is better (as long as we still retain the ability to view the barplot at different taxonomic levels). We'll test your q2-taxa PR out.

Here's an example of what the FeatureData[Taxonomy] could look like. This is a bit silly, but it would let us (a) define meaningful features when doing read-based profiling, and (b) use existing actions that work with FeatureTable and FeatureData[Taxonomy].

Feature ID	Taxon
k__Bacteria; p__Verrucomicrobia; c__Verrucomicrobiae; o__Verrucomicrobiales; f__Verrucomicrobiaceae; g__Akkermansia; s__muciniphila	k__Bacteria; p__Verrucomicrobia; c__Verrucomicrobiae; o__Verrucomicrobiales; f__Verrucomicrobiaceae; g__Akkermansia; s__muciniphila
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Acidovorax	k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Acidovorax
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__[Mogibacteriaceae]; g__; s__	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__[Mogibacteriaceae]; g__; s__

0 replies

nbokulich · 2023-04-18T18:24:30Z

nbokulich
Apr 18, 2023
Maintainer

yeah this is how my PR in q2-taxa handles taxonomies/labels pulled from tables without accompanying FeatureData[Taxonomy] inputs, by creating a dummy taxonomy from those labels.

(b) use existing actions that work with FeatureTable and FeatureData[Taxonomy].

Can you think of some use cases? Aside from barplot, most other downstream actions that also accept taxonomy, e.g., for making heatmaps, sample classification, maybe differential abundance (?) might as well just operate on the collapsed table directly without a dummy taxonomy. The only other action that I can think of right now where this would be useful is for filtering a feature table based on taxonomy, but surely here another solution could be used.

0 replies

gregcaporaso · 2023-04-18T19:01:11Z

gregcaporaso
Apr 18, 2023
Collaborator Author

@nbokulich, I'll follow up with you on this within about the next week. We're working through some analyses, so I may either have specific examples or ideas of when it's needed, or not (in which case that will suggest we don't need this after all). Thanks for the q2-taxa PR!

0 replies

gregcaporaso · 2023-04-19T17:38:06Z

gregcaporaso
Apr 19, 2023
Collaborator Author

For reads, bracken can be used to estimate abundances and generate a table as @misialq noted.

@colinvwood and I are going to write a prototype of a function that reads braken reports and generates a FeatureTable[Frequency] artifact as we need to do this for an analysis we're working on. We'll be happy to contribute that to the plugin, or just toss it if this is already in progress. Just wanted to mention this, @misialq and @nbokulich, so you know that we're working on it in case it's helpful for you.

0 replies

misialq · 2023-04-20T05:43:34Z

misialq
Apr 20, 2023
Maintainer

Hey @gregcaporaso, thanks, that's great! I haven't gotten to that part quite yet, so it would be wonderful if you could contribute that to the plugin. I'm wrapping up rewriting the database building action to add the Bracken DB (PR coming very soon) and will then move to expanding the reads-based classification by the Bracken step - I will then ping you guys on the respective PR. Thanks!

0 replies

gregcaporaso · 2023-04-20T16:51:12Z

gregcaporaso
Apr 20, 2023
Collaborator Author

That all sounds great @misialq. We are currently running braken outside of QIIME 2 for our analysis, so we should have some good context for reviewing PRs. Turns out generating feature tables from the Braken output is very straightforward - we have that working now (again, outside of QIIME 2 for now) but we can contribute that whenever it makes sense.

0 replies

misialq · 2023-04-20T18:44:49Z

misialq
Apr 20, 2023
Maintainer

Great! I'll get back to you on that most likely some time next week.

0 replies

misialq · 2023-05-24T07:30:28Z

misialq
May 24, 2023
Maintainer

Hey @gregcaporaso,

after some more discussions with @nbokulich, we came up with this little diagram showing the proposed way forward for taxonomic classification in the shotgun workflows (see the diagram below). It includes the additional steps of mapping reads to dereplicated MAGs (if MAGs were used as classification target) or reference sequences (if reads were used as input for a classifier different than Kraken2; I'm currently working on Kaiju), followed by RPKM estimation to get an abundance table normalized by genome length of every taxon. This means we will require some small modifications of the current state + some new actions:

Kraken2 classification - one action with 2 input types:
- reads -> SampleData[Kraken2Reports % Reads] + FeatureTable[Frequency % Unnormalized]
- MAGs -> SampleData[Kraken2Reports % MAGs] + FeatureData[Taxonomy] (MAG-to-taxon mapping)
Bracken abundance re-estimation: one action with one allowed input type:
- SampleData[Kraken2Reports % Reads] -> FeatureTable[Frequency % Normalized]
- the MAGs property is disallowed here to prevent users from inputting MAG classification results
Kaiju classification - one action with 2 input types:
- reads -> FeatureTable[Frequency % Unnormalized]
- MAGs -> FeatureData[Taxonomy] (MAG-to-taxon mapping)
MAG dereplication action
action for mapping reads to reference genomes/dereplicated MAGs
action for abundance normalization - a pipeline using the mapping action from 5., followed by RPKM estimation to obtain FeatureTable[Frequency % Normalized]

How does that sound? For visibility, I'll be creating more issues to track those steps. For now, I'll start by adjusting those Kraken2 action(s) for which we have PRs in the works.

3 replies

misialq May 24, 2023
Maintainer

Hmmm, actually, if we want to do it this way, we'll need to split the classify-kraken action into two: one for reads and another one for MAGs as the first one would need to output a FeatureTable artifact and the other one a FeatureData artifact (next to the kraken reports, which we get in both cases). Unless we output all those in both cases and sometimes one or the other is just empty. Any thoughts anyone?

nbokulich May 24, 2023
Maintainer

hey @misialq — yeah, two separate actions would be necessary now that you mention it. I was thinking on the one hand that:

we could keep as one action that only outputs the kraken reports. A separate action could be used to create a FeatureData[Taxonomy] from the kraken report. And as the report also contains abundances, these could be parsed by the bracken action anyway.

HOWEVER,

we discussed that a property could be used to differentiate KrakenReport % Reads from KrakenReport % MAGs. This would need to be fixed, I think, as with all other outputs and could not be conditional on the input. So this alone would necessitate two separate actions. Right?

misialq May 25, 2023
Maintainer

That's right - I already included the property in the diagram above - this would, however, not require splitting this into two actions. We could just use a TypeMap to determine which of those is produced when reads/MAGs are used as inputs.

The reason why I thought we'd need two actions is that when we use reads as input we want to produce a FeatureTable[Frequency % Unnormalized] (+ reports) but when we use MAGs we should get FeatureData[Taxonomy] (+ reports). The reason for outputting FeatureTable[Frequency % Unnormalized] in the first case is to just keep it consistent with other classification actions here, like Kaiju.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for `FeatureTable` and `FeatureData` generation from `classify-kraken` results #36

{{title}}

Replies: 14 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

add support for FeatureTable and FeatureData generation from classify-kraken results #36

gregcaporaso Apr 4, 2023 Collaborator

Replies: 14 comments · 3 replies

misialq Apr 5, 2023 Maintainer

gregcaporaso Apr 5, 2023 Collaborator Author

nbokulich Apr 16, 2023 Maintainer

gregcaporaso Apr 17, 2023 Collaborator Author

nbokulich Apr 18, 2023 Maintainer

nbokulich Apr 18, 2023 Maintainer

gregcaporaso Apr 18, 2023 Collaborator Author

nbokulich Apr 18, 2023 Maintainer

gregcaporaso Apr 18, 2023 Collaborator Author

gregcaporaso Apr 19, 2023 Collaborator Author

misialq Apr 20, 2023 Maintainer

gregcaporaso Apr 20, 2023 Collaborator Author

misialq Apr 20, 2023 Maintainer

misialq May 24, 2023 Maintainer

misialq May 24, 2023 Maintainer

nbokulich May 24, 2023 Maintainer

misialq May 25, 2023 Maintainer

add support for `FeatureTable` and `FeatureData` generation from `classify-kraken` results #36

gregcaporaso
Apr 4, 2023
Collaborator

Replies: 14 comments 3 replies

misialq
Apr 5, 2023
Maintainer

gregcaporaso
Apr 5, 2023
Collaborator Author

nbokulich
Apr 16, 2023
Maintainer

gregcaporaso
Apr 17, 2023
Collaborator Author

nbokulich
Apr 18, 2023
Maintainer

nbokulich
Apr 18, 2023
Maintainer

gregcaporaso
Apr 18, 2023
Collaborator Author

nbokulich
Apr 18, 2023
Maintainer

gregcaporaso
Apr 18, 2023
Collaborator Author

gregcaporaso
Apr 19, 2023
Collaborator Author

misialq
Apr 20, 2023
Maintainer

gregcaporaso
Apr 20, 2023
Collaborator Author

misialq
Apr 20, 2023
Maintainer

misialq
May 24, 2023
Maintainer

misialq May 24, 2023
Maintainer

nbokulich May 24, 2023
Maintainer

misialq May 25, 2023
Maintainer