Question: Identity threshold #462

stas-malavin · 2024-11-26T22:32:55Z

Hello,
I seem to miss something in the algorithm, could you please specify how do you reject a read from being mapped without identity threshold? Or there is one implicitly?
Thank you!

marcelm · 2024-11-27T10:26:45Z

There’s no explicit threshold. The implicit requirement is that there is at least one seed in the query that can also be found in the reference. Our seeds are randstrobes that consist of two $k$-mers, where $k$ depends on read length but is typically 20. Note that not all pairs of 20-mers are indexed, so two specific 20-mers need to match between query and reference.

I don’t think the read mappers I’m aware of implement an explicit identity threshold. If you wanted to have such a threshold, you’d need to add this as a postprocessing step (for example, by inspecting the NM:i tag in the SAM output).

stas-malavin · 2024-11-27T10:57:15Z

Hi Marcel, thanks for the quick reply. Again, I'm sorry for a stupid newbie question. I usually use bbmap which has a minid parameter, i.e., 'Approximate minimum alignment identity to look for', according to the manual, with the default value of 0.76.

I imagine a situation, where a metagenome contains two closely related species. If a read of, say, 150 bp covers half the conservative region and half the variable region (e.g. of the SSU gene or some protein-coding gene with a conservative domain), could it be so that both 20-mers hit the conservative part, and you got an ambiguous alignment?

That's why we typically calculate actual alignment when k-mers match, right? And as we calculate, why not implementing an id threshold?

marcelm · 2024-11-27T14:06:45Z

You get more than just one randstrobe per query. For a 150 bp read, for example, you get about 30 on average (they overlap each other). As long as one of them hits the unique part of the reference, the read will be mapped correctly, even when you run strobealign in mapping-only mode (-x), in which it will not compute alignments.

A more difficult situation would arise when most of the read is in the conserved region and only a small part of it covers a region that is variable. Then it could be that all randstrobes are in in a conserved region, making the mapping location ambiguous. In mapping-only mode, strobealign picks one of the possible locations at random, but in alignment mode, it computes alignments for the candidate locations and uses that to determine which location is the more likely one.

If I understand correctly, the minid parameter in bbmap is a way to trade speed for sensitivity. The documentation says this:

minid=0.76              Approximate minimum alignment identity to look for. 
                        Higher is faster and less sensitive.

So this appears to be intrinsic to the way their algorithm works: If you want the program to find alignments with lower identity (i.e., increase sensitivity), you can reduce this parameter, but it will be slower.

Strobealign’s algorithm does not need to make this tradeoff, so it doesn’t offer this particular knob.

Of course something like this could be done in a separate step where strobealign would filter out alignments that are below a certain percentage identity, but I don’t see what the advantage would be: Assume strobealign finds a match at 75% identity and our threshold is 76% – why should we throw that match away? (The point is that bbmap would possibly nt have found that match in the first place because it was tuned to find matches with at least 76% identity.)

stas-malavin · 2024-11-27T14:25:23Z

OK, I think I got the idea. Just to clarify—for the alignments, strobealign has some score threshold, right?

marcelm · 2024-11-27T14:57:25Z

No, there’s no threshold either, but there is soft clipping: If extending the alignment fully towards the 5' or 3' end would lead to a bad alignment score, the alignment is truncated. Together with the first requirement of at least two 20-mers matching means that I typically don’t see really bad alignments. There are sometimes quite short alignments (with many soft-clipped bases), but you could get rid of these by filtering by alignment score (for example with sth. like samtools view -e '[AS]>=100').

If during your testing of strobealign you find alignments that make no sense for your application and that you think strobealign should not have reported, please let us know.

stas-malavin · 2024-11-28T11:51:13Z

If I use an --aemb flag, I don't have sam to filter. Do short alignments show up in the resulting tsv?

I actually want to map metagenomic reads to just one genome, not to the whole spectrum of diversity from which the reads were obtained. In this case, will I get all the reads mapped (sounds odd)?

Next, I'm working with eukaryotes for which I have not enough coverage to assemble MAGs. I download WGS project for a species whose 18S is 3% different from what I found in the metagenome, and assemble it. Now I map the metagenomic reads onto this assembly. How do I tell strobealign to map just the reads that are, say, not less than 90% similar to the reference?

marcelm · 2024-11-28T15:08:01Z

If I use an --aemb flag, I don't have sam to filter. Do short alignments show up in the resulting tsv?

In the .tsv that you get with --aemb, all matches are counted, even short ones. (I’m avoiding the term "alignment" because no actual base-level alignments are computed.)

I actually want to map metagenomic reads to just one genome, not to the whole spectrum of diversity from which the reads were obtained. In this case, will I get all the reads mapped (sounds odd)?

Ah, I understand now. Yes, I believe most reads would be mapped.

Next, I'm working with eukaryotes for which I have not enough coverage to assemble MAGs. I download WGS project for a species whose 18S is 3% different from what I found in the metagenome, and assemble it. Now I map the metagenomic reads onto this assembly. How do I tell strobealign to map just the reads that are, say, not less than 90% similar to the reference?

Because no alignments are computed when you output abundances (--aemb) or PAF (-x), similarity is also not computed, so there’s no way to filter by similarity in these modes. You would need to use SAM output, then filter the way I suggested earlier, and then convert to PAF or an abundances .tsv.

That said, while the actual alignment is not available, strobealign knows the length of the match that it finds. When using -x, this is output in column 11 in the output file. When a query matches the reference only partially, this length is much less than the read length. So maybe one way to get something similar to what you want is to filter by this length. Currently, you would have to use PAF output and then, again, do this filtering manually, but if it is indeed helpful, this could be added to strobealign as an option for --aemb. I am unsure whether this is maybe too specialized, @ksahlin would need to decide.

stas-malavin · 2024-12-01T08:36:51Z

Hi Marcel, thank you very much for the explanations and a helpful discussion!
I'll try to update this thread with the results I get using the approach you suggested.

ksahlin · 2024-12-10T10:41:00Z

I am unsure whether this is maybe too specialized, @ksahlin would need to decide.

If it's a matter of an if-statement when printing mappings to PAF I would, in principle, be okay with adding such a parameter.

However, I doubt it will be very accurate though. If a read has say 0.9 in accuracy under say, uniform substitutions, it may only have one or two syncmer (partial MCS) matches - and maybe 0 randstrobe matches. It would not yield an accurate sequence identity estimate. We probably would need extension alignment here. Minimap2 has a flag (-c I think) which does extension to get the cigar flags added to the PAF format.

stas-malavin · 2024-12-10T11:41:11Z

Minimap2 has a flag (-c I think) which does extension to get the cigar flags added to the PAF format.

So, I can just calculate identity from cigar flags and then filter? That simple?

ksahlin · 2024-12-10T12:47:39Z

Yes, you can use the cigar string or the NM flag (which is the number of mismatches) in the SAM file.

As Marcel wrote above:

I don’t think the read mappers I’m aware of implement an explicit identity threshold. If you wanted to have such a threshold, you’d need to add this as a postprocessing step (for example, by inspecting the NM:i tag in the SAM output).

It works well as long as you are aware of eventual softclipping of reads (they are not included in the CIGAR string or NM flag) and difine your desired behaviour for softclipps accordingly.

stas-malavin · 2024-12-12T11:23:36Z

Great, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Identity threshold #462

Question: Identity threshold #462

stas-malavin commented Nov 26, 2024

marcelm commented Nov 27, 2024

stas-malavin commented Nov 27, 2024

marcelm commented Nov 27, 2024

stas-malavin commented Nov 27, 2024

marcelm commented Nov 27, 2024 •

edited

Loading

stas-malavin commented Nov 28, 2024

marcelm commented Nov 28, 2024

stas-malavin commented Dec 1, 2024

ksahlin commented Dec 10, 2024 •

edited

Loading

stas-malavin commented Dec 10, 2024

ksahlin commented Dec 10, 2024 •

edited

Loading

stas-malavin commented Dec 12, 2024

Question: Identity threshold #462

Question: Identity threshold #462

Comments

stas-malavin commented Nov 26, 2024

marcelm commented Nov 27, 2024

stas-malavin commented Nov 27, 2024

marcelm commented Nov 27, 2024

stas-malavin commented Nov 27, 2024

marcelm commented Nov 27, 2024 • edited Loading

stas-malavin commented Nov 28, 2024

marcelm commented Nov 28, 2024

stas-malavin commented Dec 1, 2024

ksahlin commented Dec 10, 2024 • edited Loading

stas-malavin commented Dec 10, 2024

ksahlin commented Dec 10, 2024 • edited Loading

stas-malavin commented Dec 12, 2024

marcelm commented Nov 27, 2024 •

edited

Loading

ksahlin commented Dec 10, 2024 •

edited

Loading

ksahlin commented Dec 10, 2024 •

edited

Loading