Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Identity threshold #462

Open
stas-malavin opened this issue Nov 26, 2024 · 12 comments
Open

Question: Identity threshold #462

stas-malavin opened this issue Nov 26, 2024 · 12 comments

Comments

@stas-malavin
Copy link

Hello,
I seem to miss something in the algorithm, could you please specify how do you reject a read from being mapped without identity threshold? Or there is one implicitly?
Thank you!

@marcelm
Copy link
Collaborator

marcelm commented Nov 27, 2024

There’s no explicit threshold. The implicit requirement is that there is at least one seed in the query that can also be found in the reference. Our seeds are randstrobes that consist of two $k$-mers, where $k$ depends on read length but is typically 20. Note that not all pairs of 20-mers are indexed, so two specific 20-mers need to match between query and reference.

I don’t think the read mappers I’m aware of implement an explicit identity threshold. If you wanted to have such a threshold, you’d need to add this as a postprocessing step (for example, by inspecting the NM:i tag in the SAM output).

@stas-malavin
Copy link
Author

Hi Marcel, thanks for the quick reply. Again, I'm sorry for a stupid newbie question. I usually use bbmap which has a minid parameter, i.e., 'Approximate minimum alignment identity to look for', according to the manual, with the default value of 0.76.

I imagine a situation, where a metagenome contains two closely related species. If a read of, say, 150 bp covers half the conservative region and half the variable region (e.g. of the SSU gene or some protein-coding gene with a conservative domain), could it be so that both 20-mers hit the conservative part, and you got an ambiguous alignment?

That's why we typically calculate actual alignment when k-mers match, right? And as we calculate, why not implementing an id threshold?

@marcelm
Copy link
Collaborator

marcelm commented Nov 27, 2024

You get more than just one randstrobe per query. For a 150 bp read, for example, you get about 30 on average (they overlap each other). As long as one of them hits the unique part of the reference, the read will be mapped correctly, even when you run strobealign in mapping-only mode (-x), in which it will not compute alignments.

A more difficult situation would arise when most of the read is in the conserved region and only a small part of it covers a region that is variable. Then it could be that all randstrobes are in in a conserved region, making the mapping location ambiguous. In mapping-only mode, strobealign picks one of the possible locations at random, but in alignment mode, it computes alignments for the candidate locations and uses that to determine which location is the more likely one.

If I understand correctly, the minid parameter in bbmap is a way to trade speed for sensitivity. The documentation says this:

minid=0.76              Approximate minimum alignment identity to look for. 
                        Higher is faster and less sensitive.

So this appears to be intrinsic to the way their algorithm works: If you want the program to find alignments with lower identity (i.e., increase sensitivity), you can reduce this parameter, but it will be slower.

Strobealign’s algorithm does not need to make this tradeoff, so it doesn’t offer this particular knob.

Of course something like this could be done in a separate step where strobealign would filter out alignments that are below a certain percentage identity, but I don’t see what the advantage would be: Assume strobealign finds a match at 75% identity and our threshold is 76% – why should we throw that match away? (The point is that bbmap would possibly nt have found that match in the first place because it was tuned to find matches with at least 76% identity.)

@stas-malavin
Copy link
Author

OK, I think I got the idea. Just to clarify—for the alignments, strobealign has some score threshold, right?

@marcelm
Copy link
Collaborator

marcelm commented Nov 27, 2024

No, there’s no threshold either, but there is soft clipping: If extending the alignment fully towards the 5' or 3' end would lead to a bad alignment score, the alignment is truncated. Together with the first requirement of at least two 20-mers matching means that I typically don’t see really bad alignments. There are sometimes quite short alignments (with many soft-clipped bases), but you could get rid of these by filtering by alignment score (for example with sth. like samtools view -e '[AS]>=100').

If during your testing of strobealign you find alignments that make no sense for your application and that you think strobealign should not have reported, please let us know.

@stas-malavin
Copy link
Author

If I use an --aemb flag, I don't have sam to filter. Do short alignments show up in the resulting tsv?

I actually want to map metagenomic reads to just one genome, not to the whole spectrum of diversity from which the reads were obtained. In this case, will I get all the reads mapped (sounds odd)?

Next, I'm working with eukaryotes for which I have not enough coverage to assemble MAGs. I download WGS project for a species whose 18S is 3% different from what I found in the metagenome, and assemble it. Now I map the metagenomic reads onto this assembly. How do I tell strobealign to map just the reads that are, say, not less than 90% similar to the reference?

@marcelm
Copy link
Collaborator

marcelm commented Nov 28, 2024

If I use an --aemb flag, I don't have sam to filter. Do short alignments show up in the resulting tsv?

In the .tsv that you get with --aemb, all matches are counted, even short ones. (I’m avoiding the term "alignment" because no actual base-level alignments are computed.)

I actually want to map metagenomic reads to just one genome, not to the whole spectrum of diversity from which the reads were obtained. In this case, will I get all the reads mapped (sounds odd)?

Ah, I understand now. Yes, I believe most reads would be mapped.

Next, I'm working with eukaryotes for which I have not enough coverage to assemble MAGs. I download WGS project for a species whose 18S is 3% different from what I found in the metagenome, and assemble it. Now I map the metagenomic reads onto this assembly. How do I tell strobealign to map just the reads that are, say, not less than 90% similar to the reference?

Because no alignments are computed when you output abundances (--aemb) or PAF (-x), similarity is also not computed, so there’s no way to filter by similarity in these modes. You would need to use SAM output, then filter the way I suggested earlier, and then convert to PAF or an abundances .tsv.

That said, while the actual alignment is not available, strobealign knows the length of the match that it finds. When using -x, this is output in column 11 in the output file. When a query matches the reference only partially, this length is much less than the read length. So maybe one way to get something similar to what you want is to filter by this length. Currently, you would have to use PAF output and then, again, do this filtering manually, but if it is indeed helpful, this could be added to strobealign as an option for --aemb. I am unsure whether this is maybe too specialized, @ksahlin would need to decide.

@stas-malavin
Copy link
Author

Hi Marcel, thank you very much for the explanations and a helpful discussion!
I'll try to update this thread with the results I get using the approach you suggested.

@ksahlin
Copy link
Owner

ksahlin commented Dec 10, 2024

I am unsure whether this is maybe too specialized, @ksahlin would need to decide.

If it's a matter of an if-statement when printing mappings to PAF I would, in principle, be okay with adding such a parameter.

However, I doubt it will be very accurate though. If a read has say 0.9 in accuracy under say, uniform substitutions, it may only have one or two syncmer (partial MCS) matches - and maybe 0 randstrobe matches. It would not yield an accurate sequence identity estimate. We probably would need extension alignment here. Minimap2 has a flag (-c I think) which does extension to get the cigar flags added to the PAF format.

@stas-malavin
Copy link
Author

Minimap2 has a flag (-c I think) which does extension to get the cigar flags added to the PAF format.

So, I can just calculate identity from cigar flags and then filter? That simple?

@ksahlin
Copy link
Owner

ksahlin commented Dec 10, 2024

Yes, you can use the cigar string or the NM flag (which is the number of mismatches) in the SAM file.

As Marcel wrote above:

I don’t think the read mappers I’m aware of implement an explicit identity threshold. If you wanted to have such a threshold, you’d need to add this as a postprocessing step (for example, by inspecting the NM:i tag in the SAM output).

It works well as long as you are aware of eventual softclipping of reads (they are not included in the CIGAR string or NM flag) and difine your desired behaviour for softclipps accordingly.

@stas-malavin
Copy link
Author

Great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants