-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Identity threshold #462
Comments
There’s no explicit threshold. The implicit requirement is that there is at least one seed in the query that can also be found in the reference. Our seeds are randstrobes that consist of two I don’t think the read mappers I’m aware of implement an explicit identity threshold. If you wanted to have such a threshold, you’d need to add this as a postprocessing step (for example, by inspecting the |
Hi Marcel, thanks for the quick reply. Again, I'm sorry for a stupid newbie question. I usually use I imagine a situation, where a metagenome contains two closely related species. If a read of, say, 150 bp covers half the conservative region and half the variable region (e.g. of the SSU gene or some protein-coding gene with a conservative domain), could it be so that both 20-mers hit the conservative part, and you got an ambiguous alignment? That's why we typically calculate actual alignment when k-mers match, right? And as we calculate, why not implementing an id threshold? |
You get more than just one randstrobe per query. For a 150 bp read, for example, you get about 30 on average (they overlap each other). As long as one of them hits the unique part of the reference, the read will be mapped correctly, even when you run strobealign in mapping-only mode ( A more difficult situation would arise when most of the read is in the conserved region and only a small part of it covers a region that is variable. Then it could be that all randstrobes are in in a conserved region, making the mapping location ambiguous. In mapping-only mode, strobealign picks one of the possible locations at random, but in alignment mode, it computes alignments for the candidate locations and uses that to determine which location is the more likely one. If I understand correctly, the
So this appears to be intrinsic to the way their algorithm works: If you want the program to find alignments with lower identity (i.e., increase sensitivity), you can reduce this parameter, but it will be slower. Strobealign’s algorithm does not need to make this tradeoff, so it doesn’t offer this particular knob. Of course something like this could be done in a separate step where strobealign would filter out alignments that are below a certain percentage identity, but I don’t see what the advantage would be: Assume strobealign finds a match at 75% identity and our threshold is 76% – why should we throw that match away? (The point is that |
OK, I think I got the idea. Just to clarify—for the alignments, strobealign has some score threshold, right? |
No, there’s no threshold either, but there is soft clipping: If extending the alignment fully towards the 5' or 3' end would lead to a bad alignment score, the alignment is truncated. Together with the first requirement of at least two 20-mers matching means that I typically don’t see really bad alignments. There are sometimes quite short alignments (with many soft-clipped bases), but you could get rid of these by filtering by alignment score (for example with sth. like If during your testing of strobealign you find alignments that make no sense for your application and that you think strobealign should not have reported, please let us know. |
If I use an I actually want to map metagenomic reads to just one genome, not to the whole spectrum of diversity from which the reads were obtained. In this case, will I get all the reads mapped (sounds odd)? Next, I'm working with eukaryotes for which I have not enough coverage to assemble MAGs. I download WGS project for a species whose 18S is 3% different from what I found in the metagenome, and assemble it. Now I map the metagenomic reads onto this assembly. How do I tell strobealign to map just the reads that are, say, not less than 90% similar to the reference? |
In the
Ah, I understand now. Yes, I believe most reads would be mapped.
Because no alignments are computed when you output abundances ( That said, while the actual alignment is not available, strobealign knows the length of the match that it finds. When using |
Hi Marcel, thank you very much for the explanations and a helpful discussion! |
If it's a matter of an if-statement when printing mappings to PAF I would, in principle, be okay with adding such a parameter. However, I doubt it will be very accurate though. If a read has say 0.9 in accuracy under say, uniform substitutions, it may only have one or two syncmer (partial MCS) matches - and maybe 0 randstrobe matches. It would not yield an accurate sequence identity estimate. We probably would need extension alignment here. Minimap2 has a flag (-c I think) which does extension to get the cigar flags added to the PAF format. |
So, I can just calculate identity from cigar flags and then filter? That simple? |
Yes, you can use the cigar string or the NM flag (which is the number of mismatches) in the SAM file. As Marcel wrote above:
It works well as long as you are aware of eventual softclipping of reads (they are not included in the CIGAR string or NM flag) and difine your desired behaviour for softclipps accordingly. |
Great, thanks! |
Hello,
I seem to miss something in the algorithm, could you please specify how do you reject a read from being mapped without identity threshold? Or there is one implicitly?
Thank you!
The text was updated successfully, but these errors were encountered: