-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment with ksw2/WFA2 - Take 2 #251
base: main
Are you sure you want to change the base?
Conversation
great that you found out the cause! |
I investigated the unexpected ksw2 score computation. As I could have guessed, "it’s not a bug, it’s a feature": I noticed the weird scores have to do with the This behavior is different from SSW, but it makes sense. The only slight problem is that -1 is less bad in our scoring scheme (with +4 for match, -8 for mismatch) than in the BWA-MEM scoring scheme (+1 for match, -4 for mismatch), but I don’t see how this would have any more than a very minor influence on mapping results. |
I see. I guess it makes sense to have a lower penalty on N compared to actual mismatches, as there is higher uncertainty for N. |
Here are some accuracy measurements. Paired-end
Average difference: -0.0094225 Single-end
Average difference: -0.00712 RuntimeI have only looked at the runtime for the above accuracy measurements (so sample size per dataset is 1). Roughly, read lengths 200 and 300 bp on the repetitive genomes benefit the most: rye/2x300 and maize/2x300 reduce mapping runtime by 17 and 19%, respectively. Mapping time on 100 bp reads seems to be quite unaffected. The decrease in accuracy is unfortunate, but maybe not too bad. At the same time, the speedup isn’t that super great either. I’m not sure what to think of this. |
In these benchmarks, is the 'before' a partial (piece wise) alignment approach or is it the old 'semi global' one with SSW that I implemented? I strongly prefer a partial alignment approach so that we could eventually integrate split (supplementary) alignments as well as long read alignment. What is your opinion on this? While i am perfectly happy paying this very small decrease in mapping accuracy, I care more about what happens with alignments around SNPs and indels. Could you point me to two commits that you want benchmarked against each other for this? I would be happy to set off an evaluation. (it will just have to wait until end of next week). |
- rescue_mate still uses SSW - WFA2 is used to align the NAM itself - ksw2 is used (twice) to extend the NAM towards the read ends
"before" is essentially v0.9.0, so it’s SSW.
I guess it would be a good first step towards split alignments. I forgot to mention that the decrease in accuracy is smaller than the improvement we got from introducing the soft-clipping end bonus. So accuracy would still be higher than in 0.8.0.
I just rebased this PR on top of main and force-pushed it. The commits to compare against each other would be v0.9.0 (0771d1c) and 8311650. |
This supersedes #242 and is a much smaller PR because the refactoring has already been done in #249. As before:
Compared to #242, this ensures that indels are left-aligned even in the left extension. To do the left extension, we reverse query and reference, do a right extension (this is how ksw2 is intended to be used) and then reverse the resulting CIGAR. However, when doing so, we need to set a flag to tell ksw2 to right-align indels so that they are left aligned after reversing the CIGAR. The version in #242 did not do this.
This should help with getting rid of the (likely incorrect) increased indel precision/recall in #245.
To Do
fixops
andcount_edits
functions should possibly be moved somewhere else.