-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about SyncmerIterator #463
Comments
Hi and thanks for doing this analysis. I’ve so far only had time to look into the first issue, and I think it’s a bug that the very first k-mer is treated differently. What I’m wondering about is whether the same applies here: strobealign/src/randstrobes.cpp Line 113 in d7e73cc
This should also only find the first minimum. Please give me some time for looking into your other question/observation. |
This makes the two for loops that find the minimum hash identical. Thanks to @drtconway for noticing the discrepancy! See #463
I think you're right, that perhaps the test on the line you cite should be To a certain extent, it doesn't necessarily matter exactly which precise definition you use for syncmers, so long as the implementation expresses that definition on all code paths. The other issue is probably more important - if I give the sequence to be indexed on the reverse complement strand, will I get the same sequence of syncmers (in reverse order)? Should that be the case? (I think it should!) What I haven't thought through carefully is how the strandedness issue intersects with the formation of strobemers. I note that neither the Syncmer nor Strobemer papers mention strandedness. |
Yes, and syncmers just need to be generated in the same way for the reference and the query. Even if the comparisons are a bit inconsistent, it doesn’t matter so much as long as it’s inconsistent in the same way.
I looked into this now as well and as a first exercise, wrote a test to ensure that syncmers are canonical (see #465/#464). You’re correct that the order of s-mers is reversed in the vector, but this does not matter because the s-mer that determines whether the syncmer is chosen or not is the one in the middle. The strobealign/src/indexparameters.hpp Line 32 in d7e73cc
So the relevant s-mer is in the same position both in forward and reverse direction. However, there’s a slight inconsistency once again due to the way how ties are handled: Because the syncmer was chosen if the leftmost minimum was at that middle position, this is different in the reverse complemented version of the k-mer because the leftmost minimum would then be the rightmost one. The changes I made in #464 happen to fix that issue as well.
Yes, I think this also needs a bit more research. I’d like to switch to non-canonical syncmers and randstrobes because my impression is that it would both simplify the code in many places, speed up mapping, and make it easier to think about what happens, but this loses some accuracy, see the discussion in #405. |
Hi, As mentioned elsewhere I have been 'off' github because an old university email address was discontinued so I haven’t gotten any notifications from GitHub for about three weeks. I now get the context behind the commit to sample all syncmers in homopolymer and other STR regions: d7c5cbe#comments Given this discussion I feel more inclined to accept sampling such regions more frequently if it means that syncmers are identical in forward and reverse directions. Initially I was sceptical because some seeds may become very repetitive, causing lower accuracy and more runtime. But I am inclined to see a working implementation of canonical strobemers, which makes it justified. |
This makes the two for loops that find the minimum hash identical. Thanks to @drtconway for noticing the discrepancy! See #463
Hi,
I'm studying your implementation of
SyncmerIterator
, and I'm a bit confused. It's possible there's a bug in the code, but I think it's much more likely that I just don't understand the intended semantics correctly.The code runs accumulators for for the current k-mer and the current s-mer, keeping both the forward and reverse complement *-mers, which makes sense.
It then makes a canonical choice of which s-mer to hash, and adds that to the queue.
What follows is a set of cases that evaluate or re-evaluate the position of the minimising s-mer.
The first question I have is whether it matters that the branch at
strobealign/src/randstrobes.cpp
Line 92 in d7e73cc
strobealign/src/randstrobes.cpp
Line 107 in d7e73cc
The more important question pertains to the canonical choice of k-mers. I note that the code at
strobealign/src/randstrobes.cpp
Line 119 in d7e73cc
However, if I present the code with the reverse complement of the original sequence, the same set of s-mers will be derived, but in the opposite order, so that the position of the minimiser will be at the opposite end of the queue, and so the test will not pass.
My understanding of the intended semantics is that the set of syncmers would be the same regardless of which strand the original sequence is on, and certainly by reporting the hash of the min k-mer, the code creates that impression.
I'd love to understand where my reasoning has gone off the rails!
Thanks,
Tom.
The text was updated successfully, but these errors were encountered: