-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Produce a syncmer if any minimum is at the required position #464
Conversation
I added a commit with a test that ensures syncmers are canonical. The test actually fails on |
Given discussion in #463, I now fully accept this commit.
Great, seemed important to get this handled. |
Instead of iterating in reverse to find the rightmost minimum, we can iterate forward and use <= for the comparison. This makes the loop more similar to the "last s-mer within first k-mer" case a couple of lines earlier.
This makes the two for loops that find the minimum hash identical. Thanks to @drtconway for noticing the discrepancy! See #463
This changes the way in which ties are handled during syncmer computation, that is, what happens when there are multiple s-mers in the k-mer that have the same minimum hash value (most likely because the s-mers are identical). Previously, a syncmer would be returned only if the s-mer at position t happened to be the rightmost minimum. Now it will be returned if any of the s-mers with minimum hash is at positon t. This increases the number of produced syncmers.
We do not need to track the position of the minimum, only its value
bb384f8
to
97fb900
Compare
I measured whether avoiding repetitive syncmers would be beneficial, and it turns out it has very little effect in terms of runtime or accuracy. When in doubt, I think keeping it simple is the best option, so I have removed the respective commits from this PR. The relevant commit in this PR is therefore 92c6105. It leads to the following changes in accuracy. Comparing accuracy (paf)
Average difference se: +0.0016 |
ok, agreed. Approved. |
Because of @drtconway’s questions in #463, I also looked a bit closer at the SyncmerIterator implementation, and my impression is also that the handling of ties is possibly inconsistent. In Edgar’s syncmer paper, it says that "ties may be handled in different ways". Ties occur when there are multiple s-mers in the k-mer that have the same minimum hash value. One of the loops in the current code picks the rightmost minimum, another the leftmost one, so I’m not entirely sure what the intention is. I think the current implementation outputs a syncmer if the hash of the s-mer at position$t$ is the leftmost minimum (or maybe rightmost in some circumstances).
I tried to simply make the three places consistent where the minimum hash value and its position are updated, but this leads to slightly fewer syncmers being generated (I haven’t fully worked out why).
Instead, my suggestion is to change it such that a syncmer is generated if the hash of the s-mer at position$t$ is minimal. The advantage is that we no longer need to keep track of where the minimum occurred and can therefore remove
qs_min_pos
, slightly simplifying the code.Slightly more syncmers are generated, but there’s essentially zero effect on accuracy and runtime.