Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce a syncmer if any minimum is at the required position #464

Merged
merged 8 commits into from
Dec 20, 2024
Merged

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Dec 3, 2024

Because of @drtconway’s questions in #463, I also looked a bit closer at the SyncmerIterator implementation, and my impression is also that the handling of ties is possibly inconsistent. In Edgar’s syncmer paper, it says that "ties may be handled in different ways". Ties occur when there are multiple s-mers in the k-mer that have the same minimum hash value. One of the loops in the current code picks the rightmost minimum, another the leftmost one, so I’m not entirely sure what the intention is. I think the current implementation outputs a syncmer if the hash of the s-mer at position $t$ is the leftmost minimum (or maybe rightmost in some circumstances).

I tried to simply make the three places consistent where the minimum hash value and its position are updated, but this leads to slightly fewer syncmers being generated (I haven’t fully worked out why).

Instead, my suggestion is to change it such that a syncmer is generated if the hash of the s-mer at position $t$ is minimal. The advantage is that we no longer need to keep track of where the minimum occurred and can therefore remove qs_min_pos, slightly simplifying the code.

Slightly more syncmers are generated, but there’s essentially zero effect on accuracy and runtime.

@marcelm
Copy link
Collaborator Author

marcelm commented Dec 4, 2024

I added a commit with a test that ensures syncmers are canonical. The test actually fails on main (for the case that the syncmers are generated over a long string consisting only of A).

@ksahlin
Copy link
Owner

ksahlin commented Dec 10, 2024

Given discussion in #463, I now fully accept this commit.

However, there’s a slight inconsistency once again due to the way how ties are handled: Because the syncmer was chosen if the leftmost minimum was at that middle position, this is different in the reverse complemented version of the k-mer because the leftmost minimum would then be the rightmost one. The changes I made in #464 happen to fix that issue as well.

Great, seemed important to get this handled.

Instead of iterating in reverse to find the rightmost minimum, we can
iterate forward and use <= for the comparison.

This makes the loop more similar to the "last s-mer within first k-mer" case
a couple of lines earlier.
This makes the two for loops that find the minimum hash identical.

Thanks to @drtconway for noticing the discrepancy!

See #463
This changes the way in which ties are handled during syncmer computation,
that is, what happens when there are multiple s-mers in the k-mer that have
the same minimum hash value (most likely because the s-mers are identical).

Previously, a syncmer would be returned only if the s-mer at position t
happened to be the rightmost minimum. Now it will be returned if any of the
s-mers with minimum hash is at positon t.

This increases the number of produced syncmers.
We do not need to track the position of the minimum, only its value
@marcelm marcelm force-pushed the syncmers branch 2 times, most recently from bb384f8 to 97fb900 Compare December 20, 2024 09:18
@marcelm
Copy link
Collaborator Author

marcelm commented Dec 20, 2024

I measured whether avoiding repetitive syncmers would be beneficial, and it turns out it has very little effect in terms of runtime or accuracy. When in doubt, I think keeping it simple is the best option, so I have removed the respective commits from this PR.

The relevant commit in this PR is therefore 92c6105. It leads to the following changes in accuracy.

Comparing accuracy (paf)

library 5a7de44 92c6105 difference
sim5-drosophila-50-se 76.9207 76.9434 +0.0227
sim5-drosophila-75-se 83.4332 83.4574 +0.0242
sim5-drosophila-100-se 84.9742 84.9698 -0.0044
sim5-drosophila-150-se 87.5069 87.5199 +0.0130
sim5-drosophila-200-se 89.2447 89.2588 +0.0141
sim5-drosophila-300-se 91.2737 91.2808 +0.0071
sim5-maize-50-se 39.0947 39.1029 +0.0082
sim5-maize-75-se 51.1997 51.1972 -0.0025
sim5-maize-100-se 58.7611 58.7602 -0.0009
sim5-maize-150-se 70.7927 70.7971 +0.0044
sim5-maize-200-se 77.6704 77.6729 +0.0025
sim5-maize-300-se 84.6004 84.6080 +0.0076
sim5-CHM13-50-se 74.2728 74.2958 +0.0230
sim5-CHM13-75-se 82.7277 82.7271 -0.0006
sim5-CHM13-100-se 85.3533 85.3325 -0.0208
sim5-CHM13-150-se 88.6346 88.6368 +0.0022
sim5-CHM13-200-se 90.4095 90.4160 +0.0065
sim5-CHM13-300-se 92.1489 92.1565 +0.0076
sim5-rye-50-se 36.5498 36.5431 -0.0067
sim5-rye-75-se 49.1617 49.1549 -0.0068
sim5-rye-100-se 57.3582 57.3552 -0.0030
sim5-rye-150-se 70.0147 69.9844 -0.0303
sim5-rye-200-se 77.0137 76.9929 -0.0208
sim5-rye-300-se 83.8602 83.8594 -0.0008
sim5-ecoli50-50-se 10.7429 10.7365 -0.0064
sim5-ecoli50-75-se 13.6476 13.6556 +0.0080
sim5-ecoli50-100-se 15.8012 15.8174 +0.0162
sim5-ecoli50-150-se 20.1191 20.0893 -0.0298
sim5-ecoli50-200-se 23.3852 23.3933 +0.0081
sim5-ecoli50-300-se 28.3011 28.3073 +0.0062

Average difference se: +0.0016

@ksahlin
Copy link
Owner

ksahlin commented Dec 20, 2024

ok, agreed. Approved.

@marcelm marcelm merged commit bff84e9 into main Dec 20, 2024
11 checks passed
@marcelm marcelm deleted the syncmers branch December 20, 2024 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants