Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always pick first syncmer as main #472

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Always pick first syncmer as main #472

wants to merge 10 commits into from

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Dec 17, 2024

As discussed: Always pick the first syncmer as main.

As suggested in an e-mail by @ksahlin, this changes the way the base syncmer is chosen when multi-context seeds are enabled. The rule is: Pick the syncmer with odd hash value. If there’s a tie, use the one with the smaller hash value (as before).

I also tested two other rules, but these would make randstrobes non-canonical:

  • XOR the two hashes together and pick the first if the resulting value is odd
  • Always pick the first syncmer

@marcelm
Copy link
Collaborator Author

marcelm commented Dec 17, 2024

Edit: Outdated, refers to results for the "pick odd hash" rule

Results for accuracy (single-end mapping-only mode) on SIM5, using --mcs

library v0.15.0 odd as base difference
sim5-drosophila-50-se 81.4027 81.4438 +0.0411
sim5-drosophila-75-se 85.0652 85.0851 +0.0199
sim5-drosophila-100-se 86.7170 86.7240 +0.0070
sim5-drosophila-150-se 89.0129 89.0167 +0.0038
sim5-drosophila-200-se 90.1490 90.1298 -0.0192
sim5-drosophila-300-se 91.3855 91.3698 -0.0157
sim5-drosophila-500-se 92.1896 92.2030 +0.0134
sim5-maize-50-se 40.3932 40.4007 +0.0075
sim5-maize-75-se 51.8179 51.8444 +0.0265
sim5-maize-100-se 59.4661 59.4536 -0.0125
sim5-maize-150-se 71.7984 71.7702 -0.0282
sim5-maize-200-se 78.5476 78.5126 -0.0350
sim5-maize-300-se 84.9285 84.9088 -0.0197
sim5-maize-500-se 89.2947 89.2780 -0.0167
sim5-CHM13-50-se 77.1630 77.1342 -0.0288
sim5-CHM13-75-se 84.0977 84.1002 +0.0025
sim5-CHM13-100-se 86.8231 86.8427 +0.0196
sim5-CHM13-150-se 90.0521 90.0083 -0.0438
sim5-CHM13-200-se 91.2412 91.2513 +0.0101
sim5-CHM13-300-se 92.2348 92.2550 +0.0202
sim5-CHM13-500-se 92.7584 92.7547 -0.0037
sim5-rye-50-se 37.5398 37.5169 -0.0229
sim5-rye-75-se 49.6446 49.6333 -0.0113
sim5-rye-100-se 57.8679 57.9150 +0.0471
sim5-rye-150-se 70.8919 70.8721 -0.0198
sim5-rye-200-se 77.7910 77.7393 -0.0517
sim5-rye-300-se 84.1465 84.1485 +0.0020
sim5-rye-500-se 88.6140 88.5623 -0.0517

Average difference se: -0.0057

The table for SIM3 looks very similar.

@marcelm
Copy link
Collaborator Author

marcelm commented Dec 17, 2024

Results for the XOR variant

(not canonical!)

  • SIM5, mapping-only, single-end, mcs
library v0.15.0 XOR difference
sim5-drosophila-50-se 81.4027 82.1746 +0.7719
sim5-drosophila-75-se 85.0652 85.3098 +0.2446
sim5-drosophila-100-se 86.7170 86.8685 +0.1515
sim5-drosophila-150-se 89.0129 89.0143 +0.0014
sim5-drosophila-200-se 90.1490 90.1004 -0.0486
sim5-drosophila-300-se 91.3855 91.3633 -0.0222
sim5-drosophila-500-se 92.1896 92.1590 -0.0306
sim5-maize-50-se 40.3932 40.5216 +0.1284
sim5-maize-75-se 51.8179 51.8814 +0.0635
sim5-maize-100-se 59.4661 59.5326 +0.0665
sim5-maize-150-se 71.7984 72.0019 +0.2035
sim5-maize-200-se 78.5476 78.8174 +0.2698
sim5-maize-300-se 84.9285 85.1639 +0.2354
sim5-maize-500-se 89.2947 89.4971 +0.2024
sim5-CHM13-50-se 77.1630 77.3339 +0.1709
sim5-CHM13-75-se 84.0977 84.2414 +0.1437
sim5-CHM13-100-se 86.8231 86.9453 +0.1222
sim5-CHM13-150-se 90.0521 90.0507 -0.0014
sim5-CHM13-200-se 91.2412 91.2356 -0.0056
sim5-CHM13-300-se 92.2348 92.1425 -0.0923
sim5-CHM13-500-se 92.7584 92.5621 -0.1963
sim5-rye-50-se 37.5398 37.5052 -0.0346
sim5-rye-75-se 49.6446 49.6990 +0.0544
sim5-rye-100-se 57.8679 57.9129 +0.0450
sim5-rye-150-se 70.8919 71.0859 +0.1940
sim5-rye-200-se 77.7910 78.0933 +0.3023
sim5-rye-300-se 84.1465 84.3859 +0.2394
sim5-rye-500-se 88.6140 88.7453 +0.1313

Average difference se: +0.1182

Results when always picking the first syncmer as base

(not canonical!)

  • SIM5, mapping-only, single-end, mcs
library v0.15.0 always first difference
sim5-drosophila-50-se 81.4027 82.8590 +1.4563
sim5-drosophila-75-se 85.0652 85.5390 +0.4738
sim5-drosophila-100-se 86.7170 86.9826 +0.2656
sim5-drosophila-150-se 89.0129 88.9839 -0.0290
sim5-drosophila-200-se 90.1490 90.0605 -0.0885
sim5-drosophila-300-se 91.3855 91.3050 -0.0805
sim5-drosophila-500-se 92.1896 92.1746 -0.0150
sim5-maize-50-se 40.3932 40.6672 +0.2740
sim5-maize-75-se 51.8179 52.0802 +0.2623
sim5-maize-100-se 59.4661 59.5956 +0.1295
sim5-maize-150-se 71.7984 72.0178 +0.2194
sim5-maize-200-se 78.5476 78.8700 +0.3224
sim5-maize-300-se 84.9285 85.1941 +0.2656
sim5-maize-500-se 89.2947 89.5820 +0.2873
sim5-CHM13-50-se 77.1630 78.1865 +1.0235
sim5-CHM13-75-se 84.0977 84.5674 +0.4697
sim5-CHM13-100-se 86.8231 87.1812 +0.3581
sim5-CHM13-150-se 90.0521 90.1135 +0.0614
sim5-CHM13-200-se 91.2412 91.2329 -0.0083

Average difference se: +0.2972

(For SIM3, it’s less pronounced with an average difference of +0.0171.)

@ksahlin
Copy link
Owner

ksahlin commented Dec 17, 2024

well.... "WTF lol" is the only response I can come up with at the moment.

@ksahlin
Copy link
Owner

ksahlin commented Dec 17, 2024

We should consider using "first syncmer as base" then I guess. But we should also try to understand this increase in performance.

Does it come with significant slowdown?

I start to understand why you want to move away from canonicity in randstrobes. But the downside is the lower accuracy in inversions. However, the modification you did here does not seem to affect accuracy, rather the opposite.

@ksahlin
Copy link
Owner

ksahlin commented Dec 17, 2024

I also tested two other rules, but these would make randstrobes non-canonical:
XOR the two hashes together and pick the first if the resulting value is odd

If so, how about XOR the two hashes together and pick the minimum hash if the resulting value is odd, and the maximum hash if the resulting value is even. This would keep it canonical-ish. However, maybe the increase in accuracy comes from precisely the non-canonicity? Seems difficult to beat your "first syncmer as base" approach.

@marcelm
Copy link
Collaborator Author

marcelm commented Dec 18, 2024

Does it come with significant slowdown?

No, but I only tested one of the fruit fly libraries.

I start to understand why you want to move away from canonicity in randstrobes. But the downside is the lower accuracy in inversions. However, the modification you did here does not seem to affect accuracy, rather the opposite.

If so, how about XOR the two hashes together and pick the minimum hash if the resulting value is odd, and the maximum hash if the resulting value is even. This would keep it canonical-ish. However, maybe the increase in accuracy comes from precisely the non-canonicity? Seems difficult to beat your "first syncmer as base" approach.

I’ll need to let strobealign gather some statistics to answer this better. I would guess it is because 'first as base' ensures that every syncmer is picked as base exactly once.

@marcelm marcelm changed the title Choose base syncmer differently Always pick first syncmer as main Jan 14, 2025
@marcelm marcelm force-pushed the base-hash branch 2 times, most recently from 575dabd to bfa3ed4 Compare January 14, 2025 15:39
@marcelm
Copy link
Collaborator Author

marcelm commented Jan 14, 2025

I updated the PR as discussed: In this version, we always pick the first syncmer/strobe as base.

There’s one problem: Accuracy in non-MCS mode actually decreases, so we should probably not merge this PR as-is.

library before (3d14c09) after (9fa9d41) difference
sim3-drosophila-50-se 83.4670 83.0725 -0.3945
sim3-drosophila-75-se 87.3669 87.2637 -0.1032
sim3-drosophila-100-se 88.5559 88.4734 -0.0825
sim3-drosophila-200-se 91.4254 91.4186 -0.0068
sim3-drosophila-300-se 92.7891 92.7600 -0.0291
sim3-maize-100-se 65.1409 65.0592 -0.0817
sim3-maize-150-se 77.0961 77.1671 +0.0710
sim3-CHM13-100-se 89.6101 89.5067 -0.1034
sim3-CHM13-150-se 91.7412 91.6404 -0.1008
sim3-CHM13-300-se 93.5871 93.4427 -0.1444
sim3-rye-50-se 40.6777 40.5560 -0.1217
sim3-rye-150-se 75.9597 75.9887 +0.0290

One solution would be to use the old method in non-MCS mode and the new in MCS mode, but this means we would create a different index depending on whether --mcs is active.

@marcelm marcelm marked this pull request as ready for review January 14, 2025 15:52
@ksahlin
Copy link
Owner

ksahlin commented Jan 14, 2025

Do you mean that this commit is using the mcs method to represent the randstrobe hash value instead of "the old method" that uses h1 + h2? Then yes, we can use that old method to represent the hash value -- which is anyways better for the benchmarks in the paper as it represents the old version.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 14, 2025

I’m not sure I made clear what I meant.

Current strobealign uses this randstrobe_hash function (always, i.e., no matter whether --mcs is used):

randstrobe_hash_t randstrobe_hash(
    syncmer_hash_t hash1, syncmer_hash_t hash2, randstrobe_hash_t main_hash_mask
) {
    // Make the function symmetric
    if (hash1 > hash2) {
        std::swap(hash1, hash2);
    }
    return ((hash1 & main_hash_mask) | (hash2 & ~main_hash_mask)) & RANDSTROBE_HASH_MASK;
}

This PR removes the if() so that the function becomes:

randstrobe_hash_t randstrobe_hash(
    syncmer_hash_t hash1, syncmer_hash_t hash2, randstrobe_hash_t main_hash_mask
) {
    return ((hash1 & main_hash_mask) | (hash2 & ~main_hash_mask)) & RANDSTROBE_HASH_MASK;
}

This change increases accuracy in MCS mode but decreases it in non-MCS mode.

So to get the best of both worlds, we would need to use the first function in non-MCS mode and the second in MCS mode. We currently don’t have such a distinction. If we do want to make the distinction, this means that the hashes computed in non-MCS mode and in MCS mode are different and that therefore the index is different. This is doable, but we need to make sure that an non-MCS index stored to disk is not used in MCS mode and vice versa.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 14, 2025

I looked at the loss in accuracy in non-MCS mode a bit more in detail. All of the four reads I looked had lots of errors and received exactly one hit in non-MCS mode and none in MCS mode. Index statistics look like this. Before:

Index statistics
  Total strobemers:          28518206
  Distinct strobemers:       25053364 (100.00%)
    1 occurrence:            24236818 ( 96.74%)
    2..100 occurrences:        814809 (  3.25%)
    >100 occurrences:            1737 (  0.01%)

After:

Index statistics
  Total strobemers:          28518206
  Distinct strobemers:       25313326 (100.00%)
    1 occurrence:            24491955 ( 96.76%)
    2..100 occurrences:        820953 (  3.24%)
    >100 occurrences:             418 (  0.00%)

So it seems to me that the hit was no longer found because only its swapped version occurrs in the read.

@ksahlin
Copy link
Owner

ksahlin commented Jan 14, 2025

So to get the best of both worlds, we would need to use the first function in non-MCS mode and the second in MCS mode. We currently don’t have such a distinction. If we do want to make the distinction, this means that the hashes computed in non-MCS mode and in MCS mode are different and that therefore the index is different. This is doable, but we need to make sure that an non-MCS index stored to disk is not used in MCS mode and vice versa.

I see. I definitely think we should use the first (old) method as default and in our benchmarks. Can we output a warning or include a flag in the index on which method was used? I wonder how many users store the index before mapping..

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 15, 2025

I pushed a commit that removes PartialHit because I realized that we no longer need to keep track of the partial hits we have seen. Overall, this PR simplifies the MCS code quite a bit, and actually makes MCS mode faster although it’s still slower than non-MCs mode.

I have a suggestion for how to fix the issue that accuracy in non-MCS mode increases, but this actually deserves its own PR.

Since the first syncmer is always the base, we partial_start is always
identical to start and partial_end is always identical to start + k.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants