Always pick first syncmer as main #472

marcelm · 2024-12-17T11:14:59Z

As discussed: Always pick the first syncmer as main.

As suggested in an e-mail by @ksahlin, this changes the way the base syncmer is chosen when multi-context seeds are enabled. The rule is: Pick the syncmer with odd hash value. If there’s a tie, use the one with the smaller hash value (as before).

~~I also tested two other rules, but these would make randstrobes non-canonical:~~

~~XOR the two hashes together and pick the first if the resulting value is odd~~
~~Always pick the first syncmer~~

marcelm · 2024-12-17T11:19:34Z

Edit: Outdated, refers to results for the "pick odd hash" rule

Results for accuracy (single-end mapping-only mode) on SIM5, using --mcs

library	v0.15.0	odd as base	difference
sim5-drosophila-50-se	81.4027	81.4438	+0.0411
sim5-drosophila-75-se	85.0652	85.0851	+0.0199
sim5-drosophila-100-se	86.7170	86.7240	+0.0070
sim5-drosophila-150-se	89.0129	89.0167	+0.0038
sim5-drosophila-200-se	90.1490	90.1298	-0.0192
sim5-drosophila-300-se	91.3855	91.3698	-0.0157
sim5-drosophila-500-se	92.1896	92.2030	+0.0134
sim5-maize-50-se	40.3932	40.4007	+0.0075
sim5-maize-75-se	51.8179	51.8444	+0.0265
sim5-maize-100-se	59.4661	59.4536	-0.0125
sim5-maize-150-se	71.7984	71.7702	-0.0282
sim5-maize-200-se	78.5476	78.5126	-0.0350
sim5-maize-300-se	84.9285	84.9088	-0.0197
sim5-maize-500-se	89.2947	89.2780	-0.0167
sim5-CHM13-50-se	77.1630	77.1342	-0.0288
sim5-CHM13-75-se	84.0977	84.1002	+0.0025
sim5-CHM13-100-se	86.8231	86.8427	+0.0196
sim5-CHM13-150-se	90.0521	90.0083	-0.0438
sim5-CHM13-200-se	91.2412	91.2513	+0.0101
sim5-CHM13-300-se	92.2348	92.2550	+0.0202
sim5-CHM13-500-se	92.7584	92.7547	-0.0037
sim5-rye-50-se	37.5398	37.5169	-0.0229
sim5-rye-75-se	49.6446	49.6333	-0.0113
sim5-rye-100-se	57.8679	57.9150	+0.0471
sim5-rye-150-se	70.8919	70.8721	-0.0198
sim5-rye-200-se	77.7910	77.7393	-0.0517
sim5-rye-300-se	84.1465	84.1485	+0.0020
sim5-rye-500-se	88.6140	88.5623	-0.0517

Average difference se: -0.0057

The table for SIM3 looks very similar.

marcelm · 2024-12-17T11:27:05Z

Results for the XOR variant

(not canonical!)

SIM5, mapping-only, single-end, mcs

library	v0.15.0	XOR	difference
sim5-drosophila-50-se	81.4027	82.1746	+0.7719
sim5-drosophila-75-se	85.0652	85.3098	+0.2446
sim5-drosophila-100-se	86.7170	86.8685	+0.1515
sim5-drosophila-150-se	89.0129	89.0143	+0.0014
sim5-drosophila-200-se	90.1490	90.1004	-0.0486
sim5-drosophila-300-se	91.3855	91.3633	-0.0222
sim5-drosophila-500-se	92.1896	92.1590	-0.0306
sim5-maize-50-se	40.3932	40.5216	+0.1284
sim5-maize-75-se	51.8179	51.8814	+0.0635
sim5-maize-100-se	59.4661	59.5326	+0.0665
sim5-maize-150-se	71.7984	72.0019	+0.2035
sim5-maize-200-se	78.5476	78.8174	+0.2698
sim5-maize-300-se	84.9285	85.1639	+0.2354
sim5-maize-500-se	89.2947	89.4971	+0.2024
sim5-CHM13-50-se	77.1630	77.3339	+0.1709
sim5-CHM13-75-se	84.0977	84.2414	+0.1437
sim5-CHM13-100-se	86.8231	86.9453	+0.1222
sim5-CHM13-150-se	90.0521	90.0507	-0.0014
sim5-CHM13-200-se	91.2412	91.2356	-0.0056
sim5-CHM13-300-se	92.2348	92.1425	-0.0923
sim5-CHM13-500-se	92.7584	92.5621	-0.1963
sim5-rye-50-se	37.5398	37.5052	-0.0346
sim5-rye-75-se	49.6446	49.6990	+0.0544
sim5-rye-100-se	57.8679	57.9129	+0.0450
sim5-rye-150-se	70.8919	71.0859	+0.1940
sim5-rye-200-se	77.7910	78.0933	+0.3023
sim5-rye-300-se	84.1465	84.3859	+0.2394
sim5-rye-500-se	88.6140	88.7453	+0.1313

Average difference se: +0.1182

Results when always picking the first syncmer as base

(not canonical!)

SIM5, mapping-only, single-end, mcs

library	v0.15.0	always first	difference
sim5-drosophila-50-se	81.4027	82.8590	+1.4563
sim5-drosophila-75-se	85.0652	85.5390	+0.4738
sim5-drosophila-100-se	86.7170	86.9826	+0.2656
sim5-drosophila-150-se	89.0129	88.9839	-0.0290
sim5-drosophila-200-se	90.1490	90.0605	-0.0885
sim5-drosophila-300-se	91.3855	91.3050	-0.0805
sim5-drosophila-500-se	92.1896	92.1746	-0.0150
sim5-maize-50-se	40.3932	40.6672	+0.2740
sim5-maize-75-se	51.8179	52.0802	+0.2623
sim5-maize-100-se	59.4661	59.5956	+0.1295
sim5-maize-150-se	71.7984	72.0178	+0.2194
sim5-maize-200-se	78.5476	78.8700	+0.3224
sim5-maize-300-se	84.9285	85.1941	+0.2656
sim5-maize-500-se	89.2947	89.5820	+0.2873
sim5-CHM13-50-se	77.1630	78.1865	+1.0235
sim5-CHM13-75-se	84.0977	84.5674	+0.4697
sim5-CHM13-100-se	86.8231	87.1812	+0.3581
sim5-CHM13-150-se	90.0521	90.1135	+0.0614
sim5-CHM13-200-se	91.2412	91.2329	-0.0083

Average difference se: +0.2972

(For SIM3, it’s less pronounced with an average difference of +0.0171.)

ksahlin · 2024-12-17T11:32:03Z

well.... "WTF lol" is the only response I can come up with at the moment.

ksahlin · 2024-12-17T11:49:56Z

We should consider using "first syncmer as base" then I guess. But we should also try to understand this increase in performance.

Does it come with significant slowdown?

I start to understand why you want to move away from canonicity in randstrobes. But the downside is the lower accuracy in inversions. However, the modification you did here does not seem to affect accuracy, rather the opposite.

ksahlin · 2024-12-17T21:49:31Z

I also tested two other rules, but these would make randstrobes non-canonical:
XOR the two hashes together and pick the first if the resulting value is odd

If so, how about XOR the two hashes together and pick the minimum hash if the resulting value is odd, and the maximum hash if the resulting value is even. This would keep it canonical-ish. However, maybe the increase in accuracy comes from precisely the non-canonicity? Seems difficult to beat your "first syncmer as base" approach.

marcelm · 2024-12-18T07:55:33Z

Does it come with significant slowdown?

No, but I only tested one of the fruit fly libraries.

I start to understand why you want to move away from canonicity in randstrobes. But the downside is the lower accuracy in inversions. However, the modification you did here does not seem to affect accuracy, rather the opposite.

If so, how about XOR the two hashes together and pick the minimum hash if the resulting value is odd, and the maximum hash if the resulting value is even. This would keep it canonical-ish. However, maybe the increase in accuracy comes from precisely the non-canonicity? Seems difficult to beat your "first syncmer as base" approach.

I’ll need to let strobealign gather some statistics to answer this better. I would guess it is because 'first as base' ensures that every syncmer is picked as base exactly once.

marcelm · 2025-01-14T15:52:04Z

I updated the PR as discussed: In this version, we always pick the first syncmer/strobe as base.

There’s one problem: Accuracy in non-MCS mode actually decreases, so we should probably not merge this PR as-is.

library	before (`3d14c09`)	after (`9fa9d41`)	difference
sim3-drosophila-50-se	83.4670	83.0725	-0.3945
sim3-drosophila-75-se	87.3669	87.2637	-0.1032
sim3-drosophila-100-se	88.5559	88.4734	-0.0825
sim3-drosophila-200-se	91.4254	91.4186	-0.0068
sim3-drosophila-300-se	92.7891	92.7600	-0.0291
sim3-maize-100-se	65.1409	65.0592	-0.0817
sim3-maize-150-se	77.0961	77.1671	+0.0710
sim3-CHM13-100-se	89.6101	89.5067	-0.1034
sim3-CHM13-150-se	91.7412	91.6404	-0.1008
sim3-CHM13-300-se	93.5871	93.4427	-0.1444
sim3-rye-50-se	40.6777	40.5560	-0.1217
sim3-rye-150-se	75.9597	75.9887	+0.0290

One solution would be to use the old method in non-MCS mode and the new in MCS mode, but this means we would create a different index depending on whether --mcs is active.

Use option -m to pass --mcs to strobealign

ksahlin · 2025-01-14T16:20:03Z

Do you mean that this commit is using the mcs method to represent the randstrobe hash value instead of "the old method" that uses h1 + h2? Then yes, we can use that old method to represent the hash value -- which is anyways better for the benchmarks in the paper as it represents the old version.

marcelm · 2025-01-14T16:33:40Z

I’m not sure I made clear what I meant.

Current strobealign uses this randstrobe_hash function (always, i.e., no matter whether --mcs is used):

randstrobe_hash_t randstrobe_hash(
    syncmer_hash_t hash1, syncmer_hash_t hash2, randstrobe_hash_t main_hash_mask
) {
    // Make the function symmetric
    if (hash1 > hash2) {
        std::swap(hash1, hash2);
    }
    return ((hash1 & main_hash_mask) | (hash2 & ~main_hash_mask)) & RANDSTROBE_HASH_MASK;
}

This PR removes the if() so that the function becomes:

randstrobe_hash_t randstrobe_hash(
    syncmer_hash_t hash1, syncmer_hash_t hash2, randstrobe_hash_t main_hash_mask
) {
    return ((hash1 & main_hash_mask) | (hash2 & ~main_hash_mask)) & RANDSTROBE_HASH_MASK;
}

This change increases accuracy in MCS mode but decreases it in non-MCS mode.

So to get the best of both worlds, we would need to use the first function in non-MCS mode and the second in MCS mode. We currently don’t have such a distinction. If we do want to make the distinction, this means that the hashes computed in non-MCS mode and in MCS mode are different and that therefore the index is different. This is doable, but we need to make sure that an non-MCS index stored to disk is not used in MCS mode and vice versa.

marcelm · 2025-01-14T16:39:49Z

I looked at the loss in accuracy in non-MCS mode a bit more in detail. All of the four reads I looked had lots of errors and received exactly one hit in non-MCS mode and none in MCS mode. Index statistics look like this. Before:

Index statistics
  Total strobemers:          28518206
  Distinct strobemers:       25053364 (100.00%)
    1 occurrence:            24236818 ( 96.74%)
    2..100 occurrences:        814809 (  3.25%)
    >100 occurrences:            1737 (  0.01%)

After:

Index statistics
  Total strobemers:          28518206
  Distinct strobemers:       25313326 (100.00%)
    1 occurrence:            24491955 ( 96.76%)
    2..100 occurrences:        820953 (  3.24%)
    >100 occurrences:             418 (  0.00%)

So it seems to me that the hit was no longer found because only its swapped version occurrs in the read.

ksahlin · 2025-01-14T16:50:40Z

So to get the best of both worlds, we would need to use the first function in non-MCS mode and the second in MCS mode. We currently don’t have such a distinction. If we do want to make the distinction, this means that the hashes computed in non-MCS mode and in MCS mode are different and that therefore the index is different. This is doable, but we need to make sure that an non-MCS index stored to disk is not used in MCS mode and vice versa.

I see. I definitely think we should use the first (old) method as default and in our benchmarks. Can we output a warning or include a flag in the index on which method was used? I wonder how many users store the index before mapping..

marcelm · 2025-01-15T09:11:17Z

I pushed a commit that removes PartialHit because I realized that we no longer need to keep track of the partial hits we have seen. Overall, this PR simplifies the MCS code quite a bit, and actually makes MCS mode faster although it’s still slower than non-MCs mode.

I have a suggestion for how to fix the issue that accuracy in non-MCS mode increases, but this actually deserves its own PR.

Since the first syncmer is always the base, we partial_start is always identical to start and partial_end is always identical to start + k.

marcelm changed the title ~~Choose base syncmer differently~~ Always pick first syncmer as main Jan 14, 2025

marcelm force-pushed the base-hash branch 2 times, most recently from 575dabd to bfa3ed4 Compare January 14, 2025 15:39

marcelm marked this pull request as ready for review January 14, 2025 15:52

marcelm added 5 commits January 14, 2025 17:00

Optionally enable mcs in compare-baseline script

4d5d31c

Use option -m to pass --mcs to strobealign

samdiff: Report queries that became unmapped

b619eb8

Print Nam::is_rc attribute

3d14c09

Always use first syncmer as base

9fa9d41

Get rid of first_strobe_is_main (it’s always true)

7f5ac33

marcelm force-pushed the base-hash branch from bfa3ed4 to 8ece1ed Compare January 14, 2025 16:03

marcelm mentioned this pull request Jan 15, 2025

Rescue using partial hits, even in non-MCS mode #476

Open

marcelm added 2 commits January 15, 2025 13:45

Update baseline commit

f2d9e9f

Remove the partial_start and partial_end attributes

a7a2c1d

Since the first syncmer is always the base, we partial_start is always identical to start and partial_end is always identical to start + k.

marcelm force-pushed the base-hash branch from c0667ff to 43a74d6 Compare January 15, 2025 12:50

marcelm added 3 commits January 15, 2025 21:16

PartialHit is no longer needed

c9d29ec

Bump .sti format version

50b65d0

Remove outdated comment

9ea6fa5

marcelm force-pushed the base-hash branch from 43a74d6 to 9ea6fa5 Compare January 15, 2025 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always pick first syncmer as main #472

Always pick first syncmer as main #472

marcelm commented Dec 17, 2024 •

edited

Loading

marcelm commented Dec 17, 2024 •

edited

Loading

marcelm commented Dec 17, 2024 •

edited

Loading

ksahlin commented Dec 17, 2024

ksahlin commented Dec 17, 2024

ksahlin commented Dec 17, 2024

marcelm commented Dec 18, 2024

marcelm commented Jan 14, 2025 •

edited

Loading

ksahlin commented Jan 14, 2025

marcelm commented Jan 14, 2025

marcelm commented Jan 14, 2025

ksahlin commented Jan 14, 2025

marcelm commented Jan 15, 2025 •

edited

Loading

Always pick first syncmer as main #472

Are you sure you want to change the base?

Always pick first syncmer as main #472

Conversation

marcelm commented Dec 17, 2024 • edited Loading

marcelm commented Dec 17, 2024 • edited Loading

marcelm commented Dec 17, 2024 • edited Loading

Results for the XOR variant

Results when always picking the first syncmer as base

ksahlin commented Dec 17, 2024

ksahlin commented Dec 17, 2024

ksahlin commented Dec 17, 2024

marcelm commented Dec 18, 2024

marcelm commented Jan 14, 2025 • edited Loading

ksahlin commented Jan 14, 2025

marcelm commented Jan 14, 2025

marcelm commented Jan 14, 2025

ksahlin commented Jan 14, 2025

marcelm commented Jan 15, 2025 • edited Loading

marcelm commented Dec 17, 2024 •

edited

Loading

marcelm commented Dec 17, 2024 •

edited

Loading

marcelm commented Dec 17, 2024 •

edited

Loading

marcelm commented Jan 14, 2025 •

edited

Loading

marcelm commented Jan 15, 2025 •

edited

Loading