-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always pick first syncmer as main #472
base: main
Are you sure you want to change the base?
Conversation
Edit: Outdated, refers to results for the "pick odd hash" rule Results for accuracy (single-end mapping-only mode) on SIM5, using
Average difference se: -0.0057 The table for SIM3 looks very similar. |
Results for the XOR variant(not canonical!)
Average difference se: +0.1182 Results when always picking the first syncmer as base(not canonical!)
Average difference se: +0.2972 (For SIM3, it’s less pronounced with an average difference of +0.0171.) |
well.... "WTF lol" is the only response I can come up with at the moment. |
We should consider using "first syncmer as base" then I guess. But we should also try to understand this increase in performance. Does it come with significant slowdown? I start to understand why you want to move away from canonicity in randstrobes. But the downside is the lower accuracy in inversions. However, the modification you did here does not seem to affect accuracy, rather the opposite. |
If so, how about XOR the two hashes together and pick the minimum hash if the resulting value is odd, and the maximum hash if the resulting value is even. This would keep it canonical-ish. However, maybe the increase in accuracy comes from precisely the non-canonicity? Seems difficult to beat your "first syncmer as base" approach. |
No, but I only tested one of the fruit fly libraries.
I’ll need to let strobealign gather some statistics to answer this better. I would guess it is because 'first as base' ensures that every syncmer is picked as base exactly once. |
575dabd
to
bfa3ed4
Compare
I updated the PR as discussed: In this version, we always pick the first syncmer/strobe as base. There’s one problem: Accuracy in non-MCS mode actually decreases, so we should probably not merge this PR as-is.
One solution would be to use the old method in non-MCS mode and the new in MCS mode, but this means we would create a different index depending on whether |
Do you mean that this commit is using the mcs method to represent the randstrobe hash value instead of "the old method" that uses |
I’m not sure I made clear what I meant. Current strobealign uses this randstrobe_hash function (always, i.e., no matter whether randstrobe_hash_t randstrobe_hash(
syncmer_hash_t hash1, syncmer_hash_t hash2, randstrobe_hash_t main_hash_mask
) {
// Make the function symmetric
if (hash1 > hash2) {
std::swap(hash1, hash2);
}
return ((hash1 & main_hash_mask) | (hash2 & ~main_hash_mask)) & RANDSTROBE_HASH_MASK;
} This PR removes the randstrobe_hash_t randstrobe_hash(
syncmer_hash_t hash1, syncmer_hash_t hash2, randstrobe_hash_t main_hash_mask
) {
return ((hash1 & main_hash_mask) | (hash2 & ~main_hash_mask)) & RANDSTROBE_HASH_MASK;
} This change increases accuracy in MCS mode but decreases it in non-MCS mode. So to get the best of both worlds, we would need to use the first function in non-MCS mode and the second in MCS mode. We currently don’t have such a distinction. If we do want to make the distinction, this means that the hashes computed in non-MCS mode and in MCS mode are different and that therefore the index is different. This is doable, but we need to make sure that an non-MCS index stored to disk is not used in MCS mode and vice versa. |
I looked at the loss in accuracy in non-MCS mode a bit more in detail. All of the four reads I looked had lots of errors and received exactly one hit in non-MCS mode and none in MCS mode. Index statistics look like this. Before:
After:
So it seems to me that the hit was no longer found because only its swapped version occurrs in the read. |
I see. I definitely think we should use the first (old) method as default and in our benchmarks. Can we output a warning or include a flag in the index on which method was used? I wonder how many users store the index before mapping.. |
I pushed a commit that removes I have a suggestion for how to fix the issue that accuracy in non-MCS mode increases, but this actually deserves its own PR. |
Since the first syncmer is always the base, we partial_start is always identical to start and partial_end is always identical to start + k.
As discussed: Always pick the first syncmer as main.
As suggested in an e-mail by @ksahlin, this changes the way the base syncmer is chosen when multi-context seeds are enabled. The rule is: Pick the syncmer with odd hash value. If there’s a tie, use the one with the smaller hash value (as before).I also tested two other rules, but these would make randstrobes non-canonical:XOR the two hashes together and pick the first if the resulting value is oddAlways pick the first syncmer