Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: Fast path for exact matches #461

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Nov 26, 2024

Thinking about ways to alleviate the multi-context seeds slowdown, I tried to add a "fast path" for exact matches (this has nothing to do with mcs, but if we add an unrelated optimization at the same time, it feels less bad): The first randstrobe from the query is looked up in the index, and if it results in a unique hit (only one entry for that hash in the index), the query and reference are compared and if they match exactly, the match is output and everything else is skipped (that is, looking up other hits in the index, merging them into NAMs, ungapped alignments for the top NAMs).

This is a just hack for the moment (with some code duplication) that is also incomplete as it does not check the reverse-complemented sequence.

I tested this on sim5-drosophila-100 first, and there was no improvement (although ~30% of all CIGARs are 100=), but it does help a little bit with the sim3-drosophila-100 library, which appears to gets 3% faster.

I’ll need to do some more measurements.

…ique

This is a hack for the moment that is also incomplete as it does not check
the reverse-complemented sequence.

Speed improvements:

* Less than 1% on sim5-drosophila-100 (where ~30% of all CIGARs are `100=`)
* Perhaps 3% on sim3-drosophila-100
@ksahlin
Copy link
Owner

ksahlin commented Dec 10, 2024

Copy comment on an individual commit to be in the PR discussion instead:

This optimization could be dangerous as randstrobes (similar to any other approximate seeds) are not guaranteed to match the smallest edit distance, so one unique hit could actually be suboptimal (see fig below). Happens when mutations permute the hash values.

I'm not sure how common this is in practice, but I am sure it happens occasionally for large datasets - but is typically solved by looking ate some more seeds.

Maybe to solve this in practice is to look at two seeds in instead - because the probability that this happens (p) would be p^2 for two seeds under a good hash function.

Image

@ksahlin
Copy link
Owner

ksahlin commented Dec 10, 2024

I see, your commit also includes a hamming/extension alignment and requires a single exact match? Then I guess there is no real limitation introduced. But still good to keep my above comment in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants