Experiment: Fast path for exact matches #461

marcelm · 2024-11-26T15:52:16Z

Thinking about ways to alleviate the multi-context seeds slowdown, I tried to add a "fast path" for exact matches (this has nothing to do with mcs, but if we add an unrelated optimization at the same time, it feels less bad): The first randstrobe from the query is looked up in the index, and if it results in a unique hit (only one entry for that hash in the index), the query and reference are compared and if they match exactly, the match is output and everything else is skipped (that is, looking up other hits in the index, merging them into NAMs, ungapped alignments for the top NAMs).

This is a just hack for the moment (with some code duplication) that is also incomplete as it does not check the reverse-complemented sequence.

I tested this on sim5-drosophila-100 first, and there was no improvement (although ~30% of all CIGARs are 100=), but it does help a little bit with the sim3-drosophila-100 library, which appears to gets 3% faster.

I’ll need to do some more measurements.

…ique This is a hack for the moment that is also incomplete as it does not check the reverse-complemented sequence. Speed improvements: * Less than 1% on sim5-drosophila-100 (where ~30% of all CIGARs are `100=`) * Perhaps 3% on sim3-drosophila-100

ksahlin · 2024-12-10T14:28:14Z

Copy comment on an individual commit to be in the PR discussion instead:

This optimization could be dangerous as randstrobes (similar to any other approximate seeds) are not guaranteed to match the smallest edit distance, so one unique hit could actually be suboptimal (see fig below). Happens when mutations permute the hash values.

I'm not sure how common this is in practice, but I am sure it happens occasionally for large datasets - but is typically solved by looking ate some more seeds.

Maybe to solve this in practice is to look at two seeds in instead - because the probability that this happens (p) would be p^2 for two seeds under a good hash function.

ksahlin · 2024-12-10T14:30:39Z

I see, your commit also includes a hamming/extension alignment and requires a single exact match? Then I guess there is no real limitation introduced. But still good to keep my above comment in mind.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Fast path for exact matches #461

Experiment: Fast path for exact matches #461

marcelm commented Nov 26, 2024

ksahlin commented Dec 10, 2024

ksahlin commented Dec 10, 2024

Experiment: Fast path for exact matches #461

Are you sure you want to change the base?

Experiment: Fast path for exact matches #461

Conversation

marcelm commented Nov 26, 2024

ksahlin commented Dec 10, 2024

ksahlin commented Dec 10, 2024