Experiment: Fast path for exact matches #461
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thinking about ways to alleviate the multi-context seeds slowdown, I tried to add a "fast path" for exact matches (this has nothing to do with mcs, but if we add an unrelated optimization at the same time, it feels less bad): The first randstrobe from the query is looked up in the index, and if it results in a unique hit (only one entry for that hash in the index), the query and reference are compared and if they match exactly, the match is output and everything else is skipped (that is, looking up other hits in the index, merging them into NAMs, ungapped alignments for the top NAMs).
This is a just hack for the moment (with some code duplication) that is also incomplete as it does not check the reverse-complemented sequence.
I tested this on sim5-drosophila-100 first, and there was no improvement (although ~30% of all CIGARs are
100=
), but it does help a little bit with the sim3-drosophila-100 library, which appears to gets 3% faster.I’ll need to do some more measurements.