Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug bucket borders #111

Closed
wants to merge 15 commits into from
Closed

Debug bucket borders #111

wants to merge 15 commits into from

Conversation

eaasna
Copy link
Owner

@eaasna eaasna commented Sep 11, 2024

Fixed with seqan/seqan#2541

--

Use lib/seqan from eaasna/seqan#1

This PR investigates a segmentation fault that occurs when searching the human reference genome for matches for the mouse reference genome. There is a memory error because one of the buckets in the QGramDir is defined such that the bucketBegin > bucketEnd. For each k-mer the QGramDir stores its bucketBegin index. The bucketBegin index points to the QGramSA that stores the positions of the k-mer. The bucketEnd is inferred from the beginning of the next k-mer. Because the SWIFT index uses open addressing, there are 2 hash functions applied to the k-mer. The first (e.g hash1(AA) = 14) hash function value is stored in the BucketMap at position e.g hash2(hash1(AA)) = hash2(14). K-mer lookups probe the BucketMap until a matching hash value or empty bucket is found.

The GGramDir is built in two steps:

  1. count k-mers
  2. calculate the cumulative sum

After this, the suffix array is built.

It is unclear why the QGramDir is sometimes faulty, but it seems to be triggered by the lexicographically last 32-mer TTTTT....

@eaasna eaasna changed the title Check bucket borders Debug bucket borders Sep 20, 2024
@eaasna eaasna closed this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant