Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentinel character in BWT increases input alphabet size? #38

Open
cpockrandt opened this issue May 14, 2018 · 1 comment
Open

Sentinel character in BWT increases input alphabet size? #38

cpockrandt opened this issue May 14, 2018 · 1 comment

Comments

@cpockrandt
Copy link
Collaborator

I noticed when building a csa_wt<wt_pc<>> that the input sequence (an int_vector) cannot contain a zero since the sentinel character in the BWT will be represented by zero.

As far as I understand the code, the alphabet size of the input sequence increases by one (the sentinel character) when building the wavelet tree which increases the wavelet tree size. Wouldn't it be more efficient in terms of space to represent the sentinel character by some character from the input alphabet and store its position?

In SeqAn2 we replace the sentinel character by some character of the input alphabet and store the position of the sentinel in the BWT. When counting characters in a prefix of the BWT in the wavelet tree, the rank is decremented by one iff. the character to be counted is equal to the sentinel replacement and its position is within the prefix.

For collections of strings (which is yet to come), the sentinels would be replaced by the least common character of the input sequence (assuming that rare characters in the input text will also occur less frequently in the pattern to be searched) and we would keep a (sparse) bitvector storing the sentinel positions with rank support.

Is this already implemented in the SDSL (or in progress) or have you discussed this approach earlier @mpetri @simongog?

FYI @h-2

@mpetri
Copy link

mpetri commented May 15, 2018

I think simon has implemented this 2-3 years ago. Check out the "implicit_sentinel" branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants