Sentinel character in BWT increases input alphabet size? #38

cpockrandt · 2018-05-14T17:12:26Z

I noticed when building a csa_wt<wt_pc<>> that the input sequence (an int_vector) cannot contain a zero since the sentinel character in the BWT will be represented by zero.

As far as I understand the code, the alphabet size of the input sequence increases by one (the sentinel character) when building the wavelet tree which increases the wavelet tree size. Wouldn't it be more efficient in terms of space to represent the sentinel character by some character from the input alphabet and store its position?

In SeqAn2 we replace the sentinel character by some character of the input alphabet and store the position of the sentinel in the BWT. When counting characters in a prefix of the BWT in the wavelet tree, the rank is decremented by one iff. the character to be counted is equal to the sentinel replacement and its position is within the prefix.

For collections of strings (which is yet to come), the sentinels would be replaced by the least common character of the input sequence (assuming that rare characters in the input text will also occur less frequently in the pattern to be searched) and we would keep a (sparse) bitvector storing the sentinel positions with rank support.

Is this already implemented in the SDSL (or in progress) or have you discussed this approach earlier @mpetri @simongog?

FYI @h-2

The text was updated successfully, but these errors were encountered:

mpetri · 2018-05-15T01:18:33Z

I think simon has implemented this 2-3 years ago. Check out the "implicit_sentinel" branch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentinel character in BWT increases input alphabet size? #38

Sentinel character in BWT increases input alphabet size? #38

cpockrandt commented May 14, 2018

mpetri commented May 15, 2018

Sentinel character in BWT increases input alphabet size? #38

Sentinel character in BWT increases input alphabet size? #38

Comments

cpockrandt commented May 14, 2018

mpetri commented May 15, 2018