You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed when building a csa_wt<wt_pc<>> that the input sequence (an int_vector) cannot contain a zero since the sentinel character in the BWT will be represented by zero.
As far as I understand the code, the alphabet size of the input sequence increases by one (the sentinel character) when building the wavelet tree which increases the wavelet tree size. Wouldn't it be more efficient in terms of space to represent the sentinel character by some character from the input alphabet and store its position?
In SeqAn2 we replace the sentinel character by some character of the input alphabet and store the position of the sentinel in the BWT. When counting characters in a prefix of the BWT in the wavelet tree, the rank is decremented by one iff. the character to be counted is equal to the sentinel replacement and its position is within the prefix.
For collections of strings (which is yet to come), the sentinels would be replaced by the least common character of the input sequence (assuming that rare characters in the input text will also occur less frequently in the pattern to be searched) and we would keep a (sparse) bitvector storing the sentinel positions with rank support.
Is this already implemented in the SDSL (or in progress) or have you discussed this approach earlier @mpetri@simongog?
I noticed when building a
csa_wt<wt_pc<>>
that the input sequence (an int_vector) cannot contain a zero since the sentinel character in the BWT will be represented by zero.As far as I understand the code, the alphabet size of the input sequence increases by one (the sentinel character) when building the wavelet tree which increases the wavelet tree size. Wouldn't it be more efficient in terms of space to represent the sentinel character by some character from the input alphabet and store its position?
In SeqAn2 we replace the sentinel character by some character of the input alphabet and store the position of the sentinel in the BWT. When counting characters in a prefix of the BWT in the wavelet tree, the rank is decremented by one iff. the character to be counted is equal to the sentinel replacement and its position is within the prefix.
For collections of strings (which is yet to come), the sentinels would be replaced by the least common character of the input sequence (assuming that rare characters in the input text will also occur less frequently in the pattern to be searched) and we would keep a (sparse) bitvector storing the sentinel positions with rank support.
Is this already implemented in the SDSL (or in progress) or have you discussed this approach earlier @mpetri @simongog?
FYI @h-2
The text was updated successfully, but these errors were encountered: