-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] document counting #7
Comments
I'll have to think this through. Sadakane's document counting structure belongs to a family of data structures that store information in the internal nodes of the suffix tree. While the final structures can be compressed well, the construction algorithms usually store the data explicitly. The The bottleneck seems to be LCP array construction. Assuming that we can build it efficiently, the rest of the construction should not be a problem. Each GBWT node corresponds to a subtree of the root of the suffix tree, so we can traverse the subtrees independently in parallel and compress the intermediate results. |
Thank you for the swift response! Yes, I now see that in case of GCSA2, both alphabet size and suffix lengths are much smaller compared to GBWT.
Luckily, the last author of the r-index paper (https://arxiv.org/pdf/1809.02792v2.pdf) has also addressed efficient LCP construction: https://arxiv.org/pdf/1901.05226.pdf. My understanding is that Belazzougui’s algorithm should work well with GBWT (there, σ^2 looks scary but it's really <global σ> * <local σ>), and once LCP is built, it can be sampled similarly to SA to use only O(#runs) space, as per Lemma 5 in the r-index paper.
…On Sat, Jan 30, 2021, at 3:11 AM, Jouni Siren wrote:
I'll have to think this through.
Sadakane's document counting structure belongs to a family of data structures that store information in the internal nodes of the suffix tree. While the final structures can be compressed well, the construction algorithms usually store the data explicitly.
The `SadaCount` construction algorithm uses the LCP array to simulate a single-threaded traversal of the suffix tree. It maps each internal node to its inorder rank(s) and stores the information in a single array. This approach works well enough in GCSA2, where the effective text size is ~10^10 for a 1000GP graph. For the GBWT of the same data with the effective text size over 10^12, we would need something much faster and more space-efficient.
The bottleneck seems to be LCP array construction. Assuming that we can build it efficiently, the rest of the construction should not be a problem. Each GBWT node corresponds to a subtree of the root of the suffix tree, so we can traverse the subtrees independently in parallel and compress the intermediate results.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFBYG2ZEJLHXIY7SVQGWFDS4NTE3ANCNFSM4WZEKZDQ>.
|
That algorithm is not nearly fast or space-efficient enough. Extrapolating from the results in the paper, LCP array construction for the 1000GP GBWT would require 5-6 days and over 2 TB memory. The irreducible LCP algorithm (https://doi.org/10.1007/978-3-642-02441-2_17) is more promising. I had an implementation in the RLCSA that required minimal working space on top of the FM-index and the run-length encoded PLCP array. It was reasonably fast with highly repetitive texts, and it should be possible to parallelize it to take advantage of tens of CPU cores. |
I believe we would need roughly the following:
|
I assume you are referring to "Sampled Longest Common Prefix Array"? Beller et al's algorithm (queue-based) can be tweaked to use |
The algorithms of Nishimoto and Tabei are fundementally sequential. The one based on the algorithm by Beller et al. corresponds to a breadth-first traversal of the suffix tree, and it outputs the LCP values at ST node boundaries. I don't see a way of parallelizing that over tens of CPU cores. In general, stringology research is still mostly concerned about texts that are gigabytes or at most tens of gigabytes in size. The GBWT works with terabyte-scale data, and it often has to resort to simple construction algorithms that can be broken down into a large number of independent tasks. |
Fair enough. I briefly checked current lock-free queue landscape, but I don't think they are going to cut it either (best implementations achieve throughput around 10M operations/second in balanced enqueue/dequeue scenarios). |
I've been using this one: https://github.com/max0x7ba/atomic_queue
This is what you tested?
…On Wed, Feb 3, 2021 at 10:23 AM Artem Tarasov ***@***.***> wrote:
Fair enough. I briefly checked current lock-free queue landscape, but I
don't think they are going to cut it either (best implementations achieve
throughput around 10M operations/second in balanced enqueue/dequeue
scenarios).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDQEKHKDFFOVZXAEP5RY3S5EIXZANCNFSM4WZEKZDQ>
.
|
Yes, this one. I also looked at https://github.com/mpoeter/xenium. That said, I only looked at the benchmark results and have no hands-on experience. Overall I agree that the simplest solution is the best, it would be interesting to check later if more complicated approaches are more performant but it's not at all obvious that they'll be. |
In PrivVG project, document frequencies will be necessary in order to satisfy the differential privacy definition (a single new document might inflate counts a lot if it visits the same edge many times). While
FastLocate
is good enough for initial prototyping, it won't scale well.I noticed there's support for that in GCSA2 (
SadaCount
/SadaSparse
), can we please have that in GBWT as well? (cc: @ekg)The text was updated successfully, but these errors were encountered: