Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] document counting #7

Open
lomereiter opened this issue Jan 29, 2021 · 9 comments
Open

[feature request] document counting #7

lomereiter opened this issue Jan 29, 2021 · 9 comments

Comments

@lomereiter
Copy link

In PrivVG project, document frequencies will be necessary in order to satisfy the differential privacy definition (a single new document might inflate counts a lot if it visits the same edge many times). While FastLocate is good enough for initial prototyping, it won't scale well.

I noticed there's support for that in GCSA2 (SadaCount/SadaSparse), can we please have that in GBWT as well? (cc: @ekg)

@jltsiren
Copy link
Owner

I'll have to think this through.

Sadakane's document counting structure belongs to a family of data structures that store information in the internal nodes of the suffix tree. While the final structures can be compressed well, the construction algorithms usually store the data explicitly.

The SadaCount construction algorithm uses the LCP array to simulate a single-threaded traversal of the suffix tree. It maps each internal node to its inorder rank(s) and stores the information in a single array. This approach works well enough in GCSA2, where the effective text size is ~10^10 for a 1000GP graph. For the GBWT of the same data with the effective text size over 10^12, we would need something much faster and more space-efficient.

The bottleneck seems to be LCP array construction. Assuming that we can build it efficiently, the rest of the construction should not be a problem. Each GBWT node corresponds to a subtree of the root of the suffix tree, so we can traverse the subtrees independently in parallel and compress the intermediate results.

@lomereiter
Copy link
Author

lomereiter commented Jan 30, 2021 via email

@jltsiren
Copy link
Owner

jltsiren commented Feb 2, 2021

That algorithm is not nearly fast or space-efficient enough. Extrapolating from the results in the paper, LCP array construction for the 1000GP GBWT would require 5-6 days and over 2 TB memory.

The irreducible LCP algorithm (https://doi.org/10.1007/978-3-642-02441-2_17) is more promising. I had an implementation in the RLCSA that required minimal working space on top of the FM-index and the run-length encoded PLCP array. It was reasonably fast with highly repetitive texts, and it should be possible to parallelize it to take advantage of tens of CPU cores.

@jltsiren
Copy link
Owner

jltsiren commented Feb 2, 2021

I believe we would need roughly the following:

  • SDSL: Run-length encoded bitvector that can be built incrementally. (We will be dealing with bitvectors with over 10^12 zeros and ones.)
  • SDSL / GBWT: PLCP using the RLE bitvector.
  • GBWT: PLCP construction from (dynamic?) GBWT using multithreaded irreducible LCP algorithm.
  • GBWT: SadaSparse (or another variant of Sadakane's document counting structure if it makes more sense).
  • GBWT: Multithreaded SadaSparse construction using FastLocate and the PLCP array.

@lomereiter
Copy link
Author

I assume you are referring to "Sampled Longest Common Prefix Array"?

Beller et al's algorithm (queue-based) can be tweaked to use O(#runs) construction space using very similar ideas, please take a look at section 3 here: https://arxiv.org/pdf/2004.01493.pdf (compare their "incremental relation" and lemma 5 in your paper)

@jltsiren
Copy link
Owner

jltsiren commented Feb 3, 2021

The algorithms of Nishimoto and Tabei are fundementally sequential. The one based on the algorithm by Beller et al. corresponds to a breadth-first traversal of the suffix tree, and it outputs the LCP values at ST node boundaries. I don't see a way of parallelizing that over tens of CPU cores.

In general, stringology research is still mostly concerned about texts that are gigabytes or at most tens of gigabytes in size. The GBWT works with terabyte-scale data, and it often has to resort to simple construction algorithms that can be broken down into a large number of independent tasks.

@lomereiter
Copy link
Author

Fair enough. I briefly checked current lock-free queue landscape, but I don't think they are going to cut it either (best implementations achieve throughput around 10M operations/second in balanced enqueue/dequeue scenarios).

@ekg
Copy link
Contributor

ekg commented Feb 3, 2021 via email

@lomereiter
Copy link
Author

Yes, this one. I also looked at https://github.com/mpoeter/xenium. That said, I only looked at the benchmark results and have no hands-on experience.

Overall I agree that the simplest solution is the best, it would be interesting to check later if more complicated approaches are more performant but it's not at all obvious that they'll be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants