Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for get partition boundaries #475

Merged
merged 15 commits into from
Nov 30, 2023
Merged

Conversation

leerho
Copy link
Contributor

@leerho leerho commented Oct 26, 2023

This applies the same fix that eliminates duplicate entries when using getPartitionBoundaries(...) for small values of N, in all the quantile sketches.

@leerho leerho marked this pull request as draft October 26, 2023 16:02
- This is a large number of changes.

- The problem detected by the Druid team is fixed, so now the
"getPartitionBoundaries" works for input streams that are larger than
Integer.MAX_VALUE.

- This fix applies to both the KllItemsSketch and the classic
ItemsSketch.  These are the only two sketches, for now, that will
support the "getPartitionBoundaries" functionality. This is enforced via
a new "PartitioningFeature" API interface.

- In addition, there is new "partitions" package that solves the problem
of limited accuracy of our quantiles sketches when being asked to
partition very large input streams.  This package can partition very
large streams of almost unlimited size with very small variation in the
resulting partition sizes. I have tested this with streams as large as
30E12 elements.

- I have reduced code duplication in a number of places. Specifically,
All the quantile sketch sorted view classes use only 3 iterator
implementations, which are for float, double and generic. Further
consolidation of classes can be done across the sorted view classes
themselves, but that will have to be done later.

- Javadocs have been improved in a number of places and I have fixed
spelling errors when I see them.
@leerho leerho marked this pull request as ready for review November 16, 2023 23:19
@leerho
Copy link
Contributor Author

leerho commented Nov 16, 2023

This is the final review before I start the release process for Java - 5.0.0

1. providing a main(...)
2. Initiate via TestNG
3. Direct progammatic access.

I also split out the reporting method as it was duplicate code.
The code included here does work fine for moderate sized partitioning
tasks. As an example, using the test code in the test branch with the
partitioning task of splitting a data set of 1 billion items into 324
partitions of size 3M items completed in under 3 minutes, which was
performed on a single CPU. For much larger partitioning tasks, it is
recommended that this code be leveraged into a parallelized systems
environment.

I made some minor tweaks to the test code examples.
@leerho leerho merged commit e06ae7d into master Nov 30, 2023
4 checks passed
@leerho leerho deleted the Fixes_for_getPartitionBoundaries branch November 30, 2023 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants