-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Dynamic Pruning in Cardinality Aggregation #13821
Support Dynamic Pruning in Cardinality Aggregation #13821
Conversation
91f7a04
to
c2775ac
Compare
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
c2775ac
to
8fc9bee
Compare
❌ Gradle check result for c2775ac: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 91f7a04: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 8fc9bee: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: bowenlan-amzn <[email protected]>
8fc9bee
to
85133c4
Compare
❌ Gradle check result for 85133c4: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
❕ Gradle check result for 9d4701c: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #13821 +/- ##
============================================
+ Coverage 71.42% 71.66% +0.24%
- Complexity 59978 61795 +1817
============================================
Files 4985 5105 +120
Lines 282275 290175 +7900
Branches 40946 41917 +971
============================================
+ Hits 201603 207968 +6365
- Misses 63999 65043 +1044
- Partials 16673 17164 +491 ☔ View full report in Codecov by Sentry. |
❌ Gradle check result for d53182b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 77fceea: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Pretty interesting, seems this failure can be reproduced consistently with the seed
|
❌ Gradle check result for 09f957c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 09f957c: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-13821-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 2f8cb079c2c220304b8e6965853a15732c3bf57c
# Push it to GitHub
git push --set-upstream origin backport/backport-13821-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x Then, create a pull request where the |
…t#13821) * Cardinality aggregation dynamic pruning changes Signed-off-by: bowenlan-amzn <[email protected]> * Reading Signed-off-by: bowenlan-amzn <[email protected]> * remaining disjunction scorer full understand Signed-off-by: bowenlan-amzn <[email protected]> * utilize competitive iterator api to perform pruning Signed-off-by: bowenlan-amzn <[email protected]> * handle missing input Signed-off-by: bowenlan-amzn <[email protected]> * add change log Signed-off-by: bowenlan-amzn <[email protected]> * clean up Signed-off-by: bowenlan-amzn <[email protected]> * Clean up Signed-off-by: bowenlan-amzn <[email protected]> * Test fix Signed-off-by: bowenlan-amzn <[email protected]> * Do all the scoring within Cardinality Signed-off-by: bowenlan-amzn <[email protected]> * clean unnecessary Signed-off-by: bowenlan-amzn <[email protected]> * fix Signed-off-by: bowenlan-amzn <[email protected]> * Add dynamic flag for this feature Signed-off-by: bowenlan-amzn <[email protected]> * Add random test, small bug fix Signed-off-by: bowenlan-amzn <[email protected]> * address comment Signed-off-by: bowenlan-amzn <[email protected]> * Address comments Signed-off-by: bowenlan-amzn <[email protected]> * address comments Signed-off-by: bowenlan-amzn <[email protected]> --------- Signed-off-by: bowenlan-amzn <[email protected]> Co-authored-by: Rishabh Maurya <[email protected]>
* Cardinality aggregation dynamic pruning changes * Reading * remaining disjunction scorer full understand * utilize competitive iterator api to perform pruning * handle missing input * add change log * clean up * Clean up * Test fix * Do all the scoring within Cardinality * clean unnecessary * fix * Add dynamic flag for this feature * Add random test, small bug fix * address comment * Address comments * address comments --------- Signed-off-by: bowenlan-amzn <[email protected]> Co-authored-by: Rishabh Maurya <[email protected]>
…t#13821) * Cardinality aggregation dynamic pruning changes Signed-off-by: bowenlan-amzn <[email protected]> * Reading Signed-off-by: bowenlan-amzn <[email protected]> * remaining disjunction scorer full understand Signed-off-by: bowenlan-amzn <[email protected]> * utilize competitive iterator api to perform pruning Signed-off-by: bowenlan-amzn <[email protected]> * handle missing input Signed-off-by: bowenlan-amzn <[email protected]> * add change log Signed-off-by: bowenlan-amzn <[email protected]> * clean up Signed-off-by: bowenlan-amzn <[email protected]> * Clean up Signed-off-by: bowenlan-amzn <[email protected]> * Test fix Signed-off-by: bowenlan-amzn <[email protected]> * Do all the scoring within Cardinality Signed-off-by: bowenlan-amzn <[email protected]> * clean unnecessary Signed-off-by: bowenlan-amzn <[email protected]> * fix Signed-off-by: bowenlan-amzn <[email protected]> * Add dynamic flag for this feature Signed-off-by: bowenlan-amzn <[email protected]> * Add random test, small bug fix Signed-off-by: bowenlan-amzn <[email protected]> * address comment Signed-off-by: bowenlan-amzn <[email protected]> * Address comments Signed-off-by: bowenlan-amzn <[email protected]> * address comments Signed-off-by: bowenlan-amzn <[email protected]> --------- Signed-off-by: bowenlan-amzn <[email protected]> Co-authored-by: Rishabh Maurya <[email protected]>
…t#13821) (opensearch-project#14203) * Cardinality aggregation dynamic pruning changes * Reading * remaining disjunction scorer full understand * utilize competitive iterator api to perform pruning * handle missing input * add change log * clean up * Clean up * Test fix * Do all the scoring within Cardinality * clean unnecessary * fix * Add dynamic flag for this feature * Add random test, small bug fix * address comment * Address comments * address comments --------- Signed-off-by: bowenlan-amzn <[email protected]> Co-authored-by: Rishabh Maurya <[email protected]> Signed-off-by: kkewwei <[email protected]>
…t#13821) * Cardinality aggregation dynamic pruning changes Signed-off-by: bowenlan-amzn <[email protected]> * Reading Signed-off-by: bowenlan-amzn <[email protected]> * remaining disjunction scorer full understand Signed-off-by: bowenlan-amzn <[email protected]> * utilize competitive iterator api to perform pruning Signed-off-by: bowenlan-amzn <[email protected]> * handle missing input Signed-off-by: bowenlan-amzn <[email protected]> * add change log Signed-off-by: bowenlan-amzn <[email protected]> * clean up Signed-off-by: bowenlan-amzn <[email protected]> * Clean up Signed-off-by: bowenlan-amzn <[email protected]> * Test fix Signed-off-by: bowenlan-amzn <[email protected]> * Do all the scoring within Cardinality Signed-off-by: bowenlan-amzn <[email protected]> * clean unnecessary Signed-off-by: bowenlan-amzn <[email protected]> * fix Signed-off-by: bowenlan-amzn <[email protected]> * Add dynamic flag for this feature Signed-off-by: bowenlan-amzn <[email protected]> * Add random test, small bug fix Signed-off-by: bowenlan-amzn <[email protected]> * address comment Signed-off-by: bowenlan-amzn <[email protected]> * Address comments Signed-off-by: bowenlan-amzn <[email protected]> * address comments Signed-off-by: bowenlan-amzn <[email protected]> --------- Signed-off-by: bowenlan-amzn <[email protected]> Co-authored-by: Rishabh Maurya <[email protected]>
Description
First I want to thanks Rishabh's work on this which I learnt from.
Background
Cardinality aggregation count the number of distinct values of a field.
Some common use cases are counting the visitors of a service, detect anomly of unique IP address or event types and count unique product or brands in e-commerce, etc.
Idea
By defaults, aggregation works on every document that matches the top level query, collects them in an iterative manner.
However, once we collect one document, we don't need to care about the values of this document anymore, since they are already counted as the distinct values in cardinality aggregation results.
So the idea is to stop working on other documents that also have these collected values as the aggregation going on.
Implementation
To implement this idea, we need to know all the possible values of the field and the documents of each value before the aggregation started.
Lucene provide a dictionary of terms and a list the documents containing these terms.
We can build a disjunction of the documentID iterators for these term values. Every time after we collect values of a doc, we remove the iterators of the collected. So we actually skipped many documents for the next doc iteration and won't see these collected values anymore.
Lucene provide a dictionary of terms and the documents that contain these terms.
However, for numeric field, lucene provides a tree structure that let user traverse the numeric values in sorted order. So it would be different to do the same kind of dynamic pruning for numeric field. It's still possible but we can work on it as a follow up considering numeric cardinality is less common comparing to keyword term cardinality.
One shard may contain multiple segments, and by default segments are processed in sequential. If the seen values are saved at shard level, for the subsequential segments, there's no need to build the iterator for these values.
This will be a follow up tasks.
Public Facing Changes
New cluster setting that can dynamically tune the threshold of supported maximum cardinality for doing the pruning
search.dynamic_pruning.cardinality_aggregation_threshold
Set to 0 can disable the feature.
Currently use 100 which should be a safe number.
opensearch-project/documentation-website#7334
opensearch-project/documentation-website#7341
Benchmark
Add cardinality aggregation operations in big5 workload
opensearch-project/opensearch-benchmark-workloads#311
Performance improvements before and after
Related Issues
Resolves #11959
Check List
API changes companion pull request created.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.