Datehistogram improvement #170

bowenlan-amzn · 2023-11-16T04:57:44Z

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)
Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: bowenlan-amzn <[email protected]>

jainankitk · 2023-11-17T13:56:19Z

...in/java/org/opensearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregator.java

+            while (i < targetBuckets) {
+                // Calculate the lower bucket bound
+                final byte[] lower = new byte[8];
+                NumericUtils.longToSortableBytes(Math.max(roundedLow, low), lower, 0);
+                // Calculate the upper bucket bound
+                final byte[] upper = new byte[8];
+                roundedLow = preparedRounding.round(roundedLow + interval);
+                // Subtract -1 if the minimum is roundedLow as roundedLow itself
+                // is included in the next bucket
+                NumericUtils.longToSortableBytes(Math.min(roundedLow - 1, high), upper, 0);
+
+                filters[i++] = context.searcher().createWeight(new PointRangeQuery(field, lower, upper, 1) {


This logic has problem since we are focused on creating the required number of target buckets, instead we should limit to the upper and lower bounds. For example, I see the following response:

% curl -s -X GET "localhost:9200/nyc_taxis/_search?pretty" -H 'Content-Type: application/json' -d'{"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-01-01 01:04:06","lt": "2016-01-01 00:00:00"}}},"aggs": {"dropoffs_over_time": {"auto_date_histogram": {"field": "dropoff_datetime","buckets": "4"}}}}' { "took" : 4295, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 100, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "dropoffs_over_time" : { "buckets" : [ { "key_as_string" : "2015-01-01 00:00:00", "key" : 1420070400000, "doc_count" : 100 }, { "key_as_string" : "2016-01-01 00:00:00", "key" : 1451606400000, "doc_count" : 0 }, { "key_as_string" : "2017-01-01 00:00:00", "key" : 1483228800000, "doc_count" : 0 } ], "interval" : "1y" } } }

jainankitk · 2023-11-17T13:57:35Z

...in/java/org/opensearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregator.java

+            roundingInfosLoop: do {
+                RoundingInfo curRoundingInfo = roundingInfos[roundingIdx];
+                for (int curInnerInterval: curRoundingInfo.innerIntervals) {
+                    if (bestDuration <= curInnerInterval * curRoundingInfo.roughEstimateDurationMillis) {
+                        interval = curInnerInterval * curRoundingInfo.roughEstimateDurationMillis;
+                        break roundingInfosLoop;
+                    }
+                }
+                roundingIdx++;


Probably need to move the preparedRounding as well to the next index? Else it is always staying at 0 (second level)

jainankitk · 2023-11-17T14:00:00Z

...in/java/org/opensearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregator.java

+            for (i = 0; i < filters.length; i++) {
+                long bucketOrd = bucketOrds.add(
+                    owningBucketOrd,
+                    preparedRounding.round(NumericUtils.sortableBytesToLong(((PointRangeQuery) filters[i].getQuery()).getLowerPoint(), 0))
+                );
+                if (bucketOrd < 0) { // already seen
+                    bucketOrd = -1 - bucketOrd;
+                }
+                incrementBucketDocCount(bucketOrd, counts[i]);
+            }
+            throw new CollectionTerminatedException();


We can be aggressive during bucket creation, and invoke the logic to merge buckets if it exceeds the target bucket count:

do { try (LongKeyedBucketOrds oldOrds = bucketOrds) { preparedRounding = prepareRounding(++roundingIdx); long[] mergeMap = new long[Math.toIntExact(oldOrds.size())]; bucketOrds = new LongKeyedBucketOrds.FromSingle(context.bigArrays()); LongKeyedBucketOrds.BucketOrdsEnum ordsEnum = oldOrds.ordsEnum(0); while (ordsEnum.next()) { long oldKey = ordsEnum.value(); long newKey = preparedRounding.round(oldKey); long newBucketOrd = bucketOrds.add(0, newKey); mergeMap[(int) ordsEnum.ord()] = newBucketOrd >= 0 ? newBucketOrd : -1 - newBucketOrd; } merge(mergeMap, bucketOrds.size()); } } while (roundingIdx < roundingInfos.length - 1 && (bucketOrds.size() > targetBuckets * roundingInfos[roundingIdx].getMaximumInnerInterval() || max - min > targetBuckets * roundingInfos[roundingIdx].getMaximumRoughEstimateDurationMillis()));

bowenlan-amzn added 2 commits November 13, 2023 09:14

Reading

c418a0c

Signed-off-by: bowenlan-amzn <[email protected]>

apply the optimization on autodatehistogram

2f18078

Signed-off-by: bowenlan-amzn <[email protected]>

jainankitk reviewed Nov 17, 2023

View reviewed changes

bowenlan-amzn mentioned this pull request Nov 17, 2023

[Date Histogram] Apply the optimization to AutoDateHistogram opensearch-project/OpenSearch#11220

Closed

jainankitk deleted the branch jainankitk:date-histo November 29, 2023 14:50

jainankitk closed this Nov 29, 2023

bowenlan-amzn deleted the datehistogram-improvement branch January 4, 2024 17:19

bowenlan-amzn restored the datehistogram-improvement branch January 4, 2024 17:20

bowenlan-amzn deleted the datehistogram-improvement branch January 4, 2024 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datehistogram improvement #170

Datehistogram improvement #170

bowenlan-amzn commented Nov 16, 2023

jainankitk Nov 17, 2023

jainankitk Nov 17, 2023

jainankitk Nov 17, 2023

Datehistogram improvement #170

Datehistogram improvement #170

Conversation

bowenlan-amzn commented Nov 16, 2023

Description

Related Issues

Check List

jainankitk Nov 17, 2023

Choose a reason for hiding this comment

jainankitk Nov 17, 2023

Choose a reason for hiding this comment

jainankitk Nov 17, 2023

Choose a reason for hiding this comment