Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datehistogram improvement #170

Conversation

bowenlan-amzn
Copy link

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: bowenlan-amzn <[email protected]>
Comment on lines +366 to +377
while (i < targetBuckets) {
// Calculate the lower bucket bound
final byte[] lower = new byte[8];
NumericUtils.longToSortableBytes(Math.max(roundedLow, low), lower, 0);
// Calculate the upper bucket bound
final byte[] upper = new byte[8];
roundedLow = preparedRounding.round(roundedLow + interval);
// Subtract -1 if the minimum is roundedLow as roundedLow itself
// is included in the next bucket
NumericUtils.longToSortableBytes(Math.min(roundedLow - 1, high), upper, 0);

filters[i++] = context.searcher().createWeight(new PointRangeQuery(field, lower, upper, 1) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic has problem since we are focused on creating the required number of target buckets, instead we should limit to the upper and lower bounds. For example, I see the following response:

% curl -s -X GET "localhost:9200/nyc_taxis/_search?pretty" -H 'Content-Type: application/json' -d'{"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-01-01 01:04:06","lt": "2016-01-01 00:00:00"}}},"aggs": {"dropoffs_over_time": {"auto_date_histogram": {"field": "dropoff_datetime","buckets": "4"}}}}'
{
  "took" : 4295,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 100,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "dropoffs_over_time" : {
      "buckets" : [
        {
          "key_as_string" : "2015-01-01 00:00:00",
          "key" : 1420070400000,
          "doc_count" : 100
        },
        {
          "key_as_string" : "2016-01-01 00:00:00",
          "key" : 1451606400000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "2017-01-01 00:00:00",
          "key" : 1483228800000,
          "doc_count" : 0
        }
      ],
      "interval" : "1y"
    }
  }
}

Comment on lines +349 to +357
roundingInfosLoop: do {
RoundingInfo curRoundingInfo = roundingInfos[roundingIdx];
for (int curInnerInterval: curRoundingInfo.innerIntervals) {
if (bestDuration <= curInnerInterval * curRoundingInfo.roughEstimateDurationMillis) {
interval = curInnerInterval * curRoundingInfo.roughEstimateDurationMillis;
break roundingInfosLoop;
}
}
roundingIdx++;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to move the preparedRounding as well to the next index? Else it is always staying at 0 (second level)

Comment on lines +398 to +408
for (i = 0; i < filters.length; i++) {
long bucketOrd = bucketOrds.add(
owningBucketOrd,
preparedRounding.round(NumericUtils.sortableBytesToLong(((PointRangeQuery) filters[i].getQuery()).getLowerPoint(), 0))
);
if (bucketOrd < 0) { // already seen
bucketOrd = -1 - bucketOrd;
}
incrementBucketDocCount(bucketOrd, counts[i]);
}
throw new CollectionTerminatedException();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can be aggressive during bucket creation, and invoke the logic to merge buckets if it exceeds the target bucket count:

do {
                        try (LongKeyedBucketOrds oldOrds = bucketOrds) {
                            preparedRounding = prepareRounding(++roundingIdx);
                            long[] mergeMap = new long[Math.toIntExact(oldOrds.size())];
                            bucketOrds = new LongKeyedBucketOrds.FromSingle(context.bigArrays());
                            LongKeyedBucketOrds.BucketOrdsEnum ordsEnum = oldOrds.ordsEnum(0);
                            while (ordsEnum.next()) {
                                long oldKey = ordsEnum.value();
                                long newKey = preparedRounding.round(oldKey);
                                long newBucketOrd = bucketOrds.add(0, newKey);
                                mergeMap[(int) ordsEnum.ord()] = newBucketOrd >= 0 ? newBucketOrd : -1 - newBucketOrd;
                            }
                            merge(mergeMap, bucketOrds.size());
                        }
                    } while (roundingIdx < roundingInfos.length - 1
                        && (bucketOrds.size() > targetBuckets * roundingInfos[roundingIdx].getMaximumInnerInterval()
                            || max - min > targetBuckets * roundingInfos[roundingIdx].getMaximumRoughEstimateDurationMillis()));

@jainankitk jainankitk deleted the branch jainankitk:date-histo November 29, 2023 14:50
@jainankitk jainankitk closed this Nov 29, 2023
@bowenlan-amzn bowenlan-amzn deleted the datehistogram-improvement branch January 4, 2024 17:19
@bowenlan-amzn bowenlan-amzn restored the datehistogram-improvement branch January 4, 2024 17:20
@bowenlan-amzn bowenlan-amzn deleted the datehistogram-improvement branch January 4, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants