Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite EARLIEST/LATEST query operators to EARLIEST_BY/LATEST_BY #15095

Merged
merged 13 commits into from
Oct 11, 2023

Conversation

gargvishesh
Copy link
Contributor

Description

EARLIEST and LATEST operators implicitly reference the __time column for calculation of the aggregate value. Since the reference isn't explicit, Calcite sometimes fails to update the __time column name when there's column renaming --such as in the case of nested queries -- resulting in column not found errors.

This change rewrites these operators to EARLIEST_BY and LATEST_BY during query processing to make the reference explicit to Calcite.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.


SqlAggFunction aggFunction;

switch (getName()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this switch is really needed:

  • AggregatorType is an enum; you could probably save it in the constructor - so that no string based switch is necessary
    • and probably the enum could have a method or something which returns the other
  • or you may also accept a second AggregatorType in the constructor ( what should it be rewritten to) so there will be no question what the "other" should be....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to take the latter approach of passing the replacement SqlAggFunction in the constructor itself.


switch (operands.size()) {
case 1:
return aggFunction.createCall(pos, operands.get(0), new SqlIdentifier(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note:
you could probably use the list constructor of the call; so there is less duplication

List<SqlNode> newOperands = new ArrayList<SqlNode>();
newOperands.add(operands.get(0));
newOperands.add(new SqlIdentifier("__time", pos));
if (operands.size() == 2)
  newOperands.add(operands.get(1));
return aggFunction.createCall(pos, newOperands);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - updated the code likewise.

@@ -340,5 +345,47 @@ private static class EarliestLatestSqlAggFunction extends SqlAggFunction
Optionality.FORBIDDEN
);
}

@Override
public SqlNode rewriteCall(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if after this patch constructors of classes like LongLastAggregatorFactory should still accept timeColumn as null or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I think it shouldn't. Maybe can take up those modifications in a separate PR.

return aggFunction.createCall(pos, operands.get(0), new SqlIdentifier(
"__time", pos));
case 2:
return aggFunction.createCall(pos, operands.get(0), new SqlIdentifier(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should validate the new call created as well - since we're hard coding the __time now in the function call. It may or may not be present in the aggregation scope - which is actually a bad thing that the LATEST/EARLIEST invocations have.

Copy link
Contributor Author

@gargvishesh gargvishesh Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new call will be validated post the rewrite. But there was still the issue of validator treating rewritten query as originating from the user and flagging unrelatable col __time not found in any table (row x, col y) error. I've now addressed it by introducing a custom SQLIdentifier class.

}
catch (CalciteContextException e) {
if (e.getCause() instanceof SqlValidatorException) {
throw DruidException.forPersona(DruidException.Persona.ADMIN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throw DruidException.forPersona(DruidException.Persona.ADMIN)
throw DruidException.forPersona(DruidException.Persona.USER)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chose ADMIN Persona here since the exceptions in existing tests such as testTimeColumnAggregationsOnLookups had had the same -- not sure why though. If that's incorrect, shall update the tests as well.

Copy link
Contributor

@LakshSingla LakshSingla Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the reason you are looking for: https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/planner/QueryHandler.java#L692

However, in this case, it seems like we do know the exact reason why the query failed - It's due to the absence of a time column. So perhaps change the wording to be more assertive. Also, we should change it to user, because of the reason highlighted in the link above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the reason will always be absence of time column - what if there's a case of a join where __time column is coming from both the tables, in that case I think it could also be an ambiguous column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what will admin do with these planning errors even if they are hints. The hints are meant for the end-user. I don't want to hold this PR so would let it merge as it is.

@abhishekagarwal87 abhishekagarwal87 merged commit c6ca990 into apache:master Oct 11, 2023
81 checks passed
cryptoe pushed a commit to cryptoe/druid that referenced this pull request Oct 12, 2023
…che#15095)

EARLIEST and LATEST operators implicitly reference the __time column for calculation of the aggregate value. Since the reference isn't explicit, Calcite sometimes fails to update the __time column name when there's column renaming --such as in the case of nested queries -- resulting in column not found errors.

This change rewrites these operators to EARLIEST_BY and LATEST_BY during query processing to make the reference explicit to Calcite.

(cherry picked from commit c6ca990)
@cryptoe cryptoe mentioned this pull request Oct 12, 2023
LakshSingla pushed a commit that referenced this pull request Oct 12, 2023
)

EARLIEST and LATEST operators implicitly reference the __time column for calculation of the aggregate value. Since the reference isn't explicit, Calcite sometimes fails to update the __time column name when there's column renaming --such as in the case of nested queries -- resulting in column not found errors.

This change rewrites these operators to EARLIEST_BY and LATEST_BY during query processing to make the reference explicit to Calcite.

(cherry picked from commit c6ca990)
ektravel pushed a commit to ektravel/druid that referenced this pull request Oct 16, 2023
…che#15095)

EARLIEST and LATEST operators implicitly reference the __time column for calculation of the aggregate value. Since the reference isn't explicit, Calcite sometimes fails to update the __time column name when there's column renaming --such as in the case of nested queries -- resulting in column not found errors.

This change rewrites these operators to EARLIEST_BY and LATEST_BY during query processing to make the reference explicit to Calcite.
@abhishekagarwal87 abhishekagarwal87 added this to the 28.0 milestone Oct 19, 2023
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
…che#15095)

EARLIEST and LATEST operators implicitly reference the __time column for calculation of the aggregate value. Since the reference isn't explicit, Calcite sometimes fails to update the __time column name when there's column renaming --such as in the case of nested queries -- resulting in column not found errors.

This change rewrites these operators to EARLIEST_BY and LATEST_BY during query processing to make the reference explicit to Calcite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants