-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intervals are updated properly for Unnest queries #15020
Conversation
sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us add some comments and assumptions in the getFiltration
method's Javadoc:
- FilteredDataSource is always on the UnnestDataSource
- UnnestDataSource's analysis gives the base data source which cannot have filters.
Also, please add why this special handling is required.
We should add a test in the query's test which filters on unnest on top of a lookup.
@@ -788,6 +790,28 @@ static Pair<DataSource, Filtration> getFiltration( | |||
{ | |||
if (!canUseIntervalFiltering(dataSource)) { | |||
return Pair.of(dataSource, toFiltration(filter, virtualColumnRegistry.getFullRowSignature(), false)); | |||
} else if (dataSource instanceof UnnestDataSource) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It won't make a difference with the logic as is, but let's put this branch at the top since it is the most specific of the branches.
I think there are cases when there are multiple What if the base table of the above chain has a filter of itself? (Is it possible in the first place?) If not then we should clearly highlight that assumption in the getFiltration method. Otherwise, it would be good if the code could handle cases like these automatically. Also, UNNEST on MSQ got merged so let's make sure that the tests pass with them and we prune the segments properly. |
A filteredDS is created only inside the rel for Unnest, ensuring it only grabs the outermost filter and, if possible, pushes it down inside the data source. So a chain of Filter->Unnest->Filter is typically impossible when the query is done through SQL. Also, Calcite has filter reduction rules that push filters deep into base data sources for better query planning. Hence the case that can be seen is a bunch of unnests followed by a terminal filteredDS like Unnest->Unnest->FilteredDS. I have UTs resembling the same. A base table with a chain of filters is synonymous with a filteredDS. In case there are filters present in the getFiltration call we still update the interval by: 1) creating a filtration from the filteredDS's filter and 2) Updating the interval of the outer filter with the intervals in step 1 and you'll see these 2 calls in the code I'll merge the updated master into this branch to ensure that the tests pass correctly after this change. |
Let's note this comment somewhere as the assumption while optimizing the filter. This might not be intuitive for someone who is just messing with that part of the code, and since its implicitly baked into the code, it would be good to add it as a comment. Also, if we can, then let's add a check that there is no multiple nesting of the filtered data sources (defensive check) |
@@ -1798,7 +1798,7 @@ public void testTimeColumnAggregationFromExtern() throws IOException | |||
.setExpectedValidationErrorMatcher( | |||
new DruidExceptionMatcher(DruidException.Persona.ADMIN, DruidException.Category.INVALID_INPUT, "general") | |||
.expectMessageIs( | |||
"Query planning failed for unknown reason, our best guess is this " | |||
"Query could not be planned. A possible reason is " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. this is a cosmetic change, it was too small to open a separate PR for this so just adding it here
A defensive check was added, code was slightly reformatted along with a comment |
final DataSource baseOfFilterDataSource = filteredDataSource.getBase(); | ||
if (baseOfFilterDataSource instanceof FilteredDataSource) { | ||
throw DruidException.defensive("Cannot create a filteredDataSource using another filteredDataSource as a base"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of getBase(), we need to do a recursive check on the filteredDataSource.getChildren()
.
i.e. FilteredDS -> UnnestDS -> FilteredDS is disallowed ❌ .
We are only accounting for the top-level filtered DS, therefore any intervals in any nested DS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rethinking about it, I see that we are only handling the cases where UnnestDS is at the top level. A query chain like QueryDS -> UnnestDS -> FilteredDS -> .... won't set the intervals specified in the FilteredDS properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QueryDS would be handled separately where it would find its own UnnestDS and FilteredDS and set the interval for the inner subquery. The outer query should still be eternity. This is similar to how Druid handles scan over queryDS. I'll add the examples. I agree with the other comment that we need to recursively for the case of Unnest->Filter->Unnest->Filter. I'll update the comment adding that a chain cannot start with a filterDS as a filteredDS is only created as the base of an UnnestDS
} else if (dataSource instanceof FilteredDataSource) { | ||
// A filteredDS is created only inside the rel for Unnest, ensuring it only grabs the outermost filter | ||
// and, if possible, pushes it down inside the data source. | ||
// So a chain of Filter->Unnest->Filter is typically impossible when the query is done through SQL. | ||
// Also, Calcite has filter reduction rules that push filters deep into base data sources for better query planning. | ||
// Hence, the case that can be seen is a bunch of unnests followed by a terminal filteredDS like Unnest->Unnest->FilteredDS. | ||
// A base table with a chain of filters is synonymous with a filteredDS. | ||
// In case there are filters present in the getFiltration call we still update the interval by: | ||
// 1) creating a filtration from the filteredDS's filter and | ||
// 2) Updating the interval of the outer filter with the intervals in step 1, and you'll see these 2 calls in the code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: push this branch above due to the same reason, perhaps we can move it to another PR if there are no further comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the PR, and for entertaining the questions I had around UNNEST.
Fixes a bug where the unnest queries were not updated with the correct intervals.
Previously unnest queries were not updated with the correct intervals. For example
if I run a query as
The following plan shows up
The actual interval should have been
With this change, we have added unit tests to ensure the intervals are correctly updated
This PR has: