Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intervals are updated properly for Unnest queries #15020

Merged
merged 13 commits into from
Oct 3, 2023

Conversation

somu-imply
Copy link
Contributor

@somu-imply somu-imply commented Sep 21, 2023

Previously unnest queries were not updated with the correct intervals. For example
if I run a query as

select * from foo
, UNNEST(MV_TO_ARRAY("dim3")) as ud(d3) 
where __time >= TIMESTAMP '2000-01-02 00:00:00' and __time <= TIMESTAMP '2000-01-03 00:10:00' 

The following plan shows up

{
  "queryType": "scan",
  "dataSource": {
    "type": "unnest",
    "base": {
      "type": "filter",
      "base": {
        "type": "table",
        "name": "foo"
      },
      "filter": {
        "type": "range",
        "column": "__time",
        "matchValueType": "LONG",
        "lower": 946771200000,
        "upper": 946858200000
      }
    },
    "virtualColumn": {
      "type": "expression",
      "name": "j0.unnest",
      "expression": "\"dim3\"",
      "outputType": "STRING"
    },
    "unnestFilter": null
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "resultFormat": "compactedList",
  "limit": 1001,
  "columns": [
    "__time",
    "dim1",
    "dim2",
    "dim3",
    "j0.unnest",
    "m1",
    "m2"
  ],
  "legacy": false,
  "context": {
    "enableUnnest": true,
    "queryId": "a6147ad4-13c4-4c9c-9213-42e3d808d00f",
    "sqlOuterLimit": 1001,
    "sqlQueryId": "a6147ad4-13c4-4c9c-9213-42e3d808d00f",
    "useNativeQueryExplain": true
  },
  "granularity": {
    "type": "all"
  }
}

The actual interval should have been

"intervals": {
    "type": "intervals",
    "intervals": [
      "2000-01-02T00:00:00.000Z/2000-01-03T00:10:00.001Z"
    ]
  }

With this change, we have added unit tests to ensure the intervals are correctly updated

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@somu-imply somu-imply marked this pull request as ready for review September 21, 2023 03:35
Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us add some comments and assumptions in the getFiltration method's Javadoc:

  1. FilteredDataSource is always on the UnnestDataSource
  2. UnnestDataSource's analysis gives the base data source which cannot have filters.

Also, please add why this special handling is required.
We should add a test in the query's test which filters on unnest on top of a lookup.

@@ -788,6 +790,28 @@ static Pair<DataSource, Filtration> getFiltration(
{
if (!canUseIntervalFiltering(dataSource)) {
return Pair.of(dataSource, toFiltration(filter, virtualColumnRegistry.getFullRowSignature(), false));
} else if (dataSource instanceof UnnestDataSource) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It won't make a difference with the logic as is, but let's put this branch at the top since it is the most specific of the branches.

@LakshSingla
Copy link
Contributor

LakshSingla commented Sep 25, 2023

I think there are cases when there are multiple FilteredDataSources stacked on top of each other directly or indirectly.
For example -> FilteredDS -> UnnestDS -> FilteredDS -> UnnestDS.
How would this code handle those cases, will it ignore the second filter?

What if the base table of the above chain has a filter of itself? (Is it possible in the first place?) If not then we should clearly highlight that assumption in the getFiltration method. Otherwise, it would be good if the code could handle cases like these automatically.

Also, UNNEST on MSQ got merged so let's make sure that the tests pass with them and we prune the segments properly.

@somu-imply
Copy link
Contributor Author

somu-imply commented Sep 25, 2023

A filteredDS is created only inside the rel for Unnest, ensuring it only grabs the outermost filter and, if possible, pushes it down inside the data source. So a chain of Filter->Unnest->Filter is typically impossible when the query is done through SQL. Also, Calcite has filter reduction rules that push filters deep into base data sources for better query planning. Hence the case that can be seen is a bunch of unnests followed by a terminal filteredDS like Unnest->Unnest->FilteredDS. I have UTs resembling the same. A base table with a chain of filters is synonymous with a filteredDS. In case there are filters present in the getFiltration call we still update the interval by: 1) creating a filtration from the filteredDS's filter and 2) Updating the interval of the outer filter with the intervals in step 1 and you'll see these 2 calls in the code

I'll merge the updated master into this branch to ensure that the tests pass correctly after this change.

@LakshSingla
Copy link
Contributor

LakshSingla commented Sep 25, 2023

Let's note this comment somewhere as the assumption while optimizing the filter. This might not be intuitive for someone who is just messing with that part of the code, and since its implicitly baked into the code, it would be good to add it as a comment.

Also, if we can, then let's add a check that there is no multiple nesting of the filtered data sources (defensive check)
This shouldn't take extra runtime, and at the cost of increased code complexity, we make sure that the queries don't silently pass with incorrect pruning like earlier.

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Sep 25, 2023
@@ -1798,7 +1798,7 @@ public void testTimeColumnAggregationFromExtern() throws IOException
.setExpectedValidationErrorMatcher(
new DruidExceptionMatcher(DruidException.Persona.ADMIN, DruidException.Category.INVALID_INPUT, "general")
.expectMessageIs(
"Query planning failed for unknown reason, our best guess is this "
"Query could not be planned. A possible reason is "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. this is a cosmetic change, it was too small to open a separate PR for this so just adding it here

@somu-imply
Copy link
Contributor Author

A defensive check was added, code was slightly reformatted along with a comment

Comment on lines +816 to +819
final DataSource baseOfFilterDataSource = filteredDataSource.getBase();
if (baseOfFilterDataSource instanceof FilteredDataSource) {
throw DruidException.defensive("Cannot create a filteredDataSource using another filteredDataSource as a base");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of getBase(), we need to do a recursive check on the filteredDataSource.getChildren().
i.e. FilteredDS -> UnnestDS -> FilteredDS is disallowed ❌ .
We are only accounting for the top-level filtered DS, therefore any intervals in any nested DS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rethinking about it, I see that we are only handling the cases where UnnestDS is at the top level. A query chain like QueryDS -> UnnestDS -> FilteredDS -> .... won't set the intervals specified in the FilteredDS properly.

Copy link
Contributor Author

@somu-imply somu-imply Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QueryDS would be handled separately where it would find its own UnnestDS and FilteredDS and set the interval for the inner subquery. The outer query should still be eternity. This is similar to how Druid handles scan over queryDS. I'll add the examples. I agree with the other comment that we need to recursively for the case of Unnest->Filter->Unnest->Filter. I'll update the comment adding that a chain cannot start with a filterDS as a filteredDS is only created as the base of an UnnestDS

Comment on lines 804 to 813
} else if (dataSource instanceof FilteredDataSource) {
// A filteredDS is created only inside the rel for Unnest, ensuring it only grabs the outermost filter
// and, if possible, pushes it down inside the data source.
// So a chain of Filter->Unnest->Filter is typically impossible when the query is done through SQL.
// Also, Calcite has filter reduction rules that push filters deep into base data sources for better query planning.
// Hence, the case that can be seen is a bunch of unnests followed by a terminal filteredDS like Unnest->Unnest->FilteredDS.
// A base table with a chain of filters is synonymous with a filteredDS.
// In case there are filters present in the getFiltration call we still update the interval by:
// 1) creating a filtration from the filteredDS's filter and
// 2) Updating the interval of the outer filter with the intervals in step 1, and you'll see these 2 calls in the code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: push this branch above due to the same reason, perhaps we can move it to another PR if there are no further comments.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the PR, and for entertaining the questions I had around UNNEST.

@LakshSingla LakshSingla merged commit cb05028 into apache:master Oct 3, 2023
74 checks passed
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
ektravel pushed a commit to ektravel/druid that referenced this pull request Oct 16, 2023
Fixes a bug where the unnest queries were not updated with the correct intervals.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying Area - SQL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants