Intervals are updated properly for Unnest queries #15020

somu-imply · 2023-09-21T03:30:06Z

Previously unnest queries were not updated with the correct intervals. For example
if I run a query as

select * from foo
, UNNEST(MV_TO_ARRAY("dim3")) as ud(d3) 
where __time >= TIMESTAMP '2000-01-02 00:00:00' and __time <= TIMESTAMP '2000-01-03 00:10:00'

The following plan shows up

{
  "queryType": "scan",
  "dataSource": {
    "type": "unnest",
    "base": {
      "type": "filter",
      "base": {
        "type": "table",
        "name": "foo"
      },
      "filter": {
        "type": "range",
        "column": "__time",
        "matchValueType": "LONG",
        "lower": 946771200000,
        "upper": 946858200000
      }
    },
    "virtualColumn": {
      "type": "expression",
      "name": "j0.unnest",
      "expression": "\"dim3\"",
      "outputType": "STRING"
    },
    "unnestFilter": null
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "resultFormat": "compactedList",
  "limit": 1001,
  "columns": [
    "__time",
    "dim1",
    "dim2",
    "dim3",
    "j0.unnest",
    "m1",
    "m2"
  ],
  "legacy": false,
  "context": {
    "enableUnnest": true,
    "queryId": "a6147ad4-13c4-4c9c-9213-42e3d808d00f",
    "sqlOuterLimit": 1001,
    "sqlQueryId": "a6147ad4-13c4-4c9c-9213-42e3d808d00f",
    "useNativeQueryExplain": true
  },
  "granularity": {
    "type": "all"
  }
}

The actual interval should have been

"intervals": {
    "type": "intervals",
    "intervals": [
      "2000-01-02T00:00:00.000Z/2000-01-03T00:10:00.001Z"
    ]
  }

With this change, we have added unit tests to ensure the intervals are correctly updated

This PR has:

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java

LakshSingla

Let us add some comments and assumptions in the getFiltration method's Javadoc:

FilteredDataSource is always on the UnnestDataSource
UnnestDataSource's analysis gives the base data source which cannot have filters.

Also, please add why this special handling is required.
We should add a test in the query's test which filters on unnest on top of a lookup.

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java

LakshSingla · 2023-09-24T13:23:39Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java

@@ -788,6 +790,28 @@ static Pair<DataSource, Filtration> getFiltration(
  {
    if (!canUseIntervalFiltering(dataSource)) {
      return Pair.of(dataSource, toFiltration(filter, virtualColumnRegistry.getFullRowSignature(), false));
+    } else if (dataSource instanceof UnnestDataSource) {


nit: It won't make a difference with the logic as is, but let's put this branch at the top since it is the most specific of the branches.

LakshSingla · 2023-09-25T03:54:34Z

I think there are cases when there are multiple FilteredDataSources stacked on top of each other directly or indirectly.
For example -> FilteredDS -> UnnestDS -> FilteredDS -> UnnestDS.
How would this code handle those cases, will it ignore the second filter?

What if the base table of the above chain has a filter of itself? (Is it possible in the first place?) If not then we should clearly highlight that assumption in the getFiltration method. Otherwise, it would be good if the code could handle cases like these automatically.

Also, UNNEST on MSQ got merged so let's make sure that the tests pass with them and we prune the segments properly.

somu-imply · 2023-09-25T17:35:37Z

A filteredDS is created only inside the rel for Unnest, ensuring it only grabs the outermost filter and, if possible, pushes it down inside the data source. So a chain of Filter->Unnest->Filter is typically impossible when the query is done through SQL. Also, Calcite has filter reduction rules that push filters deep into base data sources for better query planning. Hence the case that can be seen is a bunch of unnests followed by a terminal filteredDS like Unnest->Unnest->FilteredDS. I have UTs resembling the same. A base table with a chain of filters is synonymous with a filteredDS. In case there are filters present in the getFiltration call we still update the interval by: 1) creating a filtration from the filteredDS's filter and 2) Updating the interval of the outer filter with the intervals in step 1 and you'll see these 2 calls in the code

I'll merge the updated master into this branch to ensure that the tests pass correctly after this change.

LakshSingla · 2023-09-25T17:56:41Z

Let's note this comment somewhere as the assumption while optimizing the filter. This might not be intuitive for someone who is just messing with that part of the code, and since its implicitly baked into the code, it would be good to add it as a comment.

Also, if we can, then let's add a check that there is no multiple nesting of the filtered data sources (defensive check)
This shouldn't take extra runtime, and at the cost of increased code complexity, we make sure that the queries don't silently pass with incorrect pruning like earlier.

somu-imply · 2023-09-25T21:26:17Z

extensions-core/multi-stage-query/src/test/java/org/apache/druid/msq/exec/MSQSelectTest.java

@@ -1798,7 +1798,7 @@ public void testTimeColumnAggregationFromExtern() throws IOException
        .setExpectedValidationErrorMatcher(
            new DruidExceptionMatcher(DruidException.Persona.ADMIN, DruidException.Category.INVALID_INPUT, "general")
                .expectMessageIs(
-                    "Query planning failed for unknown reason, our best guess is this "
+                    "Query could not be planned. A possible reason is "


Nit. this is a cosmetic change, it was too small to open a separate PR for this so just adding it here

somu-imply · 2023-09-26T18:50:45Z

A defensive check was added, code was slightly reformatted along with a comment

LakshSingla · 2023-09-27T18:13:13Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java

+      final DataSource baseOfFilterDataSource = filteredDataSource.getBase();
+      if (baseOfFilterDataSource instanceof FilteredDataSource) {
+        throw DruidException.defensive("Cannot create a filteredDataSource using another filteredDataSource as a base");
+      }


Instead of getBase(), we need to do a recursive check on the filteredDataSource.getChildren().
i.e. FilteredDS -> UnnestDS -> FilteredDS is disallowed ❌ .
We are only accounting for the top-level filtered DS, therefore any intervals in any nested DS.

Rethinking about it, I see that we are only handling the cases where UnnestDS is at the top level. A query chain like QueryDS -> UnnestDS -> FilteredDS -> .... won't set the intervals specified in the FilteredDS properly.

The QueryDS would be handled separately where it would find its own UnnestDS and FilteredDS and set the interval for the inner subquery. The outer query should still be eternity. This is similar to how Druid handles scan over queryDS. I'll add the examples. I agree with the other comment that we need to recursively for the case of Unnest->Filter->Unnest->Filter. I'll update the comment adding that a chain cannot start with a filterDS as a filteredDS is only created as the base of an UnnestDS

LakshSingla · 2023-09-27T18:16:29Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java

+    } else if (dataSource instanceof FilteredDataSource) {
+      // A filteredDS is created only inside the rel for Unnest, ensuring it only grabs the outermost filter
+      // and, if possible, pushes it down inside the data source.
+      // So a chain of Filter->Unnest->Filter is typically impossible when the query is done through SQL.
+      // Also, Calcite has filter reduction rules that push filters deep into base data sources for better query planning.
+      // Hence, the case that can be seen is a bunch of unnests followed by a terminal filteredDS like Unnest->Unnest->FilteredDS.
+      // A base table with a chain of filters is synonymous with a filteredDS.
+      // In case there are filters present in the getFiltration call we still update the interval by:
+      // 1) creating a filtration from the filteredDS's filter and
+      // 2) Updating the interval of the outer filter with the intervals in step 1, and you'll see these 2 calls in the code


nit: push this branch above due to the same reason, perhaps we can move it to another PR if there are no further comments.

LakshSingla

LGTM! Thanks for the PR, and for entertaining the questions I had around UNNEST.

Fixes a bug where the unnest queries were not updated with the correct intervals.

Intervals are updated properly for Unnest queries

e0e3cca

github-actions bot added the Area - Querying label Sep 21, 2023

somu-imply marked this pull request as ready for review September 21, 2023 03:35

pranavbhole reviewed Sep 21, 2023

View reviewed changes

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java Show resolved Hide resolved

More nested test cases

c9454c6

pranavbhole reviewed Sep 21, 2023

View reviewed changes

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java Outdated Show resolved Hide resolved

LakshSingla added the Area - SQL label Sep 21, 2023

LakshSingla mentioned this pull request Sep 21, 2023

Unnest now works on MSQ #14886

Merged

10 tasks

somu-imply added 3 commits September 21, 2023 08:54

Minor refactoring to use optimize/optimizeFilterOnly

a03e92a

Merge remote-tracking branch 'upstream/master' into unnest_filterDS

22e162d

Merge remote-tracking branch 'upstream/master' into unnest_filterDS

5534076

pranavbhole approved these changes Sep 23, 2023

View reviewed changes

LakshSingla reviewed Sep 24, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/master' into unnest_filterDS

3cc4d84

Merge remote-tracking branch 'upstream/master' into unnest_filterDS

f4bdd07

Adding comment, extra test case and updating an error message

14e512a

github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Sep 25, 2023

somu-imply commented Sep 25, 2023

View reviewed changes

somu-imply added 2 commits September 26, 2023 11:45

Defensive check added for filteredDS

3157667

Merge remote-tracking branch 'upstream/master' into unnest_filterDS

73e8ecb

LakshSingla reviewed Sep 27, 2023

View reviewed changes

somu-imply added 3 commits September 27, 2023 21:28

Handling for nested cases along with examples

cfdc854

tmp

3fef69c

Merge remote-tracking branch 'upstream/master' into unnest_filterDS

0f931d0

clintropolis approved these changes Oct 3, 2023

View reviewed changes

LakshSingla approved these changes Oct 3, 2023

View reviewed changes

LakshSingla merged commit cb05028 into apache:master Oct 3, 2023
74 checks passed

LakshSingla added this to the 28.0 milestone Oct 12, 2023

ektravel pushed a commit to ektravel/druid that referenced this pull request Oct 16, 2023

Intervals are updated properly for Unnest queries (apache#15020)

9e82fc0

Fixes a bug where the unnest queries were not updated with the correct intervals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intervals are updated properly for Unnest queries #15020

Intervals are updated properly for Unnest queries #15020

somu-imply commented Sep 21, 2023 •

edited

Loading

LakshSingla left a comment

LakshSingla Sep 24, 2023

LakshSingla commented Sep 25, 2023 •

edited

Loading

somu-imply commented Sep 25, 2023 •

edited

Loading

LakshSingla commented Sep 25, 2023 •

edited

Loading

somu-imply Sep 25, 2023

somu-imply commented Sep 26, 2023

LakshSingla Sep 27, 2023

LakshSingla Sep 27, 2023

somu-imply Sep 27, 2023 •

edited

Loading

LakshSingla Sep 27, 2023

LakshSingla left a comment

Intervals are updated properly for Unnest queries #15020

Intervals are updated properly for Unnest queries #15020

Conversation

somu-imply commented Sep 21, 2023 • edited Loading

LakshSingla left a comment

Choose a reason for hiding this comment

LakshSingla Sep 24, 2023

Choose a reason for hiding this comment

LakshSingla commented Sep 25, 2023 • edited Loading

somu-imply commented Sep 25, 2023 • edited Loading

LakshSingla commented Sep 25, 2023 • edited Loading

somu-imply Sep 25, 2023

Choose a reason for hiding this comment

somu-imply commented Sep 26, 2023

LakshSingla Sep 27, 2023

Choose a reason for hiding this comment

LakshSingla Sep 27, 2023

Choose a reason for hiding this comment

somu-imply Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

LakshSingla Sep 27, 2023

Choose a reason for hiding this comment

LakshSingla left a comment

Choose a reason for hiding this comment

somu-imply commented Sep 21, 2023 •

edited

Loading

LakshSingla commented Sep 25, 2023 •

edited

Loading

somu-imply commented Sep 25, 2023 •

edited

Loading

LakshSingla commented Sep 25, 2023 •

edited

Loading

somu-imply Sep 27, 2023 •

edited

Loading