Updating plans when using joins with unnest on the left #15075

somu-imply · 2023-10-03T21:59:31Z

Previously a query as

with t1 as (
select * from foo, unnest(MV_TO_ARRAY("dim3")) as u(d3)
)
select * from t1 JOIN "numFoo" as t2
ON t1.d3 = t2."dim1"

would be planned as a join between a query data source on the left and a query data source on the right. Although the results were correct this is limiting performance as query data sources are evaluated at the broker where the number of rows is limited by maxSubqueryRows.

Additionally native queries like

{
  "queryType" : "scan",
  "dataSource" : {
    "type" : "join",
    "left" : {
      "type" : "unnest",
      "base" : {
        "type" : "table",
        "name" : "foo"
      },
      "virtualColumn" : {
        "type" : "expression",
        "name" : "j0.unnest",
        "expression" : "\"dim3\"",
        "outputType" : "STRING"
      },
      "unnestFilter" : null
    },
    "right" : {
      "type" : "query",
      "query" : {
        "queryType" : "scan",
        "dataSource" : {
          "type" : "table",
          "name" : "numfoo"
        },
        "intervals" : {
          "type" : "intervals",
          "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
        },
        "resultFormat" : "compactedList",
        "columns" : [ "__time", "cnt", "d1", "d2", "dim1", "dim2", "dim3", "dim4", "dim5", "dim6", "f1", "f2", "l1", "l2", "m1", "m2", "unique_dim1" ],
        "legacy" : false,
        "context" : {
          "defaultTimeout" : 300000,
          "maxScatterGatherBytes" : 9223372036854775807,
          "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
          "sqlQueryId" : "dummy",
          "vectorSize" : 2,
          "vectorize" : "force",
          "vectorizeVirtualColumns" : "force"
        },
        "granularity" : {
          "type" : "all"
        }
      }
    },
    "rightPrefix" : "_j0.",
    "condition" : "(\"j0.unnest\" == \"_j0.dim1\")",
    "joinType" : "INNER"
  },
  "intervals" : {
    "type" : "intervals",
    "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
  },
  "resultFormat" : "compactedList",
  "columns" : [ "__time", "_j0.__time", "_j0.cnt", "_j0.d1", "_j0.d2", "_j0.dim1", "_j0.dim2", "_j0.dim3", "_j0.dim4", "_j0.dim5", "_j0.dim6", "_j0.f1", "_j0.f2", "_j0.l1", "_j0.l2", "_j0.m1", "_j0.m2", "_j0.unique_dim1", "cnt", "dim1", "dim2", "dim3", "j0.unnest", "m1", "m2", "unique_dim1" ],
  "legacy" : false,
  "context" : {
    "defaultTimeout" : 300000,
    "maxScatterGatherBytes" : 9223372036854775807,
    "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
    "sqlQueryId" : "dummy",
    "vectorSize" : 2,
    "vectorize" : "force",
    "vectorizeVirtualColumns" : "force"
  },
  "granularity" : {
    "type" : "all"
  }
}

would fail with an error

java.lang.ClassCastException: org.apache.druid.query.UnnestDataSource cannot be cast to org.apache.druid.query.TableDataSource

Through this PR we do the following to address this:

Refactor getAnalysis() for JoinDataSource to correctly use the base datasource if the left hand of a join has an UnnestDataSource
Update the createSegmentMapFunction for the JoinDataSource to use the segment map function correctly
Additional machinery for the correct query plan
Additional unit tests added to support our case

This PR has:

…ight not run in MSQ

pranavbhole · 2023-10-05T19:13:03Z

processing/src/main/java/org/apache/druid/query/JoinDataSource.java

+    // Will need an instanceof check here
+    // A future work should look into if the flattenJoin
+    // can be refactored to omit these instanceof checks
+    while (current instanceof JoinDataSource || current instanceof UnnestDataSource || current instanceof FilteredDataSource) {


can we add test cases for self join with unnest datasource if we do not have already?

Thanks, added a test with self join on an unnest data source

pranavbhole · 2023-10-05T19:19:29Z

looks good to me.

clintropolis · 2023-10-06T03:36:55Z

processing/src/main/java/org/apache/druid/query/JoinDataSource.java

@@ -476,10 +476,18 @@ private Function<SegmentReference, SegmentReference> createSegmentMapFunctionInt
                           .orElse(null)
                )
            );
-
+            final Function<SegmentReference, SegmentReference> baseMapFn;
+            if (left instanceof JoinDataSource) {


this seems worth a comment on what is going on. Is it still ok to do if left is not concrete?

I'll add a comment as into why we are not using the isConcrete() check and instead using the instanceof check here

Comment is added

clintropolis · 2023-10-06T03:42:37Z

processing/src/main/java/org/apache/druid/query/JoinDataSource.java

+                joinDataSource.getConditionAnalysis()
+            )
+        );
+      } else if (current instanceof UnnestDataSource) {


it doesn't seem intuitive to me that we can flatten away unnest and filtered datasources, could we add comments explaining why its ok? is it still ok if the unnest datasource is wrapping a join datasource? like does it flatten through it? where does the unnest and filters go in that case?

I'll add comments. The getAnalysis() of an Unnest or a filteredDS always delegates to its base. So flattening through a Join->Unnest->Join kind of scenario to get the base data source makes sense as it goes down to find the base concrete data source. In this PR, the filters on the filteredDataSource and unnestDataSource are not pushed down to the left of the join, the unnest filter and the filter on the filteredDataSource remain on the data source. I have added an unit test of Join->Unnest->Join will add another UT of Join->Unnest->Filter->Join

Comment and unit test added

* Updating plans when using joins with unnest on the left * Correcting segment map function for hashJoin * The changes done here are not reflected into MSQ yet so these tests might not run in MSQ * native tests * Self joins with unnest data source * Making this pass * Addressing comments by adding explanation and new test

Updating plans when using joins with unnest on the left

52fb78d

github-actions bot added the Area - Querying label Oct 3, 2023

somu-imply added 4 commits October 3, 2023 17:57

Correcting segment map function for hashJoin

e2b1329

The changes done here are not reflected into MSQ yet so these tests m…

7a06d1b

…ight not run in MSQ

native tests

0a4892d

Merge remote-tracking branch 'upstream/master' into joins_over_unnest

a71d4f1

somu-imply mentioned this pull request Oct 5, 2023

Updating code to add new rule for correlate on left of join #15073

Closed

10 tasks

pranavbhole reviewed Oct 5, 2023

View reviewed changes

somu-imply added 3 commits October 5, 2023 14:07

Self joins with unnest data source

78f0090

Merge remote-tracking branch 'upstream/master' into joins_over_unnest

589ed89

Making this pass

31153f7

clintropolis reviewed Oct 6, 2023

View reviewed changes

somu-imply added 3 commits October 6, 2023 08:34

Merge remote-tracking branch 'upstream/master' into joins_over_unnest

518c1ba

Addressing comments by adding explanation and new test

b6c2e03

Merge remote-tracking branch 'upstream/master' into joins_over_unnest

8a0f9a0

clintropolis approved these changes Oct 7, 2023

View reviewed changes

soumyava merged commit 57ab8e1 into apache:master Oct 7, 2023
81 checks passed

LakshSingla added this to the 28.0 milestone Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating plans when using joins with unnest on the left #15075

Updating plans when using joins with unnest on the left #15075

somu-imply commented Oct 3, 2023 •

edited

Loading

pranavbhole Oct 5, 2023

somu-imply Oct 5, 2023

pranavbhole commented Oct 5, 2023

clintropolis Oct 6, 2023

somu-imply Oct 6, 2023

somu-imply Oct 6, 2023

clintropolis Oct 6, 2023

somu-imply Oct 6, 2023 •

edited

Loading

somu-imply Oct 6, 2023

Updating plans when using joins with unnest on the left #15075

Updating plans when using joins with unnest on the left #15075

Conversation

somu-imply commented Oct 3, 2023 • edited Loading

pranavbhole Oct 5, 2023

Choose a reason for hiding this comment

somu-imply Oct 5, 2023

Choose a reason for hiding this comment

pranavbhole commented Oct 5, 2023

clintropolis Oct 6, 2023

Choose a reason for hiding this comment

somu-imply Oct 6, 2023

Choose a reason for hiding this comment

somu-imply Oct 6, 2023

Choose a reason for hiding this comment

clintropolis Oct 6, 2023

Choose a reason for hiding this comment

somu-imply Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

somu-imply Oct 6, 2023

Choose a reason for hiding this comment

somu-imply commented Oct 3, 2023 •

edited

Loading

somu-imply Oct 6, 2023 •

edited

Loading