Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating plans when using joins with unnest on the left #15075

Merged
merged 11 commits into from
Oct 7, 2023

Conversation

somu-imply
Copy link
Contributor

@somu-imply somu-imply commented Oct 3, 2023

Previously a query as

with t1 as (
select * from foo, unnest(MV_TO_ARRAY("dim3")) as u(d3)
)
select * from t1 JOIN "numFoo" as t2
ON t1.d3 = t2."dim1"

would be planned as a join between a query data source on the left and a query data source on the right. Although the results were correct this is limiting performance as query data sources are evaluated at the broker where the number of rows is limited by maxSubqueryRows.

Additionally native queries like

{
  "queryType" : "scan",
  "dataSource" : {
    "type" : "join",
    "left" : {
      "type" : "unnest",
      "base" : {
        "type" : "table",
        "name" : "foo"
      },
      "virtualColumn" : {
        "type" : "expression",
        "name" : "j0.unnest",
        "expression" : "\"dim3\"",
        "outputType" : "STRING"
      },
      "unnestFilter" : null
    },
    "right" : {
      "type" : "query",
      "query" : {
        "queryType" : "scan",
        "dataSource" : {
          "type" : "table",
          "name" : "numfoo"
        },
        "intervals" : {
          "type" : "intervals",
          "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
        },
        "resultFormat" : "compactedList",
        "columns" : [ "__time", "cnt", "d1", "d2", "dim1", "dim2", "dim3", "dim4", "dim5", "dim6", "f1", "f2", "l1", "l2", "m1", "m2", "unique_dim1" ],
        "legacy" : false,
        "context" : {
          "defaultTimeout" : 300000,
          "maxScatterGatherBytes" : 9223372036854775807,
          "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
          "sqlQueryId" : "dummy",
          "vectorSize" : 2,
          "vectorize" : "force",
          "vectorizeVirtualColumns" : "force"
        },
        "granularity" : {
          "type" : "all"
        }
      }
    },
    "rightPrefix" : "_j0.",
    "condition" : "(\"j0.unnest\" == \"_j0.dim1\")",
    "joinType" : "INNER"
  },
  "intervals" : {
    "type" : "intervals",
    "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
  },
  "resultFormat" : "compactedList",
  "columns" : [ "__time", "_j0.__time", "_j0.cnt", "_j0.d1", "_j0.d2", "_j0.dim1", "_j0.dim2", "_j0.dim3", "_j0.dim4", "_j0.dim5", "_j0.dim6", "_j0.f1", "_j0.f2", "_j0.l1", "_j0.l2", "_j0.m1", "_j0.m2", "_j0.unique_dim1", "cnt", "dim1", "dim2", "dim3", "j0.unnest", "m1", "m2", "unique_dim1" ],
  "legacy" : false,
  "context" : {
    "defaultTimeout" : 300000,
    "maxScatterGatherBytes" : 9223372036854775807,
    "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
    "sqlQueryId" : "dummy",
    "vectorSize" : 2,
    "vectorize" : "force",
    "vectorizeVirtualColumns" : "force"
  },
  "granularity" : {
    "type" : "all"
  }
}

would fail with an error

java.lang.ClassCastException: org.apache.druid.query.UnnestDataSource cannot be cast to org.apache.druid.query.TableDataSource

Through this PR we do the following to address this:

  1. Refactor getAnalysis() for JoinDataSource to correctly use the base datasource if the left hand of a join has an UnnestDataSource
  2. Update the createSegmentMapFunction for the JoinDataSource to use the segment map function correctly
  3. Additional machinery for the correct query plan
  4. Additional unit tests added to support our case

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

// Will need an instanceof check here
// A future work should look into if the flattenJoin
// can be refactored to omit these instanceof checks
while (current instanceof JoinDataSource || current instanceof UnnestDataSource || current instanceof FilteredDataSource) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add test cases for self join with unnest datasource if we do not have already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added a test with self join on an unnest data source

@pranavbhole
Copy link
Contributor

looks good to me.

@@ -476,10 +476,18 @@ private Function<SegmentReference, SegmentReference> createSegmentMapFunctionInt
.orElse(null)
)
);

final Function<SegmentReference, SegmentReference> baseMapFn;
if (left instanceof JoinDataSource) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems worth a comment on what is going on. Is it still ok to do if left is not concrete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment as into why we are not using the isConcrete() check and instead using the instanceof check here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is added

joinDataSource.getConditionAnalysis()
)
);
} else if (current instanceof UnnestDataSource) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't seem intuitive to me that we can flatten away unnest and filtered datasources, could we add comments explaining why its ok? is it still ok if the unnest datasource is wrapping a join datasource? like does it flatten through it? where does the unnest and filters go in that case?

Copy link
Contributor Author

@somu-imply somu-imply Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add comments. The getAnalysis() of an Unnest or a filteredDS always delegates to its base. So flattening through a Join->Unnest->Join kind of scenario to get the base data source makes sense as it goes down to find the base concrete data source. In this PR, the filters on the filteredDataSource and unnestDataSource are not pushed down to the left of the join, the unnest filter and the filter on the filteredDataSource remain on the data source. I have added an unit test of Join->Unnest->Join will add another UT of Join->Unnest->Filter->Join

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment and unit test added

@soumyava soumyava merged commit 57ab8e1 into apache:master Oct 7, 2023
81 checks passed
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
ektravel pushed a commit to ektravel/druid that referenced this pull request Oct 16, 2023
* Updating plans when using joins with unnest on the left

* Correcting segment map function for hashJoin

* The changes done here are not reflected into MSQ yet so these tests might not run in MSQ

* native tests

* Self joins with unnest data source

* Making this pass

* Addressing comments by adding explanation and new test
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
* Updating plans when using joins with unnest on the left

* Correcting segment map function for hashJoin

* The changes done here are not reflected into MSQ yet so these tests might not run in MSQ

* native tests

* Self joins with unnest data source

* Making this pass

* Addressing comments by adding explanation and new test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants