Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join context hints #17541

Merged
merged 18 commits into from
Dec 17, 2024
Merged

Join context hints #17541

merged 18 commits into from
Dec 17, 2024

Conversation

adarshsanjeev
Copy link
Contributor

@adarshsanjeev adarshsanjeev commented Dec 6, 2024

Rebased and addressed some issues of #17406

Currently, using the query context, all joins in a query can be hinted to either broadcast or sort merge join. However, this does not allow this to be set at a join level.

Add support for hints to queries, which can dictate the join type.

Syntax

select /*+ sort_merge */ w1.cityName, w2.countryName
from
(
  select /*+ broadcast */ w3.cityName AS cityName, w4.countryName AS countryName from wikipedia w3 LEFT JOIN wikipedia-set2 w4 ON w3.regionName = w4.regionName
) w1
JOIN wikipedia-set1 w2 ON w1.cityName = w2.cityName
where w1.cityName='New York';

Note
Calcite does not expose if the hint is at a particular join level, and allows inheritance of hints from parent nodes. This causes join hints to to affect inner queries. Additional join hints can be added to inner queries as a workaround.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Dec 6, 2024
@adarshsanjeev adarshsanjeev marked this pull request as ready for review December 11, 2024 09:11

import java.util.Arrays;

public enum JoinHint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to not have this as an enum - and you could probably use as(JoinAlgorithm.class) or asJoinAlgorithm() to obtain the current value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an issue with that, Calcite only seems to accept certain keywords as hints, and this is sort_merge. Trying to replace the hint name with sortMerge seems to make it not read by calcite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't really mattter - I was just thinking something like:

public abstract class DruidHint {
  abstract T as(Class<T> clazz);

  public static DruidHint fromRelHint(RelHint rh) {
    // factory method
  }

  static abstrat class DruidJoinHint extends DruidHint {}

  static class SortMerge extends DruidJoinHint {
     String name() { "sort_merge"; }
     as() ... { return
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes to make it more in line with this, please take a look

Comment on lines 53 to 55
@JsonCreator
public static JoinHint fromString(final String id)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be serializable by jackson; if it gets on the wire it should be a JoinAlgorithm already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might have been to deserializer it for the planner context at some point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this JoinHint should be serialized
if its not in use I think it would be better to remove it


import java.util.Arrays;

public enum JoinHint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't really mattter - I was just thinking something like:

public abstract class DruidHint {
  abstract T as(Class<T> clazz);

  public static DruidHint fromRelHint(RelHint rh) {
    // factory method
  }

  static abstrat class DruidJoinHint extends DruidHint {}

  static class SortMerge extends DruidJoinHint {
     String name() { "sort_merge"; }
     as() ... { return
  }
}

Comment on lines 53 to 55
@JsonCreator
public static JoinHint fromString(final String id)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this JoinHint should be serialized
if its not in use I think it would be better to remove it

private static HintStrategyTable createHintStrategies()
{
return HintStrategyTable.builder()
.hintStrategy(JoinHint.BROADCAST.name(), HintPredicates.JOIN)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the usage of name() intended here?
because I think that will return BROADCAST the "name" of the enum key and not the id!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this method returning HintStrategyTable could live inside DruidHint if that's created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to DruidHint


public static JoinAlgorithm getJoinAlgorithm(Join join, PlannerContext plannerContext)
{
RelHint closestHint = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I was thinking to use DruidJoinHint here.. so that you can't even set closesHint to a non-join hint...but its not really important

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the RelHint allows keeping track the closestHint's path length, to compare in the for loop. We do have a check to see if it's a join hint in the loop so this should be fine

@cryptoe cryptoe merged commit bb4416a into apache:master Dec 17, 2024
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants