Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(optimizer): support PullUpCorrelatedPredicateAggRule #15026

Merged
merged 4 commits into from
Feb 7, 2024

Conversation

chenzl25
Copy link
Contributor

@chenzl25 chenzl25 commented Feb 6, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Comment on lines -1728 to +1637
└─LogicalProject { exprs: [ps_partkey, ps_suppkey, (0.5:Decimal * sum(l_quantity)) as $expr2] }
└─LogicalAgg { group_key: [ps_partkey, ps_suppkey], aggs: [sum(l_quantity)] }
└─LogicalJoin { type: LeftOuter, on: IsNotDistinctFrom(ps_partkey, l_partkey) AND IsNotDistinctFrom(ps_suppkey, l_suppkey), output: [ps_partkey, ps_suppkey, l_quantity] }
├─LogicalAgg { group_key: [ps_partkey, ps_suppkey], aggs: [] }
│ └─LogicalJoin { type: LeftSemi, on: (ps_partkey = p_partkey), output: [ps_partkey, ps_suppkey] }
│ ├─LogicalSource { source: partsupp, columns: [ps_partkey, ps_suppkey, ps_availqty, ps_supplycost, ps_comment, _row_id], time_range: (Unbounded, Unbounded) }
│ └─LogicalProject { exprs: [p_partkey] }
│ └─LogicalSource { source: part, columns: [p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment, _row_id], time_range: (Unbounded, Unbounded) }
└─LogicalProject { exprs: [l_partkey, l_suppkey, l_quantity] }
└─LogicalFilter { predicate: IsNotNull(l_partkey) AND IsNotNull(l_suppkey) }
└─LogicalSource { source: lineitem, columns: [l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment, _row_id], time_range: (Unbounded, Unbounded) }
└─LogicalProject { exprs: [(0.5:Decimal * sum(l_quantity)) as $expr2, l_partkey, l_suppkey] }
└─LogicalAgg { group_key: [l_partkey, l_suppkey], aggs: [sum(l_quantity)] }
└─LogicalSource { source: lineitem, columns: [l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment, _row_id], time_range: (Unbounded, Unbounded) }
Copy link
Contributor Author

@chenzl25 chenzl25 Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the TPCH Q20 we want to optimize in this PR.

Comment on lines 28 to 56
/// Pull up correlated predicates from the right agg side of Apply to the `on` clause of Join.
///
/// Before:
///
/// ```text
/// LogicalApply
/// / \
/// LHS Project
/// |
/// Agg [group by nothing]
/// |
/// Project
/// |
/// Filter [correlated_input_ref(yyy) = xxx]
/// ```
///
/// After:
///
/// ```text
/// LogicalApply [yyy = xxx]
/// / \
/// LHS Project
/// |
/// Agg [group by xxx]
/// |
/// Project
/// |
/// Filter
/// ```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A graph to explain how this rule works. It tries to pull up the correlated expr from the filter to the apply.

Copy link
Contributor

@lmatz lmatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new q20 LGTM, thanks
better than flink as this one is now more bushy while Flink's is left deep and one-level deeper

Copy link
Contributor

@st1page st1page left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +83 to +86
// It could be too restrictive to require the group key to be empty. We can relax this in the future if necessary.
if !group_key.is_empty() {
return None;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is simply adding tge correlated key(xxx) into the group key correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, simply adding those group keys is not correct. We need to handle the new_agg parent input reference in a more sophisticated way instead of the current simple shifting.

Comment on lines 146 to 152
// If there is a count aggregate, bail out and leave for general subquery unnesting to deal.
if agg_calls
.iter()
.any(|agg_call| agg_call.agg_kind == AggKind::Count)
{
return None;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? is count very special here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here is the corner case. I didn't come up with an idea of how to deal with it now. If you have some ideas, feel free to share. It is also related to the TPCH Q17.

create table t (a int, b int);

create table t2 (c int, d int);

insert into t values (1, 2);

flush;

select * from t where t.a > (select count(*) from t2 where b = d);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When group by is empty, count would return 0 instead of null.

Copy link
Contributor

@st1page st1page Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can not find a way to rewrite it... But it should ok for Q17 because the aggregator is an AVG, we should choose one to optimize #14799

  • maintain the AVG and other similar aggregator in plan node and delay their rewriting(currently RW rewrites it when creating the Agg plan node)
  • consider the project-agg together in the rule later

@chenzl25 chenzl25 enabled auto-merge February 7, 2024 03:59
@chenzl25 chenzl25 added this pull request to the merge queue Feb 7, 2024
Merged via the queue into main with commit 8e3c526 Feb 7, 2024
26 of 27 checks passed
@chenzl25 chenzl25 deleted the dylan/support_tpch_subquery_unnest branch February 7, 2024 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants