Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(over window): fix error in using aggregate function result as win… #12551

Merged
merged 8 commits into from
Oct 10, 2023
Merged

Conversation

jetjinser
Copy link
Contributor

…dow function argument

rewrote the partition by and group by expressions (input_ref) in the window function to avoid out of index problems.

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Fixes #10666.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

…dow function argument

rewrote the `partition by` and `group by` expressions (`input_ref`) in the window function
to avoid out of index problems.
@jetjinser
Copy link
Contributor Author

prepare table

dev=> create table t (id int, a int, b int, c int);
CREATE_TABLE

dev=> insert into t values (1, 72, 23, 84), (2, 98, 13, 29), (3, 74, 56, 43), (4, 19, 56, 83), (5, 52, 68, 20);
INSERT 0 5

dev=> select * from t;
 id | a  | b  | c
----+----+----+----
  1 | 72 | 23 | 84
  2 | 98 | 13 | 29
  3 | 74 | 56 | 43
  4 | 19 | 56 | 83
  5 | 52 | 68 | 20
(5 rows)

execute query

dev=> select row_number() over (order by a) from t group by a;
ERROR:  QueryError: Feature is not yet implemented: Window function with empty PARTITION BY is not supported yet
No tracking issue yet. Feel free to submit a feature request at https://github.com/risingwavelabs/risingwave/issues/new?labels=type%2Ffeature&template=feature_request.yml

dev=> select row_number() over (partition by a order by a) from t group by a;
 row_number
------------
          1
          1
          1
          1
          1
(5 rows)

dev=> select sum((sum(b))) over (partition by a order by a) from t group by a;
 sum
-----
  56
  68
  23
  56
  13
(5 rows)

explain query

dev=> explain select row_number() over (partition by a order by a) from t group by a;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 BatchExchange { order: [], dist: Single }
 └─BatchProject { exprs: [row_number] }
   └─BatchOverWindow { window_functions: [row_number() OVER(PARTITION BY t.a ORDER BY t.a ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)] }
     └─BatchSort { order: [t.a ASC, t.a ASC] }
       └─BatchHashAgg { group_key: [t.a], aggs: [] }
         └─BatchExchange { order: [], dist: HashShard(t.a) }
           └─BatchScan { table: t, columns: [a] }
(7 rows)

dev=> explain select row_number() over (partition by a order by a desc) from t group by a;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 BatchExchange { order: [], dist: Single }
 └─BatchProject { exprs: [row_number] }
   └─BatchOverWindow { window_functions: [row_number() OVER(PARTITION BY t.a ORDER BY t.a DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)] }
     └─BatchSort { order: [t.a ASC, t.a DESC] }
       └─BatchHashAgg { group_key: [t.a], aggs: [] }
         └─BatchExchange { order: [], dist: HashShard(t.a) }
           └─BatchScan { table: t, columns: [a] }
(7 rows)

dev=> explain select sum((sum(b))) over (partition by a order by a) from t group by a;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 BatchExchange { order: [], dist: Single }
 └─BatchProject { exprs: [sum] }
   └─BatchOverWindow { window_functions: [sum(sum(t.b)) OVER(PARTITION BY t.a ORDER BY t.a ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)] }
     └─BatchSort { order: [t.a ASC, t.a ASC] }
       └─BatchHashAgg { group_key: [t.a], aggs: [sum(t.b)] }
         └─BatchExchange { order: [], dist: HashShard(t.a) }
           └─BatchScan { table: t, columns: [a, b] }
(7 rows)

@st1page
Copy link
Contributor

st1page commented Sep 27, 2023

thanks! could you also help to add some planner test?
https://github.com/risingwavelabs/risingwave/blob/abaf83b784ed8113ad7a3a76e50f95bf48efbcaa/src/frontend/planner_test/README.md

@stdrc stdrc requested review from stdrc and st1page September 27, 2023 05:37
@jetjinser
Copy link
Contributor Author

jetjinser commented Sep 29, 2023

Hi, I just added some planner_test and e2e_test.
However, I noticed that some differences between rw(e2e test) and pgsql(db fiddle):
pgsql seems to have such a priority order: order by > order by in over window > no order by, so when order by x only in over window, result should be sorted as key x.

I don't know yet how rw handles the sort key, but I'll start checking.

edit:

It dawned on me that the order by in the window function should not affect the final result. I tried mysql 8.0 and its results were consistent with our tests in e2e. My current consideration is that maybe this is where pgsql is not reasonable enough?

tested mysql: 8.0.
tested pgsql: 9.4, 9.5, 9.6, 10, 11, 12, 13, 14, 15.

edit*:

documented: If ORDER BY is not given, the rows are returned in whatever order the system finds fastest to produce.

Maybe this can be seen as compatible behavior.

Copy link
Member

@stdrc stdrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM! Thx for your contribution!

Copy link
Member

@stdrc stdrc Oct 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you plz test some cases like the following where partition by and order by are not the same as group by?

select
    a, b,
    sum( sum(c) ) over (partition by a order by b)
from t
group by a, b;

Or, if we want to go further:

select
    a, b,
    sum( sum(c) ) over (partition by a, avg(d) order by max(e), b)
from t
group by a, b;

Also plz add streaming version of these tests. You may refer to how OverWindow is e2e-tested.

Comment on lines 939 to 944
- sql: |
CREATE TABLE t (a int, b int);
SELECT a, row_number() OVER (PARTITION BY a ORDER BY a DESC) FROM t;
expected_outputs:
- batch_plan
- stream_plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not a test for agg.

Comment on lines 1652 to 1670
- sql: |
CREATE TABLE t(a int, b int);
SELECT a, sum((sum(b))) OVER (PARTITION BY a ORDER BY a) FROM t GROUP BY a;
batch_plan: |-
BatchExchange { order: [], dist: Single }
└─BatchProject { exprs: [t.a, sum] }
└─BatchOverWindow { window_functions: [sum(sum(t.b)) OVER(PARTITION BY t.a ORDER BY t.a ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)] }
└─BatchSort { order: [t.a ASC, t.a ASC] }
└─BatchHashAgg { group_key: [t.a], aggs: [sum(t.b)] }
└─BatchExchange { order: [], dist: HashShard(t.a) }
└─BatchScan { table: t, columns: [t.a, t.b], distribution: SomeShard }
stream_plan: |-
StreamMaterialize { columns: [a, sum], stream_key: [a], pk_columns: [a], pk_conflict: NoCheck }
└─StreamProject { exprs: [t.a, sum] }
└─StreamOverWindow { window_functions: [sum(sum(t.b)) OVER(PARTITION BY t.a ORDER BY t.a ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)] }
└─StreamProject { exprs: [t.a, sum(t.b)] }
└─StreamHashAgg { group_key: [t.a], aggs: [sum(t.b), count] }
└─StreamExchange { dist: HashShard(t.a) }
└─StreamTableScan { table: t, columns: [t.a, t.b, t._row_id], pk: [t._row_id], dist: UpstreamHashShard(t._row_id) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but similarly, will better to have different partition by, order by and group by.

@jetjinser
Copy link
Contributor Author

will it feel better now?

Copy link
Member

@stdrc stdrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

a,
sum((sum(b))) over (partition by a order by a)
from t
group by a;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add order by a to ensure consistent result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is for consistent result, should order by constraints be added to every query?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Should be added to every query in this file.

@jetjinser jetjinser added this pull request to the merge queue Oct 10, 2023
Merged via the queue into risingwavelabs:main with commit 6ec0b95 Oct 10, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/fix Bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: frontend panic when using aggregate function result as window function argument
4 participants