fix(over window): fix error in using aggregate function result as win… #12551

jetjinser · 2023-09-26T18:39:07Z

…dow function argument

rewrote the partition by and group by expressions (input_ref) in the window function to avoid out of index problems.

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Fixes #10666.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

…dow function argument rewrote the `partition by` and `group by` expressions (`input_ref`) in the window function to avoid out of index problems.

jetjinser · 2023-09-26T18:46:50Z

prepare table

dev=> create table t (id int, a int, b int, c int);
CREATE_TABLE

dev=> insert into t values (1, 72, 23, 84), (2, 98, 13, 29), (3, 74, 56, 43), (4, 19, 56, 83), (5, 52, 68, 20);
INSERT 0 5

dev=> select * from t;
 id | a  | b  | c
----+----+----+----
  1 | 72 | 23 | 84
  2 | 98 | 13 | 29
  3 | 74 | 56 | 43
  4 | 19 | 56 | 83
  5 | 52 | 68 | 20
(5 rows)

execute query

dev=> select row_number() over (order by a) from t group by a;
ERROR:  QueryError: Feature is not yet implemented: Window function with empty PARTITION BY is not supported yet
No tracking issue yet. Feel free to submit a feature request at https://github.com/risingwavelabs/risingwave/issues/new?labels=type%2Ffeature&template=feature_request.yml

dev=> select row_number() over (partition by a order by a) from t group by a;
 row_number
------------
          1
          1
          1
          1
          1
(5 rows)

dev=> select sum((sum(b))) over (partition by a order by a) from t group by a;
 sum
-----
  56
  68
  23
  56
  13
(5 rows)

explain query

dev=> explain select row_number() over (partition by a order by a) from t group by a;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 BatchExchange { order: [], dist: Single }
 └─BatchProject { exprs: [row_number] }
   └─BatchOverWindow { window_functions: [row_number() OVER(PARTITION BY t.a ORDER BY t.a ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)] }
     └─BatchSort { order: [t.a ASC, t.a ASC] }
       └─BatchHashAgg { group_key: [t.a], aggs: [] }
         └─BatchExchange { order: [], dist: HashShard(t.a) }
           └─BatchScan { table: t, columns: [a] }
(7 rows)

dev=> explain select row_number() over (partition by a order by a desc) from t group by a;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 BatchExchange { order: [], dist: Single }
 └─BatchProject { exprs: [row_number] }
   └─BatchOverWindow { window_functions: [row_number() OVER(PARTITION BY t.a ORDER BY t.a DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)] }
     └─BatchSort { order: [t.a ASC, t.a DESC] }
       └─BatchHashAgg { group_key: [t.a], aggs: [] }
         └─BatchExchange { order: [], dist: HashShard(t.a) }
           └─BatchScan { table: t, columns: [a] }
(7 rows)

dev=> explain select sum((sum(b))) over (partition by a order by a) from t group by a;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 BatchExchange { order: [], dist: Single }
 └─BatchProject { exprs: [sum] }
   └─BatchOverWindow { window_functions: [sum(sum(t.b)) OVER(PARTITION BY t.a ORDER BY t.a ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)] }
     └─BatchSort { order: [t.a ASC, t.a ASC] }
       └─BatchHashAgg { group_key: [t.a], aggs: [sum(t.b)] }
         └─BatchExchange { order: [], dist: HashShard(t.a) }
           └─BatchScan { table: t, columns: [a, b] }
(7 rows)

st1page · 2023-09-27T04:11:45Z

thanks! could you also help to add some planner test?
https://github.com/risingwavelabs/risingwave/blob/abaf83b784ed8113ad7a3a76e50f95bf48efbcaa/src/frontend/planner_test/README.md

jetjinser · 2023-09-29T11:52:31Z

Hi, I just added some planner_test and e2e_test.
However, I noticed that some differences between rw(e2e test) and pgsql(db fiddle):
pgsql seems to have such a priority order: order by > order by in over window > no order by, so when order by x only in over window, result should be sorted as key x.

I don't know yet how rw handles the sort key, but I'll start checking.

edit:

It dawned on me that the order by in the window function should not affect the final result. I tried mysql 8.0 and its results were consistent with our tests in e2e. My current consideration is that maybe this is where pgsql is not reasonable enough?

tested mysql: 8.0.
tested pgsql: 9.4, 9.5, 9.6, 10, 11, 12, 13, 14, 15.

edit*:

documented: If ORDER BY is not given, the rows are returned in whatever order the system finds fastest to produce.

Maybe this can be seen as compatible behavior.

stdrc

Rest LGTM! Thx for your contribution!

stdrc · 2023-10-07T07:54:21Z

e2e_test/batch/aggregate/with_over_window.slt.part

Could you plz test some cases like the following where partition by and order by are not the same as group by?

select a, b, sum( sum(c) ) over (partition by a order by b) from t group by a, b;

Or, if we want to go further:

select a, b, sum( sum(c) ) over (partition by a, avg(d) order by max(e), b) from t group by a, b;

Also plz add streaming version of these tests. You may refer to how OverWindow is e2e-tested.

stdrc · 2023-10-07T07:55:55Z

src/frontend/planner_test/tests/testdata/input/agg.yaml

+- sql: |
+    CREATE TABLE t (a int, b int);
+    SELECT a, row_number() OVER (PARTITION BY a ORDER BY a DESC) FROM t;
+  expected_outputs:
+    - batch_plan
+    - stream_plan


This seems not a test for agg.

stdrc · 2023-10-07T07:57:38Z

src/frontend/planner_test/tests/testdata/output/agg.yaml

+- sql: |
+    CREATE TABLE t(a int, b int);
+    SELECT a, sum((sum(b))) OVER (PARTITION BY a ORDER BY a) FROM t GROUP BY a;
+  batch_plan: |-
+    BatchExchange { order: [], dist: Single }
+    └─BatchProject { exprs: [t.a, sum] }
+      └─BatchOverWindow { window_functions: [sum(sum(t.b)) OVER(PARTITION BY t.a ORDER BY t.a ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)] }
+        └─BatchSort { order: [t.a ASC, t.a ASC] }
+          └─BatchHashAgg { group_key: [t.a], aggs: [sum(t.b)] }
+            └─BatchExchange { order: [], dist: HashShard(t.a) }
+              └─BatchScan { table: t, columns: [t.a, t.b], distribution: SomeShard }
+  stream_plan: |-
+    StreamMaterialize { columns: [a, sum], stream_key: [a], pk_columns: [a], pk_conflict: NoCheck }
+    └─StreamProject { exprs: [t.a, sum] }
+      └─StreamOverWindow { window_functions: [sum(sum(t.b)) OVER(PARTITION BY t.a ORDER BY t.a ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)] }
+        └─StreamProject { exprs: [t.a, sum(t.b)] }
+          └─StreamHashAgg { group_key: [t.a], aggs: [sum(t.b), count] }
+            └─StreamExchange { dist: HashShard(t.a) }
+              └─StreamTableScan { table: t, columns: [t.a, t.b, t._row_id], pk: [t._row_id], dist: UpstreamHashShard(t._row_id) }


Looks good, but similarly, will better to have different partition by, order by and group by.

jetjinser · 2023-10-08T07:09:34Z

will it feel better now?

stdrc

LGTM!

stdrc · 2023-10-09T09:22:31Z

e2e_test/batch/aggregate/with_over_window.slt.part

+    a,
+    sum((sum(b))) over (partition by a order by a)
+from t
+group by a;


Better to add order by a to ensure consistent result.

If it is for consistent result, should order by constraints be added to every query?

Yes. Should be added to every query in this file.

To ensure consistent result

#12551)

#12551) (#12720) Co-authored-by: jinser <[email protected]>

fix(over window): fix error in using aggregate function result as win…

8ba0605

…dow function argument rewrote the `partition by` and `group by` expressions (`input_ref`) in the window function to avoid out of index problems.

stdrc requested review from stdrc and st1page September 27, 2023 05:37

jetjinser added 2 commits September 29, 2023 19:16

test: add planner_test agg with window function

0d7caa6

test: add e2e_test agg with window function

fc9f48b

stdrc reviewed Oct 7, 2023

View reviewed changes

jetjinser added 4 commits October 8, 2023 12:16

test: add more complex agg with window e2e_test

593706c

test: fix the agg planner_test query missed agg

7cd9457

test: more complex agg with window planner_test

3b0645c

test: add "over_window with agg" streaming e2e_test

5e18a8a

stdrc approved these changes Oct 9, 2023

View reviewed changes

test: add group by to every agg + window query

4147b8c

To ensure consistent result

lmatz added the need-cherry-pick-v1.3 label Oct 9, 2023

jetjinser added this pull request to the merge queue Oct 10, 2023

Merged via the queue into risingwavelabs:main with commit 6ec0b95 Oct 10, 2023
9 checks passed

stdrc added type/fix Bug fix need-cherry-pick-v1.3 and removed need-cherry-pick-v1.3 labels Oct 10, 2023

stdrc pushed a commit that referenced this pull request Oct 10, 2023

fix(over window): fix error in using aggregate function result as win… (

e346272

#12551)

stdrc mentioned this pull request Oct 10, 2023

fix(over window): fix error in using aggregate function result as win… (#12551) #12720

Merged

lmatz pushed a commit that referenced this pull request Oct 10, 2023

fix(over window): fix error in using aggregate function result as win… (

fdaedaf

#12551) (#12720) Co-authored-by: jinser <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(over window): fix error in using aggregate function result as win… #12551

fix(over window): fix error in using aggregate function result as win… #12551

jetjinser commented Sep 26, 2023

jetjinser commented Sep 26, 2023

st1page commented Sep 27, 2023

jetjinser commented Sep 29, 2023 •

edited

Loading

stdrc left a comment

stdrc Oct 7, 2023 •

edited

Loading

stdrc Oct 7, 2023

stdrc Oct 7, 2023

jetjinser commented Oct 8, 2023

stdrc left a comment

stdrc Oct 9, 2023

jetjinser Oct 9, 2023

stdrc Oct 9, 2023

fix(over window): fix error in using aggregate function result as win… #12551

fix(over window): fix error in using aggregate function result as win… #12551

Conversation

jetjinser commented Sep 26, 2023

What's changed and what's your intention?

Checklist

Documentation

Release note

jetjinser commented Sep 26, 2023

prepare table

execute query

explain query

st1page commented Sep 27, 2023

jetjinser commented Sep 29, 2023 • edited Loading

edit:

edit*:

stdrc left a comment

Choose a reason for hiding this comment

stdrc Oct 7, 2023 • edited Loading

Choose a reason for hiding this comment

stdrc Oct 7, 2023

Choose a reason for hiding this comment

stdrc Oct 7, 2023

Choose a reason for hiding this comment

jetjinser commented Oct 8, 2023

stdrc left a comment

Choose a reason for hiding this comment

stdrc Oct 9, 2023

Choose a reason for hiding this comment

jetjinser Oct 9, 2023

Choose a reason for hiding this comment

stdrc Oct 9, 2023

Choose a reason for hiding this comment

jetjinser commented Sep 29, 2023 •

edited

Loading

stdrc Oct 7, 2023 •

edited

Loading