Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: union causes unreasonable long key #14314

Closed
fuyufjh opened this issue Jan 3, 2024 · 6 comments
Closed

Discussion: union causes unreasonable long key #14314

fuyufjh opened this issue Jan 3, 2024 · 6 comments
Assignees
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Jan 3, 2024

Problems

We observed several times that incorrect usage union leads to extraordinary long stream key, usually including a jsonb column. For example,

create materialized view mv1
select * from t1
union
select * from t2
union
select * from t3

RisingWave will optimize this query into

Fragment 0
StreamMaterialize
└── StreamAppendOnlyDedup 
    └── StreamExchange Hash([0, 1, 2, 3, 4, 5, 6, 7]) from 1

Fragment 1
StreamProject
└── StreamUnion { all: true }
    ├── StreamExchange Hash([8, 9]) from 2
    ├── StreamExchange Hash([8, 9]) from 3
    └── StreamExchange Hash([8, 9]) from 4

Note the StreamExchange and StreamAppendOnlyDedup (or StreamGroupTopN for non-append-only cases) are on the full columns, so the cost is high.

Even worse, the long stream key will be used by downstream MVs, which leads to even higher cost.

Solution?

More than one approaches can somehow mitigate this problem.

  1. Disallow union by default (can be worked-around by setting a session variable)
  2. Disallow any aggregation on a JSONB column by default (ditto.)
  3. Use upstream stream key for dedup operator instead of distinct columns

Personally I prefer 2.

@github-actions github-actions bot added this to the release-1.6 milestone Jan 3, 2024
@st1page
Copy link
Contributor

st1page commented Jan 3, 2024

Use upstream stream key for dedup operator instead of distinct columns

I guess it is not correct SQL semantics

Disallow any aggregation on a JSONB column by default (ditto.)

+1 for this.

@st1page
Copy link
Contributor

st1page commented Jan 3, 2024

Use upstream stream key for dedup operator instead of distinct columns

I guess it is not correct SQL semantics

a case

dev=> create table t1(v int, k int primary key);
CREATE_TABLE
dev=> create table t2(v int, k int primary key);
CREATE_TABLE
dev=> insert into t1 values (1,1);
INSERT 0 1
dev=> insert into t2 values (2,1);
INSERT 0 1
dev=> select distinct on (k) * from (select * from t1 union all select * from t2);
 v | k 
---+---
 2 | 1
(1 row)

dev=> select * from t1 union select * from t2;
 v | k 
---+---
 1 | 1
 2 | 1
(2 rows)

@fuyufjh
Copy link
Member Author

fuyufjh commented Jan 4, 2024

Then, let's try out

  1. Disallow any aggregation on a JSONB column by default (ditto.)

?

Note that the aggregation is not only limited to group by. select distinct, union or group top-N are all included.

@yuhao-su
Copy link
Contributor

yuhao-su commented Jan 4, 2024

So in general we don't want jsonb in stream key?

@fuyufjh
Copy link
Member Author

fuyufjh commented Jan 4, 2024

So in general we don't want jsonb in stream key?

Yes. It's mostly a mistake and will lead to very long stream key.

@xxchan
Copy link
Member

xxchan commented Jan 5, 2024

#7981 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants