Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve: output stream_key of Dedup should be upstream's stream_key instead of all columns #14298

Closed
fuyufjh opened this issue Jan 2, 2024 · 4 comments

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Jan 2, 2024

Output stream_key of Dedup should be upstream's stream_key instead of all columns, because it's strictly better.

By the way, a related problem: now AppendOnlyDedup uses PK as its dedup key.

@github-actions github-actions bot added this to the release-1.6 milestone Jan 2, 2024
@st1page
Copy link
Contributor

st1page commented Jan 2, 2024

Currently, we use the dedup_cols as the stream key which means the keys in the distinct on clause.

fn stream_key(&self) -> Option<Vec<usize>> {
what do you mean about the "all columns"?

@fuyufjh
Copy link
Member Author

fuyufjh commented Jan 3, 2024

Currently, we use the dedup_cols as the stream key which means the keys in the distinct on clause.

fn stream_key(&self) -> Option<Vec<usize>> {

what do you mean about the "all columns"?

I mean for select distinct (without on (<columns>)), all the columns will be used as stream key.

@fuyufjh
Copy link
Member Author

fuyufjh commented Jan 3, 2024

Yesterday I read some code around this part, and I am not sure whether it's good.

  1. For distinct on (<columns>), it seems reasonable to use the specified distinct columns as stream key. Usually in this cases the distinct columns are selected by users, only including the useful ones.
  2. Additionally, using upstream stream key is not 100% "strictly better" considering hidden key row_id. It makes the work become a bit tricky.

Perhaps I should close this issue first.

@fuyufjh fuyufjh closed this as not planned Won't fix, can't repro, duplicate, stale Jan 3, 2024
@st1page
Copy link
Contributor

st1page commented Jan 3, 2024

Sadly, there are two invariants on the stream key

  1. It should be a unique key
  2. The dist key must be a subset of the stream key

So if we want to change the stream_key of the Dedup, we have to also change its to_stream implementation to modify its distribution enforcement logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants