Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-append-only distinct may output adjacent noop updates #17030

Closed
BugenZhao opened this issue May 30, 2024 · 0 comments · Fixed by #17048
Closed

non-append-only distinct may output adjacent noop updates #17030

BugenZhao opened this issue May 30, 2024 · 0 comments · Fixed by #17048
Assignees
Labels
component/streaming Stream processing related issue. type/enhancement Improvements to existing implementation.
Milestone

Comments

@BugenZhao
Copy link
Member

BugenZhao commented May 30, 2024

Using DISTINCT on a non-append-only relation will generate a plan like this:

dev=> create table t (v int);
CREATE_TABLE

dev=> explain create materialized view mv as select distinct v from t;
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 StreamMaterialize { columns: [v], stream_key: [v], pk_columns: [v], pk_conflict: NoCheck }
 └─StreamProject { exprs: [t.v] }
   └─StreamHashAgg { group_key: [t.v], aggs: [count] }
     └─StreamExchange { dist: HashShard(t.v) }
       └─StreamTableScan { table: t, columns: [v, _row_id] }
(5 rows)

If we insert an existing value into the table, we'll still get a chunk with adjacent noop updates in StreamMaterialize.

2024-05-30T17:04:40.659745+08:00 DEBUG   rw-streaming actor{otel.name="Actor 27" actor_id=27 prev_epoch=6542110140137472 curr_epoch=6542110205739008}:executor{otel.name="Materialize 1B00000004"}:executor{otel.name="Project 1B00000003"}:executor{otel.name="HashAgg 1B00000002"}: events::stream::message::chunk: 
+----+-----+-------+
|    | t.v | count |
+----+-----+-------+
| U- | 233 | 1     |
| U+ | 233 | 2     |
+----+-----+-------+
 cardinality=2 capacity=2
2024-05-30T17:04:40.659848+08:00 DEBUG   rw-streaming actor{otel.name="Actor 27" actor_id=27 prev_epoch=6542110140137472 curr_epoch=6542110205739008}:executor{otel.name="Materialize 1B00000004"}:executor{otel.name="Project 1B00000003"}: events::stream::message::chunk: 
+----+-----+
|    | t.v |
+----+-----+
| U- | 233 |
| U+ | 233 |
+----+-----+
 cardinality=2 capacity=2

If the output is further used as a dimension table in Join, the noop update will cause an amplification per-row, resulting in extremely high latency. Note that this does not only happen on DISTINCT, LATERAL JOIN could also generate a plan like this.


This is mainly because we add an extra row_count agg-call for internal use, which is then stripped with the following Project. We are missing the optimization here:

if prev_row_count == 0 && curr_row_count == 0 || prev_outputs == curr_outputs {
// No rows exist, or output is not changed.
None

Also, if there's no column-pruning dispatchers (typically in MV on MV), we'll also miss the optimization in #14652.

// Apply output indices after calculating the vnode.
let chunk = if self.output_indices.len() < chunk.columns().len() {
chunk
.project(&self.output_indices)
.eliminate_adjacent_noop_update()
} else {
chunk.project(&self.output_indices)
};

Following the idea of #10949, I'm considering whether we should apply the optimization to Projects, or at least specifically the Project in the case of this issue.

@BugenZhao BugenZhao added type/enhancement Improvements to existing implementation. component/streaming Stream processing related issue. labels May 30, 2024
@github-actions github-actions bot added this to the release-1.10 milestone May 30, 2024
@BugenZhao BugenZhao self-assigned this May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/streaming Stream processing related issue. type/enhancement Improvements to existing implementation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant