Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(StreamHashJoin): redundant/unexpected output ops for update #14835

Closed
xiangjinwu opened this issue Jan 29, 2024 · 5 comments
Closed

bug(StreamHashJoin): redundant/unexpected output ops for update #14835

xiangjinwu opened this issue Jan 29, 2024 · 5 comments
Assignees
Labels
component/streaming Stream processing related issue. no-issue-activity type/bug Something isn't working

Comments

@xiangjinwu
Copy link
Contributor

xiangjinwu commented Jan 29, 2024

Describe the bug

Refer to repro steps and expected behavior.

Error message/log

No response

To Reproduce

Add tracing::warn!("proj-input: {}", chunk.to_pretty_with_schema(&sch)); in phase 1 Message::Chunk.
Also let sch = input.schema().clone(); before inner.execute().
(Are there simpler ways to inspect on input/output chunks of a specific executor?)

let st = input.execute().map(move |msg| {
let this = this.clone();
let msg = msg?;
let is_fence: bool;
#[auto_enum(Future)]
let fut = match msg {
Message::Chunk(chunk) => {
is_fence = false;

create table posts (pid int primary key, title varchar);
create table comments (cid int primary key, pid int);
create materialized view num_comments as select pid, count(*) as num from comments group by pid;
create materialized view stat_by_posts as select p.pid, title, coalesce(num, 0) from posts p left join num_comments n on p.pid = n.pid;
insert into posts values (1, 'hello');
insert into comments values (11, 1);
insert into comments values (12, 1);

The input chunk of project executor, which is the output chunk of join executor, appears as following in the added log:

+----+-----------+-------------+------------------+------------------+
|    | posts.pid | posts.title | num_comments.num | num_comments.pid |
+----+-----------+-------------+------------------+------------------+
| U- | 1         | hello       | 1                | 1                |
| U+ | 1         | hello       |                  |                  |
| U- | 1         | hello       |                  |                  |
| U+ | 1         | hello       | 2                | 1                |
+----+-----------+-------------+------------------+------------------+

Note the above uses a left outer join. If it was inner join, the chunks logged would be:

+---+-----------+-------------+------------------+------------------+
|   | posts.pid | posts.title | num_comments.num | num_comments.pid |
+---+-----------+-------------+------------------+------------------+
| - | 1         | hello       | 1                | 1                |
| + | 1         | hello       | 2                | 1                |
+---+-----------+-------------+------------------+------------------+

Expected behavior

By intuition, the ideal output of join executor in this case is (regardless of left outer or inner):

+----+-----------+-------------+------------------+------------------+
|    | posts.pid | posts.title | num_comments.num | num_comments.pid |
+----+-----------+-------------+------------------+------------------+
| U- | 1         | hello       | 1                | 1                |
| U+ | 1         | hello       | 2                | 1                |
+----+-----------+-------------+------------------+------------------+

Semantically, they look equivalent and we may think it has little impact on downstream processing. However, when sinking with upsert behavior, it results in unnecessary deletion where a single insert in sufficient.

How did you deploy RisingWave?

./risedev playground

The version of RisingWave

main (f16a623)

Additional context

Derived from a customer reported issue about upsert sink containing unexpected tombstones (i.e. deletions), even when derived upstream key matches with specified downstream key. cc @tabVersion @st1page

@xiangjinwu xiangjinwu added type/bug Something isn't working component/streaming Stream processing related issue. labels Jan 29, 2024
@github-actions github-actions bot added this to the release-1.7 milestone Jan 29, 2024
@yuhao-su
Copy link
Contributor

It is a legitimate output for the current join implementation. Doing such optimization can be complicated, not to mention we can't avoid the "redundant" U- U+ mentioned above if the two input are in different chunks.

If it really frustrating to have unnecessary delete to sink downstream, we can consider adding a compaction before sink.

@BugenZhao
Copy link
Member

Are there simpler ways to inspect on input/output chunks of a specific executor?

Yes, but currently not for specific executor.

#12024 (comment)

@fuyufjh
Copy link
Member

fuyufjh commented Mar 26, 2024

we can consider adding a compaction before sink.

Done in #15345

However, personally, I still hope to complete this issue. #15345 is somehow a "patch" instead of a native solution.

Copy link
Contributor

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

@yuhao-su
Copy link
Contributor

should be resolved in #17450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/streaming Stream processing related issue. no-issue-activity type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants