Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stream): compact stream chunk by dropping useless rows in sink #11070

Merged
merged 55 commits into from
Sep 13, 2023

Conversation

xx01cyx
Copy link
Contributor

@xx01cyx xx01cyx commented Jul 19, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR compacts a stream chunk by converting UPDATE into INSERT & DELETE, as well as dropping useless rows. Specifically, UPDATE K->K will be dropped. The chunk compaction is used in sink executor and project executor.

Related issues: #10949
Fix #10853

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR contains user-facing changes.
Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

  • Installation and deployment
  • Connector (sources & sinks)
  • SQL commands, functions, and operators
  • RisingWave cluster configuration changes
  • Other (please specify in the release note below)

Release note

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has totally checked 3827 files.

Valid Invalid Ignored Fixed
1714 1 2112 0
Click to see the invalid file list
  • src/stream/src/common/compact_chunk.rs

src/stream/src/common/compact_chunk.rs Outdated Show resolved Hide resolved
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@BugenZhao BugenZhao self-requested a review July 20, 2023 04:02
@xx01cyx xx01cyx requested a review from st1page July 20, 2023 05:32
@codecov
Copy link

codecov bot commented Jul 20, 2023

Codecov Report

Merging #11070 (7ef5945) into main (da89875) will increase coverage by 0.03%.
Report is 69 commits behind head on main.
The diff coverage is 90.26%.

@@            Coverage Diff             @@
##             main   #11070      +/-   ##
==========================================
+ Coverage   69.76%   69.80%   +0.03%     
==========================================
  Files        1404     1405       +1     
  Lines      235006   235243     +237     
==========================================
+ Hits       163958   164204     +246     
+ Misses      71048    71039       -9     
Flag Coverage Δ
rust 69.80% <90.26%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
src/common/src/array/mod.rs 88.06% <ø> (ø)
src/common/src/row/project.rs 74.13% <0.00%> (-5.87%) ⬇️
src/connector/src/sink/kafka.rs 38.82% <0.00%> (ø)
src/connector/src/sink/kinesis.rs 0.00% <0.00%> (ø)
src/connector/src/source/external.rs 35.51% <0.00%> (ø)
src/frontend/src/optimizer/plan_node/stream.rs 13.69% <ø> (ø)
src/stream/src/from_proto/sink.rs 0.00% <0.00%> (ø)
src/connector/src/sink/mod.rs 37.37% <50.00%> (ø)
src/stream/src/executor/sink.rs 91.74% <87.50%> (+0.12%) ⬆️
src/common/src/array/stream_chunk.rs 77.04% <93.33%> (+15.33%) ⬆️
... and 5 more

... and 10 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

src/common/src/array/compact_chunk.rs Outdated Show resolved Hide resolved
src/common/src/array/data_chunk_iter.rs Outdated Show resolved Hide resolved
@TennyZhuang
Copy link
Contributor

A quick question. Should we compact per chunk? Why not per barrier in SinkExecutor?

@st1page st1page linked an issue Jul 24, 2023 that may be closed by this pull request
2 tasks
@st1page
Copy link
Contributor

st1page commented Sep 12, 2023

implement some new logic to maintain the UpdateDelete and UpdateInsert in the input chunk. PTAL again @fuyufjh , thanks. Maybe we can open it in project executor?

@fuyufjh
Copy link
Member

fuyufjh commented Sep 12, 2023

implement some new logic to maintain the UpdateDelete and UpdateInsert in the input chunk. PTAL again @fuyufjh , thanks. Maybe we can open it in project executor?

-1. It seems to be too heavy (and also obscure) for being used in ProjectExecutor, especially for the event reordering

@fuyufjh
Copy link
Member

fuyufjh commented Sep 12, 2023

From my point of view, this PR has been diverged a lot from its original motivation (#10853), and has become somehow over-designed.

Also slightly -1 for #10949, because it's a problem imagined by ourselves. Even if we need to solve it, we need to carefully choose the implementation (must be very fast) and where to put it on (such as before UDF and after column-pruning Project)

I tend to close this PR now and restart from the very beginning: solve #10853 at SinkExecutor.

@st1page
Copy link
Contributor

st1page commented Sep 12, 2023

I tend to close this PR now and restart from the very beginning: solve #10853 at SinkExecutor.

Fine, Let's merge this PR and I will solve it in sink executor later

Copy link
Contributor

@TennyZhuang TennyZhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loooks good enough to merge

@fuyufjh
Copy link
Member

fuyufjh commented Sep 12, 2023

I tend to close this PR now and restart from the very beginning: solve #10853 at SinkExecutor.

Fine, Let's merge this PR and I will solve it in sink executor later

Misunderstood. 😇 I mean close, not merge.

@st1page
Copy link
Contributor

st1page commented Sep 12, 2023

solve #10853 at SinkExecutor

solve #10853 at SinkExecutor need this PR absolutely

@fuyufjh
Copy link
Member

fuyufjh commented Sep 12, 2023

Anyway, you can merge it if you insist, because it only changes the Sink. Although I think my proposal is a little bit better in terms of efficiency.

It's still unacceptable to me to do the same thing in ProjectExecutor. The reason has been states in previous comments:

  • I do think it might be dangerous to reorder events in the process of streaming. After all, no operator does this now. But sorry I can't come up with a good example now.
  • I hope to preserve the U-/U+ rows. There are still lots of operators can benefit more or less from this. Reordering all deletes before inserts will completely break U-/U+.

@st1page
Copy link
Contributor

st1page commented Sep 12, 2023

Anyway, you can merge it if you insist, because it only changes the Sink. Although I think my proposal is a little bit better in terms of efficiency.

It's still unacceptable to me to do the same thing in ProjectExecutor. The reason has been states in previous comments:

  • I do think it might be dangerous to reorder events in the process of streaming. After all, no operator does this now. But sorry I can't come up with a good example now.
  • I hope to preserve the U-/U+ rows. There are still lots of operators can benefit more or less from this. Reordering all deletes before inserts will completely break U-/U+.

But both issues are solved now 😇 The only remind issue here is the performance overhead now,

@fuyufjh
Copy link
Member

fuyufjh commented Sep 12, 2023

Anyway, you can merge it if you insist, because it only changes the Sink. Although I think my proposal is a little bit better in terms of efficiency.
It's still unacceptable to me to do the same thing in ProjectExecutor. The reason has been states in previous comments:

  • I do think it might be dangerous to reorder events in the process of streaming. After all, no operator does this now. But sorry I can't come up with a good example now.
  • I hope to preserve the U-/U+ rows. There are still lots of operators can benefit more or less from this. Reordering all deletes before inserts will completely break U-/U+.

But both issues are solved now 😇 The only remind issue here is the performance overhead now,

Sorry, I missed some changes in the file. Just had a huddle with @st1page and here is the notes:

I hope to preserve the U-/U+ rows. There are still lots of operators can benefit more or less from this. Reordering all deletes before inserts will completely break U-/U+.

This doesn't apply on current implementation.

I do think it might be dangerous to reorder events in the process of streaming. After all, no operator does this now. But sorry I can't come up with a good example now.

There is no reordering now. This PR finds out all pair of deletes and insert and set their invisibility to false. I have concern with this behavior, because I'm not 100% sure it's safe.

For example, assume there is a unique(v) constraint in a MView as well as its input stream.

   k   v
--------
+  1   5
-  1   5  <-- to eliminate
+  2   5
-  2   5
+  1   5  <-- to eliminate

If we eliminate the rows, then after + 1 5 and + 2 5, the constraint will be broken temporarily.

In order words, this is somehow another kind of reordering: logically, it delays the delete to the position right before its paired insert.

I know we don't have unique constraint and it's not a good example, but I do have such concern. After all, it's hard to say how much it can improve.

@st1page st1page closed this Sep 13, 2023
@st1page st1page reopened this Sep 13, 2023
@st1page st1page enabled auto-merge September 13, 2023 06:39
@st1page st1page added this pull request to the merge queue Sep 13, 2023
@st1page st1page changed the title feat(stream): compact stream chunk by dropping useless rows feat(stream): compact stream chunk by dropping useless rows in sink Sep 13, 2023
Merged via the queue into main with commit 690a3b4 Sep 13, 2023
@st1page st1page deleted the cyx/compact-stream-chunk branch September 13, 2023 07:11
Li0k pushed a commit that referenced this pull request Sep 15, 2023
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: st1page <[email protected]>
Co-authored-by: stonepage <[email protected]>
Co-authored-by: Bugen Zhao <[email protected]>
Co-authored-by: TennyZhuang <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Oct 9, 2023
@StrikeW StrikeW mentioned this pull request Jun 7, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf(streaming): compact useless intermediate result in stream chunk sink emitting more rows than expected
6 participants