refactor: deprecate StreamChunkWithState #14524

Rossil2012 · 2024-01-12T04:45:39Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Resolve #14384.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

tabVersion

I think it is on the right track

src/common/src/catalog/mod.rs

src/connector/src/parser/mod.rs

src/source/src/source_desc.rs

src/stream/src/executor/source/mod.rs

src/stream/src/executor/source/fs_source_executor.rs

BugenZhao · 2024-01-16T03:02:31Z

Is the PR ready for review?

tabVersion · 2024-01-16T04:35:34Z

Is the PR ready for review?

No, still lack the "write to state table" part.

xxchan · 2024-01-16T09:05:13Z

src/stream/src/executor/source/mod.rs

+pub fn get_split_offset_mapping_from_chunk(
+    chunk: &StreamChunk,
+    partition_idx: usize,
+    offset_idx: usize,
+) -> Option<HashMap<SplitId, String>> {
+    let mut split_offset_mapping = HashMap::new();
+    for (_, row) in chunk.rows() {
+        let split_id = row.datum_at(partition_idx).unwrap().into_utf8().into();
+        let offset = row.datum_at(offset_idx).unwrap().into_utf8();
+        split_offset_mapping.insert(split_id, offset.to_string());
+    }
+    Some(split_offset_mapping)
+}


This is an additional round of iteration compared with StreamChunkWithState. We might need to benchmark.

We also do it in parser, just moving the logic here. It is inevitable to iter over each chunk to get the latest offsets.

Yes I know that. I emphasize "additional" here. i.e., we need to iterate here, but still need to iterate in the original place for other things.. So it's not "just moving"

lmatz · 2024-01-23T13:35:27Z

The committer owns the performance too, please check https://www.notion.so/risingwave-labs/Manually-Build-Image-and-Run-Performance-Longevity-Test-b784f1eae1cf48889b2645d020b6b7d3?pvs=4

src/connector/src/parser/additional_columns.rs

BugenZhao

Generally LGTM

src/connector/src/source/datagen/source/generator.rs

src/connector/src/source/manager.rs

src/connector/src/source/nexmark/source/reader.rs

Rossil2012 · 2024-01-25T02:43:02Z

Before refactor:

After refactor:

Is the performance degradation acceptable?

BugenZhao · 2024-01-25T03:41:29Z

Is the performance degradation acceptable?

LGTM

BugenZhao · 2024-01-25T03:46:34Z

Another question: previously we only record the offset of the last record for each split every time when yielding a chunk. Now we make it per-record.

Do we ensure that the offset for each record corresponds exactly?
Is there or will there be any downstream relying on this invariant?

Rossil2012 · 2024-01-25T04:25:14Z

Do we ensure that the offset for each record corresponds exactly?

Not sure about what "corresponds" indicates. For now, we ensure the offset for each columns is distinct, and the latest offset of each split in one chunk is equal to the split_offset_mapping before refactor.

Is there or will there be any downstream relying on this invariant?

IIUC, there are 2 cases that offsets are used. In Source Executor, offsets of partitions are persisted and used in recovery. In Fetch Executor, offsets are used to track if the file on reading reaches the end so we can move on another file then. And this refactor can cover both cases.

StrikeW · 2024-01-25T04:32:39Z

I'd like to take a look.

src/connector/src/source/reader/desc.rs

StrikeW · 2024-01-25T06:31:34Z

src/stream/src/executor/source/mod.rs

@@ -42,3 +49,51 @@ pub async fn barrier_to_message_stream(mut rx: UnboundedReceiver<Barrier>) {
    }
    bail!("barrier reader closed unexpectedly");
 }
+
+pub fn get_split_offset_mapping_from_chunk(


It seems we still rely on the split_offset_mapping to update source internal state, do you plan to refactor that part in future? We should ensure the semantic is same as before I think.

split_offset_mapping is actually the latest offset of each partitions, we cannot avoid this round of iteration to compute it.

StrikeW

General LGTM

BugenZhao · 2024-01-26T05:21:04Z

Not sure about what "corresponds" indicates. For now, we ensure the offset for each columns is distinct

I assume you meant to say rows instead of columns. That's exactly what I was going to ask!

Rossil2012 added 2 commits January 11, 2024 18:01

add offset/partition cols upon building SourceDesc

976ed2a

batch rename

f2d9466

github-actions bot added Invalid PR Title and removed Invalid PR Title labels Jan 12, 2024

Rossil2012 changed the title ~~Kanzhen/deprecate chunk with state~~ refactor: deprecate StreamChunkWithState Jan 12, 2024

github-actions bot added the type/refactor label Jan 12, 2024

Rossil2012 added 15 commits January 12, 2024 14:03

refactor nexmark reader

095f3a1

fix nexmark

2217dc6

refactor datagen

9fd4670

fmt

e3ac101

refactor opendal reader

f696d25

fix test of plain_parser

3fe8df9

refactor pulsar-iceberg source

df51578

refactor fetch_executor roughly

c40634e

refactor parser

5d570fc

fix misc

7035694

fix executors

e8ea63f

fix misc

0ffdf3b

const additional column name

a544eb0

refactor fetch_executor

8c1f930

simplify import

633c710

Rossil2012 marked this pull request as ready for review January 15, 2024 06:22

Rossil2012 requested a review from tabVersion January 15, 2024 06:22

tabVersion reviewed Jan 15, 2024

View reviewed changes

BugenZhao self-requested a review January 16, 2024 03:02

tabVersion marked this pull request as draft January 16, 2024 04:35

xxchan reviewed Jan 16, 2024

View reviewed changes

fix comments

d5203d2

BugenZhao reviewed Jan 24, 2024

View reviewed changes

src/connector/src/parser/additional_columns.rs Outdated Show resolved Hide resolved

BugenZhao approved these changes Jan 24, 2024

View reviewed changes

src/connector/src/source/datagen/source/generator.rs Show resolved Hide resolved

src/connector/src/source/manager.rs Outdated Show resolved Hide resolved

src/connector/src/source/nexmark/source/reader.rs Outdated Show resolved Hide resolved

Rossil2012 added 4 commits January 25, 2024 10:11

fix comments

e5ce20d

refactor datagen with StreamChunkBuilder

8ac8f68

refactor nexmark with StreamChunkBuilder

ab1e53a

fix warning

487fc1c

resolve conflicts

60aa198

BugenZhao requested a review from StrikeW January 25, 2024 03:46

fix unit test

f78c256

fix conflict

23f480c

StrikeW reviewed Jan 25, 2024

View reviewed changes

src/connector/src/source/reader/desc.rs Show resolved Hide resolved

StrikeW reviewed Jan 25, 2024

View reviewed changes

StrikeW approved these changes Jan 25, 2024

View reviewed changes

Rossil2012 added 3 commits January 25, 2024 16:27

add comments

a2888cf

fix conflict

3625684

fix conflicts

e39c486

Rossil2012 added this pull request to the merge queue Jan 25, 2024

Merged via the queue into main with commit 372c2d7 Jan 25, 2024
26 of 27 checks passed

Rossil2012 deleted the kanzhen/deprecate_chunk_with_state branch January 25, 2024 11:14

StrikeW mentioned this pull request Jan 26, 2024

perf: throughput of stateless query with no computation drops 20% over time #14815

Closed

cyliu0 mentioned this pull request Jan 31, 2024

nightly-20240126 deleting data from backfill table stuck #14886

Closed

StrikeW mentioned this pull request Mar 21, 2024

fix(source): parse message with empty key and payload #15678

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: deprecate StreamChunkWithState #14524

refactor: deprecate StreamChunkWithState #14524

Rossil2012 commented Jan 12, 2024 •

edited

Loading

tabVersion left a comment

BugenZhao commented Jan 16, 2024

tabVersion commented Jan 16, 2024

xxchan Jan 16, 2024 •

edited

Loading

tabVersion Jan 16, 2024

xxchan Jan 16, 2024 •

edited

Loading

lmatz commented Jan 23, 2024

BugenZhao left a comment

Rossil2012 commented Jan 25, 2024

BugenZhao commented Jan 25, 2024

BugenZhao commented Jan 25, 2024

Rossil2012 commented Jan 25, 2024

StrikeW commented Jan 25, 2024

StrikeW Jan 25, 2024

Rossil2012 Jan 25, 2024

StrikeW left a comment

BugenZhao commented Jan 26, 2024

refactor: deprecate StreamChunkWithState #14524

refactor: deprecate StreamChunkWithState #14524

Conversation

Rossil2012 commented Jan 12, 2024 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

tabVersion left a comment

Choose a reason for hiding this comment

BugenZhao commented Jan 16, 2024

tabVersion commented Jan 16, 2024

xxchan Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

tabVersion Jan 16, 2024

Choose a reason for hiding this comment

xxchan Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

lmatz commented Jan 23, 2024

BugenZhao left a comment

Choose a reason for hiding this comment

Rossil2012 commented Jan 25, 2024

BugenZhao commented Jan 25, 2024

BugenZhao commented Jan 25, 2024

Rossil2012 commented Jan 25, 2024

StrikeW commented Jan 25, 2024

StrikeW Jan 25, 2024

Choose a reason for hiding this comment

Rossil2012 Jan 25, 2024

Choose a reason for hiding this comment

StrikeW left a comment

Choose a reason for hiding this comment

BugenZhao commented Jan 26, 2024

Rossil2012 commented Jan 12, 2024 •

edited

Loading

xxchan Jan 16, 2024 •

edited

Loading

xxchan Jan 16, 2024 •

edited

Loading