fix(cdc): commit offset to upstream after checkpoint has commit #16058

StrikeW · 2024-04-01T16:06:17Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

resolve #15464

Add WaitEpochWorker in source executor to wait the commit of an epoch for some cdc connectors
Add JniDbzSourceRegistry to allow source executor can get the source handler
Add get_encoded_offset() to SplitMetaData to get offset from a split

related: #15312

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

github-actions

license-eye has totally checked 4959 files.

Valid	Invalid	Ignored	Fixed
2131	1	2827	0

Click to see the invalid file list

java/connector-node/risingwave-source-cdc/src/main/java/io/debezium/embedded/EmbeddedEngineChangeEvent.java

...node/risingwave-source-cdc/src/main/java/io/debezium/embedded/EmbeddedEngineChangeEvent.java

…e handler

src/connector/src/source/base.rs

src/connector/src/source/kinesis/split.rs

src/connector/src/source/iceberg/mod.rs

src/stream/src/common/table/watermark.rs

src/stream/src/common/table/state_table.rs

hzxa21 · 2024-04-11T06:13:13Z

src/stream/src/executor/source/source_executor.rs

+                            }
+                        }
+                        Err(e) => {
+                            tracing::error!(


At the first glance I thought we may miss commit_cdc_offset if we reach here. But I realize that try_wait_epoch on committed epoch won't fail unless the node is exiting. We may want to add a commment here for this assumption.

We don't need to guarantee commit offset to upstream on each checkpoint, it should not be a big deal if some checkpoint failed to commit offset. The upstream just need to retain those logs for a while, but eventually will be discarded by future checkpoint.

I agree that as long as we don't miss offset commit for a long time, it is not a big deal. However, my point is not about whether we will miss it. My point is that try_wait_epoch is not expected to fail in this case. I originally thought we should panic here but logging an error log is more robust. How about just tweaking the message to be "Unexpected failure in try_wait_epoch {}" and adding a comment for this assumption?

hzxa21 · 2024-04-11T06:19:16Z

src/connector/src/source/reader/reader.rs

@@ -105,6 +105,11 @@ impl SourceReader {
        }
    }

+    /// Postgres and Oracle connectors need to commit the offset to upstream.
+    pub fn need_commit_offset_to_upstream(&self) -> bool {
+        matches!(&self.config, ConnectorProperties::PostgresCdc(_))


How about mysql? I thought all dbz-based cdc needs to commit offset.

Only pg and oracle has this kind of protocol.

Do you mean that recordCommitter.markProcessed() and recordCommitter.markBatchFinished are no-ops for other dbz-based cdc source?

No. only a step in recordCommitter.markBatchFinished is no-op for cdc sources that are not pg and oracle.

/** * Commits the given offset with the source database. Used by some connectors * like Postgres and Oracle to indicate how far the source TX log can be * discarded. */ default void commitOffset(Map<String, ?> partition, Map<String, ?> offset) { }

I got your point. To be cautious, we should call markProcessed and markBatchFinished for all cdc connectors.

src/connector/src/source/base.rs

...ector-service/src/main/java/com/risingwave/connector/source/core/DbzChangeEventConsumer.java

StrikeW · 2024-04-12T05:21:32Z

I evaluated the effectiveness of this PR with chaos test, the Cannot seek to the last known offset error is gone, which means #15464 can be resolved.
But ch_benchmark_q4 still encounter inconsistency of the cdc source tables (tracked in #15312). I think we can merge this PR first and continue investigating the cdc table inconsistent issues.

hzxa21 · 2024-04-12T07:34:15Z

I evaluated the effectiveness of this PR with chaos test, the Cannot seek to the last known offset error is gone, which means #15464 can be resolved. But ch_benchmark_q4 still encounter inconsistency of the cdc source tables (tracked in #15312). I think we can merge this PR first and continue investigating the cdc table inconsistent issues.

#15312 (comment) Can we be sure that the extra rows in RW is not caused by this PR?

StrikeW · 2024-04-12T07:39:31Z

I evaluated the effectiveness of this PR with chaos test, the Cannot seek to the last known offset error is gone, which means #15464 can be resolved. But ch_benchmark_q4 still encounter inconsistency of the cdc source tables (tracked in #15312). I think we can merge this PR first and continue investigating the cdc table inconsistent issues.

#15312 (comment) Can we be sure that the extra rows in RW is not caused by this PR?

I also run against the nightly image (750), the symptom is same: RW rows is more than upstream pg.

hzxa21 · 2024-04-12T07:52:53Z

src/connector/src/source/base.rs

+            PostgresCdc(split) => split.start_offset().clone().unwrap_or_default(),
+            MongodbCdc(split) => split.start_offset().clone().unwrap_or_default(),
+            CitusCdc(split) => split.start_offset().clone().unwrap_or_default(),
+            _ => "".to_string(),


unreachable! instead of empty string?

hzxa21

Rest LGTM!

BugenZhao

LGTM for the executor part

StrikeW added 5 commits March 29, 2024 16:11

add wait_epoch to state table

8c7b3b6

framework of cdc offset commit

8c022b1

add get offset method

7536a77

Merge remote-tracking branch 'origin/main' into siyuan/cdc-commit-offset

8876fdf

todo: call java object to commit offset

1e3a061

github-actions bot added the type/fix Bug fix label Apr 1, 2024

github-actions bot reviewed Apr 1, 2024

View reviewed changes

...node/risingwave-source-cdc/src/main/java/io/debezium/embedded/EmbeddedEngineChangeEvent.java Outdated Show resolved Hide resolved

split reader control worker

2d1644a

StrikeW linked an issue Apr 2, 2024 that may be closed by this pull request

feat: checkpoint commit callback (for cdc-connector) #15464

Closed

StrikeW changed the title ~~fix(cdc): commit offset to upstream after checkpoint has commit~~ fix(cdc): commit offset to upstream after checkpoint has commit (WIP) Apr 2, 2024

StrikeW added 3 commits April 2, 2024 22:36

introduce global JniDbzSourceRegistry to allow source exec can get th…

1410ab2

…e handler

impl commit_cdc_offset

e5bd37b

Merge remote-tracking branch 'origin/main' into siyuan/cdc-commit-offset

2444602

StrikeW added ci/run-backwards-compat-tests Run backwards compatibility tests in your PR. ci/run-build ci/skip-ci ci/run-build-other ci/run-e2e-source-tests labels Apr 7, 2024

refine and tested

e07f7e4

StrikeW force-pushed the siyuan/cdc-commit-offset branch from 3f0e91a to e07f7e4 Compare April 7, 2024 10:00

minor

44fc124

StrikeW mentioned this pull request Apr 7, 2024

chaos mesh daily test: ch-benchmark-pg-cdc data verification failed #15312

Closed

StrikeW changed the title ~~fix(cdc): commit offset to upstream after checkpoint has commit (WIP)~~ fix(cdc): commit offset to upstream after checkpoint has commit Apr 7, 2024

StrikeW marked this pull request as ready for review April 7, 2024 14:38

StrikeW removed ci/run-backwards-compat-tests Run backwards compatibility tests in your PR. ci/run-build ci/skip-ci ci/run-build-other ci/run-e2e-source-tests labels Apr 7, 2024

fix

0b75cc8

StrikeW added 3 commits April 9, 2024 11:41

fix ui

646f6e4

fix error repoprt

b071e60

fix ci

683cc41

StrikeW requested review from BugenZhao and wenym1 April 9, 2024 05:24

BugenZhao reviewed Apr 9, 2024

View reviewed changes

cleanup

7bc3c92

StrikeW requested a review from BugenZhao April 9, 2024 08:46

BugenZhao requested a review from xxchan April 11, 2024 05:52

hzxa21 reviewed Apr 11, 2024

View reviewed changes

remove get_encoded_offset()

fa0cde7

github-actions bot removed the ci/run-e2e-single-node-tests label Apr 11, 2024

StrikeW requested a review from hzxa21 April 12, 2024 05:21

hzxa21 reviewed Apr 12, 2024

View reviewed changes

hzxa21 approved these changes Apr 12, 2024

View reviewed changes

BugenZhao approved these changes Apr 12, 2024

View reviewed changes

StrikeW added 2 commits April 12, 2024 16:14

fix comments

952a1d2

fix

1d4a940

StrikeW added this pull request to the merge queue Apr 12, 2024

Merged via the queue into main with commit cc795da Apr 12, 2024
28 of 29 checks passed

StrikeW deleted the siyuan/cdc-commit-offset branch April 12, 2024 12:03

StrikeW added the need-cherry-pick-release-1.8 label Apr 15, 2024

github-actions bot mentioned this pull request Apr 15, 2024

cherrypick fix(cdc): commit offset to upstream after checkpoint has commit (#16058) to branch release-1.8 #16297

Closed

StrikeW added a commit that referenced this pull request Apr 15, 2024

cherry pick #16058

fec6037

StrikeW mentioned this pull request Apr 30, 2024

fix(cdc): don't wait epoch commit for cdc connectors except postgres #16551

Merged

9 tasks

StrikeW mentioned this pull request May 10, 2024

Postgres-cdc: the size of WAL for a replication slot accumulated for a long time #16697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cdc): commit offset to upstream after checkpoint has commit #16058

fix(cdc): commit offset to upstream after checkpoint has commit #16058

StrikeW commented Apr 1, 2024 •

edited

Loading

github-actions bot left a comment

hzxa21 Apr 11, 2024

StrikeW Apr 11, 2024

hzxa21 Apr 12, 2024

hzxa21 Apr 11, 2024

StrikeW Apr 11, 2024

hzxa21 Apr 12, 2024

StrikeW Apr 12, 2024 •

edited

Loading

StrikeW commented Apr 12, 2024

hzxa21 commented Apr 12, 2024

StrikeW commented Apr 12, 2024

hzxa21 Apr 12, 2024

hzxa21 left a comment

BugenZhao left a comment

fix(cdc): commit offset to upstream after checkpoint has commit #16058

fix(cdc): commit offset to upstream after checkpoint has commit #16058

Conversation

StrikeW commented Apr 1, 2024 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

github-actions bot left a comment

Choose a reason for hiding this comment

hzxa21 Apr 11, 2024

Choose a reason for hiding this comment

StrikeW Apr 11, 2024

Choose a reason for hiding this comment

hzxa21 Apr 12, 2024

Choose a reason for hiding this comment

hzxa21 Apr 11, 2024

Choose a reason for hiding this comment

StrikeW Apr 11, 2024

Choose a reason for hiding this comment

hzxa21 Apr 12, 2024

Choose a reason for hiding this comment

StrikeW Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

StrikeW commented Apr 12, 2024

hzxa21 commented Apr 12, 2024

StrikeW commented Apr 12, 2024

hzxa21 Apr 12, 2024

Choose a reason for hiding this comment

hzxa21 left a comment

Choose a reason for hiding this comment

BugenZhao left a comment

Choose a reason for hiding this comment

StrikeW commented Apr 1, 2024 •

edited

Loading

StrikeW Apr 12, 2024 •

edited

Loading