feat(stream): wait committed epoch in state table init_epoch #19223

wenym1 · 2024-11-01T05:57:27Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

The init_epoch part of #18312.

It is required that after init_epoch returns, we can always read the consistent data on prev_epoch, and otherwise, we will read inconsistent data and pollute the downstream. init_epoch is called during recovery and configuration. Previously, in recovery, this is ensured by CN waiting the latest hummock version, and in configuration change, this is ensured by pausing the barrier injection and do a CN-wide try_wait_epoch call in the WaitEpochCommit streaming service rpc to wait for the committed epoch bumping up globally..

In this PR, in the init_epoch of StateTable, we will change to wait for the committed_epoch of the state table to bump up to the prev_epoch. With this PR, during recovery there is no need for CN to wait for the latest hummock version, and for configuration change without update vnode bitmap (replace table, sink into table, not including scale), there is no need to do the pause-wait-resume barrier injection.

Deadlock can easily happen. Bumping up committed_epoch depends on collecting barriers from all actors, but if init_epoch blocks the handling of initial barrier, when init_epoch waits for bumping up committed_epoch, it blocks barrier handling, and causes committed_epoch never bump up, and cause deadlock.

To avoid deadlock, we should strictly follow the order when handling the initial barrier. The order should be, receive first barrier, yield first barrier, and then StateTable::init_epoch. In this PR, we just ensure that all usages of StateTable::init_epoch will be included in the changed code so that we can carefully review the code. The StateTable::init_epoch method will become async, and therefore all direct usages on it will be included in the changed code. Some non-async utils method that calls StateTable::init_epoch is required to be async as well, and therefore will be included in the changed code as well. There are two utils methods, both named init_epoch, that were already async, and call StateTable::init_epoch. To include the usages of these two methods in the changed code, the two methods are both renamed to init_epoch_after_yield_barrier, so that the usage can also be included in the changed code. Moreover, if the handling order is incorrect, deadlock must happen if the problematic code is reached. Hence, if we can pass all the CI tests, we can be confident that the correctness is fulfilled.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

…init-epoch-await

wenym1 · 2024-11-08T09:34:54Z

Hints for reviewers:

In storage side, the major changes are as followed.

In the init of LocalHummockStorage, we change to wait for the prev_epoch to be the committed_epoch no matter whether it is is_replicated or not. Previously we only do this when is_replicated is true.
In clear_shared_buffer, we no longer pass the version_id to wait for. In event handler, since we don't need to wait for bumping the version id, the handling logic becomes non-async. As a result, in the InitRequest of streaming control bidi-stream, we won't pass the version_id.
Some unit tests are updated to adapt to the change.

In streaming side, the major changes are as followed.

All stateful executor should strictly follow an order in initialization: receive first barrier -> save some barrier information -> yield barrier -> call StateTable::init_epoch.
In this PR, we ensure that all usages should be included in the changed code. Changes in most executors are trivial, simply reordering some codes. now.rs and temporal_join.rs may require extra attention, because the change is relatively not that trivial.
The init_epoch of StateTable is changed from non-async to async, and therefore all direct usages on the methods will need to add .await, and therefore included in the changed code.
For any non-async methods that previously call StateTable::init_epoch, they become async, and therefore all usages on these methods will need to add .await, and therefore included in the changed code.
For async methods that previously call StateTable::init_epoch, they will renamed to init_after_yield_barrier, and included in the changed code.

hzxa21

LGTM on the storage change.

hzxa21 · 2024-11-11T04:20:09Z

src/storage/src/hummock/event_handler/hummock_event_handler.rs

-            let mut latest_version = if let Some(CacheRefillerEvent {
-                pinned_version,
-                new_pinned_version,
-            }) = self.refiller.clear()


Just a note: after this PR, the refiller buffer won't get cleared on handle_clear. This won't affect correctness but may slightly delay future version update because a new version update needs to wait for previous refiller events to complete before it can be effective.

How come we no longer need to clear refiller?

The refiller was previously clear for the logic of waiting for receiving the latest hummock version. The event handler should wait for receiving the latest hummock in meta node, so that the newly spawned actors can see the latest hummock version after recovery. However, after this PR, the spawned actors will explicitly wait for the committed_epoch of the state table to bump up to the prev_epoch in init_epoch, so that we don't need to wait for the latest hummock version globally during recovery.

BugenZhao

The streaming part LGTM

src/stream/src/common/table/state_table.rs

src/stream/src/common/log_store_impl/in_mem.rs

stdrc

Although in this PR we seem correctly changed all occurrences, it looks still dangerous to me that some other developer can easily break the order.

stdrc · 2024-11-12T06:32:53Z

src/stream/src/executor/approx_percentile/global.rs

        let first_barrier = expect_first_barrier(&mut input_stream).await?;
-        let mut state = self.state;
-        state.init(first_barrier.epoch).await?;
+        let first_epoch = first_barrier.epoch;
        yield Message::Barrier(first_barrier);
+        let mut state = self.state;
+        state.init(first_epoch).await?;


Is it possible to have a declmacro to enforce the order of yielding and using the first barrier? Like

handle_first_barrier!(&mut input_stream, |barrier| { self.state.init(barrier.epoch).await?; });

which internally calls expect_first_barrier + yield Barrier + the provided init block.

It's difficult, because yield barrier in the custom macro will fail to compile in the try_stream proc macro.

Although in this PR we seem correctly changed all occurrences, it looks still dangerous to me that some other developer can easily break the order.

I think it's fine, because if the order is not followed, deadlock must happen. As long as the newly developed features are covered in CI test, which we are supposed to do, we can be confident that the order is correctly handled, and otherwise something must be wrong.

OK, then maybe we can add some notes on this in the doc of init_epoch.

wenym1 added 3 commits November 1, 2024 13:51

feat(stream): wait committed epoch in state table init_epoch

572b017

Merge branch 'main' into yiming/init-epoch-await

b281378

fix

e2f66a3

github-actions bot added type/feature ci/run-e2e-single-node-tests ci/run-e2e-test-other-backends labels Nov 1, 2024

fix ut

4f36536

wenym1 added the ci/main-cron/run-all label Nov 1, 2024

wenym1 and others added 6 commits November 1, 2024 17:19

Merge branch 'main' into yiming/init-epoch-await

42ccfa4

Merge branch 'main' into yiming/init-epoch-await

05b231a

fix bench

27df405

fix

ac3825c

test

1e9d0ab

Merge remote-tracking branch 'origin/kwannoel/mysql-fix' into yiming/…

18a7f90

…init-epoch-await

wenym1 force-pushed the yiming/init-epoch-await branch from 7fd5a68 to 18a7f90 Compare November 4, 2024 10:04

wenym1 added 2 commits November 4, 2024 18:05

remove comma

d9894c8

Merge branch 'kwannoel/mysql-fix' into yiming/init-epoch-await

307233f

graphite-app bot requested a review from a team November 4, 2024 10:48

Merge branch 'main' into yiming/init-epoch-await

76a8bda

wenym1 requested review from stdrc, hzxa21, chenzl25, BugenZhao and st1page November 5, 2024 06:47

wenym1 requested review from Li0k, kwannoel and zwang28 November 8, 2024 09:35

hzxa21 approved these changes Nov 11, 2024

View reviewed changes

Merge branch 'main' into yiming/init-epoch-await

5f31714

BugenZhao approved these changes Nov 12, 2024

View reviewed changes

src/stream/src/common/table/state_table.rs Outdated Show resolved Hide resolved

src/stream/src/common/log_store_impl/in_mem.rs Outdated Show resolved Hide resolved

wenym1 added 2 commits November 12, 2024 13:40

Merge branch 'main' into yiming/init-epoch-await

3499d39

address comment

d4cb8b5

wenym1 enabled auto-merge November 12, 2024 06:14

stdrc reviewed Nov 12, 2024

View reviewed changes

wenym1 disabled auto-merge November 12, 2024 06:41

wenym1 removed the ci/main-cron/run-all label Nov 12, 2024

add comment

9d98b9c

wenym1 added this pull request to the merge queue Nov 12, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 12, 2024

wenym1 added this pull request to the merge queue Nov 12, 2024

Merged via the queue into main with commit fe65509 Nov 12, 2024
32 of 34 checks passed

wenym1 deleted the yiming/init-epoch-await branch November 12, 2024 08:35

hzxa21 mentioned this pull request Dec 17, 2024

Some tables doesn't sync for PG CDC v2.1.0-rt.8 #19681

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stream): wait committed epoch in state table init_epoch #19223

feat(stream): wait committed epoch in state table init_epoch #19223

wenym1 commented Nov 1, 2024 •

edited

Loading

wenym1 commented Nov 8, 2024

hzxa21 left a comment

hzxa21 Nov 11, 2024

kwannoel Nov 11, 2024

wenym1 Nov 11, 2024

BugenZhao left a comment

stdrc left a comment

stdrc Nov 12, 2024 •

edited

Loading

wenym1 Nov 12, 2024

stdrc Nov 12, 2024

feat(stream): wait committed epoch in state table init_epoch #19223

feat(stream): wait committed epoch in state table init_epoch #19223

Conversation

wenym1 commented Nov 1, 2024 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

wenym1 commented Nov 8, 2024

hzxa21 left a comment

Choose a reason for hiding this comment

hzxa21 Nov 11, 2024

Choose a reason for hiding this comment

kwannoel Nov 11, 2024

Choose a reason for hiding this comment

wenym1 Nov 11, 2024

Choose a reason for hiding this comment

BugenZhao left a comment

Choose a reason for hiding this comment

stdrc left a comment

Choose a reason for hiding this comment

stdrc Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

wenym1 Nov 12, 2024

Choose a reason for hiding this comment

stdrc Nov 12, 2024

Choose a reason for hiding this comment

wenym1 commented Nov 1, 2024 •

edited

Loading

stdrc Nov 12, 2024 •

edited

Loading