Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stream): wait committed epoch in state table init_epoch #19223

Merged
merged 17 commits into from
Nov 12, 2024

Conversation

wenym1
Copy link
Contributor

@wenym1 wenym1 commented Nov 1, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

The init_epoch part of #18312.

It is required that after init_epoch returns, we can always read the consistent data on prev_epoch, and otherwise, we will read inconsistent data and pollute the downstream. init_epoch is called during recovery and configuration. Previously, in recovery, this is ensured by CN waiting the latest hummock version, and in configuration change, this is ensured by pausing the barrier injection and do a CN-wide try_wait_epoch call in the WaitEpochCommit streaming service rpc to wait for the committed epoch bumping up globally..

In this PR, in the init_epoch of StateTable, we will change to wait for the committed_epoch of the state table to bump up to the prev_epoch. With this PR, during recovery there is no need for CN to wait for the latest hummock version, and for configuration change without update vnode bitmap (replace table, sink into table, not including scale), there is no need to do the pause-wait-resume barrier injection.

Deadlock can easily happen. Bumping up committed_epoch depends on collecting barriers from all actors, but if init_epoch blocks the handling of initial barrier, when init_epoch waits for bumping up committed_epoch, it blocks barrier handling, and causes committed_epoch never bump up, and cause deadlock.

To avoid deadlock, we should strictly follow the order when handling the initial barrier. The order should be, receive first barrier, yield first barrier, and then StateTable::init_epoch. In this PR, we just ensure that all usages of StateTable::init_epoch will be included in the changed code so that we can carefully review the code. The StateTable::init_epoch method will become async, and therefore all direct usages on it will be included in the changed code. Some non-async utils method that calls StateTable::init_epoch is required to be async as well, and therefore will be included in the changed code as well. There are two utils methods, both named init_epoch, that were already async, and call StateTable::init_epoch. To include the usages of these two methods in the changed code, the two methods are both renamed to init_epoch_after_yield_barrier, so that the usage can also be included in the changed code. Moreover, if the handling order is incorrect, deadlock must happen if the problematic code is reached. Hence, if we can pass all the CI tests, we can be confident that the correctness is fulfilled.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@wenym1 wenym1 force-pushed the yiming/init-epoch-await branch from 7fd5a68 to 18a7f90 Compare November 4, 2024 10:04
@graphite-app graphite-app bot requested a review from a team November 4, 2024 10:48
@wenym1
Copy link
Contributor Author

wenym1 commented Nov 8, 2024

Hints for reviewers:

In storage side, the major changes are as followed.

  1. In the init of LocalHummockStorage, we change to wait for the prev_epoch to be the committed_epoch no matter whether it is is_replicated or not. Previously we only do this when is_replicated is true.
  2. In clear_shared_buffer, we no longer pass the version_id to wait for. In event handler, since we don't need to wait for bumping the version id, the handling logic becomes non-async. As a result, in the InitRequest of streaming control bidi-stream, we won't pass the version_id.
  3. Some unit tests are updated to adapt to the change.

In streaming side, the major changes are as followed.

  1. All stateful executor should strictly follow an order in initialization: receive first barrier -> save some barrier information -> yield barrier -> call StateTable::init_epoch.
  2. In this PR, we ensure that all usages should be included in the changed code. Changes in most executors are trivial, simply reordering some codes. now.rs and temporal_join.rs may require extra attention, because the change is relatively not that trivial.
  3. The init_epoch of StateTable is changed from non-async to async, and therefore all direct usages on the methods will need to add .await, and therefore included in the changed code.
  4. For any non-async methods that previously call StateTable::init_epoch, they become async, and therefore all usages on these methods will need to add .await, and therefore included in the changed code.
  5. For async methods that previously call StateTable::init_epoch, they will renamed to init_after_yield_barrier, and included in the changed code.

@wenym1 wenym1 requested review from Li0k, kwannoel and zwang28 November 8, 2024 09:35
Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the storage change.

let mut latest_version = if let Some(CacheRefillerEvent {
pinned_version,
new_pinned_version,
}) = self.refiller.clear()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: after this PR, the refiller buffer won't get cleared on handle_clear. This won't affect correctness but may slightly delay future version update because a new version update needs to wait for previous refiller events to complete before it can be effective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we no longer need to clear refiller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refiller was previously clear for the logic of waiting for receiving the latest hummock version. The event handler should wait for receiving the latest hummock in meta node, so that the newly spawned actors can see the latest hummock version after recovery. However, after this PR, the spawned actors will explicitly wait for the committed_epoch of the state table to bump up to the prev_epoch in init_epoch, so that we don't need to wait for the latest hummock version globally during recovery.

Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streaming part LGTM

src/stream/src/common/table/state_table.rs Outdated Show resolved Hide resolved
src/stream/src/common/log_store_impl/in_mem.rs Outdated Show resolved Hide resolved
@wenym1 wenym1 enabled auto-merge November 12, 2024 06:14
Copy link
Member

@stdrc stdrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although in this PR we seem correctly changed all occurrences, it looks still dangerous to me that some other developer can easily break the order.

Comment on lines 54 to +58
let first_barrier = expect_first_barrier(&mut input_stream).await?;
let mut state = self.state;
state.init(first_barrier.epoch).await?;
let first_epoch = first_barrier.epoch;
yield Message::Barrier(first_barrier);
let mut state = self.state;
state.init(first_epoch).await?;
Copy link
Member

@stdrc stdrc Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a declmacro to enforce the order of yielding and using the first barrier? Like

handle_first_barrier!(&mut input_stream, |barrier| {
    self.state.init(barrier.epoch).await?;
});

which internally calls expect_first_barrier + yield Barrier + the provided init block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult, because yield barrier in the custom macro will fail to compile in the try_stream proc macro.

Although in this PR we seem correctly changed all occurrences, it looks still dangerous to me that some other developer can easily break the order.

I think it's fine, because if the order is not followed, deadlock must happen. As long as the newly developed features are covered in CI test, which we are supposed to do, we can be confident that the order is correctly handled, and otherwise something must be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then maybe we can add some notes on this in the doc of init_epoch.

@wenym1 wenym1 added this pull request to the merge queue Nov 12, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 12, 2024
@wenym1 wenym1 added this pull request to the merge queue Nov 12, 2024
Merged via the queue into main with commit fe65509 Nov 12, 2024
32 of 34 checks passed
@wenym1 wenym1 deleted the yiming/init-epoch-await branch November 12, 2024 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants