feat: enable shuffle for snapshot backfill #18063
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
In #17735, we didn't enable shuffled backfill for snapshot backfill yet, because when running the backfill test in CI, if shuffled backfill enabled, the stream got stuck and never finished.
After investigating the log, we figured out that there was deadlock happening. When the downstream creating job is consuming the upstream log store, it will wait for the upstream epoch to be committed. However, in the current code, this waiting will cause back-pressure to the upstream, and then block the upstream from handling data, and then the upstream epoch won't be finished and the committed until unblocked. The deadlock didn't happen when shuffled backfill is not enabled because, if not shuffled, the dispatcher between upstream mv executors and downstream backfill executor is all local exchange, which has large buffer, and is less likely to get back-pressured. However, when shuffled backfill is enabled, for remote exchange that has smaller buffer, we are likely to hit the back-pressure and enter further deadlock.
In this PR, we resolve it by also concurrently polling upstream while waiting for upstream epoch to be committed, and the shuffled backfill is enabled for snapshot backfill.
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.