feat: Introduce scale-in in recovery. #13270

shanicky · 2023-11-06T10:13:29Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR attempts to replace the original migration scheme by generating an offline scale-in scheme during recovery. This is just a preliminary version, and will be optimized later.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
All checks passed in ./risedev check (or alias, ./risedev c)

yezizp2012

Rest LGTM.

src/meta/src/barrier/recovery.rs

yezizp2012 · 2023-11-06T10:57:47Z

src/tests/simulation/tests/integration_tests/recovery/scale_in_when_recovery.rs

+    let mut cluster = Cluster::start(config.clone()).await?;
+    let mut session = cluster.start_session();
+
+    session.run("CREATE TABLE t1 (v1 int);").await?;


Can we create one more complicated materialized view here, so that we can test no shuffle dispatcher as well?

Added a slightly more complex mv, which will also detect single fragment migration.

BugenZhao · 2023-11-07T04:13:12Z

The diff is really hard to review as it does not link scale.rs to scale_controller.rs. 🥵 Is there any way to fix this?

codecov · 2023-11-07T05:51:10Z

Codecov Report

Merging #13270 (93048da) into main (7b3f8fc) will decrease coverage by 0.03%.
Report is 2 commits behind head on main.
The diff coverage is 9.94%.

@@            Coverage Diff             @@
##             main   #13270      +/-   ##
==========================================
- Coverage   67.76%   67.73%   -0.03%     
==========================================
  Files        1525     1525              
  Lines      259263   259415     +152     
==========================================
+ Hits       175693   175719      +26     
- Misses      83570    83696     +126

Flag	Coverage Δ
rust	`67.73% <9.94%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/common/src/config.rs	`84.40% <ø> (ø)`
src/meta/src/barrier/mod.rs	`82.47% <100.00%> (+0.19%)`	⬆️
src/meta/src/manager/env.rs	`61.97% <100.00%> (+0.19%)`	⬆️
src/meta/node/src/lib.rs	`1.32% <0.00%> (-0.01%)`	⬇️
src/meta/src/barrier/command.rs	`34.29% <33.33%> (+3.06%)`	⬆️
src/meta/service/src/scale_service.rs	`0.00% <0.00%> (ø)`
src/meta/src/stream/stream_manager.rs	`65.80% <13.72%> (-3.34%)`	⬇️
src/meta/src/barrier/recovery.rs	`44.42% <6.25%> (-7.61%)`	⬇️
src/meta/src/stream/scale.rs	`10.50% <6.28%> (+0.53%)`	⬆️

... and 5 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

shanicky · 2023-11-07T06:01:47Z

The diff is really hard to review as it does not link scale.rs to scale_controller.rs. 🥵 Is there any way to fix this?

Maybe we can move the code from scale.rs to StreamManager and delete scale.rs, let me try this

yezizp2012

LGTM, wait for some review comments from @BugenZhao .

BugenZhao · 2023-11-08T04:45:15Z

generating an offline scale-in scheme during recovery

Can we elaborate more on this? Do you mean we won't get stuck if there's no enough parallel units, but a auto scaling-in will be triggered after timeout?

BugenZhao · 2023-11-08T04:49:44Z

src/meta/src/stream/scale.rs

-impl GlobalStreamManager {
+#[derive(Debug, Clone, Copy)]
+pub struct RescheduleOptions {
+    pub resolve_no_shuffle_upstream: bool,


When will it be false? Could you please leave some documentation here?

In the previous reschedule implementation, it would throw an error if the provided fragment list had a no shuffle downstream. After PR #10985 , we added this parameter which will automatically resolve to the root of the no shuffle dependency relationship. During recovery, we need this parameter to prevent errors.

shanicky · 2023-11-08T05:47:10Z

generating an offline scale-in scheme during recovery

Can we elaborate more on this? Do you mean we won't get stuck if there's no enough parallel units, but a auto scaling-in will be triggered after timeout?

Indeed, the previous implementation would trigger migration when a machine had been offline for a certain period of time. If there were no available parallel unit clusters, it would get stuck in the recovery state. This PR offers an alternative through offline scaling, which scales in the parallel units on the corresponding machine.

BugenZhao

Great

Signed-off-by: Shanicky Chen <[email protected]>

…le_controller` module to separate file.

Signed-off-by: Shanicky Chen <[email protected]>

…rror handling during recovery and scaling in.

Signed-off-by: Shanicky Chen <[email protected]>

…ify `scale` module.

… generate_stable_resize_plan, reschedule_actors methods; modify worker parallelism

… for resolving upstream dependencies.

Signed-off-by: Shanicky Chen <[email protected]>

shanicky requested review from yezizp2012 and BugenZhao November 6, 2023 10:13

github-actions bot added the type/feature label Nov 6, 2023

shanicky force-pushed the peng/try-gen-scale-in branch 3 times, most recently from 651346d to 52c82c0 Compare November 6, 2023 10:21

yezizp2012 reviewed Nov 6, 2023

View reviewed changes

shanicky force-pushed the peng/try-gen-scale-in branch from 3074a7a to 5cf5bb6 Compare November 6, 2023 16:26

yezizp2012 approved these changes Nov 7, 2023

View reviewed changes

shanicky linked an issue Nov 8, 2023 that may be closed by this pull request

Recovery supports generating reschedule plans offline and executing them. #13301

Closed

BugenZhao reviewed Nov 8, 2023

View reviewed changes

shanicky requested a review from BugenZhao November 8, 2023 07:46

BugenZhao approved these changes Nov 8, 2023

View reviewed changes

shanicky added this pull request to the merge queue Nov 8, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 8, 2023

shanicky added this pull request to the merge queue Nov 8, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 8, 2023

shanicky added 8 commits November 8, 2023 22:22

introduce scale controller

3c3a013

Signed-off-by: Shanicky Chen <[email protected]>

Add enable_scale_in_when_recovery option to MetaConfig. Move `sca…

d4b00f5

…le_controller` module to separate file.

roll back ci.toml

4fad0ad

Signed-off-by: Shanicky Chen <[email protected]>

Modified GlobalBarrierManager and scale_in_when_recovery.rs for e…

f720a19

…rror handling during recovery and scaling in.

try to update example.toml

06998de

Signed-off-by: Shanicky Chen <[email protected]>

Import WorkerId struct and add to crate::manager.

aa263dc

Remove scale_controller module and corresponding use statement, mod…

4ddf116

…ify `scale` module.

Remove unused imports, reorganize code; add build_reschedule_context,…

29273de

… generate_stable_resize_plan, reschedule_actors methods; modify worker parallelism

shanicky added 2 commits November 8, 2023 22:22

Add resolve_no_shuffle_upstream field to RescheduleOptions struct…

9b26176

… for resolving upstream dependencies.

use stream scan instead of chain

93048da

Signed-off-by: Shanicky Chen <[email protected]>

shanicky force-pushed the peng/try-gen-scale-in branch from 7edadcc to 93048da Compare November 8, 2023 14:46

shanicky enabled auto-merge November 8, 2023 14:47

shanicky added this pull request to the merge queue Nov 8, 2023

Merged via the queue into main with commit 306801b Nov 8, 2023
7 checks passed

shanicky deleted the peng/try-gen-scale-in branch November 8, 2023 15:31

fuyufjh mentioned this pull request Nov 15, 2023

Tracking: Automatically set parallelism for streaming jobs #13140

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Introduce scale-in in recovery. #13270

feat: Introduce scale-in in recovery. #13270

shanicky commented Nov 6, 2023

yezizp2012 left a comment

yezizp2012 Nov 6, 2023

shanicky Nov 6, 2023

BugenZhao commented Nov 7, 2023

codecov bot commented Nov 7, 2023 •

edited

Loading

shanicky commented Nov 7, 2023

yezizp2012 left a comment

BugenZhao commented Nov 8, 2023

BugenZhao Nov 8, 2023

shanicky Nov 8, 2023

BugenZhao Nov 8, 2023

shanicky commented Nov 8, 2023

BugenZhao left a comment

feat: Introduce scale-in in recovery. #13270

feat: Introduce scale-in in recovery. #13270

Conversation

shanicky commented Nov 6, 2023

What's changed and what's your intention?

Checklist

yezizp2012 left a comment

Choose a reason for hiding this comment

yezizp2012 Nov 6, 2023

Choose a reason for hiding this comment

shanicky Nov 6, 2023

Choose a reason for hiding this comment

BugenZhao commented Nov 7, 2023

codecov bot commented Nov 7, 2023 • edited Loading

Codecov Report

shanicky commented Nov 7, 2023

yezizp2012 left a comment

Choose a reason for hiding this comment

BugenZhao commented Nov 8, 2023

BugenZhao Nov 8, 2023

Choose a reason for hiding this comment

shanicky Nov 8, 2023

Choose a reason for hiding this comment

BugenZhao Nov 8, 2023

Choose a reason for hiding this comment

shanicky commented Nov 8, 2023

BugenZhao left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 7, 2023 •

edited

Loading