-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Introduce scale-in in recovery. #13270
Conversation
651346d
to
52c82c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM.
let mut cluster = Cluster::start(config.clone()).await?; | ||
let mut session = cluster.start_session(); | ||
|
||
session.run("CREATE TABLE t1 (v1 int);").await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create one more complicated materialized view here, so that we can test no shuffle dispatcher as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a slightly more complex mv, which will also detect single fragment migration.
3074a7a
to
5cf5bb6
Compare
The diff is really hard to review as it does not link |
Codecov Report
@@ Coverage Diff @@
## main #13270 +/- ##
==========================================
- Coverage 67.76% 67.73% -0.03%
==========================================
Files 1525 1525
Lines 259263 259415 +152
==========================================
+ Hits 175693 175719 +26
- Misses 83570 83696 +126
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 5 files with indirect coverage changes 📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today! |
Maybe we can move the code from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, wait for some review comments from @BugenZhao .
Can we elaborate more on this? Do you mean we won't get stuck if there's no enough parallel units, but a auto scaling-in will be triggered after timeout? |
impl GlobalStreamManager { | ||
#[derive(Debug, Clone, Copy)] | ||
pub struct RescheduleOptions { | ||
pub resolve_no_shuffle_upstream: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will it be false
? Could you please leave some documentation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous reschedule implementation, it would throw an error if the provided fragment list had a no shuffle downstream. After PR #10985 , we added this parameter which will automatically resolve to the root of the no shuffle dependency relationship. During recovery, we need this parameter to prevent errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool.
Indeed, the previous implementation would trigger migration when a machine had been offline for a certain period of time. If there were no available parallel unit clusters, it would get stuck in the recovery state. This PR offers an alternative through offline scaling, which scales in the parallel units on the corresponding machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great
Signed-off-by: Shanicky Chen <[email protected]>
…le_controller` module to separate file.
Signed-off-by: Shanicky Chen <[email protected]>
…rror handling during recovery and scaling in.
Signed-off-by: Shanicky Chen <[email protected]>
…ify `scale` module.
… generate_stable_resize_plan, reschedule_actors methods; modify worker parallelism
… for resolving upstream dependencies.
Signed-off-by: Shanicky Chen <[email protected]>
7edadcc
to
93048da
Compare
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
This PR attempts to replace the original migration scheme by generating an offline scale-in scheme during recovery. This is just a preliminary version, and will be optimized later.
Checklist
./risedev check
(or alias,./risedev c
)