fix(scale): move reschedule_lock to ScaleController & use universal scale_controller #15037

shanicky · 2024-02-06T11:09:02Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Closes #15018 .

We need to implement mutual exclusion between the recovery and the auto scaling loop. Previously, the lock was inside the stream manager, which made it unattainable for recovery. By moving it to the scale controller, we can now acquire the lock.

Additionally, we mistakenly used separate scale controllers in both the stream manager and recovery, which resulted in the lock becoming ineffective. This pull request has also addressed and fixed that issue.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
All checks passed in ./risedev check (or alias, ./risedev c)

yezizp2012

LGTM, could you please share some details about why that problem occurred?

kwannoel · 2024-02-07T08:16:27Z

Just asking some questions to better understand the problem and solution.

We need to implement mutual exclusion between the recovery and the auto scaling loop. Previously, the lock was inside the stream manager, which made it unattainable for recovery. By moving it to the scale controller, we can now acquire the lock.

During recovery, the stream manager is not online, is that why the lock is not acquirable on recovery?
Why scale controller is available during recovery, but not stream manager?

During recovery why do we need to acquire the reschedule lock? Is it to prevent any scaling from happening?

I suppose the scaling controller controls the scaling process, and so conceptually it also makes sense that it should hold the rw lock.

Additionally, we mistakenly used separate scale controllers in both the stream manager and recovery, which resulted in the lock becoming ineffective. This pull request has also addressed and fixed that issue.

Could you elaborate on how it led to the panic in #15018?

shanicky · 2024-02-07T08:31:52Z

Just asking some questions to better understand the problem and solution.

We need to implement mutual exclusion between the recovery and the auto scaling loop. Previously, the lock was inside the stream manager, which made it unattainable for recovery. By moving it to the scale controller, we can now acquire the lock.

During recovery, the stream manager is not online, is that why the lock is not acquirable on recovery? Why scale controller is available during recovery, but not stream manager?

During recovery why do we need to acquire the reschedule lock? Is it to prevent any scaling from happening?

I suppose the scaling controller controls the scaling process, and so conceptually it also makes sense that it should hold the rw lock.

Additionally, we mistakenly used separate scale controllers in both the stream manager and recovery, which resulted in the lock becoming ineffective. This pull request has also addressed and fixed that issue.

Could you elaborate on how it led to the panic in #15018?

First off, our barrier control and stream control are currently managed by two distinct managers with a lot of disconnected logic. This will be unified in the future.

During startup, our recovery loop and auto scale loop are initiated at the same time. For a newly added node, this means they will both generate a plan simultaneously, which is clearly incorrect. Previously, because the recovery loop couldn't acquire the lock, the two couldn't interfere with each other. We used to rely on the barrier generated by the auto scale loop failing to prevent issues. However, in some rare circumstances, if the recovery loop is sufficiently fast, it may result in the auto scale loop producing an outdated plan that could still be successfully deployed, leading to problems.

Therefore, we have moved the lock into the scale controller itself, so now both the recovery loop and the auto scale loop can possess the lock concurrently. Moreover, previously the individual scale controllers led to separate locks within each controller, which did not serve the purpose of mutual exclusion effectively.

Signed-off-by: Shanicky Chen <[email protected]>

…cale_controller (#15037) Signed-off-by: Shanicky Chen <[email protected]>

…cale_controller (#15037) (#15050) Signed-off-by: Shanicky Chen <[email protected]>

github-actions bot added the type/fix Bug fix label Feb 6, 2024

shanicky force-pushed the peng/inner-reschedule-lock branch from d104b17 to e72c48f Compare February 6, 2024 13:25

yezizp2012 approved these changes Feb 6, 2024

View reviewed changes

shanicky added the need-cherry-pick-release-1.7 label Feb 7, 2024

yezizp2012 approved these changes Feb 7, 2024

View reviewed changes

shanicky changed the title ~~fix(scale): move reschedule_lock to ScaleController~~ fix(scale): move reschedule_lock to ScaleController & use universal scale_controller Feb 7, 2024

shanicky enabled auto-merge February 7, 2024 09:03

shanicky added 3 commits February 7, 2024 17:45

Updated lock methods; refactored reschedule_lock.

2b96f24

use universal scale controller

55d3e66

Signed-off-by: Shanicky Chen <[email protected]>

Added clone arg to ScaleController ctor, cleaned imports.

cfe8845

shanicky force-pushed the peng/inner-reschedule-lock branch from e047313 to cfe8845 Compare February 7, 2024 09:47

shanicky added this pull request to the merge queue Feb 7, 2024

Merged via the queue into main with commit 85f0023 Feb 7, 2024
27 checks passed

shanicky deleted the peng/inner-reschedule-lock branch February 7, 2024 10:22

github-actions bot mentioned this pull request Feb 7, 2024

cherrypick fix(scale): move reschedule_lock to ScaleController & use universal scale_controller (#15037) to branch release-1.7 #15048

Closed

shanicky added a commit that referenced this pull request Feb 7, 2024

fix(scale): move reschedule_lock to ScaleController & use universal s…

e6a3898

…cale_controller (#15037) Signed-off-by: Shanicky Chen <[email protected]>

shanicky mentioned this pull request Feb 7, 2024

fix(scale): move reschedule_lock to ScaleController & use universal scale_controller (#15037) #15050

Merged

9 tasks

shanicky added a commit that referenced this pull request Feb 7, 2024

fix(scale): move reschedule_lock to ScaleController & use universal s…

5686064

…cale_controller (#15037) Signed-off-by: Shanicky Chen <[email protected]>

shanicky added a commit that referenced this pull request Feb 7, 2024

fix(scale): move reschedule_lock to ScaleController & use universal s…

9cf1b06

…cale_controller (#15037) (#15050) Signed-off-by: Shanicky Chen <[email protected]>

st1page mentioned this pull request Feb 8, 2024

2024-02-07 nexmark performance degradation #15054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scale): move reschedule_lock to ScaleController & use universal scale_controller #15037

fix(scale): move reschedule_lock to ScaleController & use universal scale_controller #15037

shanicky commented Feb 6, 2024 •

edited by kwannoel

Loading

yezizp2012 left a comment

kwannoel commented Feb 7, 2024

shanicky commented Feb 7, 2024

fix(scale): move reschedule_lock to ScaleController & use universal scale_controller #15037

fix(scale): move reschedule_lock to ScaleController & use universal scale_controller #15037

Conversation

shanicky commented Feb 6, 2024 • edited by kwannoel Loading

What's changed and what's your intention?

Checklist

yezizp2012 left a comment

Choose a reason for hiding this comment

kwannoel commented Feb 7, 2024

shanicky commented Feb 7, 2024

shanicky commented Feb 6, 2024 •

edited by kwannoel

Loading