-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(scale): move reschedule_lock to ScaleController & use universal scale_controller #15037
Conversation
d104b17
to
e72c48f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, could you please share some details about why that problem occurred?
Just asking some questions to better understand the problem and solution.
During recovery, the stream manager is not online, is that why the lock is not acquirable on recovery? During recovery why do we need to acquire the reschedule lock? Is it to prevent any scaling from happening? I suppose the scaling controller controls the scaling process, and so conceptually it also makes sense that it should hold the rw lock.
Could you elaborate on how it led to the panic in #15018? |
First off, our barrier control and stream control are currently managed by two distinct managers with a lot of disconnected logic. This will be unified in the future. During startup, our recovery loop and auto scale loop are initiated at the same time. For a newly added node, this means they will both generate a plan simultaneously, which is clearly incorrect. Previously, because the recovery loop couldn't acquire the lock, the two couldn't interfere with each other. We used to rely on the barrier generated by the auto scale loop failing to prevent issues. However, in some rare circumstances, if the recovery loop is sufficiently fast, it may result in the auto scale loop producing an outdated plan that could still be successfully deployed, leading to problems. Therefore, we have moved the lock into the scale controller itself, so now both the recovery loop and the auto scale loop can possess the lock concurrently. Moreover, previously the individual scale controllers led to separate locks within each controller, which did not serve the purpose of mutual exclusion effectively. |
e047313
to
cfe8845
Compare
…cale_controller (#15037) Signed-off-by: Shanicky Chen <[email protected]>
…cale_controller (#15037) Signed-off-by: Shanicky Chen <[email protected]>
…cale_controller (#15037) (#15050) Signed-off-by: Shanicky Chen <[email protected]>
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Closes #15018 .
We need to implement mutual exclusion between the recovery and the auto scaling loop. Previously, the lock was inside the stream manager, which made it unattainable for recovery. By moving it to the scale controller, we can now acquire the lock.
Additionally, we mistakenly used separate scale controllers in both the stream manager and recovery, which resulted in the lock becoming ineffective. This pull request has also addressed and fixed that issue.
Checklist
./risedev check
(or alias,./risedev c
)