Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(scale): move reschedule_lock to ScaleController & use universal scale_controller #15037

Merged
merged 3 commits into from
Feb 7, 2024

Conversation

shanicky
Copy link
Contributor

@shanicky shanicky commented Feb 6, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Closes #15018 .

We need to implement mutual exclusion between the recovery and the auto scaling loop. Previously, the lock was inside the stream manager, which made it unattainable for recovery. By moving it to the scale controller, we can now acquire the lock.

Additionally, we mistakenly used separate scale controllers in both the stream manager and recovery, which resulted in the lock becoming ineffective. This pull request has also addressed and fixed that issue.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

@github-actions github-actions bot added the type/fix Bug fix label Feb 6, 2024
@shanicky shanicky force-pushed the peng/inner-reschedule-lock branch from d104b17 to e72c48f Compare February 6, 2024 13:25
Copy link
Member

@yezizp2012 yezizp2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, could you please share some details about why that problem occurred?

@kwannoel
Copy link
Contributor

kwannoel commented Feb 7, 2024

Just asking some questions to better understand the problem and solution.

We need to implement mutual exclusion between the recovery and the auto scaling loop. Previously, the lock was inside the stream manager, which made it unattainable for recovery. By moving it to the scale controller, we can now acquire the lock.

During recovery, the stream manager is not online, is that why the lock is not acquirable on recovery?
Why scale controller is available during recovery, but not stream manager?

During recovery why do we need to acquire the reschedule lock? Is it to prevent any scaling from happening?

I suppose the scaling controller controls the scaling process, and so conceptually it also makes sense that it should hold the rw lock.

Additionally, we mistakenly used separate scale controllers in both the stream manager and recovery, which resulted in the lock becoming ineffective. This pull request has also addressed and fixed that issue.

Could you elaborate on how it led to the panic in #15018?

@shanicky shanicky changed the title fix(scale): move reschedule_lock to ScaleController fix(scale): move reschedule_lock to ScaleController & use universal scale_controller Feb 7, 2024
@shanicky
Copy link
Contributor Author

shanicky commented Feb 7, 2024

Just asking some questions to better understand the problem and solution.

We need to implement mutual exclusion between the recovery and the auto scaling loop. Previously, the lock was inside the stream manager, which made it unattainable for recovery. By moving it to the scale controller, we can now acquire the lock.

During recovery, the stream manager is not online, is that why the lock is not acquirable on recovery? Why scale controller is available during recovery, but not stream manager?

During recovery why do we need to acquire the reschedule lock? Is it to prevent any scaling from happening?

I suppose the scaling controller controls the scaling process, and so conceptually it also makes sense that it should hold the rw lock.

Additionally, we mistakenly used separate scale controllers in both the stream manager and recovery, which resulted in the lock becoming ineffective. This pull request has also addressed and fixed that issue.

Could you elaborate on how it led to the panic in #15018?

First off, our barrier control and stream control are currently managed by two distinct managers with a lot of disconnected logic. This will be unified in the future.

During startup, our recovery loop and auto scale loop are initiated at the same time. For a newly added node, this means they will both generate a plan simultaneously, which is clearly incorrect. Previously, because the recovery loop couldn't acquire the lock, the two couldn't interfere with each other. We used to rely on the barrier generated by the auto scale loop failing to prevent issues. However, in some rare circumstances, if the recovery loop is sufficiently fast, it may result in the auto scale loop producing an outdated plan that could still be successfully deployed, leading to problems.

Therefore, we have moved the lock into the scale controller itself, so now both the recovery loop and the auto scale loop can possess the lock concurrently. Moreover, previously the individual scale controllers led to separate locks within each controller, which did not serve the purpose of mutual exclusion effectively.

@shanicky shanicky enabled auto-merge February 7, 2024 09:03
@shanicky shanicky force-pushed the peng/inner-reschedule-lock branch from e047313 to cfe8845 Compare February 7, 2024 09:47
@shanicky shanicky added this pull request to the merge queue Feb 7, 2024
Merged via the queue into main with commit 85f0023 Feb 7, 2024
27 checks passed
@shanicky shanicky deleted the peng/inner-reschedule-lock branch February 7, 2024 10:22
shanicky added a commit that referenced this pull request Feb 7, 2024
shanicky added a commit that referenced this pull request Feb 7, 2024
shanicky added a commit that referenced this pull request Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/fix Bug fix
Projects
None yet
3 participants