Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: config change to scale in/out #3284

Closed
1 of 4 tasks
KeXiangWang opened this issue Jun 17, 2022 · 4 comments
Closed
1 of 4 tasks

feat: config change to scale in/out #3284

KeXiangWang opened this issue Jun 17, 2022 · 4 comments
Assignees

Comments

@KeXiangWang
Copy link
Contributor

KeXiangWang commented Jun 17, 2022

Introduce config change to support scaling in/out.

To implement config change, we introduce a new pair of barrier: Pause/Resume.
Pause barrier marks the end of the previous configurations
Resume barrier marks the start of the new configurations

The config change process contains three steps:

  1. Pause the entire stream from source executor
  2. Do all the configuration change including creating new batch & streaming actors
  3. Resume the stream

Steps to scale in/out for each fragment

  • Build new channels
  • Build new actors
  • Update channel of existing actors
  • Recover data from old actors(scale out) or Fetch data from outdated actors(scale in)
  • Delete outdated actors
  • Delete outdated channels

List of tasks:

@skyzh
Copy link
Contributor

skyzh commented Jun 21, 2022

Pause the entire stream from source executor

Do we consider source scale in this step?

Update channel of existing actors

I believe this will need to split into two steps: pre-mutate and mutate

  • In pre-mutate step, a barrier will be sent from the sources before the change, and the actors before the change. It will contain the info that "in next epoch, what the graph structure should be", and prepare merge executors to wait for barrier on new actors.
  • In mutate step, the actual connections will be mutated on the graph. The barrier will be sent to both old and new actors.
  • This deserves a new design doc, and I wonder who will take over this task.

Recover data from old actors(scale out) or Fetch data from outdated actors(scale in)

Luckily, this step is unnecessary thanks to our shared storage design.

Delete outdated actors

Delete outdated channels

Also need design, this is not well-supported for now. Maybe we will need an actor manager that manages all actors running in the background, and abort them (or wait for them to exit gracefully) when we need to stop them. Luckily currently drop actors are handled by stop mutation, you might find it useful.

@skyzh
Copy link
Contributor

skyzh commented Jun 21, 2022

Pause the entire stream from source executor

Also need to wait for checkpoint, this is worth mentioning. The concurrent checkpoint PR is a really large change -- there'll be multiple barriers flowing in the system. I believe you'll need to take this into account.

@KeXiangWang
Copy link
Contributor Author

Thanks for your suggestion! I'm designing this part, collaborating with Ting Sun. We will carry out a detailed doc later.

Pause the entire stream from source executor

Also need to wait for checkpoint, this is worth mentioning. The concurrent checkpoint PR is a really large change -- there'll be multiple barriers flowing in the system. I believe you'll need to take this into account.

@BugenZhao
Copy link
Member

Duplicated with #3750.

@BugenZhao BugenZhao closed this as not planned Won't fix, can't repro, duplicate, stale Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants