feat: config change to scale in/out #3284

KeXiangWang · 2022-06-17T00:10:23Z

Introduce config change to support scaling in/out.

To implement config change, we introduce a new pair of barrier: Pause/Resume.
Pause barrier marks the end of the previous configurations
Resume barrier marks the start of the new configurations

The config change process contains three steps:

Pause the entire stream from source executor
Do all the configuration change including creating new batch & streaming actors
Resume the stream

Steps to scale in/out for each fragment

Build new channels
Build new actors
Update channel of existing actors
Recover data from old actors(scale out) or Fetch data from outdated actors(scale in)
Delete outdated actors
Delete outdated channels

List of tasks:

feat: Add pause barrier and resume barrier support #3292
Add support to scale out
Add support to scale in
Add Scale Service and relavent scale interfaces

skyzh · 2022-06-21T15:59:53Z

Pause the entire stream from source executor

Do we consider source scale in this step?

Update channel of existing actors

I believe this will need to split into two steps: pre-mutate and mutate

In pre-mutate step, a barrier will be sent from the sources before the change, and the actors before the change. It will contain the info that "in next epoch, what the graph structure should be", and prepare merge executors to wait for barrier on new actors.
In mutate step, the actual connections will be mutated on the graph. The barrier will be sent to both old and new actors.
This deserves a new design doc, and I wonder who will take over this task.

Recover data from old actors(scale out) or Fetch data from outdated actors(scale in)

Luckily, this step is unnecessary thanks to our shared storage design.

Delete outdated actors

Delete outdated channels

Also need design, this is not well-supported for now. Maybe we will need an actor manager that manages all actors running in the background, and abort them (or wait for them to exit gracefully) when we need to stop them. Luckily currently drop actors are handled by stop mutation, you might find it useful.

skyzh · 2022-06-21T16:01:14Z

Pause the entire stream from source executor

Also need to wait for checkpoint, this is worth mentioning. The concurrent checkpoint PR is a really large change -- there'll be multiple barriers flowing in the system. I believe you'll need to take this into account.

KeXiangWang · 2022-06-22T03:36:14Z

Thanks for your suggestion! I'm designing this part, collaborating with Ting Sun. We will carry out a detailed doc later.

Pause the entire stream from source executor

Also need to wait for checkpoint, this is worth mentioning. The concurrent checkpoint PR is a really large change -- there'll be multiple barriers flowing in the system. I believe you'll need to take this into account.

BugenZhao · 2022-11-18T05:49:56Z

Duplicated with #3750.

KeXiangWang added the type/feature label Jun 17, 2022

KeXiangWang assigned KeXiangWang and Sunt-ing Jun 17, 2022

BugenZhao self-assigned this Jun 17, 2022

KeXiangWang assigned shanicky Jun 27, 2022

This was referenced Jul 4, 2022

feat(metrics): add sampled (de)serialization duration metrics for RPC #3618

Merged

feat(metrics): add backpressure metrics #3636

Merged

BugenZhao closed this as not planned Won't fix, can't repro, duplicate, stale Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: config change to scale in/out #3284

feat: config change to scale in/out #3284

KeXiangWang commented Jun 17, 2022 •

edited

Loading

skyzh commented Jun 21, 2022 •

edited

Loading

skyzh commented Jun 21, 2022

KeXiangWang commented Jun 22, 2022

BugenZhao commented Nov 18, 2022

feat: config change to scale in/out #3284

feat: config change to scale in/out #3284

Comments

KeXiangWang commented Jun 17, 2022 • edited Loading

skyzh commented Jun 21, 2022 • edited Loading

skyzh commented Jun 21, 2022

KeXiangWang commented Jun 22, 2022

BugenZhao commented Nov 18, 2022

KeXiangWang commented Jun 17, 2022 •

edited

Loading

skyzh commented Jun 21, 2022 •

edited

Loading