Discussion: ensure scale can complete in time #15490

hzxa21 · 2024-03-06T10:05:06Z

When barrier latency is high due to insufficient parallelism, user may want to scale their streaming jobs accordingly when resources are sufficient to accelerate computation.

The current implementation of scaling has the following properties:

Irrelevant actors won't be dropped and rebuilt.
It relies on barriers (Pause, ConfigChange, Resume) to complete the scaling process.

This results in a dilemma:

When barrier latency is super high (e.g. 2hrs), user needs to wait for a very long time until scale completes, which is counter-intuitive.
Though user can manually trigger a recovery (e.g. by restart a node) to accelerate scaling because it will clear in-flight barriers during actor rebuilt, this means the existing scaling implementation via Pause/ConfigChange/Resume barrier is an overkill. Not to mention that this will cause full operator cache invalidation.

This makes me think that it is a flaw in the current scaling mechanism and we should improve it. Some ideas after discussion with @wenym1:

Find the first aligned barrier in source and transform it into Pause barrier to trigger scaling immediately.
Make scaling to not relying on barrier (@wenym1 can comment more one the details).

BugenZhao · 2024-06-25T05:14:44Z

Find the first aligned barrier in source and transform it into Pause barrier to trigger scaling immediately.

Make scaling to not relying on barrier (@wenym1 can comment more one the details).

I believe #13396 can be addressed by adopting a very similar idea if we find it feasible. BTW, it is possible now to share more on the details?

hzxa21 added the type/feature label Mar 6, 2024

github-actions bot added this to the release-1.8 milestone Mar 6, 2024

BugenZhao mentioned this issue Mar 26, 2024

Discussion: Decouple cancel/drop mview from barrier #13396

Open

shanicky self-assigned this Apr 8, 2024

shanicky modified the milestones: release-1.8, release-1.10 May 8, 2024

BugenZhao mentioned this issue Jun 25, 2024

Discussion(meta): while configuring, only wait for the barriers related to the specific parts of the graph to be collected #17422

Open

shanicky modified the milestones: release-1.10, release-1.11 Jul 10, 2024

wenym1 mentioned this issue Jul 16, 2024

refactor: actor wait barrier manager inject barrier #17613

Merged

9 tasks

shanicky modified the milestones: release-2.0, future-release-2.1 Aug 19, 2024

shanicky modified the milestones: release-2.1, future-release-2.2 Oct 8, 2024

shanicky modified the milestones: release-2.2, release-2.3 Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: ensure scale can complete in time #15490

Discussion: ensure scale can complete in time #15490

hzxa21 commented Mar 6, 2024

BugenZhao commented Jun 25, 2024

Discussion: ensure scale can complete in time #15490

Discussion: ensure scale can complete in time #15490

Comments

hzxa21 commented Mar 6, 2024

BugenZhao commented Jun 25, 2024