Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: ensure scale can complete in time #15490

Open
hzxa21 opened this issue Mar 6, 2024 · 1 comment
Open

Discussion: ensure scale can complete in time #15490

hzxa21 opened this issue Mar 6, 2024 · 1 comment
Assignees
Milestone

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 6, 2024

When barrier latency is high due to insufficient parallelism, user may want to scale their streaming jobs accordingly when resources are sufficient to accelerate computation.

The current implementation of scaling has the following properties:

  1. Irrelevant actors won't be dropped and rebuilt.
  2. It relies on barriers (Pause, ConfigChange, Resume) to complete the scaling process.

This results in a dilemma:

  • When barrier latency is super high (e.g. 2hrs), user needs to wait for a very long time until scale completes, which is counter-intuitive.
  • Though user can manually trigger a recovery (e.g. by restart a node) to accelerate scaling because it will clear in-flight barriers during actor rebuilt, this means the existing scaling implementation via Pause/ConfigChange/Resume barrier is an overkill. Not to mention that this will cause full operator cache invalidation.

This makes me think that it is a flaw in the current scaling mechanism and we should improve it. Some ideas after discussion with @wenym1:

  • Find the first aligned barrier in source and transform it into Pause barrier to trigger scaling immediately.
  • Make scaling to not relying on barrier (@wenym1 can comment more one the details).
@BugenZhao
Copy link
Member

  • Find the first aligned barrier in source and transform it into Pause barrier to trigger scaling immediately.
  • Make scaling to not relying on barrier (@wenym1 can comment more one the details).

I believe #13396 can be addressed by adopting a very similar idea if we find it feasible. BTW, it is possible now to share more on the details?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants