You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When barrier latency is high due to insufficient parallelism, user may want to scale their streaming jobs accordingly when resources are sufficient to accelerate computation.
The current implementation of scaling has the following properties:
Irrelevant actors won't be dropped and rebuilt.
It relies on barriers (Pause, ConfigChange, Resume) to complete the scaling process.
This results in a dilemma:
When barrier latency is super high (e.g. 2hrs), user needs to wait for a very long time until scale completes, which is counter-intuitive.
Though user can manually trigger a recovery (e.g. by restart a node) to accelerate scaling because it will clear in-flight barriers during actor rebuilt, this means the existing scaling implementation via Pause/ConfigChange/Resume barrier is an overkill. Not to mention that this will cause full operator cache invalidation.
This makes me think that it is a flaw in the current scaling mechanism and we should improve it. Some ideas after discussion with @wenym1:
Find the first aligned barrier in source and transform it into Pause barrier to trigger scaling immediately.
Make scaling to not relying on barrier (@wenym1 can comment more one the details).
The text was updated successfully, but these errors were encountered:
When barrier latency is high due to insufficient parallelism, user may want to scale their streaming jobs accordingly when resources are sufficient to accelerate computation.
The current implementation of scaling has the following properties:
This results in a dilemma:
This makes me think that it is a flaw in the current scaling mechanism and we should improve it. Some ideas after discussion with @wenym1:
The text was updated successfully, but these errors were encountered: