Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: compute node unregisters from meta for graceful shutdown #17662

Merged
merged 11 commits into from
Jul 26, 2024

remove no-actor optimization & move actor op handler to async context

4dcfb9d
Select commit
Loading
Failed to load commit list.
Merged

feat: compute node unregisters from meta for graceful shutdown #17662

remove no-actor optimization & move actor op handler to async context
4dcfb9d
Select commit
Loading
Failed to load commit list.
Task list completed / task-list-completed Started 2024-07-29 07:00:06 ago

3 / 4 tasks completed

1 task still to be completed

Details

Required Tasks

Task Status
The compute node will first unregister from the meta service, so that following batch queries and streaming jobs won't be scheduled here. Incomplete
Then, it sends a Shutdown message on the barrier control stream, triggering a recovery on the new set of compute nodes. Incomplete
After that, the compute node waits for the connection to be reset. Incomplete
Finally, exit the entrypoint function then the process gracefully. Incomplete
I have written necessary rustdoc comments Completed
I have added necessary unit tests and integration tests Completed
All checks passed in ./risedev check (or alias, ./risedev c) Completed
My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users) Incomplete
#17802 Graphite Incomplete
#17662 Graphite 👈 Incomplete
#17633 Graphite Incomplete
#17586 Graphite Incomplete
#17608 Graphite Incomplete
main Incomplete
we always clear the executor cache when scaling-in, so there might be no big difference on streaming performance, Incomplete
recovery does not affect batch availability, Incomplete
scaling online can be less responsive (depending on the number of in-flight barriers), which may not fit within the default killing timeout of 30s in Kubernetes, Incomplete
Explicitly send an Err(Shutdown) through the stream from the compute node.
We need to be able to recognize this error and differentiate it from other errors in the meta node, for (slightly) different handling behavior, e.g., do not ignore this error even if it's not associated with an in-flight barrier. Incomplete
Close the channel and let meta service acknowledge it by receiving a None.
I'm just concerned whether this is reliable enough, i.e., if there are any other unexpected scenarios where the stream is disconnected without an error. Incomplete