feat: compute node unregisters from meta for graceful shutdown #17662
Task list completed / task-list-completed
Started
2024-07-29 07:00:06
ago
3 / 4 tasks completed
1 task still to be completed
Details
Required Tasks
Task | Status |
---|---|
The compute node will first unregister from the meta service, so that following batch queries and streaming jobs won't be scheduled here. | Incomplete |
Then, it sends a Shutdown message on the barrier control stream, triggering a recovery on the new set of compute nodes. |
Incomplete |
After that, the compute node waits for the connection to be reset. | Incomplete |
Finally, exit the entrypoint function then the process gracefully. | Incomplete |
I have written necessary rustdoc comments | Completed |
I have added necessary unit tests and integration tests | Completed |
All checks passed in ./risedev check (or alias, ./risedev c ) |
Completed |
My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users) | Incomplete |
#17802 | Incomplete |
#17662 👈 | Incomplete |
#17633 | Incomplete |
#17586 | Incomplete |
#17608 | Incomplete |
main |
Incomplete |
we always clear the executor cache when scaling-in, so there might be no big difference on streaming performance, | Incomplete |
recovery does not affect batch availability, | Incomplete |
scaling online can be less responsive (depending on the number of in-flight barriers), which may not fit within the default killing timeout of 30s in Kubernetes, | Incomplete |
Explicitly send an Err(Shutdown) through the stream from the compute node. |
|
We need to be able to recognize this error and differentiate it from other errors in the meta node, for (slightly) different handling behavior, e.g., do not ignore this error even if it's not associated with an in-flight barrier. | Incomplete |
Close the channel and let meta service acknowledge it by receiving a None . |
|
I'm just concerned whether this is reliable enough, i.e., if there are any other unexpected scenarios where the stream is disconnected without an error. | Incomplete |
Loading