You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RisingWave may crash due to temporary network failure and recover later.
On cloud, k8s is responsible for health checks. (e.g. On RisingWave inner test cloud setup, the frequency of health check is 10s) If health checks fail several times, the failure pod will be re-deployed.
However, with a large file cache, the recovery of RisingWave can be slow. The compute node may be continuously killed by k8s health check.
e.g.
In this case, compute nodes keep restarting but cannot recover after a failure, but there is no unexpected log between launchings.
With the k8s event log, we can see that the compute node is killed by health check:
LAST SEEN TYPE REASON OBJECT MESSAGE
53m (x3 over 113m) Normal Created Pod/benchmark-risingwave-compute-c-0 Created container compute
53m (x3 over 113m) Normal Started Pod/benchmark-risingwave-compute-c-0 Started container compute
29m (x9 over 54m) Normal Pulled Pod/benchmark-risingwave-compute-c-0 Container image "ghcr.io/risingwavelabs/risingwave:git-52483b367f65b9f4d8224567fe15f4d2a9407232" already present on machine
4m50s (x174 over 54m) Warning Unhealthy Pod/benchmark-risingwave-compute-c-0 Startup probe failed: dial tcp 10.0.63.110:5688: connect: connection refused
53m Normal Killing Pod/benchmark-risingwave-compute-c-0 Container compute failed startup probe, will be restarted
9m45s (x125 over 47m) Warning BackOff Pod/benchmark-risingwave-compute-c-0 Back-off restarting failed container compute in pod benchmark-risingwave-compute-c-0_benchmark-yao-manual(0cf79b54-eed1-4185-a5a6-3db8e9fde3b8)
0s (x161 over 47m) Warning BackOff Pod/benchmark-risingwave-compute-c-0 Back-off restarting failed container compute in pod benchmark-risingwave-compute-c-0_benchmark-yao-manual(0cf79b54-eed1-4185-a5a6-3db8e9fde3b8)
0s (x215 over 59m) Warning Unhealthy Pod/benchmark-risingwave-compute-c-0 Startup probe failed: dial tcp 10.0.63.110:5688: connect: connection refused
0s (x195 over 57m) Warning BackOff Pod/benchmark-risingwave-compute-c-0 Back-off restarting failed container compute in pod benchmark-risingwave-compute-c-0_benchmark-yao-manual(0cf79b54-eed1-4185-a5a6-3db8e9fde3b8)
The health check failed to connect to the serving port of the compute node.
But the log shows that the compute node served normally at least for a short period.
To make sure the port can be correctly connected, I manually checked it:
But because of the slow file cache recovery, the compute node missed some health checks and was killed seconds after it was finally on.
Solution:
Lazy load file cache.
The file cache should be created once it is opened, and its recovery can be scheduled in the background. During the file cache recovery, any operation to the file cache acts like an operation to a 0-capacity file cache. Once the file cache is fully recovered, it will serve normally.
File cache lazy loading can be useful for all file cache users. So it will be implemented in the file cache engine (foyer). And that will also keep RisingWave simple.
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
Describe the bug
RisingWave may crash due to temporary network failure and recover later.
On cloud, k8s is responsible for health checks. (e.g. On RisingWave inner test cloud setup, the frequency of health check is 10s) If health checks fail several times, the failure pod will be re-deployed.
However, with a large file cache, the recovery of RisingWave can be slow. The compute node may be continuously killed by k8s health check.
e.g.
In this case, compute nodes keep restarting but cannot recover after a failure, but there is no unexpected log between launchings.
With the k8s event log, we can see that the compute node is killed by health check:
The health check failed to connect to the serving port of the compute node.
But the log shows that the compute node served normally at least for a short period.
To make sure the port can be correctly connected, I manually checked it:
But because of the slow file cache recovery, the compute node missed some health checks and was killed seconds after it was finally on.
Solution:
Lazy load file cache.
The file cache should be created once it is opened, and its recovery can be scheduled in the background. During the file cache recovery, any operation to the file cache acts like an operation to a 0-capacity file cache. Once the file cache is fully recovered, it will serve normally.
File cache lazy loading can be useful for all file cache users. So it will be implemented in the file cache engine (foyer). And that will also keep RisingWave simple.
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: