Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: compute node recovery fails when file cache is large on cloud #12656

Closed
MrCroxx opened this issue Oct 7, 2023 · 0 comments · Fixed by #12714
Closed

bug: compute node recovery fails when file cache is large on cloud #12656

MrCroxx opened this issue Oct 7, 2023 · 0 comments · Fixed by #12714
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@MrCroxx
Copy link
Contributor

MrCroxx commented Oct 7, 2023

Describe the bug

RisingWave may crash due to temporary network failure and recover later.

On cloud, k8s is responsible for health checks. (e.g. On RisingWave inner test cloud setup, the frequency of health check is 10s) If health checks fail several times, the failure pod will be re-deployed.

However, with a large file cache, the recovery of RisingWave can be slow. The compute node may be continuously killed by k8s health check.

e.g.

In this case, compute nodes keep restarting but cannot recover after a failure, but there is no unexpected log between launchings.

image

With the k8s event log, we can see that the compute node is killed by health check:

LAST SEEN            TYPE     REASON    OBJECT                                 MESSAGE
53m (x3 over 113m)   Normal   Created   Pod/benchmark-risingwave-compute-c-0   Created container compute
53m (x3 over 113m)   Normal   Started   Pod/benchmark-risingwave-compute-c-0   Started container compute
29m (x9 over 54m)    Normal   Pulled    Pod/benchmark-risingwave-compute-c-0   Container image "ghcr.io/risingwavelabs/risingwave:git-52483b367f65b9f4d8224567fe15f4d2a9407232" already present on machine
4m50s (x174 over 54m)   Warning   Unhealthy   Pod/benchmark-risingwave-compute-c-0   Startup probe failed: dial tcp 10.0.63.110:5688: connect: connection refused
53m                     Normal    Killing     Pod/benchmark-risingwave-compute-c-0   Container compute failed startup probe, will be restarted
9m45s (x125 over 47m)   Warning   BackOff     Pod/benchmark-risingwave-compute-c-0   Back-off restarting failed container compute in pod benchmark-risingwave-compute-c-0_benchmark-yao-manual(0cf79b54-eed1-4185-a5a6-3db8e9fde3b8)
0s (x161 over 47m)      Warning   BackOff     Pod/benchmark-risingwave-compute-c-0   Back-off restarting failed container compute in pod benchmark-risingwave-compute-c-0_benchmark-yao-manual(0cf79b54-eed1-4185-a5a6-3db8e9fde3b8)
0s (x215 over 59m)      Warning   Unhealthy   Pod/benchmark-risingwave-compute-c-0   Startup probe failed: dial tcp 10.0.63.110:5688: connect: connection refused
0s (x195 over 57m)      Warning   BackOff     Pod/benchmark-risingwave-compute-c-0   Back-off restarting failed container compute in pod benchmark-risingwave-compute-c-0_benchmark-yao-manual(0cf79b54-eed1-4185-a5a6-3db8e9fde3b8)

The health check failed to connect to the serving port of the compute node.

But the log shows that the compute node served normally at least for a short period.

image

To make sure the port can be correctly connected, I manually checked it:

image

But because of the slow file cache recovery, the compute node missed some health checks and was killed seconds after it was finally on.

image

Solution:

Lazy load file cache.

The file cache should be created once it is opened, and its recovery can be scheduled in the background. During the file cache recovery, any operation to the file cache acts like an operation to a 0-capacity file cache. Once the file cache is fully recovered, it will serve normally.

File cache lazy loading can be useful for all file cache users. So it will be implemented in the file cache engine (foyer). And that will also keep RisingWave simple.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
2 participants