bug: compute node recovery fails when file cache is large on cloud #12656

MrCroxx · 2023-10-07T09:48:55Z

Describe the bug

RisingWave may crash due to temporary network failure and recover later.

On cloud, k8s is responsible for health checks. (e.g. On RisingWave inner test cloud setup, the frequency of health check is 10s) If health checks fail several times, the failure pod will be re-deployed.

However, with a large file cache, the recovery of RisingWave can be slow. The compute node may be continuously killed by k8s health check.

e.g.

In this case, compute nodes keep restarting but cannot recover after a failure, but there is no unexpected log between launchings.

With the k8s event log, we can see that the compute node is killed by health check:

LAST SEEN            TYPE     REASON    OBJECT                                 MESSAGE
53m (x3 over 113m)   Normal   Created   Pod/benchmark-risingwave-compute-c-0   Created container compute
53m (x3 over 113m)   Normal   Started   Pod/benchmark-risingwave-compute-c-0   Started container compute
29m (x9 over 54m)    Normal   Pulled    Pod/benchmark-risingwave-compute-c-0   Container image "ghcr.io/risingwavelabs/risingwave:git-52483b367f65b9f4d8224567fe15f4d2a9407232" already present on machine
4m50s (x174 over 54m)   Warning   Unhealthy   Pod/benchmark-risingwave-compute-c-0   Startup probe failed: dial tcp 10.0.63.110:5688: connect: connection refused
53m                     Normal    Killing     Pod/benchmark-risingwave-compute-c-0   Container compute failed startup probe, will be restarted
9m45s (x125 over 47m)   Warning   BackOff     Pod/benchmark-risingwave-compute-c-0   Back-off restarting failed container compute in pod benchmark-risingwave-compute-c-0_benchmark-yao-manual(0cf79b54-eed1-4185-a5a6-3db8e9fde3b8)
0s (x161 over 47m)      Warning   BackOff     Pod/benchmark-risingwave-compute-c-0   Back-off restarting failed container compute in pod benchmark-risingwave-compute-c-0_benchmark-yao-manual(0cf79b54-eed1-4185-a5a6-3db8e9fde3b8)
0s (x215 over 59m)      Warning   Unhealthy   Pod/benchmark-risingwave-compute-c-0   Startup probe failed: dial tcp 10.0.63.110:5688: connect: connection refused
0s (x195 over 57m)      Warning   BackOff     Pod/benchmark-risingwave-compute-c-0   Back-off restarting failed container compute in pod benchmark-risingwave-compute-c-0_benchmark-yao-manual(0cf79b54-eed1-4185-a5a6-3db8e9fde3b8)

The health check failed to connect to the serving port of the compute node.

But the log shows that the compute node served normally at least for a short period.

To make sure the port can be correctly connected, I manually checked it:

But because of the slow file cache recovery, the compute node missed some health checks and was killed seconds after it was finally on.

Solution:

Lazy load file cache.

The file cache should be created once it is opened, and its recovery can be scheduled in the background. During the file cache recovery, any operation to the file cache acts like an operation to a 0-capacity file cache. Once the file cache is fully recovered, it will serve normally.

File cache lazy loading can be useful for all file cache users. So it will be implemented in the file cache engine (foyer). And that will also keep RisingWave simple.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

MrCroxx added the type/bug Something isn't working label Oct 7, 2023

MrCroxx self-assigned this Oct 7, 2023

MrCroxx mentioned this issue Oct 7, 2023

feat: introduce lazy store foyer-rs/foyer#151

Merged

3 tasks

github-actions bot added this to the release-1.3 milestone Oct 7, 2023

fuyufjh modified the milestones: release-1.3, release-1.4 Oct 10, 2023

MrCroxx mentioned this issue Oct 11, 2023

feat(storage): filecache lazy load, unit-level refill, reduce insert latency #12714

Merged

8 tasks

MrCroxx closed this as completed in #12714 Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: compute node recovery fails when file cache is large on cloud #12656

bug: compute node recovery fails when file cache is large on cloud #12656

MrCroxx commented Oct 7, 2023

bug: compute node recovery fails when file cache is large on cloud #12656

bug: compute node recovery fails when file cache is large on cloud #12656

Comments

MrCroxx commented Oct 7, 2023

Describe the bug

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context