storage: potential leaked pinned version/snapshot #9576

zwang28 · 2023-05-04T08:36:47Z

Describe the bug

In compute node, pinned hummock version is released asynchronously via sending message in drop. However if the send fails, only error is reported. Thus, corresponding pinned version can never been released, until compute node is restarted.

risingwave/src/storage/src/hummock/local_version/pinned_version.rs

Line 67 in 7d49b88

tracing::warn!("failed to send req unpin version id: {}", self.version_id);

risingwave/src/storage/src/hummock/event_handler/mod.rs

Line 163 in 7d49b88

tracing::error!(

To Reproduce

No response

Expected behavior

No response

Additional context

Should we panic instead of reporting error? @wenym1 @Li0k

The text was updated successfully, but these errors were encountered:

Li0k · 2023-05-04T09:27:17Z

After re-reading the existing implementation logic, ReadVersion does hold a reference to PinnedVersion and is only released when VersionUpdate or Drop. When the system cannot properly Destory LocalHummockStorage, the ref of PinnedVersion may not be cleaned up.

#[derive(Clone)]
/// A container of information required for reading from hummock.
pub struct HummockReadVersion {
    /// Local version for staging data.
    staging: StagingVersion,

    /// Remote version for committed data.
    committed: CommittedVersion,
}

Although the current assumption is that LocalHummockStorage is always Destroyed before the EventHandler, so the send operation cannot fail. But when the assumptions are broken, this can lead to leaks. So I prefer to panic, when it fails, and recover by recovering.

wenym1 · 2023-05-05T05:48:05Z

In the CN side, we have several long running worker tokio tasks to handle events like version pin/unpin, and other events. Any failure to send these events is because these long running workers panic for some reasons. After these worker panics, the CN cannot work well until gets restarted.

Therefore, instead of panic in send error, I think we can monitor the join handles of these worker tokio tasks, and when we finds out that any of these task panics, we either panic the whole CN process to trigger a restart, or we reset everything without shutdown.

hzxa21 · 2023-05-10T09:39:04Z

related: #9732

zwang28 · 2023-05-22T08:20:01Z

Close because this is a potential improvement rather than existing bug.

zwang28 added type/bug Something isn't working component/storage Storage labels May 4, 2023

github-actions bot added this to the release-0.20 milestone May 4, 2023

zwang28 changed the title ~~storage: potential leaked pinned version~~ storage: potential leaked pinned version/snapshot May 4, 2023

Li0k mentioned this issue May 4, 2023

chore(storage): panic when LocalHummockStorage fail to send DestroyReadVersion event #9581

Closed

4 tasks

hzxa21 mentioned this issue May 10, 2023

Tracking: Critical Performance & Stability Issues #6640

Open

65 tasks

zwang28 closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: potential leaked pinned version/snapshot #9576

storage: potential leaked pinned version/snapshot #9576

zwang28 commented May 4, 2023 •

edited

Loading

Li0k commented May 4, 2023 •

edited

Loading

wenym1 commented May 5, 2023

hzxa21 commented May 10, 2023

zwang28 commented May 22, 2023

storage: potential leaked pinned version/snapshot #9576

storage: potential leaked pinned version/snapshot #9576

Comments

zwang28 commented May 4, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

Li0k commented May 4, 2023 • edited Loading

wenym1 commented May 5, 2023

hzxa21 commented May 10, 2023

zwang28 commented May 22, 2023

zwang28 commented May 4, 2023 •

edited

Loading

Li0k commented May 4, 2023 •

edited

Loading