-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: potential leaked pinned version/snapshot #9576
Comments
After re-reading the existing implementation logic, ReadVersion does hold a reference to PinnedVersion and is only released when VersionUpdate or Drop. When the system cannot properly Destory LocalHummockStorage, the ref of PinnedVersion may not be cleaned up. #[derive(Clone)]
/// A container of information required for reading from hummock.
pub struct HummockReadVersion {
/// Local version for staging data.
staging: StagingVersion,
/// Remote version for committed data.
committed: CommittedVersion,
} Although the current assumption is that LocalHummockStorage is always Destroyed before the EventHandler, so the send operation cannot fail. But when the assumptions are broken, this can lead to leaks. So I prefer to panic, when it fails, and recover by recovering. |
In the CN side, we have several long running worker tokio tasks to handle events like version pin/unpin, and other events. Any failure to send these events is because these long running workers panic for some reasons. After these worker panics, the CN cannot work well until gets restarted. Therefore, instead of panic in send error, I think we can monitor the join handles of these worker tokio tasks, and when we finds out that any of these task panics, we either panic the whole CN process to trigger a restart, or we reset everything without shutdown. |
related: #9732 |
Close because this is a potential improvement rather than existing bug. |
Describe the bug
In compute node, pinned hummock version is released asynchronously via sending message in
drop
. However if thesend
fails, only error is reported. Thus, corresponding pinned version can never been released, until compute node is restarted.risingwave/src/storage/src/hummock/local_version/pinned_version.rs
Line 67 in 7d49b88
risingwave/src/storage/src/hummock/event_handler/mod.rs
Line 163 in 7d49b88
Similar problem in frontend:
risingwave/src/frontend/src/scheduler/hummock_snapshot_manager.rs
Line 127 in 7d49b88
To Reproduce
No response
Expected behavior
No response
Additional context
Should we panic instead of reporting error? @wenym1 @Li0k
The text was updated successfully, but these errors were encountered: