Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: potential leaked pinned version/snapshot #9576

Closed
zwang28 opened this issue May 4, 2023 · 4 comments
Closed

storage: potential leaked pinned version/snapshot #9576

zwang28 opened this issue May 4, 2023 · 4 comments
Labels
component/storage Storage type/bug Something isn't working
Milestone

Comments

@zwang28
Copy link
Contributor

zwang28 commented May 4, 2023

Describe the bug

In compute node, pinned hummock version is released asynchronously via sending message in drop. However if the send fails, only error is reported. Thus, corresponding pinned version can never been released, until compute node is restarted.

tracing::warn!("failed to send req unpin version id: {}", self.version_id);

Similar problem in frontend:

error!("failed to send release epoch: {}", err);

To Reproduce

No response

Expected behavior

No response

Additional context

Should we panic instead of reporting error? @wenym1 @Li0k

@zwang28 zwang28 added type/bug Something isn't working component/storage Storage labels May 4, 2023
@github-actions github-actions bot added this to the release-0.20 milestone May 4, 2023
@zwang28 zwang28 changed the title storage: potential leaked pinned version storage: potential leaked pinned version/snapshot May 4, 2023
@Li0k
Copy link
Contributor

Li0k commented May 4, 2023

After re-reading the existing implementation logic, ReadVersion does hold a reference to PinnedVersion and is only released when VersionUpdate or Drop. When the system cannot properly Destory LocalHummockStorage, the ref of PinnedVersion may not be cleaned up.

#[derive(Clone)]
/// A container of information required for reading from hummock.
pub struct HummockReadVersion {
    /// Local version for staging data.
    staging: StagingVersion,

    /// Remote version for committed data.
    committed: CommittedVersion,
}

Although the current assumption is that LocalHummockStorage is always Destroyed before the EventHandler, so the send operation cannot fail. But when the assumptions are broken, this can lead to leaks. So I prefer to panic, when it fails, and recover by recovering.

@wenym1
Copy link
Contributor

wenym1 commented May 5, 2023

In the CN side, we have several long running worker tokio tasks to handle events like version pin/unpin, and other events. Any failure to send these events is because these long running workers panic for some reasons. After these worker panics, the CN cannot work well until gets restarted.

Therefore, instead of panic in send error, I think we can monitor the join handles of these worker tokio tasks, and when we finds out that any of these task panics, we either panic the whole CN process to trigger a restart, or we reset everything without shutdown.

@hzxa21
Copy link
Collaborator

hzxa21 commented May 10, 2023

related: #9732

@zwang28
Copy link
Contributor Author

zwang28 commented May 22, 2023

Close because this is a potential improvement rather than existing bug.

@zwang28 zwang28 closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/storage Storage type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants