-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Checkpoint] wait for checkpoint service to stop during reconfig #17556
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
3 Skipped Deployments
|
9c7a2d4
to
949b084
Compare
Discussed this with @mwtian in the office the other day. Can we make sure it is actually safe to cancel |
I had the same question. It might be better to keep the pre-existing graceful shutdown logic, and do a |
I was thinking that if canceling checkpoint creation is problematic, panic would cause inconsistency as well which I have not observed. The in-memory data seems per epoch and they do not seem to be concurrently accessed. I will add a fail point and if there is a strong preference, I will add the wait for processing to finish. |
certified_checkpoint_output, | ||
state.clone(), | ||
metrics.clone(), | ||
); | ||
|
||
spawn_monitored_task!(aggregator.run()); | ||
tasks.spawn(aggregator.run()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are also losing the spawn_monitored_task! here. should be easy to change the macro to take a joinset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using monitored_future!() now.
The difference is that the panic wipes all in memory state so we would only be dealing with crash-recovery issues. Still, the fact that we only do this at the end of the epoch is convincing to me. Between that and the fact that simtest would be able to find problems with this, I think its okay as is. |
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
949b084
to
bf03889
Compare
Updated to wait for checkpoint service tasks to shut down. |
crates/sui-node/src/lib.rs
Outdated
drop(checkpoint_service_exit); | ||
// Stop the old checkpoint service and wait for them to finish. | ||
let _ = checkpoint_service_exit.send(()); | ||
while let Some(_result) = checkpoint_service_tasks.join_next().await {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a timeout here in case something is wrong in the service to prevent it from stopping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a panic in tests if this times out.
By wrapping |
Description
Currently during reconfig,
CheckpointService
tasks, includingCheckpointBuilder
andCheckpointAggregator
, are notified to shut down. But reconfig does not wait for them to finish shutting down. There can be a race between the reconfig loop proceeding to drop the epoch db handle, whileCheckpointBuilder
tries to read from epoch db when creating a new checkpoint. The race can result in panics.Test plan
CI
Simulation
Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.