Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Checkpoint] wait for checkpoint service to stop during reconfig #17556

Merged
merged 2 commits into from
Oct 18, 2024

Conversation

mwtian
Copy link
Member

@mwtian mwtian commented May 7, 2024

Description

Currently during reconfig, CheckpointService tasks, including CheckpointBuilder and CheckpointAggregator, are notified to shut down. But reconfig does not wait for them to finish shutting down. There can be a race between the reconfig loop proceeding to drop the epoch db handle, while CheckpointBuilder tries to read from epoch db when creating a new checkpoint. The race can result in panics.

Test plan

CI
Simulation


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:

Copy link

vercel bot commented May 7, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 18, 2024 4:42am
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Oct 18, 2024 4:42am
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Oct 18, 2024 4:42am
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Oct 18, 2024 4:42am

@halfprice
Copy link
Contributor

Discussed this with @mwtian in the office the other day. Can we make sure it is actually safe to cancel make_checkpoint and aggregator's run_and_notify, and won't leave inconsistent state in the memory? Maybe consider using fail_points to cancel these tasks while it is running, and see if it causes any errors.

@mystenmark
Copy link
Contributor

Discussed this with @mwtian in the office the other day. Can we make sure it is actually safe to cancel make_checkpoint and aggregator's run_and_notify, and won't leave inconsistent state in the memory? Maybe consider using fail_points to cancel these tasks while it is running, and see if it causes any errors.

I had the same question. It might be better to keep the pre-existing graceful shutdown logic, and do a while join_set.join_next().is_some() to wait for shutdown to complete

@mwtian
Copy link
Member Author

mwtian commented May 10, 2024

I was thinking that if canceling checkpoint creation is problematic, panic would cause inconsistency as well which I have not observed. The in-memory data seems per epoch and they do not seem to be concurrently accessed. I will add a fail point and if there is a strong preference, I will add the wait for processing to finish.

certified_checkpoint_output,
state.clone(),
metrics.clone(),
);

spawn_monitored_task!(aggregator.run());
tasks.spawn(aggregator.run());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are also losing the spawn_monitored_task! here. should be easy to change the macro to take a joinset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using monitored_future!() now.

@mystenmark
Copy link
Contributor

I was thinking that if canceling checkpoint creation is problematic, panic would cause inconsistency as well which I have not observed. The in-memory data seems per epoch and they do not seem to be concurrently accessed. I will add a fail point and if there is a strong preference, I will add the wait for processing to finish.

The difference is that the panic wipes all in memory state so we would only be dealing with crash-recovery issues. Still, the fact that we only do this at the end of the epoch is convincing to me. Between that and the fact that simtest would be able to find problems with this, I think its okay as is.

Copy link
Contributor

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jul 23, 2024
@mwtian mwtian removed the Stale label Jul 30, 2024
@mwtian mwtian force-pushed the checkpoint-service-reconfig branch from 949b084 to bf03889 Compare October 15, 2024 22:31
@mwtian
Copy link
Member Author

mwtian commented Oct 15, 2024

Updated to wait for checkpoint service tasks to shut down.

drop(checkpoint_service_exit);
// Stop the old checkpoint service and wait for them to finish.
let _ = checkpoint_service_exit.send(());
while let Some(_result) = checkpoint_service_tasks.join_next().await {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a timeout here in case something is wrong in the service to prevent it from stopping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a panic in tests if this times out.

@mwtian
Copy link
Member Author

mwtian commented Oct 18, 2024

By wrapping build.run() with epoch_store_clone.within_alive_epoch(), IIUC it also forces cancellation when the epoch ends. I'm not sure if the same wrapping needs to be applied to aggregator.run() as well. Will leave this for another PR if needed.

@mwtian mwtian merged commit 8d2bb84 into main Oct 18, 2024
47 checks passed
@mwtian mwtian deleted the checkpoint-service-reconfig branch October 18, 2024 05:09
mwtian added a commit that referenced this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants