[Checkpoint] wait for checkpoint service to stop during reconfig #17556

mwtian · 2024-05-07T22:20:52Z

Description

Currently during reconfig, CheckpointService tasks, including CheckpointBuilder and CheckpointAggregator, are notified to shut down. But reconfig does not wait for them to finish shutting down. There can be a race between the reconfig loop proceeding to drop the epoch db handle, while CheckpointBuilder tries to read from epoch db when creating a new checkpoint. The race can result in panics.

Test plan

CI
Simulation

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

vercel · 2024-05-07T22:20:55Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Oct 18, 2024 4:42am

3 Skipped Deployments

Name	Status	Preview	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	Oct 18, 2024 4:42am
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	Oct 18, 2024 4:42am
sui-typescript-docs	⬜️ Ignored (Inspect)	Visit Preview	Oct 18, 2024 4:42am

halfprice · 2024-05-10T18:39:42Z

Discussed this with @mwtian in the office the other day. Can we make sure it is actually safe to cancel make_checkpoint and aggregator's run_and_notify, and won't leave inconsistent state in the memory? Maybe consider using fail_points to cancel these tasks while it is running, and see if it causes any errors.

mystenmark · 2024-05-10T21:21:23Z

Discussed this with @mwtian in the office the other day. Can we make sure it is actually safe to cancel make_checkpoint and aggregator's run_and_notify, and won't leave inconsistent state in the memory? Maybe consider using fail_points to cancel these tasks while it is running, and see if it causes any errors.

I had the same question. It might be better to keep the pre-existing graceful shutdown logic, and do a while join_set.join_next().is_some() to wait for shutdown to complete

mwtian · 2024-05-10T21:56:43Z

I was thinking that if canceling checkpoint creation is problematic, panic would cause inconsistency as well which I have not observed. The in-memory data seems per epoch and they do not seem to be concurrently accessed. I will add a fail point and if there is a strong preference, I will add the wait for processing to finish.

mystenmark · 2024-05-22T03:04:55Z

crates/sui-core/src/checkpoints/mod.rs

            certified_checkpoint_output,
            state.clone(),
            metrics.clone(),
        );

-        spawn_monitored_task!(aggregator.run());
+        tasks.spawn(aggregator.run());


we are also losing the spawn_monitored_task! here. should be easy to change the macro to take a joinset.

Using monitored_future!() now.

mystenmark · 2024-05-22T03:06:19Z

I was thinking that if canceling checkpoint creation is problematic, panic would cause inconsistency as well which I have not observed. The in-memory data seems per epoch and they do not seem to be concurrently accessed. I will add a fail point and if there is a strong preference, I will add the wait for processing to finish.

The difference is that the panic wipes all in memory state so we would only be dealing with crash-recovery issues. Still, the fact that we only do this at the end of the epoch is convincing to me. Between that and the fact that simtest would be able to find problems with this, I think its okay as is.

github-actions · 2024-07-23T01:53:39Z

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

mwtian · 2024-10-15T22:31:51Z

Updated to wait for checkpoint service tasks to shut down.

halfprice · 2024-10-17T20:00:32Z

crates/sui-node/src/lib.rs

-                drop(checkpoint_service_exit);
+                // Stop the old checkpoint service and wait for them to finish.
+                let _ = checkpoint_service_exit.send(());
+                while let Some(_result) = checkpoint_service_tasks.join_next().await {}


Do we want to add a timeout here in case something is wrong in the service to prevent it from stopping?

Added a panic in tests if this times out.

mwtian · 2024-10-18T04:49:14Z

By wrapping build.run() with epoch_store_clone.within_alive_epoch(), IIUC it also forces cancellation when the epoch ends. I'm not sure if the same wrapping needs to be applied to aggregator.run() as well. Will leave this for another PR if needed.

…fig (#17556)" This reverts commit 8d2bb84.

mwtian requested review from lxfind, halfprice and mystenmark May 7, 2024 22:21

mwtian force-pushed the checkpoint-service-reconfig branch from 9c7a2d4 to 949b084 Compare May 7, 2024 22:25

vercel bot deployed to Preview – sui-docs May 7, 2024 22:26 View deployment

mystenmark reviewed May 22, 2024

View reviewed changes

github-actions bot added the Stale label Jul 23, 2024

mwtian removed the Stale label Jul 30, 2024

[Checkpoint] wait for checkpoint service to stop during reconfig

bf03889

mwtian force-pushed the checkpoint-service-reconfig branch from 949b084 to bf03889 Compare October 15, 2024 22:31

vercel bot deployed to Preview – sui-docs October 15, 2024 22:34 View deployment

mwtian requested a review from mystenmark October 16, 2024 02:52

halfprice approved these changes Oct 17, 2024

View reviewed changes

.

e10bac2

mwtian enabled auto-merge (squash) October 18, 2024 04:42

vercel bot deployed to Preview – sui-docs October 18, 2024 04:42 View deployment

mwtian merged commit 8d2bb84 into main Oct 18, 2024
47 checks passed

mwtian deleted the checkpoint-service-reconfig branch October 18, 2024 05:09

mwtian added a commit that referenced this pull request Oct 22, 2024

Revert "[Checkpoint] wait for checkpoint service to stop during recon…

fd68a36

…fig (#17556)" This reverts commit 8d2bb84.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Checkpoint] wait for checkpoint service to stop during reconfig #17556

[Checkpoint] wait for checkpoint service to stop during reconfig #17556

mwtian commented May 7, 2024 •

edited

Loading

vercel bot commented May 7, 2024 •

edited

Loading

halfprice commented May 10, 2024

mystenmark commented May 10, 2024

mwtian commented May 10, 2024

mystenmark May 22, 2024

mwtian Oct 18, 2024

mystenmark commented May 22, 2024

github-actions bot commented Jul 23, 2024

mwtian commented Oct 15, 2024

halfprice Oct 17, 2024

mwtian Oct 18, 2024

mwtian commented Oct 18, 2024 •

edited

Loading

[Checkpoint] wait for checkpoint service to stop during reconfig #17556

[Checkpoint] wait for checkpoint service to stop during reconfig #17556

Conversation

mwtian commented May 7, 2024 • edited Loading

Description

Test plan

Release notes

vercel bot commented May 7, 2024 • edited Loading

halfprice commented May 10, 2024

mystenmark commented May 10, 2024

mwtian commented May 10, 2024

mystenmark May 22, 2024

Choose a reason for hiding this comment

mwtian Oct 18, 2024

Choose a reason for hiding this comment

mystenmark commented May 22, 2024

github-actions bot commented Jul 23, 2024

mwtian commented Oct 15, 2024

halfprice Oct 17, 2024

Choose a reason for hiding this comment

mwtian Oct 18, 2024

Choose a reason for hiding this comment

mwtian commented Oct 18, 2024 • edited Loading

mwtian commented May 7, 2024 •

edited

Loading

vercel bot commented May 7, 2024 •

edited

Loading

mwtian commented Oct 18, 2024 •

edited

Loading