Skip to content

Commit

Permalink
agent: eliminate potential races while rebooting
Browse files Browse the repository at this point in the history
The agent had a few conditions under which it can enter the
MonitoringUpdate state before it successfully uncordons the node:

* Brupop updates its state into Rebooted before the reboot terminates
  the process
* After rebooting, an error ocurrs after updating the state but before
  performing the uncordon.

We avoid this conditions by:
* Exiting the agent process in cases where the desired state is Rebooted
  but we are not yet running the new version.
* Always uncordoning the node before marking that we have successfully
  transitioned into the Rebooted state.
* Defensively uncordoning the node once again when we enter the
  Monitoring state.
  • Loading branch information
cbgbt committed Jun 25, 2024
1 parent 6d16aac commit 73aae1b
Showing 1 changed file with 21 additions and 1 deletion.
22 changes: 21 additions & 1 deletion agent/src/agentclient.rs
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ const NUM_RETRIES: usize = 5;
const AGENT_SLEEP_DURATION: Duration = Duration::from_secs(5);
const UPDATE_CHECK_INTERVAL: Duration = Duration::from_secs(120);

// After issuing an update, the agent sleeps before gracefully exiting.
// This avoids entering a kubernetes CrashLoop that repeatedly restarts the agent.
const REBOOT_SLEEP_INTERVAL: Duration = Duration::from_secs(180);

type SimpleRateLimiter = RateLimiter<NotKeyed, InMemoryState, DefaultClock, NoOpMiddleware>;

lazy_static! {
Expand Down Expand Up @@ -476,6 +480,9 @@ impl<T: APIServerClient> BrupopAgent<T> {
event!(Level::INFO, "Rebooting node to complete update");

if running_desired_version(bottlerocket_shadow_spec).await? {
// Uncordon the node before marking the reboot as complete
self.handle_recover().await?;

// Make sure the status in BottlerocketShadow is up-to-date from the reboot
self.update_status_in_shadow(
bottlerocket_shadow,
Expand All @@ -486,17 +493,30 @@ impl<T: APIServerClient> BrupopAgent<T> {
),
)
.await?;
self.handle_recover().await?;
} else {
// This branch must never complete successfully, or Brupop will assume that
// we've performed the state transition when we haven't.
// Returning an `Err` value is acceptable.
self.prepare_for_update().await?;
apiclient::boot_into_update().await.context(
agentclient_error::UpdateActionsSnafu {
action: "Reboot".to_string(),
},
)?;

// Sleep to avoid entering k8s crash loop backoff
sleep(REBOOT_SLEEP_INTERVAL).await;
// Gracefully exit while waiting for a reboot.
event!(Level::INFO, "Agent will exit while it awaits reboot.");
std::process::exit(0);
}
}
BottlerocketShadowState::MonitoringUpdate => {
// The node should be uncordoned before transitioning to `RebootedIntoUpdate`.
// Defensively uncordon again.
event!(Level::INFO, "Ensuring that the node is uncordoned");
self.handle_recover().await?;

event!(Level::INFO, "Monitoring node's healthy condition");
// TODO: we left space here for customer if they need add customized criteria
// which uses to decide to transition from MonitoringUpdate to WaitingForUpdate.
Expand Down

0 comments on commit 73aae1b

Please sign in to comment.