Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support updating the RoT bootloader (stage0) #5882

Merged
merged 1 commit into from
Jun 20, 2024
Merged

Support updating the RoT bootloader (stage0) #5882

merged 1 commit into from
Jun 20, 2024

Conversation

labbott
Copy link
Contributor

@labbott labbott commented Jun 12, 2024

No description provided.

@labbott labbott marked this pull request as draft June 12, 2024 12:43
update-common/src/artifacts/update_plan.rs Outdated Show resolved Hide resolved
update-common/src/artifacts/update_plan.rs Outdated Show resolved Hide resolved
@@ -98,6 +98,10 @@ pub(crate) struct StartRackUpdateArgs {
#[clap(flatten)]
component_ids: ComponentIdSelector,

/// Force update the RoT Bootloader even if the version is the same.
#[clap(long, help_heading = "Update options")]
force_update_rot_bootloader: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, did you need this to deal with FWID not matching after embootleby or other SWD flashing? (trailing flash blocks not in erased state)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just for consistency with the other artifact force flags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is to match what we offer for the other flags. I'd rather have it for consistency sake.

wicket/src/ui/panes/update.rs Outdated Show resolved Hide resolved
@@ -946,6 +977,78 @@ impl UpdateDriver {
},
)
.register();

// Send the bootloader update to the RoT.
// XXX is this where we mutex?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wicket is all manually driven, right? By that I mean that each stage0 update needs to be triggered by a human.
In any case, we don't want more than one stage0 update in process at any one time and we want manual intervention if any fail. If we loose a device because of a bricked RoT, we don't want to proceed.
This isn't a real issue until the chain of trust and trust quorum are fully built out, but bricked RoTs are going to at least require field service to touch the machines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each stage0 update is manually triggered but it's possible to kick off many updates in parallel. I thought we had discussed only allow one stage0 update to occur at a time across the rack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wicket is all manually driven, right? By that I mean that each stage0 update needs to be triggered by a human.

Wicket is only manually driven to a point. A human kicks off "do all the updates for sled N", but then wicket drives RoT -> SP -> host updates autonomously. Multiple sleds can be updated in parallel.

wicketd/src/update_tracker.rs Outdated Show resolved Hide resolved
wicketd/src/update_tracker.rs Outdated Show resolved Hide resolved
@labbott labbott force-pushed the stage0 branch 6 times, most recently from 1342c24 to b93411d Compare June 18, 2024 20:44
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, just a handful of questions


// Older versions of the SP have a bug that prevents setting
// the active slot for the RoT bootloader. Check for these
// and skip the update until the SP gets updated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checking: we only care about SP version and not RoT version for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. If the RoT version supports rot-boot-info it should support stage0 update.

// artifact available and that all available RoT artifacts are the same
// version, so we can unwrap the first artifact here and assume its
// version matches any subsequent artifacts.
// TODO this needs to be fixed for multi version to work!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this TODO is for a subsequent PR? Will that be similar to the changes to allow multi-version non-bootloader RoT images?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those changes should be very similar.

.await
{
Ok(v) => match v.into_inner() {
// the minimum we will ever return is 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantees this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we request the HIGHEST_KNOWN_VERSION which is 3. If this message return success we should always return V3 (see https://github.com/oxidecomputer/hubris/blob/master/drv/lpc55-update-server/src/main.rs#L349, https://github.com/oxidecomputer/management-gateway-service/blob/main/gateway-messages/src/sp_to_mgs.rs#L616) . If we return an error that gets handled below and we should skip the update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the SP is on an older version that doesn't know about 3 yet? I think (oxidecomputer/management-gateway-service#209 (comment)) that means we could get back a v2 response, right? Unless something else in the flow here guarantees we're on a new-enough SP already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rot-boot-info is a separate message that gets sent to the RoT. If the SP doesn't know about the message it will return an error that it can't be deserialized and we will skip the bootloder update

 │                  │ Message: Failed to run `rot_boot_info`: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json",                │                   │ 
 │                  │ "x-request-id": "8afc7433-17e1-40c8-ac22-edb184b79063", "content-length": "242", "date": "Sun, 28 Dec 1986 00:08:31 GMT"}; value: Error { error_code: │                   │ 
 │                  │ Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 15 }: Error response from SP: sprot: failed to   │                   │ 
 │                  │ deserialize message", request_id: "8afc7433-17e1-40c8-ac22-edb184b79063" }. Will not proceed with update.                                             │                   │ 
 │                  │                                                                                                                     

If the SP does know about the message it will send it on to the RoT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh thanks!

)
.await
.map_err(|error| {
SpComponentUpdateTerminalError::SetRotActiveSlotFailed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit - should this be a new SetRotBootloaderActiveSlotFailed variant to get a more precise error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make that change

@@ -2518,6 +3000,7 @@ impl<'a> SpComponentUpdateContext<'a> {
}
UpdateComponent::Sp => {
// Nothing special to do on the SP - just reset it.
// TODO fixup the SP to also set the active slot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for oxidecomputer/hubris#1808? Do we need to put this fix in before R9 and/or were you planning to do it on this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is related. It's not necessary for this PR to work and I was wary about adding more logic changes for backwards compatability since this PR is already pretty large.

)
.register();

// Now reset (again) to boot into the new stage0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mutex this substep (or one of the other ones) (#5882 (comment))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how to place the mutex. We want to mutex at the rack level and it seemed like this gets spawned off per SP. This is probably my complete inexperience with Rust synchronization though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's going to be a moderate-to-large mess no matter what, I think. How bad is landing this without it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this particular set of stage0 updates it's pretty low risk. We're going to have some subsequent stage0 updates to add epochs to avoid roll backs so I think we could target it there.

@labbott
Copy link
Contributor Author

labbott commented Jun 20, 2024

I just did a test on london going from release 8 to this repo. What's going to happen is the first time through we're going to get an error because the SP is too old

 │                  │ Message: Failed to run `rot_boot_info`: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json",                │                   │ 
 │                  │ "x-request-id": "8afc7433-17e1-40c8-ac22-edb184b79063", "content-length": "242", "date": "Sun, 28 Dec 1986 00:08:31 GMT"}; value: Error { error_code: │                   │ 
 │                  │ Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 15 }: Error response from SP: sprot: failed to   │                   │ 
 │                  │ deserialize message", request_id: "8afc7433-17e1-40c8-ac22-edb184b79063" }. Will not proceed with update.                                             │                   │ 
 │                  │                                                                                                                     

And then we're going to have to run the update again which should give a warning such as Message: fwid c23a53858e94932a95945f28730e41ae4a2d1a8db4776283245eda143b6b2994 is known to be missing a caboose. before we finally update stage0.

The only thing I have not been able to test is running the update on a prod/rel signed sled. There is one in dogfood so once this merges I'm going to plan to do an off cycle dogfood update to make sure this works.

@labbott labbott marked this pull request as ready for review June 20, 2024 19:00
@labbott labbott merged commit 3549436 into main Jun 20, 2024
20 checks passed
@labbott labbott deleted the stage0 branch June 20, 2024 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants