Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] add preliminary "write boot disk OS" http endpoints #4633

Merged
merged 13 commits into from
Dec 8, 2023

Conversation

jgallagher
Copy link
Contributor

This PR adds two new endpoints to sled-agent:

  • POST /boot-disk/{boot_disk}/os/write to start a new update to one of the two boot disks
  • GET /boot-disk/{boot_disk}/os/write/status to get the status of the most-recently-started update to the specified boot disk

The actual drive-writing-machinery is extracted from installinator into installinator-common, which is now a new dependency of sled-agent. The bulk of the changes in this PR center around being able to run that drive-write from within the context of the pair of dropshot endpoints above, plus a bit of glue for unit testing with "in-memory disks".

TODO before merging:

  • test on madrid (both the sled-agent endpoints, and that nothing broke in the small installinator refactor)

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

dev-tools/omdb/tests/successes.out Outdated Show resolved Hide resolved
installinator-common/src/raw_disk_writer.rs Outdated Show resolved Hide resolved
Comment on lines 265 to 267
// A previous write is done, but we're
// immedately starting a new one, so discard the
// previous result.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha I had a note here asking if it's worth documenting whether we want an explicit clear step, but you have it right below!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in hindsight since we did this for both the SP and wicketd, it seems reasonable to do it here too. Changed in 8a65c20.

sled-agent/src/boot_disk_os_writer.rs Outdated Show resolved Hide resolved
sled-agent/src/boot_disk_os_writer.rs Outdated Show resolved Hide resolved
}
})?;

let mut buf = vec![0; disk_block_size];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(General note, mostly for myself, that this can't be Vec::with_capacity(disk_block_size), because read_exact takes &mut [u8] which will be empty in the with_capacity case.)

sled-agent/src/boot_disk_os_writer.rs Outdated Show resolved Hide resolved
Comment on lines +857 to +862
/// Current progress of an OS image being written to disk.
#[derive(
Debug, Clone, Copy, PartialEq, Eq, Deserialize, JsonSchema, Serialize,
)]
#[serde(tag = "state", rename_all = "snake_case")]
pub enum BootDiskOsWriteProgress {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether the update engine might be a good fit here, it seems like this is at the same state of evolution we were with the installinator, before which we ended up writing the update engine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I didn't even think of that, but you're definitely right. It kinda feels like overkill since there's no consumer for the spec, etc., I think? I don't know off the top of my head which would be more approachable; I'm inclined to leave this as-is but add a comment that if we come back to this and need it to do more we could consider replacing the hand rolled state machine - seem reasonable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds fine.

@jgallagher
Copy link
Contributor Author

Tested on madrid (and demo'd today); only ran into one small problem which I fixed in 865d281: sled-agent was hardcoding a 1 MiB request limit, so that commit moves the hard-coding to sled-agent's config.toml and sets it to 2 GiB in all our configs.

Testing notes from madrid are below.


After setting up madrid:

  • booted from slot A (set up by manually updating the scrimlet OS, and confirmed via prtconf-v showing boot-source of disk:17)
  • using dd, confirmed that the hash of the raw disk device for A matched the hash of the zfs.img of this commit, and the hash of the B device did not:
BRM44220001 # shasum zfs.img
16be0b2ae1bf30e3b330b5ebb9e8140544db5df6  zfs.img

# A
BRM44220001 # dd status=none if=/dev/rdsk/c1t00A0750132752F12d0s0 bs=4096 count=204801 | shasum -
16be0b2ae1bf30e3b330b5ebb9e8140544db5df6  -

# B
BRM44220001 # dd status=none if=/dev/rdsk/c7t00A0750132748127d0s0 bs=4096 count=204801 | shasum -
7dff835fbfd5d5a92fa7a428637021baeeec4b4f  -

Confirmed the GET endpoint for both slots shows no update:

BRM44220001 # curl http://[fd00:1122:3344:102::1]:12345/boot-disk/A/os/write/status
{"status":"no_update_started"}
BRM44220001 # curl http://[fd00:1122:3344:102::1]:12345/boot-disk/B/os/write/status
{"status":"no_update_started"}

Start an update to slot B:

BRM44220001 # curl -X POST -T zfs.img 'http://[fd00:1122:3344:102::1]:12345/boot-disk/B/os/write?update_id=5ef7e0d5-9892-4806-a767-1df71d9d83bb&sha3_256_digest=6d01294beba485946d512afe9cbb82a657bf0c72eb884f22357dea902e50b62e'

Polling status every second while it runs:

{"status":"no_update_started"}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"receiving_uploaded_image","bytes_received":98821728}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"receiving_uploaded_image","bytes_received":388933376}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"receiving_uploaded_image","bytes_received":680485856}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"writing_image_to_disk","bytes_written":37359616}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"writing_image_to_disk","bytes_written":121917440}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"writing_image_to_disk","bytes_written":206319616}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"writing_image_to_disk","bytes_written":289697792}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"writing_image_to_disk","bytes_written":374255616}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"writing_image_to_disk","bytes_written":458887168}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"writing_image_to_disk","bytes_written":543580160}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"writing_image_to_disk","bytes_written":628396032}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"writing_image_to_disk","bytes_written":712097792}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"writing_image_to_disk","bytes_written":796778496}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":24424448}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":76963840}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":129142784}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":226209792}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":302428160}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":369467392}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":440737792}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":506818560}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":576548864}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":628310016}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":680747008}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":733007872}}
{"status":"in_progress","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb","progress":{"state":"validating_written_image","bytes_read":785166336}}
{"status":"complete","update_id":"5ef7e0d5-9892-4806-a767-1df71d9d83bb"}

Confirming the disk has the zfs.img contents we expect:

BRM44220001 # dd if=/dev/rdsk/c7t00A0750132748127d0s0 bs=4096 count=204801 | shasum -
16be0b2ae1bf30e3b330b5ebb9e8140544db5df6  -

Also manually tested several of the edge cases covered by the unit tests:

  • Can't start an update while another update is running
  • Must DELETE the state of a prior update before starting a new one
  • Uploading an image with a bad hash fails
  • Trying to DELETE with an incorrect UUID fails
  • ... maybe one or two others I've forgotton

@jgallagher jgallagher merged commit 76b835d into main Dec 8, 2023
22 checks passed
@jgallagher jgallagher deleted the john/sled-agent-write-os branch December 8, 2023 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants