Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rearrange buildomat jobs; rewrite releng process in rust and aggressively parallelize #5744

Merged
merged 39 commits into from
May 15, 2024

Conversation

iliana
Copy link
Contributor

@iliana iliana commented May 13, 2024

Note: documentation says cargo xtask releng but I am going to wire that up in a follow-up PR; the current equivalent is cargo run --release --bin omicron-releng.


Prior to this change we have five main "release engineering" Buildomat jobs that do operations beyond running the test suite:

  • a package job which runs omicron-package in various configurations,
  • a build OS images job which builds the host and trampoline images,
  • a TUF repo job which builds the final TUF repo (this is the build artifact we actually want),
  • a deploy job which uses the single-sled packages to test that a VM boots to SSH (this is a test we actually want),
  • and a CI tools job which builds common tools used by multiple jobs.

This looks like:

graph LR
    package --> host-image["build OS images"]
    package --> deploy
    package --> tuf-repo["TUF repo"]
    host-image --> tuf-repo
    ci-tools["CI tools"] --> deploy
    ci-tools --> tuf-repo
Loading

(There are also the currently-disabled a4x2 jobs but those are independent of this particular graph.)

I think the initial idea behind this was to reuse build artifacts where possible, but this is pretty complicated and adds a lot more output upload/download overhead than expected, which slows down the time to get the end artifact we actually want.

This PR changes the graph to:

graph LR
    package --> deploy
    tuf-repo["TUF repo"]
Loading

And the TUF repo job primarily runs a new releng binary, which runs all of the steps required to download and build all the components of the TUF repo in a single task, using a terrible job runner I wrote.

CI taking too long? busy, but can't get stuff done? try 30 to 40 async rust functions. 30 to 40 async rust fns: an easy iteration time winner. put them directly on a runtime with a half-baked job server. you will certainly not regret writing 30 to 40 async fns
(thank you to @lifning for the image)

The primary goal here was to reduce the time from pushing a commit to getting a TUF repo out the other end; this drops time-to-TUF-repo from ~80 minutes to ~45. In the process this also made it much easier to build a TUF repo (and iterate on that process) locally: just run cargo xtask releng (TODO: soon). It also deleted a lot of Bash.

One thing to note is that, in service of the mission to get time-to-TUF-repo down as much as possible, that job only uploads the TUF repo (and some logs). I also put all of the outputs for the package job into a single tarball for the deploy job to unpack. There are no longer separate uploads for the OS images and each zone; these can be extracted from the repo as we normally do.

@iliana iliana requested review from sunshowers and smklein May 13, 2024 15:51
Copy link
Contributor

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it means it's really easy to build a TUF repo via xtask?

Comment on lines +109 to +119
client
.get(format!("{}/artifact/{}", base_url, hash))
.send()
.and_then(|response| response.json())
.await
.with_context(|| {
format!(
"failed to fetch hubris artifact {} from {}",
hash, base_url
)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unfortunate that permslip is still a private repo but I guess it's mostly just a wrapper around reqwest anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an earlier revision (not pushed) I ended up copying the openapi.json out of the repo and generating a client with Progenitor, and then I realized how much that was slowing down rust-analyzer just for a single API call. So I went with this relatively-small wrapper.

dev-tools/releng/src/hubris.rs Show resolved Hide resolved
(we haven't used this since starting to use permslip for hubris
artifacts, anyway)
@iliana
Copy link
Contributor Author

iliana commented May 13, 2024

This looks like it means it's really easy to build a TUF repo via xtask?

That, and running it locally re-uses all the Omicron artifacts too, so iterating on this process is a lot easier. (I think it takes about 8 minutes total if the control plane is previously built; that time is mostly building the OS images.)

@david-crespo
Copy link
Contributor

Thanks for the great description, and looking forward to docs/releng.adoc. I never really knew what these jobs were doing.

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First-pass review, generally looks good!

@@ -3,24 +3,11 @@
#: name = "helios / package"
#: variety = "basic"
#: target = "helios-2.0"
#: rust_toolchain = "1.72.1"
#: rust_toolchain = "1.77.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to stay in sync with rust-toolchain.toml? Can we read from that file directly instead? To be honest I didn't know this existed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes:

[toolchain]
# NOTE: This toolchain is also specified in various jobs in
# .github/buildomat/jobs/. If you update it here, update those files too.
#
# We choose a specific toolchain (rather than "stable") for repeatability. The
# intent is to keep this up-to-date with recently-released stable Rust.
channel = "1.77.2"

If it's not in sync it doesn't actually break anything, but the job runs a bit slower (Buildomat downloads 1.72.1 and then rustup's Cargo proxy sees you actually wanted 1.77.2 and downloads that).

Buildomat feature request: oxidecomputer/buildomat#55

.github/buildomat/jobs/package.sh Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
Comment on lines 43 to 53
for (name, value) in command.get_envs() {
if let Some(value) = value {
write!(
command_str,
"{}={} ",
shell_words::quote(&name.to_string_lossy()),
shell_words::quote(&value.to_string_lossy())
)
.unwrap();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good spot.

dev-tools/releng/src/hubris.rs Show resolved Hide resolved
dev-tools/releng/src/job.rs Outdated Show resolved Hide resolved
dev-tools/releng/src/job.rs Outdated Show resolved Hide resolved
dev-tools/releng/src/job.rs Outdated Show resolved Hide resolved
dev-tools/releng/src/job.rs Show resolved Hide resolved
dev-tools/releng/src/job.rs Outdated Show resolved Hide resolved
docs/releng.adoc Outdated Show resolved Hide resolved
docs/releng.adoc Outdated
also manage installation and uninstallation of these zones; see
how-to-run.adoc.)
. Some of the packaged artifacts are installed directly in the OS
images; `cargo xtask releng` unpacks these into a temporary directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a bit of thinking to understand this — I assume it refers to stuff like putting the console assets directly in the file system of the control plane zone in the host image. But the line as-is is pretty cryptic; it could stand to be two or three sentences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this bullet, is it better?

docs/releng.adoc Outdated Show resolved Hide resolved
Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through again, just a few more comments but pretty happy otherwise!

dev-tools/releng/src/cmd.rs Outdated Show resolved Hide resolved
dev-tools/releng/src/hubris.rs Show resolved Hide resolved
dev-tools/releng/src/job.rs Outdated Show resolved Hide resolved
@iliana
Copy link
Contributor Author

iliana commented May 14, 2024

I think I've addressed all outstanding review feedback; I still need to figure out a pattern for external xtasks (clap is currently making me unhappy), but I think that's the last TODO here.

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it, thanks for doing this work!

impl std::fmt::Display for Command {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
let command = self.inner.as_std();
for (name, value) in command.get_envs() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How noisy is this? I think this would print out the entire environment, which would be a lot: on my machine, export | wc -c returns 4923.

Copy link
Contributor Author

@iliana iliana May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only prints the environment variables explicitly set in the command builder. https://doc.rust-lang.org/stable/std/process/struct.Command.html#method.get_envs

struct Args {
/// ZFS dataset to use for `helios-build` when building the host image
#[clap(long, default_value_t = Self::default_dataset("host"))]
host_dataset: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is there some validation we can do on the dataset names? Would that be worth it at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think verifying the datasets already exist (one of the preflight checks we run) is sufficient.

@iliana
Copy link
Contributor Author

iliana commented May 15, 2024

Despite the documentation all saying cargo xtask releng I think I am going to go ahead and merge this (after I test it on london) so we get the CI time improvements. For now folks can use cargo run --release --bin omicron-releng. I'm still toying with a few different options for external xtasks and I think that probably deserves its own separate review.

@iliana iliana merged commit 59636c9 into main May 15, 2024
20 checks passed
@iliana iliana deleted the iliana/releng branch May 15, 2024 04:55
iliana added a commit that referenced this pull request Jun 21, 2024
I forgot to remove these in #5744. These scripts were adapted into
`cargo xtask releng` and aren't referenced anywhere in this repo
anymore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants