Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceZoneRequest is too general #4466

Merged
merged 50 commits into from
Nov 30, 2023
Merged

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Nov 8, 2023

tl;dr: While looking at adding a GET /services inventory API, I dug into the ServiceZoneRequest type that Sled Agent uses to describe a zone configured by RSS/Nexus. I found a lot of cases where it can represent configurations that are either invalid or at least never happen in practice. This is kind of a big problem -- I'll explain why later. This PR reworks this type so that it represents only valid configurations.

Notes for reviewers

This PR is unfortunately quite large, though not as big as it looks because there's a lot of whitespace changes and changes to schema files and sample inputs and outputs for the data format migration. It's also unfortunately delicate because a lot of this stuff wasn't previously being verified at compile time and a lot of this behavior is not covered by existing CI tests.

I've tried to make it as easy as possible to review by adding these notes and annotating the PR with notes about why various changes were made. If there's anything I can do to help review this (an interactive session? something else?) please let me know!


A lot of the stuff here talks about the difference between the switch zone and "Omicron zones" or "Omicron-managed zones". By these I mean zones whose configuration is entirely determined externally to sled agent by either RSS or Nexus.

The big changes here are:

  • Generally, sled agent referred to both Omicron zones and the switch zone as "services". I think we've discussed elsewhere that this is a little confusing. Elsewhere in Omicron we mostly use the RFD 248 definition of "service", where "the Nexus fleet" would be one service, of which one Nexus zone would be an instance. Then there's SMF services. So I have generally replaced the term "service" with "Omicron zone". Honestly, this was also just helpful for me going through the code to see what had been updated and what hadn't. (Nexus and the database still call these things "services" for now but I think that will end up changing later. I also didn't rename the file or the ServiceManager. I'd be happy to do this if we want but that felt like it was going to unnecessarily make this gigantic PR even bigger and harder to review.)
  • AllZoneRequests, ServiceZoneRequest, ServiceType, etc. have been broken up and reworked into OmicronZonesConfig, OmicronZoneConfig, OmicronZoneType, etc. (for Omicron zones) and similar types for switch zones. The original types still exist but are now all in services_migration.rs, which contains code for migrating from one set to the other. The new types are still in params.rs (for the Omicron zones) and services.rs (for the switch zone).
  • I added a CLI tool called services-ledger-check-migrate that can take any old-format "services" ledger and convert it to the new-format "zones" ledger (or just check that a bunch of ledgers can be converted successfully).
  • Since I was already changing this API, I went ahead and implemented a version number for the "PUT /omicron-zones" endpoint. I also added a GET /omicron-zones endpoint. These are both needed for the automated update project. They also help test the rest of this.

By far, most of the changes (and the most delicate) are in sled-agent/src/services.rs and sled-agent/src/rack_setup. services.rs has the most because ServiceZoneRequest was passed through several layers inside ServiceManager; by splitting it up I needed to change quite a lot of stuff, but in mostly mechanical ways.

Breaking changes

  • Sled Agent internal API: PUT /services is now PUT /omicron-zones and the request body is pretty different. I believe this is not a problem because the only consumer was RSS, which I'm updating here, and I think it's safe to assume that RSS and Sled Agent will be at the same version at rack setup time.
  • Sled Agent "services" ledger: there are more details in my PR comments, but this is basically the list of zones the sled agent is configured to run. There's migration code here that converts the old form to the new form. This can fail if some of my assumptions below are wrong. See testing notes below.
  • RSS service plan ledger: there are more details in the PR comments, but I believe there's no condition in which we would have read this file and used it so I think it's okay that the format and filename changed. If for some reason we do need to do this and we find the old service plan format, RSS will fail with a clear error message.

Risks / testing

I think the big risks here are:

  • I've done something wrong with the mechanical changes in services.rs. I hope such issues are likely to show up when we try to set up any kind of zone, and that by successfully setting up every kind of zone, this should be mitigated. CI does this well enough for most zones I think. I'm less sure about the switch zone and I want to test deploying Omicron on real hardware.
  • I've made a bad assumption about what could appear in the services ledger file. To mitigate this, I've grabbed ledger files from rack2 (dogfood) and rack3 (colo) and added automated tests that convert them. The conversion code checks all the assumptions I could think of so we should notice if we fail to convert anything. I would like to test this with customer systems' ledgers as well but that may not be practical. I think it's unlikely that these would be meaningfully different because the ledgers are only created once during RSS and are likely to be just about the same across all deployments.

Motivation

You can probably skip the rest of this. It's a little pedantic but I wanted to record all the details somewhere.

Problems with ServiceZoneRequest

Ignoring the switch zone (for zones with zone_type != ZoneType::Switch):

  • The id appears at least twice in the struct and sometimes three times (.id, .dataset.id, .services.*.id). I believe these ids must all be the same for a given object but there's nothing to enforce that.
  • There's an addresses array, but it always has exactly one address.
  • There's a services array, but it always has exactly one service.
  • The "service_address" of a dataset appears multiple times.
  • For all intents and purposes, the "kind" appears at least twice and sometimes three times: (.zone_type, .dataset kind, and .services.*.details). Aside from the NTP zones: the zone type, dataset kind (if present), and service details all match up exactly. (That is, for a Cockroach zone, the zone type, dataset type, and service type are all variants called CockroachDb -- for three different enums.) NTP zones are almost the same as the others except that the service variant can be either InternalNtp or BoundaryNtp.
  • Relatedly, The dataset can be present for any type of zone, even though it can only really be present for a subset of zone types.

Altogether, this structure is really pretty simple and constrained, but the definition allows all these things to vary independently. For example, you could have three different ids in there or you could have a Nexus zone with a CockroachDb dataset and an InternalNtp service definition!

The switch zone is kind of special. Many code paths that involve ServiceZoneRequest assume they're not operating on ZoneType::Switch. For example, RSS/Nexus aren't supposed to be able to specify this zone. And similarly, some code paths that operate on the switch zone accept a ServiceZoneRequest but panic if they're given the switch zone. The switch zone version of this struct does have multiple addresses and multiple services.

Why this matters

First the basics:

  • It's hard for a reader to understand what configurations are legal. All of the conclusions I listed above were determined through code inspection. This took a while and also might be wrong. Even if it's right, it might become wrong as the code evolves.
  • Many code paths assert conditions that could be checked at compile-time. Others assume these conditions without verifying them.
  • As a result of both of these: it's not always clear what a hunk of code assumes about a ServiceZoneRequest, which adds risk to any changes.

But the real reason I'm here is: in writing RFD 433-style software inventory using this data, what should the inventory system assume about ServiceZoneRequest? It could either assume the much simpler, constrained world that we actually live in: the ids and kinds all match up, there's exactly one address, there's no switch zone, etc. Or it could support the existing, very general structure. In practice, it almost certainly can't do anything sensible with these configurations that shouldn't be possible. So we can think of this question as: at what point in the system do we attempt to translate the general ServiceZoneRequest into a simpler, more constrained version and bail out on failure? Options include:

  1. at the beginning: get rid of ServiceZoneRequest and only use the simpler one (that's what I opted for)
  2. at the sled agent "GET /services" boundary
  3. at the inventory collector
  4. at inventory consumers (e.g., the update planner)

The later in the process we do this, the more we propagate "the basics" problems above to other parts of the system. We also have to do much more work. For example, if we only constrain this in inventory consumers like the update planner, then we have do all the work in the inventory collector, database schema, and the consumers to support the much more general structure. (As a silly example: if there really can be a lot of addresses here, we'd want to use a separate database table for the addresses, which would be more annoying for all consumers and much less efficient in the common case of only one address.)

Further, every option other than option (1) allows the possibility that the translation could fail if in fact our assumptions about real-world configurations were wrong. With option (1), we'll discover that at compile time. Even if we're confident about our assumptions today, it'd be easy to accidentally add a code path that violates them (e.g., adds another address) without realizing that will break consumers at runtime. The failure mode here is pretty bad: we'd fail to inventory that sled, and we wouldn't be able to provision or deprovision or reconfigure any Omicron zones on that sled.

@luqmana
Copy link
Contributor

luqmana commented Nov 8, 2023

  • The id appears at least twice in the struct and sometimes three times (.id, .dataset.id, .services.*.id). I believe these ids must all be the same for a given object but there's nothing to enforce that.

  • There's an addresses array, but it always has exactly one address.

  • There's a services array, but it always has exactly one service.

  • The "service_address" of a dataset appears multiple times.

  • For all intents and purposes, the "kind" appears at least twice and sometimes three times: (.zone_type, .dataset kind, and .services.*.details). Aside from the NTP zones: the zone type, dataset kind (if present), and service details all match up exactly. (That is, for a Cockroach zone, the zone type, dataset type, and service type are all variants called CockroachDb -- for three different enums.) NTP zones are almost the same as the others except that the service variant can be either InternalNtp or BoundaryNtp.

  • Relatedly, The dataset can be present for any type of zone, even though it can only really be present for a subset of zone types.

I think most (if not all) of these stems from what I assume was the original intention of having multiple "services" co-located in the same zone. Afaik the only user of this was having a "dns" and "dns client" service in the same zone but that's since been removed. If we're just gonna bite the bullet and make things reflect the situation in practice, then we can definitely simplify things here.

@davepacheco
Copy link
Collaborator Author

Yeah, I hope none of that came off as criticism of how we got here. I think there are historical reasons for all of the things I mentioned. I think it was me that added the multiple services per, uh, service, which we later ripped out, and I think that's where we got some of the extra ids. We also used to separate datasets and services and I think that's why there's redundancy there. We also used to not put the id into every zone's datasets, and I think that's partly why DatasetName looks the way it does. It all made sense. Fortunately I think we've converged on a model that'll let us simplify this nicely.

@smklein smklein self-assigned this Nov 29, 2023
sled-agent/src/params.rs Outdated Show resolved Hide resolved
sled-agent/src/services_migration.rs Outdated Show resolved Hide resolved
sled-agent/src/services_migration.rs Outdated Show resolved Hide resolved
// this wrong, we'll fail spuriously here and we'll have to figure
// out what happened. But the alternative is doing extra work to
// support a condition that we do not believe can ever happen in any
// system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reading this before reading the other changes to RSS, but I think the current RSS just leaves the existing rss-service-plan.json on disk forever, right? Should we delete or convert those files that are sitting around on deployed systems?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to do this for the reason mentioned in this comment. We could! But it should never happen that we get here and find it. If we do, we might want to stop and figure out what happened. I also think there's value for debugging in keeping it around.

sled-agent/src/rack_setup/plan/service.rs Outdated Show resolved Hide resolved
sled-agent/src/services.rs Show resolved Hide resolved
sled-agent/src/services.rs Outdated Show resolved Hide resolved
// aren't, that's a problem, too.
if ledger_zone_config.omicron_version == request.version
&& ledger_zone_config.clone().to_omicron_zones_config().zones
!= request.zones
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting since I asked about it earlier - it looks like we do care about zones ordering here (and maybe shouldn't, even if in practice it's probably fine?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in practice it will be fine as-is because a given sled agent will always wind up receiving a given generation's config in the same order. It's only in-memory in RSS (and relative to what was stored on disk there) that it gets reordered. I'm not sure how to feel about it -- I could see going various ways on it. (This function for example could be more tolerant of it showing up in a different order, but maybe that would be a red flag and we would want to stop in that case?)

@@ -2600,19 +3124,10 @@ impl ServiceManager {
if let Some((ip, _)) = underlay_info { vec![ip] } else { vec![] };
addresses.push(Ipv6Addr::LOCALHOST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this might answer my question above about MGS vs dendrite and listening on multiple addresses. What would you think about changing SwitchZoneConfig's addresses to also just underlay_address and then touching up anywhere that expected it to contain an underlay address and localhost? (E.g., how the MGS block above adds localhost explicitly and then optionally adds the underlay address if it exists)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather do this in a follow-up PR. Again, just trying to separate separable changes, especially given I don't think we have comprehensive automated test coverage of all these code paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, totally fair. It will be a much easier change to make once the rest of this lands, for sure.

.expect_err("unexpectedly changed a single zone version");
assert!(matches!(
error,
Error::RequestedConfigConflicts(vr) if vr == v2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL this style of if works in matches!

@smklein smklein removed their assignment Nov 29, 2023
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this major refactor, this is definitely moving in the right direction

@@ -229,252 +231,7 @@ pub struct Zpool {
pub disk_type: DiskType,
}

/// Describes service-specific parameters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I had previously thought the overlap of both had been painful.

@@ -235,41 +283,13 @@ impl Plan {
Ok(u2_zpools)
}

pub async fn create(
log: &Logger,
pub fn create_transient(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand, why call this "create_transient" now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create() still exists and works the same as before. It did/does three things: (1) loads some information from each sled, (2) assembles an in-memory representation of a plan that describes what things are going to be where, and (3) writes that plan durably to disk. For the tests, I only wanted (2). So I factored (2) into create_transient(), which I think of this as "create a service plan, but only the in-memory part -- skip the part where you write the plan to disk" -- hence transient. I'm not attached to that name though.

sled-agent/src/services.rs Outdated Show resolved Hide resolved
/// functions that operate on either one or the other
enum ZoneArgs<'a> {
Omicron(&'a OmicronZoneConfigLocal),
SledLocal(&'a SwitchZoneConfigLocal),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming brainstorming: If we're discussing ownership, it might make sense to describe these as "Nexus-owned" vs "Sled-owned" services? I dunno if that helps, but it seems to be the distinction to me. (IMO the name "Omicron" for any zones doesn't actually provide extra context, since all these zones are kinda "Omicron zones").

@davepacheco
Copy link
Collaborator Author

I think I've addressed all the open feedback. If not, let me know. I think the only open item is the question around the name "Omicron-managed zones". While I'm generally inclined to land this PR and iterate as needed, this particular thing appears all over the internal API (OmicronZonesConfig and its related types, plus the URL omicron-zones) plus the ledger file omicron-zones.json so it's worth getting settled before landing this.

@davepacheco
Copy link
Collaborator Author

Thanks @jgallagher and @smklein for the reviews!

@davepacheco
Copy link
Collaborator Author

I wrote:

I think I've addressed all the open feedback. If not, let me know. I think the only open item is the question around the name "Omicron-managed zones". While I'm generally inclined to land this PR and iterate as needed, this particular thing appears all over the internal API (OmicronZonesConfig and its related types, plus the URL omicron-zones) plus the ledger file omicron-zones.json so it's worth getting settled before landing this.

I think the naming question is less significant than I was thinking. The omicron in PUT omicron-zones, and OmicronZonesConfig and its related types, and omicron-zones.json can all be read as "zones that are themselves part of Omicron", not necessarily "Omicron-managed". And it would be reasonable to keep those names even if we decided inside sled agent to call these "Nexus-managed" or "externally-managed" or something else.

@davepacheco
Copy link
Collaborator Author

Beyond the standard CI testing, I've:

  • successfully deployed Omicron to madrid (a 4-Gimlet, 1-Sidecar rig) with the TUF repo generated from commit 5e55c48 (tip of this branch)
  • provisioned an instance and ssh'd into it
  • cold-booted that rig
  • provisioned another instance and ssh'd into it
  • checked the Scrimlet sled-agent log for unexpected warnings (this was after the cold boot so I missed anything that RSS produced; it's also pretty manual and error-prone but I figured I'd at least look)
  • checked the sled-agent log from the helios-deploy job to see if there were any unexpected warnings from RSS (similar caveats)

@davepacheco davepacheco enabled auto-merge (squash) November 30, 2023 19:46
@davepacheco davepacheco disabled auto-merge November 30, 2023 19:46
@davepacheco davepacheco enabled auto-merge (squash) November 30, 2023 19:46
@davepacheco davepacheco merged commit e3887d5 into main Nov 30, 2023
21 of 22 checks passed
@davepacheco davepacheco deleted the dap/sled-agent-services-type branch November 30, 2023 20:55
@davepacheco davepacheco self-assigned this Dec 1, 2023
smklein added a commit that referenced this pull request Jun 21, 2024
This portion of the Sled Agent was added in
#4466, migrating an old
"services" ledger into a new, zones-oriented format.

This code served as a bridge between old and new format, but maintaining
backwards compatibility here is a burden
for modifying the zone format further.

Specifically: in #5050 ,
which attempts to add the notion of "transient zone filesystem zpool" to
the request from Nexus, the translation of "old format -> new format" is
no longer possible without additional information about "where should
the transient zone filesystem be placed".

Given that this conversion has been deployed for multiple months -
across releases - I figured it would be easier to remove the old
support, rather than making another bridge for old code compatibility.
iliana pushed a commit that referenced this pull request Jun 21, 2024
This portion of the Sled Agent was added in
#4466, migrating an old
"services" ledger into a new, zones-oriented format.

This code served as a bridge between old and new format, but maintaining
backwards compatibility here is a burden
for modifying the zone format further.

Specifically: in #5050 ,
which attempts to add the notion of "transient zone filesystem zpool" to
the request from Nexus, the translation of "old format -> new format" is
no longer possible without additional information about "where should
the transient zone filesystem be placed".

Given that this conversion has been deployed for multiple months -
across releases - I figured it would be easier to remove the old
support, rather than making another bridge for old code compatibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants