`ServiceZoneRequest` is too general #4466

davepacheco · 2023-11-08T05:04:57Z

tl;dr: While looking at adding a GET /services inventory API, I dug into the ServiceZoneRequest type that Sled Agent uses to describe a zone configured by RSS/Nexus. I found a lot of cases where it can represent configurations that are either invalid or at least never happen in practice. This is kind of a big problem -- I'll explain why later. This PR reworks this type so that it represents only valid configurations.

Notes for reviewers

This PR is unfortunately quite large, though not as big as it looks because there's a lot of whitespace changes and changes to schema files and sample inputs and outputs for the data format migration. It's also unfortunately delicate because a lot of this stuff wasn't previously being verified at compile time and a lot of this behavior is not covered by existing CI tests.

I've tried to make it as easy as possible to review by adding these notes and annotating the PR with notes about why various changes were made. If there's anything I can do to help review this (an interactive session? something else?) please let me know!

A lot of the stuff here talks about the difference between the switch zone and "Omicron zones" or "Omicron-managed zones". By these I mean zones whose configuration is entirely determined externally to sled agent by either RSS or Nexus.

The big changes here are:

Generally, sled agent referred to both Omicron zones and the switch zone as "services". I think we've discussed elsewhere that this is a little confusing. Elsewhere in Omicron we mostly use the RFD 248 definition of "service", where "the Nexus fleet" would be one service, of which one Nexus zone would be an instance. Then there's SMF services. So I have generally replaced the term "service" with "Omicron zone". Honestly, this was also just helpful for me going through the code to see what had been updated and what hadn't. (Nexus and the database still call these things "services" for now but I think that will end up changing later. I also didn't rename the file or the ServiceManager. I'd be happy to do this if we want but that felt like it was going to unnecessarily make this gigantic PR even bigger and harder to review.)
AllZoneRequests, ServiceZoneRequest, ServiceType, etc. have been broken up and reworked into OmicronZonesConfig, OmicronZoneConfig, OmicronZoneType, etc. (for Omicron zones) and similar types for switch zones. The original types still exist but are now all in services_migration.rs, which contains code for migrating from one set to the other. The new types are still in params.rs (for the Omicron zones) and services.rs (for the switch zone).
I added a CLI tool called services-ledger-check-migrate that can take any old-format "services" ledger and convert it to the new-format "zones" ledger (or just check that a bunch of ledgers can be converted successfully).
Since I was already changing this API, I went ahead and implemented a version number for the "PUT /omicron-zones" endpoint. I also added a GET /omicron-zones endpoint. These are both needed for the automated update project. They also help test the rest of this.

By far, most of the changes (and the most delicate) are in sled-agent/src/services.rs and sled-agent/src/rack_setup. services.rs has the most because ServiceZoneRequest was passed through several layers inside ServiceManager; by splitting it up I needed to change quite a lot of stuff, but in mostly mechanical ways.

Breaking changes

Sled Agent internal API: PUT /services is now PUT /omicron-zones and the request body is pretty different. I believe this is not a problem because the only consumer was RSS, which I'm updating here, and I think it's safe to assume that RSS and Sled Agent will be at the same version at rack setup time.
Sled Agent "services" ledger: there are more details in my PR comments, but this is basically the list of zones the sled agent is configured to run. There's migration code here that converts the old form to the new form. This can fail if some of my assumptions below are wrong. See testing notes below.
RSS service plan ledger: there are more details in the PR comments, but I believe there's no condition in which we would have read this file and used it so I think it's okay that the format and filename changed. If for some reason we do need to do this and we find the old service plan format, RSS will fail with a clear error message.

Risks / testing

I think the big risks here are:

I've done something wrong with the mechanical changes in services.rs. I hope such issues are likely to show up when we try to set up any kind of zone, and that by successfully setting up every kind of zone, this should be mitigated. CI does this well enough for most zones I think. I'm less sure about the switch zone and I want to test deploying Omicron on real hardware.
I've made a bad assumption about what could appear in the services ledger file. To mitigate this, I've grabbed ledger files from rack2 (dogfood) and rack3 (colo) and added automated tests that convert them. The conversion code checks all the assumptions I could think of so we should notice if we fail to convert anything. I would like to test this with customer systems' ledgers as well but that may not be practical. I think it's unlikely that these would be meaningfully different because the ledgers are only created once during RSS and are likely to be just about the same across all deployments.

Motivation

You can probably skip the rest of this. It's a little pedantic but I wanted to record all the details somewhere.

Problems with `ServiceZoneRequest`

Ignoring the switch zone (for zones with zone_type != ZoneType::Switch):

The id appears at least twice in the struct and sometimes three times (.id, .dataset.id, .services.*.id). I believe these ids must all be the same for a given object but there's nothing to enforce that.
There's an addresses array, but it always has exactly one address.
There's a services array, but it always has exactly one service.
The "service_address" of a dataset appears multiple times.
For all intents and purposes, the "kind" appears at least twice and sometimes three times: (.zone_type, .dataset kind, and .services.*.details). Aside from the NTP zones: the zone type, dataset kind (if present), and service details all match up exactly. (That is, for a Cockroach zone, the zone type, dataset type, and service type are all variants called CockroachDb -- for three different enums.) NTP zones are almost the same as the others except that the service variant can be either InternalNtp or BoundaryNtp.
Relatedly, The dataset can be present for any type of zone, even though it can only really be present for a subset of zone types.

Altogether, this structure is really pretty simple and constrained, but the definition allows all these things to vary independently. For example, you could have three different ids in there or you could have a Nexus zone with a CockroachDb dataset and an InternalNtp service definition!

The switch zone is kind of special. Many code paths that involve ServiceZoneRequest assume they're not operating on ZoneType::Switch. For example, RSS/Nexus aren't supposed to be able to specify this zone. And similarly, some code paths that operate on the switch zone accept a ServiceZoneRequest but panic if they're given the switch zone. The switch zone version of this struct does have multiple addresses and multiple services.

Why this matters

First the basics:

It's hard for a reader to understand what configurations are legal. All of the conclusions I listed above were determined through code inspection. This took a while and also might be wrong. Even if it's right, it might become wrong as the code evolves.
Many code paths assert conditions that could be checked at compile-time. Others assume these conditions without verifying them.
As a result of both of these: it's not always clear what a hunk of code assumes about a ServiceZoneRequest, which adds risk to any changes.

But the real reason I'm here is: in writing RFD 433-style software inventory using this data, what should the inventory system assume about ServiceZoneRequest? It could either assume the much simpler, constrained world that we actually live in: the ids and kinds all match up, there's exactly one address, there's no switch zone, etc. Or it could support the existing, very general structure. In practice, it almost certainly can't do anything sensible with these configurations that shouldn't be possible. So we can think of this question as: at what point in the system do we attempt to translate the general ServiceZoneRequest into a simpler, more constrained version and bail out on failure? Options include:

at the beginning: get rid of ServiceZoneRequest and only use the simpler one (that's what I opted for)
at the sled agent "GET /services" boundary
at the inventory collector
at inventory consumers (e.g., the update planner)

The later in the process we do this, the more we propagate "the basics" problems above to other parts of the system. We also have to do much more work. For example, if we only constrain this in inventory consumers like the update planner, then we have do all the work in the inventory collector, database schema, and the consumers to support the much more general structure. (As a silly example: if there really can be a lot of addresses here, we'd want to use a separate database table for the addresses, which would be more annoying for all consumers and much less efficient in the common case of only one address.)

Further, every option other than option (1) allows the possibility that the translation could fail if in fact our assumptions about real-world configurations were wrong. With option (1), we'll discover that at compile time. Even if we're confident about our assumptions today, it'd be easy to accidentally add a code path that violates them (e.g., adds another address) without realizing that will break consumers at runtime. The failure mode here is pretty bad: we'd fail to inventory that sled, and we wouldn't be able to provision or deprovision or reconfigure any Omicron zones on that sled.

…ervice manager

…lize_zone()

luqmana · 2023-11-08T22:41:30Z

The id appears at least twice in the struct and sometimes three times (.id, .dataset.id, .services.*.id). I believe these ids must all be the same for a given object but there's nothing to enforce that.

There's an addresses array, but it always has exactly one address.

There's a services array, but it always has exactly one service.

The "service_address" of a dataset appears multiple times.

For all intents and purposes, the "kind" appears at least twice and sometimes three times: (.zone_type, .dataset kind, and .services.*.details). Aside from the NTP zones: the zone type, dataset kind (if present), and service details all match up exactly. (That is, for a Cockroach zone, the zone type, dataset type, and service type are all variants called CockroachDb -- for three different enums.) NTP zones are almost the same as the others except that the service variant can be either InternalNtp or BoundaryNtp.

Relatedly, The dataset can be present for any type of zone, even though it can only really be present for a subset of zone types.

I think most (if not all) of these stems from what I assume was the original intention of having multiple "services" co-located in the same zone. Afaik the only user of this was having a "dns" and "dns client" service in the same zone but that's since been removed. If we're just gonna bite the bullet and make things reflect the situation in practice, then we can definitely simplify things here.

davepacheco · 2023-11-09T03:51:22Z

Yeah, I hope none of that came off as criticism of how we got here. I think there are historical reasons for all of the things I mentioned. I think it was me that added the multiple services per, uh, service, which we later ripped out, and I think that's where we got some of the extra ids. We also used to separate datasets and services and I think that's why there's redundancy there. We also used to not put the id into every zone's datasets, and I think that's partly why DatasetName looks the way it does. It all made sense. Fortunately I think we've converged on a model that'll let us simplify this nicely.

sled-agent/src/params.rs

sled-agent/src/services_migration.rs

jgallagher · 2023-11-28T21:35:22Z

sled-agent/src/rack_setup/plan/service.rs

+            // this wrong, we'll fail spuriously here and we'll have to figure
+            // out what happened.  But the alternative is doing extra work to
+            // support a condition that we do not believe can ever happen in any
+            // system.


I'm reading this before reading the other changes to RSS, but I think the current RSS just leaves the existing rss-service-plan.json on disk forever, right? Should we delete or convert those files that are sitting around on deployed systems?

I think we don't need to do this for the reason mentioned in this comment. We could! But it should never happen that we get here and find it. If we do, we might want to stop and figure out what happened. I also think there's value for debugging in keeping it around.

sled-agent/src/rack_setup/plan/service.rs

sled-agent/src/services.rs

jgallagher · 2023-11-29T18:37:59Z

sled-agent/src/services.rs

+        // aren't, that's a problem, too.
+        if ledger_zone_config.omicron_version == request.version
+            && ledger_zone_config.clone().to_omicron_zones_config().zones
+                != request.zones


Just noting since I asked about it earlier - it looks like we do care about zones ordering here (and maybe shouldn't, even if in practice it's probably fine?).

I think in practice it will be fine as-is because a given sled agent will always wind up receiving a given generation's config in the same order. It's only in-memory in RSS (and relative to what was stored on disk there) that it gets reordered. I'm not sure how to feel about it -- I could see going various ways on it. (This function for example could be more tolerant of it showing up in a different order, but maybe that would be a red flag and we would want to stop in that case?)

jgallagher · 2023-11-29T18:47:31Z

sled-agent/src/services.rs

@@ -2600,19 +3124,10 @@ impl ServiceManager {
            if let Some((ip, _)) = underlay_info { vec![ip] } else { vec![] };
        addresses.push(Ipv6Addr::LOCALHOST);


Oh, this might answer my question above about MGS vs dendrite and listening on multiple addresses. What would you think about changing SwitchZoneConfig's addresses to also just underlay_address and then touching up anywhere that expected it to contain an underlay address and localhost? (E.g., how the MGS block above adds localhost explicitly and then optionally adds the underlay address if it exists)

I'd rather do this in a follow-up PR. Again, just trying to separate separable changes, especially given I don't think we have comprehensive automated test coverage of all these code paths.

Yep, totally fair. It will be a much easier change to make once the rest of this lands, for sure.

jgallagher · 2023-11-29T19:10:46Z

sled-agent/src/services.rs

+            .expect_err("unexpectedly changed a single zone version");
+        assert!(matches!(
+            error,
+            Error::RequestedConfigConflicts(vr) if vr == v2


TIL this style of if works in matches!

smklein

Thanks for this major refactor, this is definitely moving in the right direction

sled-agent/src/bin/services-ledger-check-migrate.rs

smklein · 2023-11-29T19:19:58Z

sled-agent/src/params.rs

@@ -229,252 +231,7 @@ pub struct Zpool {
    pub disk_type: DiskType,
 }

-/// Describes service-specific parameters.


Makes sense, I had previously thought the overlap of both had been painful.

smklein · 2023-11-29T19:48:24Z

sled-agent/src/rack_setup/plan/service.rs

@@ -235,41 +283,13 @@ impl Plan {
        Ok(u2_zpools)
    }

-    pub async fn create(
-        log: &Logger,
+    pub fn create_transient(


Not sure I understand, why call this "create_transient" now?

create() still exists and works the same as before. It did/does three things: (1) loads some information from each sled, (2) assembles an in-memory representation of a plan that describes what things are going to be where, and (3) writes that plan durably to disk. For the tests, I only wanted (2). So I factored (2) into create_transient(), which I think of this as "create a service plan, but only the in-memory part -- skip the part where you write the plan to disk" -- hence transient. I'm not attached to that name though.

sled-agent/src/services.rs

smklein · 2023-11-29T20:30:24Z

sled-agent/src/services.rs

+/// functions that operate on either one or the other
+enum ZoneArgs<'a> {
+    Omicron(&'a OmicronZoneConfigLocal),
+    SledLocal(&'a SwitchZoneConfigLocal),


Naming brainstorming: If we're discussing ownership, it might make sense to describe these as "Nexus-owned" vs "Sled-owned" services? I dunno if that helps, but it seems to be the distinction to me. (IMO the name "Omicron" for any zones doesn't actually provide extra context, since all these zones are kinda "Omicron zones").

davepacheco · 2023-11-30T00:47:10Z

I think I've addressed all the open feedback. If not, let me know. I think the only open item is the question around the name "Omicron-managed zones". While I'm generally inclined to land this PR and iterate as needed, this particular thing appears all over the internal API (OmicronZonesConfig and its related types, plus the URL omicron-zones) plus the ledger file omicron-zones.json so it's worth getting settled before landing this.

davepacheco · 2023-11-30T00:47:22Z

Thanks @jgallagher and @smklein for the reviews!

davepacheco · 2023-11-30T01:30:31Z

I wrote:

I think I've addressed all the open feedback. If not, let me know. I think the only open item is the question around the name "Omicron-managed zones". While I'm generally inclined to land this PR and iterate as needed, this particular thing appears all over the internal API (OmicronZonesConfig and its related types, plus the URL omicron-zones) plus the ledger file omicron-zones.json so it's worth getting settled before landing this.

I think the naming question is less significant than I was thinking. The omicron in PUT omicron-zones, and OmicronZonesConfig and its related types, and omicron-zones.json can all be read as "zones that are themselves part of Omicron", not necessarily "Omicron-managed". And it would be reasonable to keep those names even if we decided inside sled agent to call these "Nexus-managed" or "externally-managed" or something else.

davepacheco · 2023-11-30T19:28:46Z

Beyond the standard CI testing, I've:

successfully deployed Omicron to madrid (a 4-Gimlet, 1-Sidecar rig) with the TUF repo generated from commit 5e55c48 (tip of this branch)
provisioned an instance and ssh'd into it
cold-booted that rig
provisioned another instance and ssh'd into it
checked the Scrimlet sled-agent log for unexpected warnings (this was after the cold boot so I missed anything that RSS produced; it's also pretty manual and error-prone but I figured I'd at least look)
checked the sled-agent log from the helios-deploy job to see if there were any unexpected warnings from RSS (similar caveats)

This portion of the Sled Agent was added in #4466, migrating an old "services" ledger into a new, zones-oriented format. This code served as a bridge between old and new format, but maintaining backwards compatibility here is a burden for modifying the zone format further. Specifically: in #5050 , which attempts to add the notion of "transient zone filesystem zpool" to the request from Nexus, the translation of "old format -> new format" is no longer possible without additional information about "where should the transient zone filesystem be placed". Given that this conversion has been deployed for multiple months - across releases - I figured it would be easier to remove the old support, rather than making another bridge for old code compatibility.

davepacheco added 4 commits November 7, 2023 04:28

WIP: updated HTTP, params, and sled agent layer; about to wade into s…

d5dc440

…ervice manager

WIP: initial updates to service manager; next need to dig into initia…

4536ec9

…lize_zone()

WIP: it compiles

50ffbe4

WIP: update API spec

23bc12c

davepacheco and others added 24 commits November 9, 2023 21:37

WIP: a bunch of stuff is no longer part of the sled agent API

1eb19c8

WIP: inches ahead: starting to fix RSS

ccb85bf

WIP: compiles again

a1827df

add XXX-dap

8ef6eb3

fix test

c32320a

clippy nits

4b040b8

WIP: replacing use of ServiceZoneRequest for switch zone

ad002a3

clean up some names

cb1c66d

clean up one nit

1e671b4

WIP: moving "legacy" service types into a separate module

fe72a7d

WIP: more legacy migration

7da2d02

this does not just happen on cold boot

0ad2395

implement migration from old-format ledger

655d9f8

add CLI tool to test migration

964c1a7

prune visibility

0959fb6

add test for new format

f58c30a

rerun tests and fix some stuff

f5bde3d

tool could support showing converted form, too

2c9a565

clean up some type names

472698e

get rid of wonky generation number management in RSS

9c6e803

remove gratuitous clone

281021f

reject old generations and also ignore that on the client side

d76c2f0

write some docs and do some cleanup

3a1f393

generation -> version

98492b5

smklein self-assigned this Nov 29, 2023

jgallagher reviewed Nov 29, 2023

View reviewed changes

Merge branch 'main' into dap/sled-agent-services-type

961ec23

smklein removed their assignment Nov 29, 2023

smklein reviewed Nov 29, 2023

View reviewed changes

davepacheco mentioned this pull request Nov 29, 2023

RSS erroneously retries non-retryable failures #4578

Open

davepacheco added 3 commits November 29, 2023 23:58

review feedback, part 1

98c3137

review feedback: rename version -> generation

4274e9d

new ledger filename is inconsistent with the others

5e55c48

davepacheco mentioned this pull request Nov 30, 2023

remove support for old-format services ledger and RSS service plan #4579

Open

davepacheco mentioned this pull request Nov 30, 2023

after miscabling, omicron setup and cold boot hung waiting for wrong switch info #4586

Open

jgallagher approved these changes Nov 30, 2023

View reviewed changes

Merge branch 'main' into dap/sled-agent-services-type

e9af61c

davepacheco enabled auto-merge (squash) November 30, 2023 19:46

davepacheco disabled auto-merge November 30, 2023 19:46

davepacheco enabled auto-merge (squash) November 30, 2023 19:46

davepacheco merged commit e3887d5 into main Nov 30, 2023
21 of 22 checks passed

davepacheco deleted the dap/sled-agent-services-type branch November 30, 2023 20:55

davepacheco self-assigned this Dec 1, 2023

smklein mentioned this pull request Jun 20, 2024

[sled agent] Deprecate the old service format #5922

Merged

davepacheco mentioned this pull request Nov 22, 2024

Add docs about backwards compat design #7085

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ServiceZoneRequest` is too general #4466

`ServiceZoneRequest` is too general #4466

davepacheco commented Nov 8, 2023 •

edited

Loading

luqmana commented Nov 8, 2023

davepacheco commented Nov 9, 2023

jgallagher Nov 28, 2023

davepacheco Nov 30, 2023

jgallagher Nov 29, 2023

davepacheco Nov 29, 2023

jgallagher Nov 29, 2023

davepacheco Nov 29, 2023

jgallagher Nov 29, 2023

jgallagher Nov 29, 2023

smklein left a comment

smklein Nov 29, 2023

smklein Nov 29, 2023

davepacheco Nov 29, 2023

smklein Nov 29, 2023

davepacheco commented Nov 30, 2023

davepacheco commented Nov 30, 2023

davepacheco commented Nov 30, 2023

davepacheco commented Nov 30, 2023

		@@ -2600,19 +3124,10 @@ impl ServiceManager {
		if let Some((ip, _)) = underlay_info { vec![ip] } else { vec![] };
		addresses.push(Ipv6Addr::LOCALHOST);

ServiceZoneRequest is too general #4466

ServiceZoneRequest is too general #4466

Conversation

davepacheco commented Nov 8, 2023 • edited Loading

Notes for reviewers

Breaking changes

Risks / testing

Motivation

Problems with ServiceZoneRequest

Why this matters

luqmana commented Nov 8, 2023

davepacheco commented Nov 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco commented Nov 30, 2023

davepacheco commented Nov 30, 2023

davepacheco commented Nov 30, 2023

davepacheco commented Nov 30, 2023

`ServiceZoneRequest` is too general #4466

`ServiceZoneRequest` is too general #4466

davepacheco commented Nov 8, 2023 •

edited

Loading

Problems with `ServiceZoneRequest`