-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ServiceZoneRequest
is too general
#4466
Conversation
I think most (if not all) of these stems from what I assume was the original intention of having multiple "services" co-located in the same zone. Afaik the only user of this was having a "dns" and "dns client" service in the same zone but that's since been removed. If we're just gonna bite the bullet and make things reflect the situation in practice, then we can definitely simplify things here. |
Yeah, I hope none of that came off as criticism of how we got here. I think there are historical reasons for all of the things I mentioned. I think it was me that added the multiple services per, uh, service, which we later ripped out, and I think that's where we got some of the extra ids. We also used to separate datasets and services and I think that's why there's redundancy there. We also used to not put the id into every zone's datasets, and I think that's partly why DatasetName looks the way it does. It all made sense. Fortunately I think we've converged on a model that'll let us simplify this nicely. |
// this wrong, we'll fail spuriously here and we'll have to figure | ||
// out what happened. But the alternative is doing extra work to | ||
// support a condition that we do not believe can ever happen in any | ||
// system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reading this before reading the other changes to RSS, but I think the current RSS just leaves the existing rss-service-plan.json
on disk forever, right? Should we delete or convert those files that are sitting around on deployed systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to do this for the reason mentioned in this comment. We could! But it should never happen that we get here and find it. If we do, we might want to stop and figure out what happened. I also think there's value for debugging in keeping it around.
// aren't, that's a problem, too. | ||
if ledger_zone_config.omicron_version == request.version | ||
&& ledger_zone_config.clone().to_omicron_zones_config().zones | ||
!= request.zones |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting since I asked about it earlier - it looks like we do care about zones
ordering here (and maybe shouldn't, even if in practice it's probably fine?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in practice it will be fine as-is because a given sled agent will always wind up receiving a given generation's config in the same order. It's only in-memory in RSS (and relative to what was stored on disk there) that it gets reordered. I'm not sure how to feel about it -- I could see going various ways on it. (This function for example could be more tolerant of it showing up in a different order, but maybe that would be a red flag and we would want to stop in that case?)
@@ -2600,19 +3124,10 @@ impl ServiceManager { | |||
if let Some((ip, _)) = underlay_info { vec![ip] } else { vec![] }; | |||
addresses.push(Ipv6Addr::LOCALHOST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this might answer my question above about MGS vs dendrite and listening on multiple addresses. What would you think about changing SwitchZoneConfig
's addresses
to also just underlay_address
and then touching up anywhere that expected it to contain an underlay address and localhost? (E.g., how the MGS block above adds localhost explicitly and then optionally adds the underlay address if it exists)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather do this in a follow-up PR. Again, just trying to separate separable changes, especially given I don't think we have comprehensive automated test coverage of all these code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, totally fair. It will be a much easier change to make once the rest of this lands, for sure.
.expect_err("unexpectedly changed a single zone version"); | ||
assert!(matches!( | ||
error, | ||
Error::RequestedConfigConflicts(vr) if vr == v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL this style of if
works in matches!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this major refactor, this is definitely moving in the right direction
@@ -229,252 +231,7 @@ pub struct Zpool { | |||
pub disk_type: DiskType, | |||
} | |||
|
|||
/// Describes service-specific parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I had previously thought the overlap of both had been painful.
@@ -235,41 +283,13 @@ impl Plan { | |||
Ok(u2_zpools) | |||
} | |||
|
|||
pub async fn create( | |||
log: &Logger, | |||
pub fn create_transient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand, why call this "create_transient" now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create()
still exists and works the same as before. It did/does three things: (1) loads some information from each sled, (2) assembles an in-memory representation of a plan that describes what things are going to be where, and (3) writes that plan durably to disk. For the tests, I only wanted (2). So I factored (2) into create_transient()
, which I think of this as "create a service plan, but only the in-memory part -- skip the part where you write the plan to disk" -- hence transient. I'm not attached to that name though.
sled-agent/src/services.rs
Outdated
/// functions that operate on either one or the other | ||
enum ZoneArgs<'a> { | ||
Omicron(&'a OmicronZoneConfigLocal), | ||
SledLocal(&'a SwitchZoneConfigLocal), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming brainstorming: If we're discussing ownership, it might make sense to describe these as "Nexus-owned" vs "Sled-owned" services? I dunno if that helps, but it seems to be the distinction to me. (IMO the name "Omicron" for any zones doesn't actually provide extra context, since all these zones are kinda "Omicron zones").
I think I've addressed all the open feedback. If not, let me know. I think the only open item is the question around the name "Omicron-managed zones". While I'm generally inclined to land this PR and iterate as needed, this particular thing appears all over the internal API ( |
Thanks @jgallagher and @smklein for the reviews! |
I wrote:
I think the naming question is less significant than I was thinking. The |
Beyond the standard CI testing, I've:
|
This portion of the Sled Agent was added in #4466, migrating an old "services" ledger into a new, zones-oriented format. This code served as a bridge between old and new format, but maintaining backwards compatibility here is a burden for modifying the zone format further. Specifically: in #5050 , which attempts to add the notion of "transient zone filesystem zpool" to the request from Nexus, the translation of "old format -> new format" is no longer possible without additional information about "where should the transient zone filesystem be placed". Given that this conversion has been deployed for multiple months - across releases - I figured it would be easier to remove the old support, rather than making another bridge for old code compatibility.
This portion of the Sled Agent was added in #4466, migrating an old "services" ledger into a new, zones-oriented format. This code served as a bridge between old and new format, but maintaining backwards compatibility here is a burden for modifying the zone format further. Specifically: in #5050 , which attempts to add the notion of "transient zone filesystem zpool" to the request from Nexus, the translation of "old format -> new format" is no longer possible without additional information about "where should the transient zone filesystem be placed". Given that this conversion has been deployed for multiple months - across releases - I figured it would be easier to remove the old support, rather than making another bridge for old code compatibility.
tl;dr: While looking at adding a
GET /services
inventory API, I dug into theServiceZoneRequest
type that Sled Agent uses to describe a zone configured by RSS/Nexus. I found a lot of cases where it can represent configurations that are either invalid or at least never happen in practice. This is kind of a big problem -- I'll explain why later. This PR reworks this type so that it represents only valid configurations.Notes for reviewers
This PR is unfortunately quite large, though not as big as it looks because there's a lot of whitespace changes and changes to schema files and sample inputs and outputs for the data format migration. It's also unfortunately delicate because a lot of this stuff wasn't previously being verified at compile time and a lot of this behavior is not covered by existing CI tests.
I've tried to make it as easy as possible to review by adding these notes and annotating the PR with notes about why various changes were made. If there's anything I can do to help review this (an interactive session? something else?) please let me know!
A lot of the stuff here talks about the difference between the switch zone and "Omicron zones" or "Omicron-managed zones". By these I mean zones whose configuration is entirely determined externally to sled agent by either RSS or Nexus.
The big changes here are:
AllZoneRequests
,ServiceZoneRequest
,ServiceType
, etc. have been broken up and reworked intoOmicronZonesConfig
,OmicronZoneConfig
,OmicronZoneType
, etc. (for Omicron zones) and similar types for switch zones. The original types still exist but are now all inservices_migration.rs
, which contains code for migrating from one set to the other. The new types are still in params.rs (for the Omicron zones) and services.rs (for the switch zone).services-ledger-check-migrate
that can take any old-format "services" ledger and convert it to the new-format "zones" ledger (or just check that a bunch of ledgers can be converted successfully).GET /omicron-zones
endpoint. These are both needed for the automated update project. They also help test the rest of this.By far, most of the changes (and the most delicate) are in sled-agent/src/services.rs and sled-agent/src/rack_setup.
services.rs
has the most becauseServiceZoneRequest
was passed through several layers insideServiceManager
; by splitting it up I needed to change quite a lot of stuff, but in mostly mechanical ways.Breaking changes
PUT /services
is nowPUT /omicron-zones
and the request body is pretty different. I believe this is not a problem because the only consumer was RSS, which I'm updating here, and I think it's safe to assume that RSS and Sled Agent will be at the same version at rack setup time.Risks / testing
I think the big risks here are:
Motivation
You can probably skip the rest of this. It's a little pedantic but I wanted to record all the details somewhere.
Problems with
ServiceZoneRequest
Ignoring the switch zone (for zones with
zone_type != ZoneType::Switch
):id
appears at least twice in the struct and sometimes three times (.id
,.dataset.id
,.services.*.id
). I believe these ids must all be the same for a given object but there's nothing to enforce that.addresses
array, but it always has exactly one address.services
array, but it always has exactly one service..zone_type
,.dataset
kind, and.services.*.details
). Aside from the NTP zones: the zone type, dataset kind (if present), and service details all match up exactly. (That is, for a Cockroach zone, the zone type, dataset type, and service type are all variants calledCockroachDb
-- for three different enums.) NTP zones are almost the same as the others except that the service variant can be eitherInternalNtp
orBoundaryNtp
.dataset
can be present for any type of zone, even though it can only really be present for a subset of zone types.Altogether, this structure is really pretty simple and constrained, but the definition allows all these things to vary independently. For example, you could have three different ids in there or you could have a
Nexus
zone with aCockroachDb
dataset and anInternalNtp
service definition!The switch zone is kind of special. Many code paths that involve
ServiceZoneRequest
assume they're not operating onZoneType::Switch
. For example, RSS/Nexus aren't supposed to be able to specify this zone. And similarly, some code paths that operate on the switch zone accept aServiceZoneRequest
but panic if they're given the switch zone. The switch zone version of this struct does have multiple addresses and multiple services.Why this matters
First the basics:
ServiceZoneRequest
, which adds risk to any changes.But the real reason I'm here is: in writing RFD 433-style software inventory using this data, what should the inventory system assume about
ServiceZoneRequest
? It could either assume the much simpler, constrained world that we actually live in: the ids and kinds all match up, there's exactly one address, there's no switch zone, etc. Or it could support the existing, very general structure. In practice, it almost certainly can't do anything sensible with these configurations that shouldn't be possible. So we can think of this question as: at what point in the system do we attempt to translate the generalServiceZoneRequest
into a simpler, more constrained version and bail out on failure? Options include:ServiceZoneRequest
and only use the simpler one (that's what I opted for)The later in the process we do this, the more we propagate "the basics" problems above to other parts of the system. We also have to do much more work. For example, if we only constrain this in inventory consumers like the update planner, then we have do all the work in the inventory collector, database schema, and the consumers to support the much more general structure. (As a silly example: if there really can be a lot of addresses here, we'd want to use a separate database table for the addresses, which would be more annoying for all consumers and much less efficient in the common case of only one address.)
Further, every option other than option (1) allows the possibility that the translation could fail if in fact our assumptions about real-world configurations were wrong. With option (1), we'll discover that at compile time. Even if we're confident about our assumptions today, it'd be easy to accidentally add a code path that violates them (e.g., adds another address) without realizing that will break consumers at runtime. The failure mode here is pretty bad: we'd fail to inventory that sled, and we wouldn't be able to provision or deprovision or reconfigure any Omicron zones on that sled.