Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus/sled-agent] move some types around, straighten up the dependency graph #6138

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented Jul 22, 2024

Reorganize some of our data structures and simplify the dependency graph. In particular, there's now just one copy of OmicronZonesConfig and related types.

This is mostly code reorganization, but there are a couple of interesting bits in here. Scroll down to the Questions section for more.

Motivation

1. The necessity of sled-agent-types

The primary driver behind this change was turning sled-agent's API into a trait -- breaking the sled-agent/nexus dependency cycle was one of the main motivations for that work. To recall, we currently have this logical cycle in the graph (reference):

Screenshot from 2024-07-21 21-03-02

With the work, we're attempting to move somewhere closer to this graph:

Screenshot from 2024-07-21 21-07-30

But to make that happen, some common types need to be moved into a sled-agent-types crate, so that the upcoming sled-agent-api crate can depend on them.

Why not put them in sled-agent-api itself? Because:

  • In sled-agent-client, we often end up normalizing to a shared type -- so we either want to implement From conversions between these types and sled-agent-client types, or to replace them outright. So one of them has to depend on the other.
  • If there's even one type that we want to replace, the dependency direction must be sled-agent-client -> sled-agent-api.
  • But the client depending on the API trait is exactly the kind of circular dependency we're trying to avoid.
  • Another reason for trying to avoid a circular dependency here is that it has the potential to leave types "floating in the ether", just circulating via the schema with no definition in Rust. (This actually is a risk today with the way dependencies are set up!)

So the only feasible alternative is to put these types in a shared crate. This crate is sled-agent-types, which we recently introduced in #6122.

2. sled-agent-types <-> nexus-types

For types shared between Nexus and sled-agent, where should we put them? The options are:

  • omicron-common: this is where we've been putting these types in the past, and is not a terrible place to put them. However, if those types pull in extra dependencies it becomes a bit more problematic, because omicron-common is depended on by several workspaces outside omicron.
    • A secondary option would be putting these types behind an optional feature. But that would cause a lot of unnecessary rebuilds of omicron-common and all of its dependents (essentially splitting the build unit universe into two), while the option we eventually chose does not.
  • sled-agent-types or nexus-types: we could put them in one or the other, but we'd have to introduce a dependency between them at some point, and they're both logically peers -- neither direction seems like the right one.
  • nexus-config: This depends on tokio-postgres (and was actually split out of omicron-common for that reason). It wouldn't be great for nexus-client or sled-agent-client to transitively pull in tokio-postgres.
  • a new crate: This is the option we've chosen. It does add the complexity of another crate, but resolves all of the previous points. We're calling this nexus-sled-agent-shared to signal that it sits on top of omicron-common and below nexus-types and sled-agent-types.

Something to consider is whether we want to have two clients in omicron, one for internal consumers with more dependencies and one for external consumers with fewer ones. I'm not saying we should do this; it's just something to keep at the back of one's head.

The changes

There are a number of closely-related changes here -- most of them had to be atomic with this one. A description of the changes and the rationale for them:

1. Zone config types

These are OmicronZonesConfig, OmicronZoneConfig, OmicronZoneType, and OmicronZoneDataset. In this PR, I've moved them from sled-agent to nexus-sled-agent-shared.

  • These types are an important part of the sled-agent API, and nexus-types previously used the progenitor-generated copy of them.
  • However that has had some issues for us, e.g. sled agent client: use SocketAddr/SocketAddrV6 for socket addresses instead of strings #4988.
  • Personally speaking, having two copies of such a complex type floating around has been suboptimal for me while I've been writing blueprint code. For example, previously, ctrl-clicking on OmicronZonesConfig in vscode always pointed me at the progenitor macro which wasn't helpful. Now, it points at the definition in nexus-sled-agent-shared.
  • There's an extra transitive dependency this introduces, which is why they're in nexus-sled-agent-shared.

There's one incident that I found along the way that I believe is a bug fix. We have rules for DNS server IP addresses:

  • External DNS servers can be IPv4 or IPv6
  • Internal DNS servers are always v6

But the Progenitor-generated copy of OmicronZonesConfig uses strings to store address/port pairs. As a result, in some cases we managed to store an IPv4 address for an internal DNS server. With this PR, this case is now detected. (I think it would have been ultimately rejected at some point, but it's possible for the bad DNS server to end up being persisted.)

2. ZoneKind

There were previously two dataless enums describing the kind of Omicron zone floating around.

  1. ZoneKind in sled-agent-client, used by Nexus: exactly OmicronZoneType without any associated data.
  2. ZoneType in sled-agent, used by sled-agent -- identifying the substring used within the service (notably with BoundaryNtp and InternalNtp combined into one).

While looking around, I also noticed that there were actually four different string representations in use for Omicron zones:

  1. the zone prefix
  2. the service prefix
  3. a prefix for the Name instance
  4. a string used in reports and error messages.

Here, 1-3 are pretty similar to each other (they combine InternalNtp and BoundaryNtp down to one, and are generated by sled-agent's ZoneType).

There is also a ZoneType in nexus-db-model that corresponds to ZoneKind.

Based on this, I decided to combine it all down into one ZoneKind enum, with 4 functions returning 4 string representations: one each for 1-4.

  • There's no Display impl -- users must choose the representation they want, which I think is correct.
  • I've tried to carefully verify that there are no behavior changes introduced here.
  • An alternative would be to have a dedicated representation for 1-3 along with one for 4, but I tried it out and it seemed too unwieldy in practice. (For example, a few places want to use the zone prefix, but then report errors using the report string. Having it be one enum with four representations in string-land makes this easier. I do hope we won't ever need a fifth, though.)

3. Physical disk config

OmicronPhysicalDiskConfig and OmicronPhysicalDisksConfig are shared types used by both Nexus and sled-agent. I moved them from sled-storage to omicron-common, next to DiskIdentity (which was already in omicron-common and which these types use).

  • Previously, nexus-types used the sled-agent-client-generated copy of these types -- but as discussed in the motivation section, that is no longer an option.
  • Why not nexus-sled-agent-shared? Well, sled-storage also needs these types, and I wanted to avoid sled-storage depending on the additional dependencies in nexus-sled-agent-shared.
  • I also used replace to patch them in so that there's just one copy of these types throughout sled-agent and Nexus.

4. Recovery silo types

These are two types, RecoverySiloConfig and Certificate, that are primarily related to Nexus and that sled-agent was using via nexus-client. Users of these types needed to be moved into sled-agent-types, which (as described in the motivation section) cannot depend on nexus-client. So I've moved them into shared crates.

  • Certificate is a simple type that can go into omicron-common's api::internal::nexus.
  • RecoverySiloConfig depends on omicron-passwords, so it goes in nexus-sled-agent-shared.

We also patch these types in nexus-client. One behavior change as a result is that password hash strength is now validated on both the client and the server. I think this is okay, but it's worth mentioning explicitly.

5. UserId

So this is a really interesting one. It needs to be moved into a shared location because RecoverySiloConfig depends on it. But during this process I noticed that we weren't doing any validation on it in clients like wicketd. So this counts as a behavior change: UserId is now validated on both the client and the server.

Note that UserId is exposed by the Nexus external API. But this change itself has no impact on the external API (this can be seen by the fact that nexus.json doesn't have any changes in it)

(I realized that the name should probably be something closer to LocalUserId, but that would be a breaking change to the external API so I've deferred it. Happy to change the name if it makes sense, though.)

6. Inventory types

These are Inventory, InventoryDisk and InventoryZpool. I've moved them to nexus-sled-agent-shared as a replacement for the previous scheme where we were using client-generated versions of them.

7. DiskVariant, SledRole, other misc types

These are simple types depended upon by others that have been moved into shared crates -- locations chosen mostly based on which code depends on them.

OpenAPI changes

There are some changes to our OpenAPI specifications.

  • Most of the changes are to description instances now that they're no longer going through Progenitor twice.
  • There are some validation changes, as documented above.

Questions

  • What should the name of the new crate be?
    • Resolved in 2024-07-23 meeting: nexus-sled-agent-shared.
  • Verify that the behavior changes documented above are all reasonable.

Other notes

It would be nice to have tooling which asserts that e.g. none of the -types crates ever depend on -client crates. Haven't written this tooling though.

Previously I did similar things using guppy, but that's only available as a library and can take a while to build (the total number of build units for xtask goes from 254 to 279, which is significant). Turning this into a binary would solve this but requires some work.

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
@sunshowers sunshowers marked this pull request as ready for review July 22, 2024 22:25
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must have been tedious @sunshowers. Thanks for the cleanup.

common-extended/README.md Outdated Show resolved Hide resolved
nexus/db-model/src/network_interface.rs Show resolved Hide resolved
nexus/reconfigurator/planning/src/system.rs Show resolved Hide resolved
nexus/types/src/inventory.rs Outdated Show resolved Hide resolved
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) July 24, 2024 21:54
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
@sunshowers sunshowers merged commit ca0e1ea into main Jul 25, 2024
25 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/nexussled-agent-move-some-types-around-straighten-up-the-dependency-graph branch July 25, 2024 02:22
sunshowers added a commit that referenced this pull request Aug 13, 2024
Building on the changes in #6138, this PR turns the sled-agent API into a trait. This is
high-impact because it currently is part of a circular dependency with the
Nexus internal API -- this PR gets us to the point where the OpenAPI documents
can be generated independently of each other.

There is some code movement but there are no functional changes in this PR -- I
didn't use `replace` here to keep the PR small. I'll try out `replace` for some
of the types with `From` impls in a subsequent PR.

(I did change the internal `VpcFirewallRule` to `ResolvedVpcFirewallRule` to
match the other types in the destination file.)

I've also added some documentation about when `nexus-sled-agent-shared` should
be used.

Depends on #6283.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants