Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled agent client: use SocketAddr/SocketAddrV6 for socket addresses instead of strings #4988

Closed
davepacheco opened this issue Feb 5, 2024 · 3 comments
Assignees

Comments

@davepacheco
Copy link
Collaborator

In the Sled Agent API, on the server side, most variants of OmicronZoneType have an address: SocketAddrV6 field or something like it. These wind up getting turned into Strings in the OpenAPI spec. So the Progenitor-generated client-side structure OmicronZoneType has these fields as Strings, too. This is starting to get pretty unwieldy because various code paths need to parse these values and deal with errors that shouldn't be possible.

Some notes on this from @ahl:

  • The behavior of turning SocketAddr into string in the OpenAPI spec comes from schemars.
  • There isn't a json schema format specifier like there is, say, for IpAddr: { "type": "string", "format": "ip" }
  • schemars also doesn't generate a "$ref" for SocketAddr.
  • If we did generate a $ref, we could use replace to swap in SocketAddr for the local type. This doesn't work right now because replace operates on named types, and there's no named type for this. It's just a String.

It sounds like the options here are:

  1. On the server, everywhere we use these, apply #[schemars(with = "some_function_that_emits_a_schema")] so that the JsonSchema for these has its own named type. On the client, replace that type with std::net::SocketAddr. I imagine we'd want to do this for SocketAddr, SocketAddrV4, and SocketAddrV6.
  2. Almost the same as the above, but define our own newtype for these on the server (instead of using the annotation) to add a name to the JsonSchema.
  3. Alternatively, fork schemars to change the schema it generates for the SocketAddr family of types.
@jgallagher
Copy link
Contributor

3. Alternatively, fork schemars to change the schema it generates for the SocketAddr family of types.

It looks like there's not an open issue on schemars about this; have we asked out of band?

@sunshowers
Copy link
Contributor

sunshowers commented Jul 21, 2024

Working on this, by using replace for all of OmicronZonesConfig -- required to break circular dependency between types and client crates as part of work to split the sled-agent API into a trait (a long chain of reasoning leads to this as the almost-necessary thing to to do. I'll try doing a writeup about this)

@sunshowers sunshowers self-assigned this Jul 21, 2024
sunshowers added a commit that referenced this issue Jul 25, 2024
…cy graph (#6138)

Reorganize some of our data structures and simplify the dependency
graph. In particular, there's now just one copy of `OmicronZonesConfig`
and related types.

This is _mostly_ code reorganization, but there are a couple of
interesting bits in here. Scroll down to the Questions section for more.

## Motivation

### 1. The necessity of sled-agent-types

The primary driver behind this change was turning sled-agent's API into
a trait -- breaking the sled-agent/nexus dependency cycle was one of the
main motivations for that work. To recall, we currently have this
logical cycle in the graph
([reference](https://rfd.shared.oxide.computer/rfd/0479#circular_deps)):

![Screenshot from 2024-07-21
21-03-02](https://github.com/user-attachments/assets/9227d036-ab37-4b1e-b4ae-e390a84e829b)

With the work, we're attempting to move somewhere closer to this graph:

![Screenshot from 2024-07-21
21-07-30](https://github.com/user-attachments/assets/4d93e98e-cfdc-46dc-9244-7fca8b72d5d1)

But to make that happen, some common types need to be moved into a
`sled-agent-types` crate, so that the upcoming `sled-agent-api` crate
can depend on them.

Why not put them in `sled-agent-api` itself? Because:

* In `sled-agent-client`, we often end up normalizing to a shared type
-- so we either want to implement `From` conversions between these types
and `sled-agent-client` types, or to `replace` them outright. So one of
them has to depend on the other.
* If there's even one type that we want to `replace`, the dependency
direction must be `sled-agent-client` -> `sled-agent-api`.
* But the client depending on the API trait is exactly the kind of
circular dependency we're trying to avoid.
* Another reason for trying to avoid a circular dependency here is that
it has the potential to leave types "floating in the ether", just
circulating via the schema with no definition in Rust. (This actually is
a risk today with the way dependencies are set up!)

So the only feasible alternative is to put these types in a shared
crate. This crate is `sled-agent-types`, which we recently introduced in
#6122.

### 2. sled-agent-types <-> nexus-types

For types shared between Nexus and sled-agent, where should we put them?
The options are:

* **omicron-common**: this is where we've been putting these types in
the past, and is not a terrible place to put them. However, if those
types pull in extra dependencies it becomes a bit more problematic,
because omicron-common is depended on by several workspaces outside
omicron.
* A secondary option would be putting these types behind an optional
feature. But that would cause a lot of unnecessary rebuilds of
omicron-common and all of its dependents (essentially splitting the
build unit universe into two), while the option we eventually chose does
not.
* **sled-agent-types** or **nexus-types**: we could put them in one or
the other, but we'd have to introduce a dependency between them at some
point, and they're both logically peers -- neither direction seems like
the right one.
* **nexus-config**: This depends on `tokio-postgres` (and was actually
split out of omicron-common for that reason). It wouldn't be great for
nexus-client or sled-agent-client to transitively pull in
tokio-postgres.
* **a new crate**: This is the option we've chosen. It does add the
complexity of another crate, but resolves all of the previous points.
We're calling this **nexus-sled-agent-shared** to signal that it sits on
top of omicron-common and below nexus-types and sled-agent-types.

Something to consider is whether we want to have two clients in omicron,
one for internal consumers with more dependencies and one for external
consumers with fewer ones. I'm not saying we should do this; it's just
something to keep at the back of one's head.

## The changes

There are a number of closely-related changes here -- most of them had
to be atomic with this one. A description of the changes and the
rationale for them:

### 1. Zone config types

These are `OmicronZonesConfig`, `OmicronZoneConfig`, `OmicronZoneType`,
and `OmicronZoneDataset`. In this PR, I've moved them from `sled-agent`
to `nexus-sled-agent-shared`.

* These types are an important part of the sled-agent API, and
`nexus-types` previously used the progenitor-generated copy of them.
* However that has had some issues for us, e.g.
#4988.
* Personally speaking, having two copies of such a complex type floating
around has been suboptimal for me while I've been writing blueprint
code. For example, previously, ctrl-clicking on `OmicronZonesConfig` in
vscode always pointed me at the progenitor macro which wasn't helpful.
Now, it points at the definition in `nexus-sled-agent-shared`.
* There's an extra transitive dependency this introduces, which is why
they're in `nexus-sled-agent-shared`.

There's one incident that I found along the way that I believe is a
**bug fix**. We have rules for DNS server IP addresses:

* External DNS servers can be IPv4 or IPv6
* Internal DNS servers are always v6

But the Progenitor-generated copy of `OmicronZonesConfig` uses strings
to store address/port pairs. As a result, in some cases we managed to
store an IPv4 address for an internal DNS server. With this PR, this
case is now detected. (I think it would have been ultimately rejected at
some point, but it's possible for the bad DNS server to end up being
persisted.)

### 2. `ZoneKind`

There were previously two dataless enums describing the kind of Omicron
zone floating around.

1. [`ZoneKind` in
`sled-agent-client`](https://github.com/oxidecomputer/omicron/blob/26959bfc1a073c91a920bd521da13a980d137e3f/clients/sled-agent-client/src/lib.rs#L87),
used by Nexus: exactly `OmicronZoneType` without any associated data.
2. [`ZoneType` in
`sled-agent`](https://github.com/oxidecomputer/omicron/blob/26959bfc1a073c91a920bd521da13a980d137e3f/sled-agent/src/params.rs#L265),
used by sled-agent -- identifying the substring used within the service
(notably with `BoundaryNtp` and `InternalNtp` combined into one).

While looking around, I also noticed that there were actually four
different string representations in use for Omicron zones:

1. the zone prefix
2. the service prefix
3. a prefix for the `Name` instance
4. a string used in reports and error messages.

Here, 1-3 are pretty similar to each other (they combine `InternalNtp`
and `BoundaryNtp` down to one, and are generated by sled-agent's
`ZoneType`).
  
There is also a `ZoneType` in nexus-db-model that corresponds to
`ZoneKind`.

Based on this, I decided to combine it all down into one `ZoneKind`
enum, with 4 functions returning 4 string representations: one each for
1-4.
* There's no `Display` impl -- users must choose the representation they
want, which I think is correct.
* I've tried to carefully verify that there are **no behavior changes**
introduced here.
* An alternative would be to have a dedicated representation for 1-3
along with one for 4, but I tried it out and it seemed too unwieldy in
practice. (For example, a few places want to use the zone prefix, but
then report errors using the report string. Having it be one enum with
four representations in string-land makes this easier. I do hope we
won't ever need a fifth, though.)

### 3. Physical disk config

`OmicronPhysicalDiskConfig` and `OmicronPhysicalDisksConfig` are shared
types used by both Nexus and sled-agent. I moved them from
`sled-storage` to `omicron-common`, next to `DiskIdentity` (which was
already in `omicron-common` and which these types use).

* Previously, `nexus-types` used the `sled-agent-client`-generated copy
of these types -- but as discussed in the motivation section, that is no
longer an option.
* Why not `nexus-sled-agent-shared`? Well, `sled-storage` also needs
these types, and I wanted to avoid sled-storage depending on the
additional dependencies in `nexus-sled-agent-shared`.
* I also used `replace` to patch them in so that there's just one copy
of these types throughout sled-agent and Nexus.

### 4. Recovery silo types

These are two types, `RecoverySiloConfig` and `Certificate`, that are
primarily related to Nexus and that `sled-agent` was using via
`nexus-client`. Users of these types needed to be moved into
sled-agent-types, which (as described in the motivation section) cannot
depend on nexus-client. So I've moved them into shared crates.

* `Certificate` is a simple type that can go into `omicron-common`'s
`api::internal::nexus`.
* `RecoverySiloConfig` depends on `omicron-passwords`, so it goes in
`nexus-sled-agent-shared`.

We also patch these types in `nexus-client`. One **behavior change** as
a result is that password hash strength is now validated on **both the
client and the server.** I think this is okay, but it's worth mentioning
explicitly.

### 5. `UserId`

So this is a really interesting one. It needs to be moved into a shared
location because `RecoverySiloConfig` depends on it. But during this
process I noticed that we weren't doing any validation on it in clients
like wicketd. So this counts as a **behavior change**: `UserId` is now
validated on **both the client and the server**.

Note that `UserId` is exposed by the Nexus external API. But this change
itself has no impact on the external API (this can be seen by the fact
that `nexus.json` doesn't have any changes in it)

(I realized that the name should probably be something closer to
`LocalUserId`, but that would be a breaking change to the external API
so I've deferred it. Happy to change the name if it makes sense,
though.)

### 6. Inventory types

These are `Inventory`, `InventoryDisk` and `InventoryZpool`. I've moved
them to `nexus-sled-agent-shared` as a replacement for the previous
scheme where we were using client-generated versions of them.

### 7. `DiskVariant`, `SledRole`, other misc types

These are simple types depended upon by others that have been moved into
shared crates -- locations chosen mostly based on which code depends on
them.

## OpenAPI changes

There are some changes to our OpenAPI specifications.

* Most of the changes are to `description` instances now that they're no
longer going through Progenitor twice.
* There are some validation changes, as documented above.

## Questions

- [x] What should the name of the new crate be?
  - Resolved in 2024-07-23 meeting: **nexus-sled-agent-shared**.
- [x] Verify that the behavior changes documented above are all
reasonable.

## Other notes

It would be nice to have tooling which asserts that e.g. none of the
`-types` crates ever depend on `-client` crates. Haven't written this
tooling though.

Previously I did similar things using
[guppy](https://github.com/guppy-rs/guppy), but that's only available as
a library and can take a while to build (the total number of build units
for `xtask` goes from 254 to 279, which is significant). Turning this
into a binary would solve this but requires some work.
@sunshowers
Copy link
Contributor

Addressed in #6138 -- writeup is in the PR summary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants