Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document a pattern for making Sled-Agent-owned config to be Nexus-owned instead #7279

Merged
merged 5 commits into from
Dec 21, 2024

Conversation

davepacheco
Copy link
Collaborator

Recent and upcoming work have prompted discussions about how we take existing Sled-Agent-owned configuration and make it Nexus-owned instead. Here's my attempt to document this pattern for future reference. I'm afraid it can't be too prescriptive because each of the cases we've encountered is a little different. But I'm hopeful this can be a useful starting point when we go do things like add an artifact/image id to Omicron zone configs.

I may have gone overboard with Mermaid again.


**In the next release** (we'll call it "release 2"):

. Add `my_config: MyConfig` to appropriate spot in Sled Agent inventory. Note that it is _not_ an optional field. In this use case, Sled Agent always knows the current value of this field.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in a case like adding the artifact/image ID, we're also adding the requirement that sled agent needs to learn about this value explicitly, right?

As an example -- "release 1" of the sled agent might launch instances using /opt/oxide/propolis-server.tar.gz -- and use whatever file that is, regardless of "version".

Just wanted to call that out, as "part of release 2 may involve making sled agent explicitly aware of what state it happens to be managing", since we've declared it should be non-optional in inventory.

(also, to be clear, it's non-optional in the sled agent API call to Nexus, not the inventory database structure, right? Otherwise, old inventory collections would be invalidated)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in a case like adding the artifact/image ID, we're also adding the requirement that sled agent needs to learn about this value explicitly, right?

As an example -- "release 1" of the sled agent might launch instances using /opt/oxide/propolis-server.tar.gz -- and use whatever file that is, regardless of "version".

Just wanted to call that out, as "part of release 2 may involve making sled agent explicitly aware of what state it happens to be managing", since we've declared it should be non-optional in inventory.

The code needs to be aware that this value, which may previously have been hardcoded or even absent altogether, is now a variable to be reported in inventory (this item) and controlled by Nexus (item 4). Is that what you mean by "learn about this value explicitly"?

(also, to be clear, it's non-optional in the sled agent API call to Nexus, not the inventory database structure, right? Otherwise, old inventory collections would be invalidated)

Good point! I will update this to be more clear that (1) the inventory API structure can make this required but the inventory database structures should have this value optional, and (2) in release 3, this can probably be made required in the inventory database structures, too. (This is annoyingly tricky: we might still have old collections at this point but it seems unlikely.)

Copy link
Collaborator

@smklein smklein Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code needs to be aware that this value, which may previously have been hardcoded or even absent altogether, is now a variable to be reported in inventory (this item) and controlled by Nexus (item 4). Is that what you mean by "learn about this value explicitly"?

Yeah, basically just calling this out -- the phrasing "In this use case, Sled Agent always knows the current value of this field" is true, but if it was an "implicit" value before, Sled Agent might need to do some legwork to get an explicit value that can be reported.

Re: Inventory stuff

Yeah, totally - another spot where it might be useful to somehow provide a barrier. If we could ensure that all inventory collections are "from this release or the one prior, and no older", that could be a useful way to gradually make these values non-optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is annoyingly tricky: we might still have old collections at this point but it seems unlikely.)

Is this almost identical (or even exactly identical) to "changing an optional field to required in a blueprint requires dropping old blueprints", with the exception that inventory already rotates old collections out with relatively high frequency except in the case of errors (where we keep around the last good collection, which might be months old on a system like dogfood)? Do you think we #7278 should explicitly include inventory collections too? (It already implicitly covers them to some extent, since saving the reconfigurator state already includes them, but doesn't mention deleting old ones if they're still around.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is annoyingly tricky: we might still have old collections at this point but it seems unlikely.)

Is this almost identical (or even exactly identical) to "changing an optional field to required in a blueprint requires dropping old blueprints", with the exception that inventory already rotates old collections out with relatively high frequency except in the case of errors (where we keep around the last good collection, which might be months old on a system like dogfood)? Do you think we #7278 should explicitly include inventory collections too? (It already implicitly covers them to some extent, since saving the reconfigurator state already includes them, but doesn't mention deleting old ones if they're still around.)

Yes, it's analogous. Sure -- I'll update that issue.

Comment on lines +242 to +243
. Wait for at least one inventory cycle to complete successfully and verify that it contains the expected `my_config` field.
. Generate a new blueprint, make it the current target, and ensure that it executes successfully. It should make no actual changes to the system, but it will propagate the current values for `my_config` to the blueprint system and to sled ledgers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree checking for the blueprint + ledgers is probably an ad-hoc operation that we need a human to do, but getting this standard process of:

  • Collect inventory
  • Make a new blueprint, check the diff, make sure it looks good
  • Set it as target, execute it

As a part of each release seems like it would be very useful for us to rely on.

(I think it would be nice to be able to assume "if you have a rack running on release N, part of that upgrade involved creating a blueprint at release N and making sure it could run")

**In the next release** (we'll call it "release 3"): all the optional fields can be made non-optional:

* Blueprints' in-memory structure can go from `my_config: Option<MyConfig>` to `my_config: MyConfig`.
* Blueprints' database representation can go from NULL-able columns to non-NULL-able ones, though only if we can populate the value or drop it from old blueprints. More work is needed here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another spot where "if we know we added a blueprint during upgrade", it also might give us the leniency to "delete all older blueprints as the final step of an upgrade". Or perhaps as one of the first steps of upgrading to N+1

During release 1 and during release 2 _before_ Sled Agent has reported the configuration in inventory, things look like this:

```mermaid
sequenceDiagram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW since you mentioned going overboard on Mermaid: I found this diagram and the one below it very helpful. (I didn't find the ones above as helpful, personally, but I don't think they detract anything from the doc, either.)

@davepacheco davepacheco enabled auto-merge (squash) December 21, 2024 00:41
@davepacheco davepacheco merged commit c140817 into main Dec 21, 2024
16 checks passed
@davepacheco davepacheco deleted the dap/doc-pattern branch December 21, 2024 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants