-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial inventory for automated update #4291
Conversation
This commit matches commit 581e902 in branch dap/nexus-inventory. I've just rebased the changes onto "main" here.
I first read that as "chicken sandwiches" and was very confused. I'm not even that hungry! |
// if we try to ask MGS about it, we have to wait for MGS to time out | ||
// its attempt to reach it (currently several seconds). This choice | ||
// enables inventory to complete much faster, at the expense of not | ||
// being able to identify this particular condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should still try to query the SPs ignition says aren't present, but on some lower frequency (maybe even a separate background task entirely? and/or in a subsystem that's more related to faults than inventory, since "I can talk to an SP that ignition says isn't there" is definitely abnormal?). I'm nervous about baking in blind spots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. I'd like to defer it for now. I don't think making this choice now makes it any harder to do that in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No argument on deferring it. Maybe create an issue after this lands so we don't lose track? Seems like the kind of thing that would only happen on an already-bad day.
PRIMARY KEY (inv_collection_id, hw_baseboard_id) | ||
); | ||
|
||
CREATE TYPE IF NOT EXISTS omicron.public.caboose_which AS ENUM ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We chatted extensively about this; I'll attempt to summarize here:
- We don't love this enum; it feels a little goofy.
- One alternative is to instead have
caboose_slot_0
/caboose_slot_1
foreign keys ininv_sp
/inv_rot
that refer to rows ininv_caboose
. This is a 1-to-at-most-1 relationship, but it allows us to encode in the schema that either all or none of the caboose data is available. - Currently,
inv_caboose
doesn't have a single primary key, so adding a foreign key to it is awkward at best. We could either add an artificial primary key toinv_caboose
, or try to shift things a bit to usesw_caboose_id
as a FK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and did this in 58c010f. Honestly, I could go either way on the result. CabooseWhich
does feel janky. But it also reflects exactly what we're getting from MGS: that is, each row in this table reflects one response from the get-caboose endpoint, and that essentially represents the parameters to that request. It's kind of a nice property that no row in an inv_*
table represents data from multiple collection requests:
- it guides the schema design -- there's one table for each kind of observation (with maybe additional tables if there are a bunch of fields that can be present or absent together)
- it makes it easy to map the collection request responses to database rows
- it makes it easy to have the uniform set of (inv_collection_id, time_collected, source) fields. We do have those here, but it's arguably misleading because the source and time_collected fields on
inv_service_processor
andinv_root_of_trust
don't apply to the caboose fields - This example sounds awfully specific but I feel like there's something general here: imagine if we wanted in the future to update the database as we collect data instead of all at once at the end. We'd have this unfortunate situation of having to either insert an
inv_service_processor
record and then update it later or else hang onto it (don't insert it) until we've tried to collected all the things that might go into it.
I don't think these are big deals for this particular case. Rather, I came to this after exploring much different ways to structure this -- like an inv_sled
table that might include pieces of information from both the sled agent (like the current host OS) and the SP (like the current host flash contents), etc. I really disliked this because in the face of partial failures you have all these partial rows and then everything has to be NULL-able. That's how I got to the "rows should not include data from multiple sources" rule.
Put differently: I don't think this specific violation of that rule is that bad, but without that rule, I found myself spinning in circles for a long time about how to design the schema. It's pretty compelling to just say "each source of observation is a table; each observation is a row; then apply the usual database normalization rules".
All that said, I'm kind of ambivalent in the end. I think I slightly prefer the previous thing with caboose_which
but I'm interested in your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the properties you describe about caboose_which
make sense. I think I still slightly-to-moderately prefer the changes in 58c010f, but the more I look at it the more I think it's largely a superficial preference. If you start to feel more strongly that you want to go back to caboose_which
I could certainly live with that, even if it's just to maintain the schema design guidance.
This feels clearest to me on this issue in particular:
imagine if we wanted in the future to update the database as we collect data instead of all at once at the end. We'd have this unfortunate situation of having to either insert an inv_service_processor record and then update it later or else hang onto it (don't insert it) until we've tried to collected all the things that might go into it.
I think it's the same either way. With caboose_which
, partway through a collection insertion we could have rows in inv_sp
that do not have corresponding rows in inv_caboose
, which feels functionally the same as inv_sp
having NULL
slotN_inv_caboose_id
foreign keys. The latter does mean when we do collect a caboose we have to do an insert+update instead of just an insert, but from a query / data representation point of view, either way you don't know whether the caboose is missing because we couldn't collect it or if it just hasn't been collected yet (absent other info like the presence of an error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the same either way.
Yeah, from a representation perspective, that's true. I had in my mind an implicit rule that we wouldn't want to write a record and then update it later in the same operation. But that's somewhat arbitrary, too. (Not doing this does make it more complicated to infer anything from partially-inserted collections, and to measure progress based on what's present, but now we're talking about several layers of hypotheticals that aren't worth dealing with now.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I was reading through inv_root_of_trust
and inv_service_processor
I was wondering where the references to the cabooses were, and then reached this comment thread and the remaining tables. Thinking about this a bit, I think we should stiick with the caboose_which
and inv_caboose
tables as they are now rather than embedding fields in the sp and rot tables which would require a write + update.
I don't think the slight convenience or aesthetically pleasing look of the sp and rot tables is strong enough to violate the rule of "one collection per source = one row in one table". That's a really powerful thing to allow us to reason about the system and my gut is telling me we'll be happy to have that later.
nexus/types/src/inventory.rs
Outdated
name: c.name, | ||
// The MGS API uses an `Option` here because old SP versions did not | ||
// supply it. But modern SP versions do. So we should never hit | ||
// this `unwrap_or()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we modify MGS to remove this Option
altogether before (or as a part of) this PR? I'm inclined to say "yes"; it's a trivial change in MGS's sp_component_caboose_get
endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll take a swing at that.
nexus/types/src/inventory.rs
Outdated
/// with separate records, even though they might come from the same source | ||
/// (in this case, a single MGS request). | ||
/// | ||
/// We make heavy use of maps, sets, and Arcs here because many of these things |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the comment makes me nervous, but I think unnecessarily so. If we actually have Arc
s pointing to each other, we can end up with undroppable cycles, but after reading over the structs I don't think we do, right? The Arc<T>
types in Collection
are:
BaseboardId
(does not contain anyArc
s)Caboose
(does not contain anyArc
s)
and then CabooseFound
keeps an Arc<Caboose>
(which is also fine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll reword this a bit to clarify that no two objects point at "each other". It's more about the fact that some objects are pointed-to by many other things within the Collection.
// `inv_service_processor` using an explicit list of columns | ||
// and values. Without the following statement, If a new | ||
// required column were added, this would only fail at | ||
// runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big 👍 on this comment (and the solution it's describing). Very clear what's going on here.
|
||
opctx.authorize(authz::Action::Modify, &authz::INVENTORY).await?; | ||
|
||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get all the collection IDs to delete in a single query instead of looping and having to delete one at a time? This is grossly oversimplified (in particular, I'm putting the error count in directly as a column), but given
create table coll (id int primary key, started timestamp, nerrors int);
a query like this should return all IDs we need to prune:
select id from coll where id not in (
-- keep the 3 most recent collections...
(select id from coll order by started desc limit 3)
union
-- and the single most recent collection that had no errors (if it wasn't
-- already saved by the "3 most recent" above)
(select id from coll where nerrors = 0 order by started desc limit 1)
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a promising approach. But I'm a little worried it'll take a while to make this real (first real SQL, then real Diesel code), and also that when we do, we may find the database winds up doing a table scan (or rejecting the SQL because we've configured it to disallow that). As an example: one of the stated assumptions is that the number of collections here could be huge. In that case, the highest-level subquery here will produce only 3 rows, but the query itself will be trying to select any collections not in that set, which will in turn return a very large result set. So we'll a LIMIT
there. We also want to start with the oldest ones, so we'll want an ORDER BY
timestamp. At this point, there are enough variables here that I'm not sure what query plan Cockroach will use. I think ideal would be to do the subquery, then scan the index (by timestamp) and just skip over any rows that are in the subquery and stop when we reach the limit. I hope it will do that but it's hard to be sure until we do that work.
I think this is probably all solvable (or else we'll find out why it's not), but if the current code is at least correct and not pathological, I'd rather defer this than spend the time now to work all this out.
// break it up if these transactions become too big. But we'd need a | ||
// way to stop other clients from discovering a collection after we | ||
// start removing it and we'd also need to make sure we didn't leak a | ||
// collection if we crash while deleting it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to prune no-longer-referenced hw_baseboard_id
or sw_caboose
rows? It's a little hard to imagine hw_baseboard_id
getting "too big" since it only gets a row for each physical component the rack sees, but maybe in a large/long-lived multirack deployment? sw_caboose
gets a few new rows for every update, so probably grows somewhat faster but still not all that quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, eventually, yes. I think this is not urgent and we will notice before it becomes so.
} | ||
|
||
impl diesel::query_builder::QueryFragment<diesel::pg::Pg> for InvCabooseInsert { | ||
fn walk_ast<'b>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no useful suggestion here, but would like to register my complaint that this is much harder to read than the equivalent query written out in raw SQL would be.
This reverts commit 58c010f.
@jgallagher I've made a bunch of changes since your review, but hopefully no surprises. Besides the stuff that came up in your review:
That's all I've got planned so I think this is ready for re-review. |
Looks like the helios deploy test failure is legit (or at least related to the changes):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davepacheco This looks great. I only skimmed most of the DB transaction stuff, but the overall picture makes sense to me. I'll leave it to John to approve due to my skimming and his expertise.
@@ -2514,6 +2514,222 @@ CREATE TABLE IF NOT EXISTS omicron.public.bootstore_keys ( | |||
generation INT8 NOT NULL | |||
); | |||
|
|||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ❤️ this comment
PRIMARY KEY (inv_collection_id, hw_baseboard_id) | ||
); | ||
|
||
CREATE TYPE IF NOT EXISTS omicron.public.caboose_which AS ENUM ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I was reading through inv_root_of_trust
and inv_service_processor
I was wondering where the references to the cabooses were, and then reached this comment thread and the remaining tables. Thinking about this a bit, I think we should stiick with the caboose_which
and inv_caboose
tables as they are now rather than embedding fields in the sp and rot tables which would require a write + update.
I don't think the slight convenience or aesthetically pleasing look of the sp and rot tables is strong enough to violate the rule of "one collection per source = one row in one table". That's a really powerful thing to allow us to reason about the system and my gut is telling me we'll be happy to have that later.
/// Prune inventory collections stored in the database, keeping at least | ||
/// `nkeep`. | ||
/// | ||
/// This function removes as many collections as possible while preserving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest nkeep
are provided by timestamps, which aren't really global. Right now collection is at 10 min intervals, so as long as only 1 nexus performs a collection per interval this should be fine. However, I could see problems arising around order, although quite unlikely due to our 500ms limitation around syncing.
I don't really think this is something worth worrying about but figured I'd ask for completeness sake. Are there autoincrementing IDs we could use for collections rather than UUIDs as foreign keys and then sort by those? Would this present other issues with dueling Nexuses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, using timestamps is definitely a little fuzzy. In this case though I think that reflects the reality that collections are not atomic and they don't have a total order. Two Nexus instances could totally run collections concurrently that have start/done times that overlap (and I think that's fine). Consumers can decide if they want the most-recently-started or most-recently-finished (or even the-one-containing-the-most-recent-collection-time-for-the-specific-item-that-I-care-about). We could potentially use a sequence to assign a total order to these but I don't think it would have a useful semantic meaning -- at best it'd be a proxy for "which one committed to the database first" and I'm not sure that's useful.
Okay so my argument is basically "report the facts (the start/done timestamp) and let consumers decide what they want". But that just punts your question to "okay, well, which ones should we keep when we're pruning them?". And I think the answer here is to tune both the frequency and nkeep
such that it doesn't really matter if we choose "wrong" -- i.e., if two collections start/finish at about the same time but for some reason a consumer might reasonably want either one, we should probably just keep both. But my expectation here is that all consumers for now would probably want the same thing, which is the latest "time_started" one, and as long as "nkeep" is more than 1 then it doesn't matter which ones we keep if two overlap because there's always a newer one which is what consumers actually want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. That makes sense. I was actually thinking that we could eliminate the overlapping collections to a degree by having each nexus check that there isn't a collection currently running - or rather that one hasn't started within some bound (say collection_interval / 2) before kicking off another. With that, collections should be very close to totally ordered by time if not always so.
Thanks for taking a look @andrewjstone! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Just a handful of small nits.
nexus/db-model/src/inventory.rs
Outdated
pub serial_number: String, | ||
} | ||
|
||
impl<'a> From<&'a BaseboardId> for HwBaseboardId { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit / question - if we have to .clone()
all the fields of BaseboardId
, should this be impl From<BaseboardId>
instead, and push the clone to the callsite? I'm not sure how this is used, but that might avoid some clones, if there are be cases where a caller has a BaseboardId
that they want to convert into a HwBaseboardId
and not use again.
Similar question about other From<&T>
impls in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Fixed in a55216d. I changed this one and the SwCaboose
one. I did not change the InvCollection
one because in that case, the source object (a Collection
) is potentially huge.
resolver: internal_dns::resolver::Resolver, | ||
creator: String, | ||
nkeep: u32, | ||
disable: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure I understand: this is a setting at the Nexus config level (i.e., the TOML file baked into the Nexus zone) and cannot change at runtime, right? If we needed to flip this switch, how would we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. To my knowledge we do not yet have a way to apply dynamic config at runtime. The intent here is that if we really needed to, we could modify the TOML file inside each Nexus zone to disable this task. Then we'd restart Nexus. It's obviously not great but I've frequently found these sorts of facilities essential in the mitigation of production incidents in the past. (A step up might be a support API for pausing any background task in a particular Nexus instance by name. But that wouldn't survive Nexus restart without storing that config somewhere.)
datastore.clone(), | ||
); | ||
|
||
// Nexus starts our very background task, so we should find a collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit typo - "starts our very background"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a typo, but poorly written. I reworded it in a55216d.
nexus/src/app/background/init.rs
Outdated
@@ -88,6 +96,30 @@ impl BackgroundTasks { | |||
(task, watcher_channel) | |||
}; | |||
|
|||
// Background task: inventory collector | |||
let task_inventory_collection = { | |||
let watcher = inventory_collection::InventoryCollector::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit / question - is this variable misnamed? Looks like register
takes watchers
as its last arg, but this is the task implementation itself, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fixed in a55216d.
RotSlotB baseboard part "FAKE_SIM_SIDECAR" serial "SimSidecar1": board "SimSidecarRot" | ||
|
||
errors: | ||
error: MGS "http://[100::1]:12345": listing ignition targets: Communication Error: error sending request for url (http://[100::1]:12345/ignition): error trying to connect: tcp connect error: Network is unreachable (os error <<redacted>>): error sending request for url (http://[100::1]:12345/ignition): error trying to connect: tcp connect error: Network is unreachable (os error <<redacted>>): error trying to connect: tcp connect error: Network is unreachable (os error <<redacted>>): tcp connect error: Network is unreachable (os error <<redacted>>): Network is unreachable (os error <<redacted>>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ug, sorry for this; I really need to clean up the duplicated: duplicated: duplicated: errors from MGS
let index = u16::try_from(i).map_err(|e| { | ||
Error::internal_error(&format!( | ||
"failed to convert error index to u16 (too \ | ||
many errors in inventory collection?): {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trivial nit - rustfmt
won't line up split strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. This keeps happening and I don't notice. I'm not sure why. I wonder if it happens when some other change (like a symbol rename) causes this block to be reformatted when I'm not actually working on it. Anyway, fixed in a55216d.
} | ||
} | ||
|
||
/// A SQL common table expression (CTE) used to insert into `inv_caboose` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this block comment is referencing code that no longer exists, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes! Yes. Removed in a55216d.
I think I've addressed the outstanding feedback and I intend to land this once the repo re-opens after the latest customer update. |
The RoT can report four different 512-byte pages (CMPA, and CFPA active/inactive/scratch). Given multiple RoT artifacts that are viable (match the right board, etc.) but are signed with different keys, these pages are required to identify which archive was signed with a key that the RoT will accept. This PR adds collection of these pages to the inventory system added in #4291. The implementation here is fairly bulky but very mechanical, and is implemented almost identically to the way we collect cabooses: there's an `rot_page_which` to identify which of the four kinds of page it is, and a table for storing the relatively small number of raw page data values. Most of the changes in this PR resulted from "find where we're doing something for cabooses, then do the analogous thing for RoT pages". There are a couple minor quibbles in the unit tests that I'll point out by leaving comments below. The RoT pages now show up when viewing a collection through omdb (note that the quite long base64 string is truncated; there's a command line flag to override the truncation and show the full string): ```console $ omdb db inventory collections show e2f84867-010d-4ac3-bbf3-bc1e865da16b > x.txt note: database URL not specified. Will search DNS. note: (override with --db-url or OMDB_DB_URL) note: using database URL postgresql://root@[::1]:43301/omicron?sslmode=disable note: database schema version matches expected (11.0.0) collection: e2f84867-010d-4ac3-bbf3-bc1e865da16b collector: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c (likely a Nexus instance) started: 2023-11-14T18:51:54.900Z done: 2023-11-14T18:51:54.942Z errors: 0 Sled SimGimlet00 part number: FAKE_SIM_GIMLET power: A2 revision: 0 MGS slot: Sled 0 (cubby 0) found at: 2023-11-14 18:51:54.924602 UTC from http://[::1]:42341 cabooses: SLOT BOARD NAME VERSION GIT_COMMIT SpSlot0 SimGimletSp SimGimlet 0.0.1 ffffffff SpSlot1 SimGimletSp SimGimlet 0.0.1 ffffffff RotSlotA SimGimletRot SimGimlet 0.0.1 eeeeeeee RotSlotB SimGimletRot SimGimlet 0.0.1 eeeeeeee RoT pages: SLOT DATA_BASE64 Cmpa Z2ltbGV0LWNtcGEAAAAAAAAAAAAAAAAA... CfpaActive Z2ltbGV0LWNmcGEtYWN0aXZlAAAAAAAA... CfpaInactive Z2ltbGV0LWNmcGEtaW5hY3RpdmUAAAAA... CfpaScratch Z2ltbGV0LWNmcGEtc2NyYXRjaAAAAAAA... RoT: active slot: slot A RoT: persistent boot preference: slot A RoT: pending persistent boot preference: - RoT: transient boot preference: - RoT: slot A SHA3-256: - RoT: slot B SHA3-256: - Sled SimGimlet01 part number: FAKE_SIM_GIMLET power: A2 revision: 0 MGS slot: Sled 1 (cubby 1) found at: 2023-11-14 18:51:54.935038 UTC from http://[::1]:42341 cabooses: SLOT BOARD NAME VERSION GIT_COMMIT SpSlot0 SimGimletSp SimGimlet 0.0.1 ffffffff SpSlot1 SimGimletSp SimGimlet 0.0.1 ffffffff RotSlotA SimGimletRot SimGimlet 0.0.1 eeeeeeee RotSlotB SimGimletRot SimGimlet 0.0.1 eeeeeeee RoT pages: SLOT DATA_BASE64 Cmpa Z2ltbGV0LWNtcGEAAAAAAAAAAAAAAAAA... CfpaActive Z2ltbGV0LWNmcGEtYWN0aXZlAAAAAAAA... CfpaInactive Z2ltbGV0LWNmcGEtaW5hY3RpdmUAAAAA... CfpaScratch Z2ltbGV0LWNmcGEtc2NyYXRjaAAAAAAA... RoT: active slot: slot A RoT: persistent boot preference: slot A RoT: pending persistent boot preference: - RoT: transient boot preference: - RoT: slot A SHA3-256: - RoT: slot B SHA3-256: - Switch SimSidecar0 part number: FAKE_SIM_SIDECAR power: A2 revision: 0 MGS slot: Switch 0 found at: 2023-11-14 18:51:54.904 UTC from http://[::1]:42341 cabooses: SLOT BOARD NAME VERSION GIT_COMMIT SpSlot0 SimSidecarSp SimSidecar 0.0.1 ffffffff SpSlot1 SimSidecarSp SimSidecar 0.0.1 ffffffff RotSlotA SimSidecarRot SimSidecar 0.0.1 eeeeeeee RotSlotB SimSidecarRot SimSidecar 0.0.1 eeeeeeee RoT pages: SLOT DATA_BASE64 Cmpa c2lkZWNhci1jbXBhAAAAAAAAAAAAAAAA... CfpaActive c2lkZWNhci1jZnBhLWFjdGl2ZQAAAAAA... CfpaInactive c2lkZWNhci1jZnBhLWluYWN0aXZlAAAA... CfpaScratch c2lkZWNhci1jZnBhLXNjcmF0Y2gAAAAA... RoT: active slot: slot A RoT: persistent boot preference: slot A RoT: pending persistent boot preference: - RoT: transient boot preference: - RoT: slot A SHA3-256: - RoT: slot B SHA3-256: - Switch SimSidecar1 part number: FAKE_SIM_SIDECAR power: A2 revision: 0 MGS slot: Switch 1 found at: 2023-11-14 18:51:54.915680 UTC from http://[::1]:42341 cabooses: SLOT BOARD NAME VERSION GIT_COMMIT SpSlot0 SimSidecarSp SimSidecar 0.0.1 ffffffff SpSlot1 SimSidecarSp SimSidecar 0.0.1 ffffffff RotSlotA SimSidecarRot SimSidecar 0.0.1 eeeeeeee RotSlotB SimSidecarRot SimSidecar 0.0.1 eeeeeeee RoT pages: SLOT DATA_BASE64 Cmpa c2lkZWNhci1jbXBhAAAAAAAAAAAAAAAA... CfpaActive c2lkZWNhci1jZnBhLWFjdGl2ZQAAAAAA... CfpaInactive c2lkZWNhci1jZnBhLWluYWN0aXZlAAAA... CfpaScratch c2lkZWNhci1jZnBhLXNjcmF0Y2gAAAAA... RoT: active slot: slot A RoT: persistent boot preference: slot A RoT: pending persistent boot preference: - RoT: transient boot preference: - RoT: slot A SHA3-256: - RoT: slot B SHA3-256: - ``` There's also a new `omdb` subcommand to report the RoT pages (which does not truncate, but if we think it should that'd be easy to change): ```console $ omdb db inventory rot-pages note: database URL not specified. Will search DNS. note: (override with --db-url or OMDB_DB_URL) note: using database URL postgresql://root@[::1]:43301/omicron?sslmode=disable note: database schema version matches expected (11.0.0) ID DATA_BASE64 099ba572-a978-4592-ae7a-452629377904 c2lkZWNhci1jZnBhLWluYWN0aXZlAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA= 0e9dc5b0-b190-43da-acb6-84450fdfdb94 c2lkZWNhci1jbXBhAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA= 80923bac-fbcc-46e0-b861-9dba906c14f7 Z2ltbGV0LWNmcGEtaW5hY3RpdmUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA= 98cc4225-a791-4092-99c6-81e27e8d8ffa c2lkZWNhci1jZnBhLWFjdGl2ZQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA= a32eaf95-a20e-4570-8860-e0fb584a2ff1 c2lkZWNhci1jZnBhLXNjcmF0Y2gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA= c941810a-1c6a-4dda-9c71-41a0caf62ace Z2ltbGV0LWNtcGEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA= e96042d0-ae8a-435c-9118-1b71e8a9a651 Z2ltbGV0LWNmcGEtYWN0aXZlAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA= fdc27064-4338-4cbe-bfe5-622b11a9afbc Z2ltbGV0LWNmcGEtc2NyYXRjaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
This PR implements the first round of hardware/software inventory for automated update. See RFD 433 for background. There's a summary of the new data model in dbinit.sql.
I'm sorry this change is so big. Here are the key pieces:
nexus/types
: type related to software inventory (used in a few places)schema/crdb
andnexus/db-model
: database schema/model described in RFD 433nexus/db-queries
: datastore queries to insert or delete an entire inventoryCollection
nexus/inventory
: new crate with Collector and builder interface. This crate only collects inventory -- it doesn't do anything with the database.nexus/src/app/background
: a new background task that uses these other pieces to collect inventory, write it to the database, and clean up old collectionsomdb
support for showing inventory data from the databaseWhat's not here (and will be in future PRs, not this one):
Some other stuff came along for the ride here. I'm happy to separate these if that's useful but they're each pretty small:
omicron-dev run-all
, as well as all tests that set up aControlPlaneTestContext
, now run a Management Gateway Service backed by the same simulated SPs used in the existing MGS tests. This was easy to do, convenient for future inventory work, and it was necessary to test theomdb
changes.omdb
does not callusdt::register_probes()
, so we don't have (for example) the diesel-dtrace probes inomdb
. I added a call inpool.rs
to cover these. This isn't quite once per process, but it's close, and ensures that anybody who uses our database layer will get these probes. This was a one line change.pool_authorized()
because it was non-pub
and there was only one caller and I found the name confusing.Here's some example output. I used
omicron-dev run-all
to get everything going:Here's
omdb nexus background-tasks show
for the new "inventory" task:Here's using
omdb
to poke around the inventory data: