Nexus inventory: Add collection of RoT CMPA and CFPA pages #4496

jgallagher · 2023-11-14T18:54:45Z

The RoT can report four different 512-byte pages (CMPA, and CFPA active/inactive/scratch). Given multiple RoT artifacts that are viable (match the right board, etc.) but are signed with different keys, these pages are required to identify which archive was signed with a key that the RoT will accept. This PR adds collection of these pages to the inventory system added in #4291.

The implementation here is fairly bulky but very mechanical, and is implemented almost identically to the way we collect cabooses: there's an rot_page_which to identify which of the four kinds of page it is, and a table for storing the relatively small number of raw page data values. Most of the changes in this PR resulted from "find where we're doing something for cabooses, then do the analogous thing for RoT pages". There are a couple minor quibbles in the unit tests that I'll point out by leaving comments below.

The RoT pages now show up when viewing a collection through omdb (note that the quite long base64 string is truncated; there's a command line flag to override the truncation and show the full string):

$ omdb db inventory collections show e2f84867-010d-4ac3-bbf3-bc1e865da16b > x.txt
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using database URL postgresql://root@[::1]:43301/omicron?sslmode=disable
note: database schema version matches expected (11.0.0)
collection: e2f84867-010d-4ac3-bbf3-bc1e865da16b
collector:  e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c (likely a Nexus instance)
started:    2023-11-14T18:51:54.900Z
done:       2023-11-14T18:51:54.942Z
errors:     0

Sled SimGimlet00
    part number: FAKE_SIM_GIMLET
    power:    A2
    revision: 0
    MGS slot: Sled 0 (cubby 0)
    found at: 2023-11-14 18:51:54.924602 UTC from http://[::1]:42341
    cabooses:
        SLOT     BOARD        NAME      VERSION GIT_COMMIT 
        SpSlot0  SimGimletSp  SimGimlet 0.0.1   ffffffff   
        SpSlot1  SimGimletSp  SimGimlet 0.0.1   ffffffff   
        RotSlotA SimGimletRot SimGimlet 0.0.1   eeeeeeee   
        RotSlotB SimGimletRot SimGimlet 0.0.1   eeeeeeee   
    RoT pages:
        SLOT         DATA_BASE64                         
        Cmpa         Z2ltbGV0LWNtcGEAAAAAAAAAAAAAAAAA... 
        CfpaActive   Z2ltbGV0LWNmcGEtYWN0aXZlAAAAAAAA... 
        CfpaInactive Z2ltbGV0LWNmcGEtaW5hY3RpdmUAAAAA... 
        CfpaScratch  Z2ltbGV0LWNmcGEtc2NyYXRjaAAAAAAA... 
    RoT: active slot: slot A
    RoT: persistent boot preference: slot A
    RoT: pending persistent boot preference: -
    RoT: transient boot preference: -
    RoT: slot A SHA3-256: -
    RoT: slot B SHA3-256: -

Sled SimGimlet01
    part number: FAKE_SIM_GIMLET
    power:    A2
    revision: 0
    MGS slot: Sled 1 (cubby 1)
    found at: 2023-11-14 18:51:54.935038 UTC from http://[::1]:42341
    cabooses:
        SLOT     BOARD        NAME      VERSION GIT_COMMIT 
        SpSlot0  SimGimletSp  SimGimlet 0.0.1   ffffffff   
        SpSlot1  SimGimletSp  SimGimlet 0.0.1   ffffffff   
        RotSlotA SimGimletRot SimGimlet 0.0.1   eeeeeeee   
        RotSlotB SimGimletRot SimGimlet 0.0.1   eeeeeeee   
    RoT pages:
        SLOT         DATA_BASE64                         
        Cmpa         Z2ltbGV0LWNtcGEAAAAAAAAAAAAAAAAA... 
        CfpaActive   Z2ltbGV0LWNmcGEtYWN0aXZlAAAAAAAA... 
        CfpaInactive Z2ltbGV0LWNmcGEtaW5hY3RpdmUAAAAA... 
        CfpaScratch  Z2ltbGV0LWNmcGEtc2NyYXRjaAAAAAAA... 
    RoT: active slot: slot A
    RoT: persistent boot preference: slot A
    RoT: pending persistent boot preference: -
    RoT: transient boot preference: -
    RoT: slot A SHA3-256: -
    RoT: slot B SHA3-256: -

Switch SimSidecar0
    part number: FAKE_SIM_SIDECAR
    power:    A2
    revision: 0
    MGS slot: Switch 0
    found at: 2023-11-14 18:51:54.904 UTC from http://[::1]:42341
    cabooses:
        SLOT     BOARD         NAME       VERSION GIT_COMMIT 
        SpSlot0  SimSidecarSp  SimSidecar 0.0.1   ffffffff   
        SpSlot1  SimSidecarSp  SimSidecar 0.0.1   ffffffff   
        RotSlotA SimSidecarRot SimSidecar 0.0.1   eeeeeeee   
        RotSlotB SimSidecarRot SimSidecar 0.0.1   eeeeeeee   
    RoT pages:
        SLOT         DATA_BASE64                         
        Cmpa         c2lkZWNhci1jbXBhAAAAAAAAAAAAAAAA... 
        CfpaActive   c2lkZWNhci1jZnBhLWFjdGl2ZQAAAAAA... 
        CfpaInactive c2lkZWNhci1jZnBhLWluYWN0aXZlAAAA... 
        CfpaScratch  c2lkZWNhci1jZnBhLXNjcmF0Y2gAAAAA... 
    RoT: active slot: slot A
    RoT: persistent boot preference: slot A
    RoT: pending persistent boot preference: -
    RoT: transient boot preference: -
    RoT: slot A SHA3-256: -
    RoT: slot B SHA3-256: -

Switch SimSidecar1
    part number: FAKE_SIM_SIDECAR
    power:    A2
    revision: 0
    MGS slot: Switch 1
    found at: 2023-11-14 18:51:54.915680 UTC from http://[::1]:42341
    cabooses:
        SLOT     BOARD         NAME       VERSION GIT_COMMIT 
        SpSlot0  SimSidecarSp  SimSidecar 0.0.1   ffffffff   
        SpSlot1  SimSidecarSp  SimSidecar 0.0.1   ffffffff   
        RotSlotA SimSidecarRot SimSidecar 0.0.1   eeeeeeee   
        RotSlotB SimSidecarRot SimSidecar 0.0.1   eeeeeeee   
    RoT pages:
        SLOT         DATA_BASE64                         
        Cmpa         c2lkZWNhci1jbXBhAAAAAAAAAAAAAAAA... 
        CfpaActive   c2lkZWNhci1jZnBhLWFjdGl2ZQAAAAAA... 
        CfpaInactive c2lkZWNhci1jZnBhLWluYWN0aXZlAAAA... 
        CfpaScratch  c2lkZWNhci1jZnBhLXNjcmF0Y2gAAAAA... 
    RoT: active slot: slot A
    RoT: persistent boot preference: slot A
    RoT: pending persistent boot preference: -
    RoT: transient boot preference: -
    RoT: slot A SHA3-256: -
    RoT: slot B SHA3-256: -

There's also a new omdb subcommand to report the RoT pages (which does not truncate, but if we think it should that'd be easy to change):

$ omdb db inventory rot-pages
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using database URL postgresql://root@[::1]:43301/omicron?sslmode=disable
note: database schema version matches expected (11.0.0)
ID                                   DATA_BASE64                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
099ba572-a978-4592-ae7a-452629377904 c2lkZWNhci1jZnBhLWluYWN0aXZlAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
0e9dc5b0-b190-43da-acb6-84450fdfdb94 c2lkZWNhci1jbXBhAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
80923bac-fbcc-46e0-b861-9dba906c14f7 Z2ltbGV0LWNmcGEtaW5hY3RpdmUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
98cc4225-a791-4092-99c6-81e27e8d8ffa c2lkZWNhci1jZnBhLWFjdGl2ZQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
a32eaf95-a20e-4570-8860-e0fb584a2ff1 c2lkZWNhci1jZnBhLXNjcmF0Y2gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
c941810a-1c6a-4dda-9c71-41a0caf62ace Z2ltbGV0LWNtcGEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
e96042d0-ae8a-435c-9118-1b71e8a9a651 Z2ltbGV0LWNmcGEtYWN0aXZlAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
fdc27064-4338-4cbe-bfe5-622b11a9afbc Z2ltbGV0LWNmcGEtc2NyYXRjaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

nexus/db-queries/src/db/datastore/inventory.rs

jgallagher · 2023-11-14T18:57:36Z

nexus/inventory/src/builder.rs

+        // either have kept the original data or returned Ok while returning an
+        // error? It seems a little strange we returned Err but accepted the new
+        // data.
+        assert_eq!(rot_page.page.data_base64, rot_page2.data_base64);


This is the second test issue I mentioned in the PR description. The caboose behaves this way so I made it match, but I'm not sure it's right - should we at least be logging a collection error if we hit this?

If I'm understanding it right: I think the current behavior is okay and also the improvements you alluded to would be good, too. My thinking is this:

It seems very possible (although probably uncommon) to get two different responses here due to the underlying data having changed in between the two calls (e.g., if we did an inventory collection during an update, we might see the caboose contents change). In this case, there's an intrinsic race, and I don't think it matters much which data we take, especially as long as we record which one it was (source and timestamp). A consumer that was depending on getting the later data really needs to wait for a collection to complete that starts after whatever operation they want it to be after. (I realize in the current implementation we'll take the later one anyway. Similarly if someone was depending on it being the earlier one, they'd have to look at a collection that finished before whatever operation they care about happened.)

The builder returns an error so that the caller can at least tell that something unusual happened and can report it. What level of reporting is appropriate? I think calling this a collection error is misleading. Those are intended to communicate that a particular collection is likely to be incomplete or invalid, which isn't the case in the example I mentioned and I don't think we want that sort of race to invalidate any concurrent collection. I think a warning-level log message (and counter bump, ideally) would be appropriate. If we wanted to augment collections with warnings or something like that, I think that'd also be fine, but I'm not sure it's worth doing right now.

The other possibility is that two MGS's reliably report inconsistent data for the same device even though the underlying data isn't changing. I can imagine this happening (i.e., due to a bug in one of them), but dealing with it in general seems pretty hard and it doesn't seem all that likely to me. Being able to tell in a support context that it's happened seems like the right level of effort.

That's just an explanation of how I think we got here. We could certainly change any of this!

This makes sense, thanks. I was only thinking of your third bullet point (or slight variations, such as an SP / RoT reporting inconsistent data itself, even if MGS isn't buggy), but you're right that this seems we're much more likely to hit this by racing with an update, which shouldn't cause a failure.

I think a warning-level log message (and counter bump, ideally) would be appropriate.

Collector does log this (at error!) level currently - want me to change them to warn!, and possibly include a "was an update happening simultaneously?" (either in the log message itself or in a comment around where the log message is generated)?

What do you mean by a counter bump?

I wouldn't bother changing it.

By counter bump I mean adding Oximeter-collected metrics to this subsystem and having an error counter for this case. That'd be welcome but obviously beyond the scope of this PR!

davepacheco

Nice!

dev-tools/omdb/src/bin/omdb/db.rs

davepacheco · 2023-11-15T17:33:29Z

nexus/inventory/src/builder.rs

+        // either have kept the original data or returned Ok while returning an
+        // error? It seems a little strange we returned Err but accepted the new
+        // data.
+        assert_eq!(rot_page.page.data_base64, rot_page2.data_base64);


If I'm understanding it right: I think the current behavior is okay and also the improvements you alluded to would be good, too. My thinking is this:

It seems very possible (although probably uncommon) to get two different responses here due to the underlying data having changed in between the two calls (e.g., if we did an inventory collection during an update, we might see the caboose contents change). In this case, there's an intrinsic race, and I don't think it matters much which data we take, especially as long as we record which one it was (source and timestamp). A consumer that was depending on getting the later data really needs to wait for a collection to complete that starts after whatever operation they want it to be after. (I realize in the current implementation we'll take the later one anyway. Similarly if someone was depending on it being the earlier one, they'd have to look at a collection that finished before whatever operation they care about happened.)

The builder returns an error so that the caller can at least tell that something unusual happened and can report it. What level of reporting is appropriate? I think calling this a collection error is misleading. Those are intended to communicate that a particular collection is likely to be incomplete or invalid, which isn't the case in the example I mentioned and I don't think we want that sort of race to invalidate any concurrent collection. I think a warning-level log message (and counter bump, ideally) would be appropriate. If we wanted to augment collections with warnings or something like that, I think that'd also be fine, but I'm not sure it's worth doing right now.

The other possibility is that two MGS's reliably report inconsistent data for the same device even though the underlying data isn't changing. I can imagine this happening (i.e., due to a bug in one of them), but dealing with it in general seems pretty hard and it doesn't seem all that likely to me. Being able to tell in a support context that it's happened seems like the right level of effort.

That's just an explanation of how I think we got here. We could certainly change any of this!

nexus/types/src/inventory.rs

davepacheco · 2023-11-15T17:38:36Z

schema/crdb/11.0.0/up2.sql

@@ -0,0 +1,2 @@
+CREATE UNIQUE INDEX IF NOT EXISTS root_of_trust_page_properties


This does make me slightly nervous (because the values are so big and the domain of possible values is so big) but I guess it's not really worse than any other string and in practice we don't expect to have that many values here.

I'm also slightly nervous about this, but (a) I believe these change even less frequently than cabooses (e.g., when keys are revoked, for which we don't yet have a process) and (b) I'm not sure what an alternative would be. Without this index we wouldn't be able to ensure uniqueness "by hand", right, since querying to see if a given page exists would require a full table scan?

nexus/db-queries/src/db/datastore/inventory.rs

@davepacheco

Noticed this while investigating #5299; I am unsure whether this will actually resolve #5299 though. I made this same mistake the first time I added something new to inventory; @davepacheco and I discussed the issues with trying to test for this at #4496 (comment). I'm not sure anything has materially changed there, but maybe testing this warrants a closer look.

jgallagher added 15 commits November 14, 2023 13:30

add db machinery for RoT pages (minus migration .sql)

7e7ca77

add schema migration

3642c31

add rot pages to CollectionBuilder tests

8040ce1

add inventory rot-pages subcommand to omdb

1970fec

add rot pages to collection insertion queries

1980df4

fix schema typo

907751c

fix semicolons nixing log statement details

5a8b72b

extended db-query inventory test for rot pages

1f81cfe

remove rot page rows when cleaning inventory collections

d31e28d

add stub rot page support to simulated SPs

2505935

add rot pages to nexus inventory collector

08e038d

add rot pages to omdb collections show

9036a45

omdb: truncate rot page base64 strings in output (overrideable)

e80f71b

fix simulated sp reporting nonsense trailing data in rot pages

ba8b920

cleanup after rebase: move schema migration from 10.0.0 to 11.0.0

62ccdfd

jgallagher requested a review from davepacheco November 14, 2023 18:54

jgallagher commented Nov 14, 2023

View reviewed changes

nexus/db-queries/src/db/datastore/inventory.rs Outdated Show resolved Hide resolved

jgallagher commented Nov 14, 2023

View reviewed changes

davepacheco approved these changes Nov 15, 2023

View reviewed changes

jgallagher added 6 commits November 20, 2023 15:13

bump schema to 12.0.0

f99b081

Merge branch 'main' into john/inventory-rot-cmpa-cfpa

78d2ca8

fixups from main merge

dd55df0

review feedback

275fc70

Merge branch 'main' into john/inventory-rot-cmpa-cfpa

d9889a9

bump schema after merging main

500ab90

jgallagher merged commit 5c6ad08 into main Nov 21, 2023
20 of 21 checks passed

jgallagher deleted the john/inventory-rot-cmpa-cfpa branch November 21, 2023 21:57

davepacheco assigned jgallagher Dec 1, 2023

jgallagher mentioned this pull request Dec 6, 2023

500 error when attempting to list uninitialized sleds #4621

Closed

jgallagher mentioned this pull request Mar 21, 2024

Prune inv_zpool when pruning collections #5301

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus inventory: Add collection of RoT CMPA and CFPA pages #4496

Nexus inventory: Add collection of RoT CMPA and CFPA pages #4496

jgallagher commented Nov 14, 2023

jgallagher Nov 14, 2023

davepacheco Nov 15, 2023

jgallagher Nov 15, 2023

davepacheco Nov 21, 2023

davepacheco left a comment

davepacheco Nov 15, 2023

davepacheco Nov 15, 2023

jgallagher Nov 15, 2023

davepacheco Nov 21, 2023

		@@ -0,0 +1,2 @@
		CREATE UNIQUE INDEX IF NOT EXISTS root_of_trust_page_properties

Nexus inventory: Add collection of RoT CMPA and CFPA pages #4496

Nexus inventory: Add collection of RoT CMPA and CFPA pages #4496

Conversation

jgallagher commented Nov 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment