-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus inventory: Add collection of RoT CMPA and CFPA pages #4496
Conversation
// either have kept the original data or returned Ok while returning an | ||
// error? It seems a little strange we returned Err but accepted the new | ||
// data. | ||
assert_eq!(rot_page.page.data_base64, rot_page2.data_base64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the second test issue I mentioned in the PR description. The caboose behaves this way so I made it match, but I'm not sure it's right - should we at least be logging a collection error if we hit this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding it right: I think the current behavior is okay and also the improvements you alluded to would be good, too. My thinking is this:
- It seems very possible (although probably uncommon) to get two different responses here due to the underlying data having changed in between the two calls (e.g., if we did an inventory collection during an update, we might see the caboose contents change). In this case, there's an intrinsic race, and I don't think it matters much which data we take, especially as long as we record which one it was (source and timestamp). A consumer that was depending on getting the later data really needs to wait for a collection to complete that starts after whatever operation they want it to be after. (I realize in the current implementation we'll take the later one anyway. Similarly if someone was depending on it being the earlier one, they'd have to look at a collection that finished before whatever operation they care about happened.)
- The builder returns an error so that the caller can at least tell that something unusual happened and can report it. What level of reporting is appropriate? I think calling this a collection error is misleading. Those are intended to communicate that a particular collection is likely to be incomplete or invalid, which isn't the case in the example I mentioned and I don't think we want that sort of race to invalidate any concurrent collection. I think a warning-level log message (and counter bump, ideally) would be appropriate. If we wanted to augment collections with warnings or something like that, I think that'd also be fine, but I'm not sure it's worth doing right now.
- The other possibility is that two MGS's reliably report inconsistent data for the same device even though the underlying data isn't changing. I can imagine this happening (i.e., due to a bug in one of them), but dealing with it in general seems pretty hard and it doesn't seem all that likely to me. Being able to tell in a support context that it's happened seems like the right level of effort.
That's just an explanation of how I think we got here. We could certainly change any of this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, thanks. I was only thinking of your third bullet point (or slight variations, such as an SP / RoT reporting inconsistent data itself, even if MGS isn't buggy), but you're right that this seems we're much more likely to hit this by racing with an update, which shouldn't cause a failure.
I think a warning-level log message (and counter bump, ideally) would be appropriate.
Collector
does log this (at error!
) level currently - want me to change them to warn!
, and possibly include a "was an update happening simultaneously?" (either in the log message itself or in a comment around where the log message is generated)?
What do you mean by a counter bump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't bother changing it.
By counter bump I mean adding Oximeter-collected metrics to this subsystem and having an error counter for this case. That'd be welcome but obviously beyond the scope of this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
// either have kept the original data or returned Ok while returning an | ||
// error? It seems a little strange we returned Err but accepted the new | ||
// data. | ||
assert_eq!(rot_page.page.data_base64, rot_page2.data_base64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding it right: I think the current behavior is okay and also the improvements you alluded to would be good, too. My thinking is this:
- It seems very possible (although probably uncommon) to get two different responses here due to the underlying data having changed in between the two calls (e.g., if we did an inventory collection during an update, we might see the caboose contents change). In this case, there's an intrinsic race, and I don't think it matters much which data we take, especially as long as we record which one it was (source and timestamp). A consumer that was depending on getting the later data really needs to wait for a collection to complete that starts after whatever operation they want it to be after. (I realize in the current implementation we'll take the later one anyway. Similarly if someone was depending on it being the earlier one, they'd have to look at a collection that finished before whatever operation they care about happened.)
- The builder returns an error so that the caller can at least tell that something unusual happened and can report it. What level of reporting is appropriate? I think calling this a collection error is misleading. Those are intended to communicate that a particular collection is likely to be incomplete or invalid, which isn't the case in the example I mentioned and I don't think we want that sort of race to invalidate any concurrent collection. I think a warning-level log message (and counter bump, ideally) would be appropriate. If we wanted to augment collections with warnings or something like that, I think that'd also be fine, but I'm not sure it's worth doing right now.
- The other possibility is that two MGS's reliably report inconsistent data for the same device even though the underlying data isn't changing. I can imagine this happening (i.e., due to a bug in one of them), but dealing with it in general seems pretty hard and it doesn't seem all that likely to me. Being able to tell in a support context that it's happened seems like the right level of effort.
That's just an explanation of how I think we got here. We could certainly change any of this!
schema/crdb/11.0.0/up2.sql
Outdated
@@ -0,0 +1,2 @@ | |||
CREATE UNIQUE INDEX IF NOT EXISTS root_of_trust_page_properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does make me slightly nervous (because the values are so big and the domain of possible values is so big) but I guess it's not really worse than any other string and in practice we don't expect to have that many values here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also slightly nervous about this, but (a) I believe these change even less frequently than cabooses (e.g., when keys are revoked, for which we don't yet have a process) and (b) I'm not sure what an alternative would be. Without this index we wouldn't be able to ensure uniqueness "by hand", right, since querying to see if a given page exists would require a full table scan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Noticed this while investigating #5299; I am unsure whether this will actually resolve #5299 though. I made this same mistake the first time I added something new to inventory; @davepacheco and I discussed the issues with trying to test for this at #4496 (comment). I'm not sure anything has materially changed there, but maybe testing this warrants a closer look.
The RoT can report four different 512-byte pages (CMPA, and CFPA active/inactive/scratch). Given multiple RoT artifacts that are viable (match the right board, etc.) but are signed with different keys, these pages are required to identify which archive was signed with a key that the RoT will accept. This PR adds collection of these pages to the inventory system added in #4291.
The implementation here is fairly bulky but very mechanical, and is implemented almost identically to the way we collect cabooses: there's an
rot_page_which
to identify which of the four kinds of page it is, and a table for storing the relatively small number of raw page data values. Most of the changes in this PR resulted from "find where we're doing something for cabooses, then do the analogous thing for RoT pages". There are a couple minor quibbles in the unit tests that I'll point out by leaving comments below.The RoT pages now show up when viewing a collection through omdb (note that the quite long base64 string is truncated; there's a command line flag to override the truncation and show the full string):
There's also a new
omdb
subcommand to report the RoT pages (which does not truncate, but if we think it should that'd be easy to change):