-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what's involved with removing a sled? #4787
Comments
See also #4719 , there is a ton of overlap between the two (removal of a sled seems like it implicitly "deactivates" all disks attached to that sled) |
There are some related tickets for reference:
Besides the above,
|
There was also a "proof of concept" when @augustuswm had to remove sled 10 from rack3. The POC obviously didn't include disk/log data migration but covered all the database things, as noted in https://github.com/oxidecomputer/colo/issues/46. |
#612 is also related to this |
For reference, I think this is now covered by #4872. |
Here are some notes from a bit of digging I just did. It's not exactly comprehensive but I wanted to look and see if there were obvious pieces we may have missed. Broadly, we can divide sled state (or the cleanup actions for that state) into three categories:
In this issue, I'm mostly concerned with category 1. RFD 459 (which is still a work in progress) discusses categories 2 and 3. To summarize category 2: instances on an expunged sled need to be enter a failure path similar to what would happen of the sled rebooted. This depends on #4872. Crucible regions on an expunged sled need to be treated as gone forever, with Omicron and Crucible machinery getting kicked off to restore the expected number of copies for any affected volumes. Category 3 is complicated -- see the RFD for more. Database stateObviously there's one record in the
Besides those direct consumers, physical disks are referenced by the I've filed new tickets:
Other persistent stateInternal and external DNS generally need to be updated when a sled is expunged. This work has already been done via #4989 and #5212. Metric data in Clickhouse presumably should remain unchanged, since it still reflects useful historical information about that sled. Switches have state about sleds (e.g., routes). These are configured via Dendrite. Existing background tasks in Nexus take Runtime stateGenerally, Nexus doesn't keep in-memory state outside the main set of database tables. Sagas are an important exception: they may have in-memory state and even persistent state (outputs from saga nodes) that may be invalidated when a sled (or a component running on a sled) becomes expunged. These are reflected in a few issues:
For sagas that are trying forever to reach a component that's now gone: these should probably check the current target blueprint and stop trying to contact things that are known to be gone. They need to make a saga-specific decision about what that means. For regular actions, they could treat this as a failure (triggering an unwind); for unwind actions, we may want to design flows so that sagas don't need to do anything in this case. (For example, if the unwind action is cleaning something up that was on some instance that has since been expunged, the expungement of that instance probably ought to be responsible for cleaning that thing up, rather than the saga.) #4259 is a little different in that a reliable persistent workflow (RPW) approach probably makes more sense. That's about all I plan to do here for now. We may of course find new things during testing. |
It'd be useful to have a written summary of exactly what has to happen as part of sled removal. This is analogous to #4651 but will include more things because we have to clean up anything that's been associated with the sled in its lifetime.
Off the top of my head, I assume we need to update/remove entries from:
sled
,physical_disk
,zpool
. Maybedataset
too? There's also stuff related to instances and Crucible regions. What exactly has to happen there?Next step for me would probably be to search the schema for foreign keys pointing at any of these things (e.g.,
sled_id
) and repeat recursively until we find nothing new.The text was updated successfully, but these errors were encountered: