-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Sled Agent] Expunged disks are not in use after omicron_physical_disks_ensure #5965
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, all! I've added multiple reviewers to this PR because it's pretty cross-cutting, and I wanted to give a "heads-up" that I'm poking at several different subcomponents of the Sled Agent.
I recommend everyone start reviewing this PR from sled-agent/src/sled_agent.rs
, in the omicron_physical_disks_ensure
function. That provides a top-down view of the changes in this PR.
Basically:
- First, we update the set of control plane disks used by this Sled (same as before this PR). This can add or remove disks.
- Then we ask the following subsystems to "stop using old disks":
- StorageMonitor / dump devices
- Zone bundler
- Probe manager
- Instance Manager
- If and only if all those subsystems let go of old disks, we return to the caller (Nexus, presumably), and let them know that we've stopped using expunged devices
Given that this touches so many different subsystems, I wanted to give folks an appropriate heads-up about "what is happening".
- @lifning - would you be willing to look at the storage monitor / dump setup?
- @bnaecker - would you be willing to look at the zone bundler?
- @rcgoodfellow - would you be willing to look at the probe manager?
- @hawkw - would you be willing to look at the instance manager?
sled-agent/src/instance_manager.rs
Outdated
|
||
for id in to_remove { | ||
info!(self.log, "only_use_disks: Removing instance"; "id" => ?id); | ||
self.instances.remove(&id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What actually happens when we remove an instance from this map? What happens to the instance as a result?
Per the comment in sled_agent.rs
, it seems like we want to mark as failed any instances which are using expunged disks, but I don't actually see where that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question - we don't actually explicitly mark the instance failed here, we just stop it from running.
This is basically the same pathway a VMM could use to terminate itself.
This is the object we end up removing via the call to self.instances.remove
:
omicron/sled-agent/src/instance.rs
Lines 904 to 910 in 30b6713
/// A reference to a single instance running a running Propolis server. | |
pub struct Instance { | |
tx: mpsc::Sender<InstanceRequest>, | |
#[allow(dead_code)] | |
runner_handle: tokio::task::JoinHandle<()>, | |
} |
And by dropping it, we remove the only sender-side of an mpsc
of InstanceRequest
s.
This should trigger the self-terminating case of the InstanceRunner
:
omicron/sled-agent/src/instance.rs
Lines 450 to 454 in 30b6713
None => { | |
warn!(self.log, "Instance request channel closed; shutting down"); | |
self.terminate().await; | |
break; | |
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kinda why I mentioned I'd be relying on the background task -- as implemented, I'd need to wait for a background task to notice that the instance is not running on the sled.
I can introduce an upcall to nexus, but I figured that could throw more of a wrench into the instance lifecycle revamp. Lemme know what you prefer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this after a couple days, I guess it's possible that we return from the omicron_physical_disks_ensure
function before the instance runner has finished terminating. I'll explicitly terminate here too, which should ensure the runner is not continuing to process the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in e360dae
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't actually explicitly mark the instance failed here, we just stop it from running.
That's good! Eventually, sled-agent will never mark instances failed. I think we do want to mark the vmm state as failed, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, thanks for bearing with me. I'm giving this a shot in: f242e0a
Right now, I'm just passing an extra boolean to some of the termination functions to decide what VMM state to propagate. Later, I think it might be useful to expand this to a stringified message too, so we can gain some better diagnostic info on why an instance failed or stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes to StorageMonitor and DumpSetup look good to me! (one nit on a variable name)
sled-agent/src/dump_setup.rs
Outdated
// This is particularly useful for disk expungement, when a caller | ||
// wants to ensure that the dump device is no longer accessing an | ||
// old device. | ||
let mut last_update_complete_tx = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the prefix last_
is ever-so-slightly confusing to its purpose - there's no code path where it isn't .take()
n later in the same iteration of the loop where it was = Some()
'd. (of course, current_
wouldn't be particularly clear either.. perhaps update_and_archiving_complete_tx
or something of the sort?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about evaluation_and_archiving_complete_tx
? The fact that it's optional, IMO, implies that it may or may not be set, but we do respect that contract if it is set within a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 691bc85
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Took this branch for a spin and ran the probes connectivity test in a4x2
, things look good there.
$ pfexec ../target/debug/commtest \
--api-timeout 30m \
http://198.51.100.23 run \
--ip-pool-begin 198.51.100.40 \
--ip-pool-end 198.51.100.70 \
--icmp-loss-tolerance 500 \
--test-duration 200s \
--packet-rate 10
the api is up
logging in ... done
project does not exist, creating ... done
default ip pool does not exist, creating ...done
ip range does not exist, creating ... done
getting sled ids ... done
checking if probe0 exists
probe0 does not exist, creating ... done
checking if probe1 exists
probe1 does not exist, creating ... done
checking if probe2 exists
probe2 does not exist, creating ... done
checking if probe3 exists
API call error: Communication Error: error sending request for url (http://198.51.100.23/experimental/v1/probes/probe3?project=classone): connection error: Connection timed out (os error 145), retrying in 3 s
probe3 does not exist, creating ... done
testing connectivity to probes
addr low avg high last sent received lost
198.51.100.41 0.644 1.186 2.225 1.006 1998 1988 9
198.51.100.40 0.666 1.095 2.556 1.259 1998 1999 0
198.51.100.42 0.691 1.289 5.446 0.867 1998 1997 0
198.51.100.43 0.623 1.089 3.092 1.288 1998 1770 2
all connectivity tests within loss tolerance
I'm wiring up the command to actually expunge sleds in #5994 , which is built on top of this PR. The basic plumbing appears to work there, though I'm admittedly not running instances on a4x2. |
@@ -333,7 +333,10 @@ impl InstanceManager { | |||
/// | |||
/// This function looks for transient zone filesystem usage on expunged | |||
/// zpools. | |||
pub async fn only_use_disks(&self, disks: AllDisks) -> Result<(), Error> { | |||
pub async fn use_only_these_disks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
Provides an internal API to remove disks, and wires it into omdb. Additionally, expands omdb commands for visibility. - `omdb db physical-disks` can be used to view all "control plane physical disks". This is similar to, but distinct from, the `omdb db inventory physical-disks` command, as it reports control plane disks that have been adopted in the control plane. This command is necessary for identifying the UUID of the associated control plane object, which is not observable via inventory. - `omdb nexus sleds expunge-disk` can be used to expunge a physical disk from a sled. This relies on many prior patches to operate correctly, but with the combination of: #5987, #5965, #5931, #5952, #5601, #5599, we can observe the following behavior: expunging a disk leads to all "users" of that disk (zone filesystems, datasets, zone bundles, etc) being removed. I tested this PR on a4x2 using the following steps: ```bash # Boot a4x2, confirm the Nexus zone is running # From g0, zlogin oxz_switch $ omdb db sleds SERIAL IP ROLE POLICY STATE ID g2 [fd00:1122:3344:103::1]:12345 - in service active 29fede5f-37e4-4528-bcf2-f3ee94924894 g0 [fd00:1122:3344:101::1]:12345 scrimlet in service active 6a2c7019-d055-4256-8bad-042b97aa0e5e g1 [fd00:1122:3344:102::1]:12345 - in service active a611b43e-3995-4cd4-9603-89ca6aca3dc5 g3 [fd00:1122:3344:104::1]:12345 scrimlet in service active f62f2cfe-d17b-4bd6-ae64-57e8224d3672 # We'll plan on expunging a disk on g1, and observing the effects. $ export SLED_ID=a611b43e-3995-4cd4-9603-89ca6aca3dc5 $ export OMDB_SLED_AGENT_URL=http://[fd00:1122:3344:102::1]:12345 $ omdb sled-agent zones list "oxz_cockroachdb_b3fecda8-2eb8-4ff3-9cf6-90c94fba7c50" "oxz_crucible_19831c98-3137-4af4-a93d-fc1a17c138f2" "oxz_crucible_6adcb8ec-6c9e-4e8a-a8d4-bbf9ad44e2c4" "oxz_crucible_74b2f587-10ce-4131-97fd-9832c52c8a41" "oxz_crucible_9e422508-f4d5-4c24-8dde-0080c0916419" "oxz_crucible_a47e9625-d189-4001-877a-cc3aa5b1f3eb" "oxz_crucible_pantry_c3b4e3cb-3e23-4f5e-921b-04e4801924fd" "oxz_external_dns_7e669b6f-a3fe-47a9-addd-20e42c58b8bb" "oxz_internal_dns_1a45a6e8-5b03-4ab4-a3db-e83fb7767767" "oxz_ntp_209ad0d0-a5e7-4ab8-ac8f-e99902697b32" "oxz_oximeter_864efebb-790f-4b7a-8377-b2c82c87f5b8" $ omdb db physical-disks | grep $SLED_ID ID SERIAL VENDOR MODEL SLED_ID POLICY STATE 23524716-a331-4d57-aa71-8bd4dbc916f8 synthetic-serial-g1_0 synthetic-vendor synthetic-model-U2 a611b43e-3995-4cd4-9603-89ca6aca3dc5 in service active 3ca1812b-55e3-47ed-861f-f667f626c8a0 synthetic-serial-g1_3 synthetic-vendor synthetic-model-U2 a611b43e-3995-4cd4-9603-89ca6aca3dc5 in service active 40139afb-7076-45d9-84cf-b96eefe7acf8 synthetic-serial-g1_1 synthetic-vendor synthetic-model-U2 a611b43e-3995-4cd4-9603-89ca6aca3dc5 in service active 5c8e33dd-1230-4214-af78-9be892d9f421 synthetic-serial-g1_4 synthetic-vendor synthetic-model-U2 a611b43e-3995-4cd4-9603-89ca6aca3dc5 in service active 85780bbf-8e2d-481e-9013-34611572f191 synthetic-serial-g1_2 synthetic-vendor synthetic-model-U2 a611b43e-3995-4cd4-9603-89ca6aca3dc5 in service active # Let's expunge the "0th" disk here. $ omdb nexus sleds expunge-disk 23524716-a331-4d57-aa71-8bd4dbc916f8 -w $ omdb nexus blueprints regenerate -w $ omdb nexus blueprints show $NEW_BLUEPRINT_ID # Observe that the new blueprint for the sled expunges some zones -- minimally, # the Crucible zone -- and no longer lists the "g1_0" disk. This should also be # summarized in the blueprint metadata comment. $ omdb nexus blueprints target set $NEW_BLUEPRINT_ID enabled -w $ omdb sled-agent zones list zones: "oxz_crucible_19831c98-3137-4af4-a93d-fc1a17c138f2" "oxz_crucible_74b2f587-10ce-4131-97fd-9832c52c8a41" "oxz_crucible_9e422508-f4d5-4c24-8dde-0080c0916419" "oxz_crucible_a47e9625-d189-4001-877a-cc3aa5b1f3eb" "oxz_crucible_pantry_c3b4e3cb-3e23-4f5e-921b-04e4801924fd" "oxz_ntp_209ad0d0-a5e7-4ab8-ac8f-e99902697b32" "oxz_oximeter_864efebb-790f-4b7a-8377-b2c82c87f5b8" # As we can see, the expunged zones have been removed. # We can also access the sled agent logs from g1 to observe that the expected requests have been sent # to adjust the set of control plane disks and expunge the expected zones. ``` This is a major part of #4719 Fixes #5370
omicron_physical_disks_ensure
is an API exposed by Sled Agent, which allows Nexus to control the set of"active" control plane disks.
Although this API was exposed, it previously did not stop the Sled Agent from using expunged disks under
all circumstances. This PR now adjusts the endpoint to "flush out" all old usage of disks before returning.
This PR:
Fixes #5929