Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sled Agent] Expunged disks are not in use after omicron_physical_disks_ensure #5965

Merged
merged 70 commits into from
Jul 15, 2024

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jun 27, 2024

omicron_physical_disks_ensure is an API exposed by Sled Agent, which allows Nexus to control the set of
"active" control plane disks.

Although this API was exposed, it previously did not stop the Sled Agent from using expunged disks under
all circumstances. This PR now adjusts the endpoint to "flush out" all old usage of disks before returning.

This PR:

  • Ensures dump device management lets go of expunged U.2s
  • Ensures Zone bundles let go of expunged U.2s
  • Removes any network probes allocated with a transient filesystem on an expunged U.2
  • Removes any VMMs allocates with a transient filesystem on an expunged U.2

Fixes #5929

Copy link
Collaborator Author

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, all! I've added multiple reviewers to this PR because it's pretty cross-cutting, and I wanted to give a "heads-up" that I'm poking at several different subcomponents of the Sled Agent.

I recommend everyone start reviewing this PR from sled-agent/src/sled_agent.rs, in the omicron_physical_disks_ensure function. That provides a top-down view of the changes in this PR.

Basically:

  • First, we update the set of control plane disks used by this Sled (same as before this PR). This can add or remove disks.
  • Then we ask the following subsystems to "stop using old disks":
  • StorageMonitor / dump devices
  • Zone bundler
  • Probe manager
  • Instance Manager
  • If and only if all those subsystems let go of old disks, we return to the caller (Nexus, presumably), and let them know that we've stopped using expunged devices

Given that this touches so many different subsystems, I wanted to give folks an appropriate heads-up about "what is happening".

  • @lifning - would you be willing to look at the storage monitor / dump setup?
  • @bnaecker - would you be willing to look at the zone bundler?
  • @rcgoodfellow - would you be willing to look at the probe manager?
  • @hawkw - would you be willing to look at the instance manager?

sled-agent/src/instance_manager.rs Outdated Show resolved Hide resolved
sled-agent/src/instance_manager.rs Outdated Show resolved Hide resolved
sled-agent/src/instance_manager.rs Outdated Show resolved Hide resolved
sled-agent/src/instance_manager.rs Outdated Show resolved Hide resolved

for id in to_remove {
info!(self.log, "only_use_disks: Removing instance"; "id" => ?id);
self.instances.remove(&id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What actually happens when we remove an instance from this map? What happens to the instance as a result?

Per the comment in sled_agent.rs, it seems like we want to mark as failed any instances which are using expunged disks, but I don't actually see where that happens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question - we don't actually explicitly mark the instance failed here, we just stop it from running.

This is basically the same pathway a VMM could use to terminate itself.
This is the object we end up removing via the call to self.instances.remove:

/// A reference to a single instance running a running Propolis server.
pub struct Instance {
tx: mpsc::Sender<InstanceRequest>,
#[allow(dead_code)]
runner_handle: tokio::task::JoinHandle<()>,
}

And by dropping it, we remove the only sender-side of an mpsc of InstanceRequests.

This should trigger the self-terminating case of the InstanceRunner:

None => {
warn!(self.log, "Instance request channel closed; shutting down");
self.terminate().await;
break;
},

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda why I mentioned I'd be relying on the background task -- as implemented, I'd need to wait for a background task to notice that the instance is not running on the sled.

I can introduce an upcall to nexus, but I figured that could throw more of a wrench into the instance lifecycle revamp. Lemme know what you prefer!

Copy link
Collaborator Author

@smklein smklein Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this after a couple days, I guess it's possible that we return from the omicron_physical_disks_ensure function before the instance runner has finished terminating. I'll explicitly terminate here too, which should ensure the runner is not continuing to process the request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in e360dae

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't actually explicitly mark the instance failed here, we just stop it from running.

That's good! Eventually, sled-agent will never mark instances failed. I think we do want to mark the vmm state as failed, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, thanks for bearing with me. I'm giving this a shot in: f242e0a

Right now, I'm just passing an extra boolean to some of the termination functions to decide what VMM state to propagate. Later, I think it might be useful to expand this to a stringified message too, so we can gain some better diagnostic info on why an instance failed or stopped.

Copy link
Contributor

@lifning lifning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes to StorageMonitor and DumpSetup look good to me! (one nit on a variable name)

// This is particularly useful for disk expungement, when a caller
// wants to ensure that the dump device is no longer accessing an
// old device.
let mut last_update_complete_tx = None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the prefix last_ is ever-so-slightly confusing to its purpose - there's no code path where it isn't .take()n later in the same iteration of the loop where it was = Some()'d. (of course, current_ wouldn't be particularly clear either.. perhaps update_and_archiving_complete_tx or something of the sort?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about evaluation_and_archiving_complete_tx ? The fact that it's optional, IMO, implies that it may or may not be set, but we do respect that contract if it is set within a loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 691bc85

Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Took this branch for a spin and ran the probes connectivity test in a4x2, things look good there.

$ pfexec ../target/debug/commtest \
    --api-timeout 30m \
    http://198.51.100.23 run \
    --ip-pool-begin 198.51.100.40 \
    --ip-pool-end 198.51.100.70 \
    --icmp-loss-tolerance 500 \
    --test-duration 200s \
    --packet-rate 10
the api is up
logging in ... done
project does not exist, creating ... done
default ip pool does not exist, creating ...done
ip range does not exist, creating ... done
getting sled ids ... done
checking if probe0 exists
probe0 does not exist, creating ... done
checking if probe1 exists
probe1 does not exist, creating ... done
checking if probe2 exists
probe2 does not exist, creating ... done
checking if probe3 exists
API call error: Communication Error: error sending request for url (http://198.51.100.23/experimental/v1/probes/probe3?project=classone): connection error: Connection timed out (os error 145), retrying in 3 s
probe3 does not exist, creating ... done
testing connectivity to probes
addr            low     avg     high    last    sent    received  lost
198.51.100.41   0.644   1.186   2.225   1.006   1998    1988      9
198.51.100.40   0.666   1.095   2.556   1.259   1998    1999      0
198.51.100.42   0.691   1.289   5.446   0.867   1998    1997      0
198.51.100.43   0.623   1.089   3.092   1.288   1998    1770      2
all connectivity tests within loss tolerance

@smklein
Copy link
Collaborator Author

smklein commented Jul 5, 2024

I'm wiring up the command to actually expunge sleds in #5994 , which is built on top of this PR.

The basic plumbing appears to work there, though I'm admittedly not running instances on a4x2.

@@ -333,7 +333,10 @@ impl InstanceManager {
///
/// This function looks for transient zone filesystem usage on expunged
/// zpools.
pub async fn only_use_disks(&self, disks: AllDisks) -> Result<(), Error> {
pub async fn use_only_these_disks(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

@smklein smklein mentioned this pull request Jul 11, 2024
16 tasks
Base automatically changed from stop-self-managing-disks to main July 15, 2024 20:34
@smklein smklein enabled auto-merge (squash) July 15, 2024 20:38
@smklein smklein merged commit ad6c92e into main Jul 15, 2024
19 checks passed
@smklein smklein deleted the physical_disks_ensure_lets_go branch July 15, 2024 22:29
smklein added a commit that referenced this pull request Jul 15, 2024
Provides an internal API to remove disks, and wires it into omdb.
Additionally, expands omdb commands for visibility.

- `omdb db physical-disks` can be used to view all "control plane
physical disks". This is similar to, but distinct from, the `omdb db
inventory physical-disks` command, as it reports control plane disks
that have been adopted in the control plane. This command is necessary
for identifying the UUID of the associated control plane object, which
is not observable via inventory.
- `omdb nexus sleds expunge-disk` can be used to expunge a physical disk
from a sled. This relies on many prior patches to operate correctly, but
with the combination of: #5987, #5965, #5931, #5952, #5601, #5599, we
can observe the following behavior: expunging a disk leads to all
"users" of that disk (zone filesystems, datasets, zone bundles, etc)
being removed.

I tested this PR on a4x2 using the following steps:

```bash
# Boot a4x2, confirm the Nexus zone is running
# From g0, zlogin oxz_switch

$ omdb db sleds

SERIAL  IP                             ROLE      POLICY      STATE   ID                                   
 g2      [fd00:1122:3344:103::1]:12345  -         in service  active  29fede5f-37e4-4528-bcf2-f3ee94924894 
 g0      [fd00:1122:3344:101::1]:12345  scrimlet  in service  active  6a2c7019-d055-4256-8bad-042b97aa0e5e 
 g1      [fd00:1122:3344:102::1]:12345  -         in service  active  a611b43e-3995-4cd4-9603-89ca6aca3dc5 
 g3      [fd00:1122:3344:104::1]:12345  scrimlet  in service  active  f62f2cfe-d17b-4bd6-ae64-57e8224d3672

# We'll plan on expunging a disk on g1, and observing the effects.
$ export SLED_ID=a611b43e-3995-4cd4-9603-89ca6aca3dc5
$ export OMDB_SLED_AGENT_URL=http://[fd00:1122:3344:102::1]:12345
$ omdb sled-agent zones list

    "oxz_cockroachdb_b3fecda8-2eb8-4ff3-9cf6-90c94fba7c50"
    "oxz_crucible_19831c98-3137-4af4-a93d-fc1a17c138f2"
    "oxz_crucible_6adcb8ec-6c9e-4e8a-a8d4-bbf9ad44e2c4"
    "oxz_crucible_74b2f587-10ce-4131-97fd-9832c52c8a41"
    "oxz_crucible_9e422508-f4d5-4c24-8dde-0080c0916419"
    "oxz_crucible_a47e9625-d189-4001-877a-cc3aa5b1f3eb"
    "oxz_crucible_pantry_c3b4e3cb-3e23-4f5e-921b-04e4801924fd"
    "oxz_external_dns_7e669b6f-a3fe-47a9-addd-20e42c58b8bb"
    "oxz_internal_dns_1a45a6e8-5b03-4ab4-a3db-e83fb7767767"
    "oxz_ntp_209ad0d0-a5e7-4ab8-ac8f-e99902697b32"
    "oxz_oximeter_864efebb-790f-4b7a-8377-b2c82c87f5b8"

$ omdb db physical-disks | grep $SLED_ID
 ID                                    SERIAL                 VENDOR            MODEL               SLED_ID                               POLICY      STATE
 23524716-a331-4d57-aa71-8bd4dbc916f8  synthetic-serial-g1_0  synthetic-vendor  synthetic-model-U2  a611b43e-3995-4cd4-9603-89ca6aca3dc5  in service  active 
 3ca1812b-55e3-47ed-861f-f667f626c8a0  synthetic-serial-g1_3  synthetic-vendor  synthetic-model-U2  a611b43e-3995-4cd4-9603-89ca6aca3dc5  in service  active 
 40139afb-7076-45d9-84cf-b96eefe7acf8  synthetic-serial-g1_1  synthetic-vendor  synthetic-model-U2  a611b43e-3995-4cd4-9603-89ca6aca3dc5  in service  active 
 5c8e33dd-1230-4214-af78-9be892d9f421  synthetic-serial-g1_4  synthetic-vendor  synthetic-model-U2  a611b43e-3995-4cd4-9603-89ca6aca3dc5  in service  active 
 85780bbf-8e2d-481e-9013-34611572f191  synthetic-serial-g1_2  synthetic-vendor  synthetic-model-U2  a611b43e-3995-4cd4-9603-89ca6aca3dc5  in service  active 

# Let's expunge the "0th" disk here.

$ omdb nexus sleds expunge-disk 23524716-a331-4d57-aa71-8bd4dbc916f8 -w
$ omdb nexus blueprints regenerate -w
$ omdb nexus blueprints show $NEW_BLUEPRINT_ID

# Observe that the new blueprint for the sled expunges some zones -- minimally,
# the Crucible zone -- and no longer lists the "g1_0" disk. This should also be
# summarized in the blueprint metadata comment.

$ omdb nexus blueprints target set $NEW_BLUEPRINT_ID enabled -w
$ omdb sled-agent zones list

zones:
    "oxz_crucible_19831c98-3137-4af4-a93d-fc1a17c138f2"
    "oxz_crucible_74b2f587-10ce-4131-97fd-9832c52c8a41"
    "oxz_crucible_9e422508-f4d5-4c24-8dde-0080c0916419"
    "oxz_crucible_a47e9625-d189-4001-877a-cc3aa5b1f3eb"
    "oxz_crucible_pantry_c3b4e3cb-3e23-4f5e-921b-04e4801924fd"
    "oxz_ntp_209ad0d0-a5e7-4ab8-ac8f-e99902697b32"
    "oxz_oximeter_864efebb-790f-4b7a-8377-b2c82c87f5b8"

# As we can see, the expunged zones have been removed.
# We can also access the sled agent logs from g1 to observe that the expected requests have been sent
# to adjust the set of control plane disks and expunge the expected zones.
```

This is a major part of
#4719
Fixes #5370
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sled Agent: Ensure that all U.2 users "let go" on disk expungement
4 participants