Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Nexus to specify Zone filesystem location #5931

Merged
merged 30 commits into from
Jul 1, 2024
Merged

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jun 21, 2024

This is a newer version of #5050 , focused on a slightly smaller scope.

This PR adds an optional filesystem_pool to Sled Agent's API, and to Nexus' blueprint for zone construction.
This allows Nexus to have control over which zpool is being used for zone provisioning, rather than letting the Sled Agent control this decision as it used to.

By giving more control to Nexus, Nexus can better understand the impact of faults (e.g., will pulling a physical disk cause a zone to fail?) and can have more control over zone-reallocation.

This field is optional largely for backwards compatibility, but in the future, will eventually become mandatory.

Fixes #5048
Part of #5929

Comment on lines +62 to +77
// Add a zpool to both sleds, just to ensure that all new zones can find
// a transient filesystem wherever they end up being placed.
let _sled_agent_zpools = DiskTestBuilder::new(&cptestctx)
.on_sled(SledUuid::from_untyped_uuid(
cptestctx.sled_agent.sled_agent.id,
))
.with_zpool_count(1)
.build()
.await;
let _sled_agent2_zpools = DiskTestBuilder::new(&cptestctx)
.on_sled(SledUuid::from_untyped_uuid(
cptestctx.sled_agent2.sled_agent.id,
))
.with_zpool_count(1)
.build()
.await;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "works" -- it forces zpools to be allocated on all sleds before we try to create a blueprint -- but it's a little gross.

I've punted cleanup of the DiskTest stuff into #5939, just to keep this PR more reasonably-sized.

@@ -703,6 +704,53 @@ pub struct TestZpool {
pub datasets: Vec<TestDataset>,
}

pub struct DiskTestBuilder<'a, N: NexusServer> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This builder specifically is cleaned up / used by default in #5939 . However, doing so in a "not-just-additive" way changes a lot of code unrelated to this PR, so, it got split out.

Comment on lines -146 to -149
// TODO-cleanup This currently returns `None` for zones that only have
// transient datasets. This should change to a non-optional value once Nexus
// is aware of them.
pub fn dataset(&self) -> Option<&OmicronZoneDataset> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to change the name of this function as a way to address this "TODO".

schema/all-zones-requests.json Show resolved Hide resolved
@smklein smklein marked this pull request as ready for review June 24, 2024 22:49
@smklein
Copy link
Collaborator Author

smklein commented Jun 25, 2024

As a follow-up: #5952 actually expunges these zones if the physical disk backing one of these transient zone filesystems goes away.

@smklein smklein requested a review from andrewjstone June 26, 2024 22:10
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. It's awesome to see this control progressing up to Nexus!

schema/all-zones-requests.json Show resolved Hide resolved
}
}

for &zpool_id in resources.all_zpools(ZpoolFilter::InService) {
for &zpool_id in all_in_service_zpools {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't explicitly part of this PR, but it appears that we always choose the first zpool that doesn't have a zone of the same type on it. This means that we aren't really balancing the load across zpools. I'm not sure if this matters or not, but it seems like it might be somewhat accidental.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading blame correctly, this disposition was originally made in #5797 , when the sled_alloc_zpool function was originally added.

@jgallagher - was this to make pool allocation deterministic? I agree that we will probably want the ability to evenly distribute load across different zpools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgallagher - was this to make pool allocation deterministic?

Yes, and also it was only choosing CRDB zones, so I didn't put much thought into it.

We probably want something like OmicronZonePlacement but for disks? (Or to expand OmicronZonePlacement to also place disks.) It is deterministic but also distributes load as evenly as it can within the set of restrictions it has. I'll file an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes total sense, thanks for filing. I can look into the issue.

@smklein smklein merged commit 98bf635 into main Jul 1, 2024
20 checks passed
@smklein smklein deleted the nexus-zone-filesystems-2 branch July 1, 2024 23:26
smklein added a commit that referenced this pull request Jul 2, 2024
Builds on #5931

The `DiskTest` utility provides mechanisms to create zpools and datasets
within a particular Sled Agent.
Historically, this was used on a *single* sled agent within the control
plane test context, but more
recently, we've added a fair amount of multi-sled tests.

This PR updates the `DiskTest` to be more usable with multiple sleds:

- It provides a `DiskTestBuilder` to help callers customize behavior
- It lets callers access virtual disks spread across multiple distinct
sleds
smklein added a commit that referenced this pull request Jul 2, 2024
This is a direct follow-up to
#5931

Before this PR, physical disk expungement would not change zones based
on transient
filesystem pool selection. Now, this selection is a factor, and
expunging disks can cause
zones to be expunged if they relied on those disks.

Part of #5929
smklein added a commit that referenced this pull request Jul 15, 2024
Provides an internal API to remove disks, and wires it into omdb.
Additionally, expands omdb commands for visibility.

- `omdb db physical-disks` can be used to view all "control plane
physical disks". This is similar to, but distinct from, the `omdb db
inventory physical-disks` command, as it reports control plane disks
that have been adopted in the control plane. This command is necessary
for identifying the UUID of the associated control plane object, which
is not observable via inventory.
- `omdb nexus sleds expunge-disk` can be used to expunge a physical disk
from a sled. This relies on many prior patches to operate correctly, but
with the combination of: #5987, #5965, #5931, #5952, #5601, #5599, we
can observe the following behavior: expunging a disk leads to all
"users" of that disk (zone filesystems, datasets, zone bundles, etc)
being removed.

I tested this PR on a4x2 using the following steps:

```bash
# Boot a4x2, confirm the Nexus zone is running
# From g0, zlogin oxz_switch

$ omdb db sleds

SERIAL  IP                             ROLE      POLICY      STATE   ID                                   
 g2      [fd00:1122:3344:103::1]:12345  -         in service  active  29fede5f-37e4-4528-bcf2-f3ee94924894 
 g0      [fd00:1122:3344:101::1]:12345  scrimlet  in service  active  6a2c7019-d055-4256-8bad-042b97aa0e5e 
 g1      [fd00:1122:3344:102::1]:12345  -         in service  active  a611b43e-3995-4cd4-9603-89ca6aca3dc5 
 g3      [fd00:1122:3344:104::1]:12345  scrimlet  in service  active  f62f2cfe-d17b-4bd6-ae64-57e8224d3672

# We'll plan on expunging a disk on g1, and observing the effects.
$ export SLED_ID=a611b43e-3995-4cd4-9603-89ca6aca3dc5
$ export OMDB_SLED_AGENT_URL=http://[fd00:1122:3344:102::1]:12345
$ omdb sled-agent zones list

    "oxz_cockroachdb_b3fecda8-2eb8-4ff3-9cf6-90c94fba7c50"
    "oxz_crucible_19831c98-3137-4af4-a93d-fc1a17c138f2"
    "oxz_crucible_6adcb8ec-6c9e-4e8a-a8d4-bbf9ad44e2c4"
    "oxz_crucible_74b2f587-10ce-4131-97fd-9832c52c8a41"
    "oxz_crucible_9e422508-f4d5-4c24-8dde-0080c0916419"
    "oxz_crucible_a47e9625-d189-4001-877a-cc3aa5b1f3eb"
    "oxz_crucible_pantry_c3b4e3cb-3e23-4f5e-921b-04e4801924fd"
    "oxz_external_dns_7e669b6f-a3fe-47a9-addd-20e42c58b8bb"
    "oxz_internal_dns_1a45a6e8-5b03-4ab4-a3db-e83fb7767767"
    "oxz_ntp_209ad0d0-a5e7-4ab8-ac8f-e99902697b32"
    "oxz_oximeter_864efebb-790f-4b7a-8377-b2c82c87f5b8"

$ omdb db physical-disks | grep $SLED_ID
 ID                                    SERIAL                 VENDOR            MODEL               SLED_ID                               POLICY      STATE
 23524716-a331-4d57-aa71-8bd4dbc916f8  synthetic-serial-g1_0  synthetic-vendor  synthetic-model-U2  a611b43e-3995-4cd4-9603-89ca6aca3dc5  in service  active 
 3ca1812b-55e3-47ed-861f-f667f626c8a0  synthetic-serial-g1_3  synthetic-vendor  synthetic-model-U2  a611b43e-3995-4cd4-9603-89ca6aca3dc5  in service  active 
 40139afb-7076-45d9-84cf-b96eefe7acf8  synthetic-serial-g1_1  synthetic-vendor  synthetic-model-U2  a611b43e-3995-4cd4-9603-89ca6aca3dc5  in service  active 
 5c8e33dd-1230-4214-af78-9be892d9f421  synthetic-serial-g1_4  synthetic-vendor  synthetic-model-U2  a611b43e-3995-4cd4-9603-89ca6aca3dc5  in service  active 
 85780bbf-8e2d-481e-9013-34611572f191  synthetic-serial-g1_2  synthetic-vendor  synthetic-model-U2  a611b43e-3995-4cd4-9603-89ca6aca3dc5  in service  active 

# Let's expunge the "0th" disk here.

$ omdb nexus sleds expunge-disk 23524716-a331-4d57-aa71-8bd4dbc916f8 -w
$ omdb nexus blueprints regenerate -w
$ omdb nexus blueprints show $NEW_BLUEPRINT_ID

# Observe that the new blueprint for the sled expunges some zones -- minimally,
# the Crucible zone -- and no longer lists the "g1_0" disk. This should also be
# summarized in the blueprint metadata comment.

$ omdb nexus blueprints target set $NEW_BLUEPRINT_ID enabled -w
$ omdb sled-agent zones list

zones:
    "oxz_crucible_19831c98-3137-4af4-a93d-fc1a17c138f2"
    "oxz_crucible_74b2f587-10ce-4131-97fd-9832c52c8a41"
    "oxz_crucible_9e422508-f4d5-4c24-8dde-0080c0916419"
    "oxz_crucible_a47e9625-d189-4001-877a-cc3aa5b1f3eb"
    "oxz_crucible_pantry_c3b4e3cb-3e23-4f5e-921b-04e4801924fd"
    "oxz_ntp_209ad0d0-a5e7-4ab8-ac8f-e99902697b32"
    "oxz_oximeter_864efebb-790f-4b7a-8377-b2c82c87f5b8"

# As we can see, the expunged zones have been removed.
# We can also access the sled agent logs from g1 to observe that the expected requests have been sent
# to adjust the set of control plane disks and expunge the expected zones.
```

This is a major part of
#4719
Fixes #5370
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[nexus] Make OmicronZoneConfig include selection of zone root filesystem
3 participants