Dataset ledger does not reflect ensured datasets #7095

andrewjstone · 2024-11-18T22:28:32Z

Background

I've been working on #6999 and trying to find all the dependencies between reconfigurator execution steps and remove them so that all steps can be run independently. Ideally, all steps are idempotent and the failure of single step gets reported as a warning and does not stop other steps from running. We don't really want to run the steps in parallel or as separate background tasks but we could if we wanted to. Eventually the structure of each step will change to be a BoxedFuture such that the execution is decoupled from the steps themselves, and so that we can mostly prevent dependencies through a constrained API. However, the first step is to find all dependencies and remove them.

One of the dependency chains is currently deploy disks -> deploy datasets -> deploy zones -> record datasets with plumb-firewall-rules mixed in but unrelated AFAICT. I was wondering what is forcing these to be dependencies so that I could go ahead and remove the dependencies themselves. As part of reading that code I found a bug, which is the topic of this issue.

Bug

As part of ensuring datasets we persist them to a ledger. However, if there are errors when ensuring the dataset, we persist them unconditionally. This is because the call to datasets_ensure_internal doesn't return a result, but a DatasetsManagementResult that contains possible internal errors. These errors are not checked before the ledger is committed.

Now for the interesting part. I actually think this may actually the right thing to do ™ . We should be writing down the desired state so that on cold boot the sled-agent can recreate these datasets if they don't already exists. I believe this was the only usage before #7006, and so this wasn't actually problematic. An argument can actually be made for persisting this desired state to the ledger before trying to ensure the datasets, in a typical write-ahead-log fashion, if this is the primary use case.

The bug arises, because we rely on the ledger to later on make decisions at runtime rather than looking at what actual datasets exist. In check_requested_zone_datasets_exist we call self.inner.storage.datasets_config_list() which ends up reading this ledger of datasets that may not actually be running. This gives a false sense of safety, even though the zones may still fail to launch in the same manner as without this check. So the check doesn't really make anything worse, it's just not correct.

If we really want to check that the datasets exist and not bail early we should go ahead and check the actual datasets that exist via the filesystem rather than reading the ledger.

In essence, my suggestion is:

Always write the ledger before ensuring datasets and use it on cold boot
Check that the actual datasets exists before attempting to launch corresponding zones if we indeed want to not attempt launching them. I'd probably do this one at a time though, because presumably we still want to launch some zones. On the next execution iteration presumably ensure datasets and ensure zones will be called again and the system will fix itself.

Alternatively, we could not write out the desired state of the datasets to the ledger until they are created, but this feels wrong to me since the system may crash after the datasets are created but the ledger will be out of date on cold boot.

The text was updated successfully, but these errors were encountered:

smklein · 2024-11-18T23:25:13Z

The comments around check_requested_zone_datasets_exist tries to clarify this somewhat

    // Compares the incoming set of zones with the datasets this sled is
    // configured to use.
    //
    // If the requested zones contain any datasets not configured on this sled,
    // an error is returned.

This check is explicitly for configuration, rather than inventory. I figured I'd check configuration instead of inventory, as that specifies "intent from Nexus" rather than "what datasets actually exist".

As discussed in chat - we could try to observe the union of "all datasets reported by system inventory" x "all datasets in the latest configuration", but my fear is that any local check reliant on inventory would be inherently subject to a TOCTTOU. Basically: If we try to give an affirmative answer that "a dataset (or disk) exists", it might immediately be yanked from the system, powered down, etc.

In that case, we'd fall back to the behavior we have today: We'd try to boot the zone, and if the dataset is gone, booting the zone should fail.

andrewjstone · 2024-11-18T23:35:52Z

The comments around check_requested_zone_datasets_exist tries to clarify this somewhat
// Compares the incoming set of zones with the datasets this sled is
// configured to use.
//
// If the requested zones contain any datasets not configured on this sled,
// an error is returned.
This check is explicitly for configuration, rather than inventory. I figured I'd check configuration instead of inventory, as that specifies "intent from Nexus" rather than "what datasets actually exist".

As discussed in chat - we could try to observe the union of "all datasets reported by system inventory" x "all datasets in the latest configuration", but my fear is that any local check reliant on inventory would be inherently subject to a TOCTTOU. Basically: If we try to give an affirmative answer that "a dataset (or disk) exists", it might immediately be yanked from the system, powered down, etc.

In that case, we'd fall back to the behavior we have today: We'd try to boot the zone, and if the dataset is gone, booting the zone should fail.

Yeah, this was a misunderstanding on my part. Thanks for the clarification around the expunge use case. Whereas we want to error out if a dataset exists in the request that should be expunged based on configuration.

As we discussed in chat, a change to the name of the function to be check_requested_zone_datasets_configured would suffice and resolve this issue. I'll go ahead and do that.

Fixes #7095

andrewjstone added the bug Something that isn't working. label Nov 18, 2024

andrewjstone added a commit that referenced this issue Nov 18, 2024

Clarify sled-agent method related to datasets

da16376

Fixes #7095

andrewjstone mentioned this issue Nov 18, 2024

Clarify sled-agent method related to datasets #7096

Merged

andrewjstone closed this as completed in #7096 Nov 19, 2024

andrewjstone closed this as completed in 574b697 Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset ledger does not reflect ensured datasets #7095

Dataset ledger does not reflect ensured datasets #7095

andrewjstone commented Nov 18, 2024

smklein commented Nov 18, 2024 •

edited

Loading

andrewjstone commented Nov 18, 2024 •

edited

Loading

Dataset ledger does not reflect ensured datasets #7095

Dataset ledger does not reflect ensured datasets #7095

Comments

andrewjstone commented Nov 18, 2024

Background

Bug

smklein commented Nov 18, 2024 • edited Loading

andrewjstone commented Nov 18, 2024 • edited Loading

smklein commented Nov 18, 2024 •

edited

Loading

andrewjstone commented Nov 18, 2024 •

edited

Loading