Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and un-delete phantom disks #4547

Merged
merged 16 commits into from
Dec 4, 2023

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Nov 21, 2023

A "phantom" disk is a disk where the disk delete saga has run for it but unwound: this leaves that disk soft-deleted, but the resources and accounting for that disk remain. Users cannot request that the disk be deleted again, and it remains a phantom.

There are two fixes for this:

  1. Change the disk delete saga to undo the disk's soft delete and set the disk to faulted during an unwind. This way, users can request that disks be deleted repeatedly until it works.

  2. Create a background task that detects these phantom disks and does the same thing: un-delete them and set them to faulted.

This requires adding an index on id to the disk table, so the schema is bumped to 12.0.1.

Fixes oxidecomputer/customer-support#58.

A "phantom" disk is a disk where the disk delete saga has run for it but
unwound: this leaves that disk soft-deleted, but the resources and
accounting for that disk remain. Users cannot request that the disk be
deleted again, and it remains a phantom.

There are two fixes for this:

1. Change the disk delete saga to undo the disk's soft delete and set
   the disk to faulted during an unwind. This way, users can request
   that disks be deleted repeatedly until it works.

2. Create a background task that detects these phantom disks and does
   the same thing: un-delete them and set them to faulted.

This requires adding an index on `id` to the `disk` table, so the schema
is bumped to 12.0.1.

Fixes oxidecomputer/customer-support#58.
@@ -0,0 +1,4 @@
CREATE UNIQUE INDEX IF NOT EXISTS lookup_deleted_disk ON omicron.public.disk (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my apologies: I just merged a 13.0.0 schema migration, so I think this will have to become 14.0.0 (or maybe even 15.0.0 if #4545 lands first)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very annoying problem. @davepacheco’s idea of named migrations and a canonical order list is sounding pretty good to me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait for #4545, that's a higher priority.

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, this looks like a solid improvement! I've a few questions around clarity and naming and such, but the overall structure looks very reasonable. Thanks!

dev-tools/omdb/src/bin/omdb/nexus.rs Outdated Show resolved Hide resolved
dev-tools/omdb/src/bin/omdb/nexus.rs Outdated Show resolved Hide resolved
nexus/db-model/src/schema.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/disk.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/disk.rs Show resolved Hide resolved
nexus/src/app/background/phantom_disks.rs Outdated Show resolved Hide resolved
nexus/src/app/background/phantom_disks.rs Outdated Show resolved Hide resolved
nexus/src/app/background/phantom_disks.rs Outdated Show resolved Hide resolved
nexus/tests/integration_tests/disks.rs Show resolved Hide resolved
schema/crdb/18.0.0/up01.sql Show resolved Hide resolved
Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the back-and-forth, this looks good to me!

@jmpesp jmpesp enabled auto-merge (squash) December 1, 2023 21:38
@jmpesp jmpesp merged commit c915eeb into oxidecomputer:main Dec 4, 2023
15 checks passed
@jmpesp jmpesp deleted the unwind_leaves_phantom_disk branch December 4, 2023 21:26
@askfongjojo
Copy link

@jmpesp, sorry this is a late-breaking question - what happens if some of the undeleted disks in the same project have the same name? Would they be renamed to honor the unique-name constraint?

@jmpesp
Copy link
Contributor Author

jmpesp commented Dec 11, 2023

@askfongjojo I had not considered that! As written in this PR the background task will not rename the disk, but I believe that the un-delete would fail due to the existing lookup_disk_by_project UNIQUE INDEX:

CREATE UNIQUE INDEX IF NOT EXISTS lookup_disk_by_project ON omicron.public.disk (
    project_id,
    name
) WHERE
    time_deleted IS NULL;

This would mean that a disk with the same name would shadow the phantom disk, consuming virtual resources and counting towards quotas. I'll open an issue to track fixing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants