-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect and un-delete phantom disks #4547
Conversation
A "phantom" disk is a disk where the disk delete saga has run for it but unwound: this leaves that disk soft-deleted, but the resources and accounting for that disk remain. Users cannot request that the disk be deleted again, and it remains a phantom. There are two fixes for this: 1. Change the disk delete saga to undo the disk's soft delete and set the disk to faulted during an unwind. This way, users can request that disks be deleted repeatedly until it works. 2. Create a background task that detects these phantom disks and does the same thing: un-delete them and set them to faulted. This requires adding an index on `id` to the `disk` table, so the schema is bumped to 12.0.1. Fixes oxidecomputer/customer-support#58.
schema/crdb/12.0.1/up01.sql
Outdated
@@ -0,0 +1,4 @@ | |||
CREATE UNIQUE INDEX IF NOT EXISTS lookup_deleted_disk ON omicron.public.disk ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my apologies: I just merged a 13.0.0
schema migration, so I think this will have to become 14.0.0
(or maybe even 15.0.0
if #4545 lands first)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very annoying problem. @davepacheco’s idea of named migrations and a canonical order list is sounding pretty good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait for #4545, that's a higher priority.
Schema change from 12.0.1 to 15.0.0
Bumped schema again to 18.0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, this looks like a solid improvement! I've a few questions around clarity and naming and such, but the overall structure looks very reasonable. Thanks!
print output should too!
add more logging to phantom disk background task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the back-and-forth, this looks good to me!
@jmpesp, sorry this is a late-breaking question - what happens if some of the undeleted disks in the same project have the same name? Would they be renamed to honor the unique-name constraint? |
@askfongjojo I had not considered that! As written in this PR the background task will not rename the disk, but I believe that the un-delete would fail due to the existing CREATE UNIQUE INDEX IF NOT EXISTS lookup_disk_by_project ON omicron.public.disk (
project_id,
name
) WHERE
time_deleted IS NULL; This would mean that a disk with the same name would shadow the phantom disk, consuming virtual resources and counting towards quotas. I'll open an issue to track fixing this. |
A "phantom" disk is a disk where the disk delete saga has run for it but unwound: this leaves that disk soft-deleted, but the resources and accounting for that disk remain. Users cannot request that the disk be deleted again, and it remains a phantom.
There are two fixes for this:
Change the disk delete saga to undo the disk's soft delete and set the disk to faulted during an unwind. This way, users can request that disks be deleted repeatedly until it works.
Create a background task that detects these phantom disks and does the same thing: un-delete them and set them to faulted.
This requires adding an index on
id
to thedisk
table, so the schema is bumped to 12.0.1.Fixes oxidecomputer/customer-support#58.