Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500 error during instance creation due to space exhaustion #7294

Open
gjcolombo opened this issue Dec 20, 2024 · 1 comment
Open

500 error during instance creation due to space exhaustion #7294

gjcolombo opened this issue Dec 20, 2024 · 1 comment
Labels
nexus Related to nexus storage Related to storage.

Comments

@gjcolombo
Copy link
Contributor

gjcolombo commented Dec 20, 2024

Seen during ad hoc testing of #7211.

Environment: Local dev cluster with a single sled and 40 GiB vdevs (set via cargo xtask virtual-hardware create --vdev-size).

Repro steps:

  1. Upload a 2 GiB image to the cluster. Create an instance with a 10 GiB boot disk based on this image.
  2. SSH into the instance and touch a file.
  3. Take a snapshot of the disk.
  4. Use omdb db sleds expunge-disk to expunge one of the virtual disks containing one of the backing regions for the running instance's boot disk.
  5. Create an image from the snapshot in step 3.
  6. Try to create a new instance based on that image.

Expected: Instance creation works.

Observed: Instance creation fails with an internal server error.

Rummaging through the Nexus logs shows that one of the region creation tasks in the instance create saga is failing with the following error description: Failed(ActionFailed { source_error: Object {"InternalError": Object {"internal_message": String("Failed to create region, unexpected state: Failed")}} })

I tried to create the same instance several more times, and most of them failed, though one creation attempt did go through successfully. Afterward I had a look at the Crucible agent logs across the system:

$ zoneadm list | grep crucible | grep -v pantry | xargs -I {} pfexec svcs -z {} -L crucible/agent | xargs -I {} pfexec looker -f {} -l erro
21:21:19.330Z ERRO crucible-agent (worker): Dataset ea426475-d493-4464-97a3-268c03ac3e03 creation failed: zfs create failed! out: err:cannot create 'oxp_1a6f67c9-4a81-4fea-90c8-0b2d30e5f156/crucible/regions/ea426475-d493-4464-97a3-268c03ac3e03': out of space
21:21:20.502Z ERRO crucible-agent (worker): Cannot find region "ea426475-d493-4464-97a3-268c03ac3e03" to remove: Dataset does not exist!
21:22:47.454Z ERRO crucible-agent (worker): Dataset 8bd92ec4-e15c-4d4b-a591-363b288be84f creation failed: zfs create failed! out: err:cannot create 'oxp_1a6f67c9-4a81-4fea-90c8-0b2d30e5f156/crucible/regions/8bd92ec4-e15c-4d4b-a591-363b288be84f': out of space
21:22:48.537Z ERRO crucible-agent (worker): Cannot find region "8bd92ec4-e15c-4d4b-a591-363b288be84f" to remove: Dataset does not exist!
21:46:15.447Z ERRO crucible-agent (worker): Dataset 11185890-3460-4df4-ae9c-cf6e870b82ff creation failed: zfs create failed! out: err:cannot create 'oxp_1a6f67c9-4a81-4fea-90c8-0b2d30e5f156/crucible/regions/11185890-3460-4df4-ae9c-cf6e870b82ff': out of space
21:46:16.256Z ERRO crucible-agent (worker): Cannot find region "11185890-3460-4df4-ae9c-cf6e870b82ff" to remove: Dataset does not exist!
21:46:32.102Z ERRO crucible-agent (worker): Dataset 39876460-fc7c-45e1-9566-25216c37df3f creation failed: zfs create failed! out: err:cannot create 'oxp_1a6f67c9-4a81-4fea-90c8-0b2d30e5f156/crucible/regions/39876460-fc7c-45e1-9566-25216c37df3f': out of space
21:46:32.840Z ERRO crucible-agent (worker): Cannot find region "39876460-fc7c-45e1-9566-25216c37df3f" to remove: Dataset does not exist!
21:46:44.575Z ERRO crucible-agent (worker): Dataset b9a1e47e-3b29-4ac0-997a-4194f46c2a89 creation failed: zfs create failed! out: err:cannot create 'oxp_1a6f67c9-4a81-4fea-90c8-0b2d30e5f156/crucible/regions/b9a1e47e-3b29-4ac0-997a-4194f46c2a89': out of space
21:46:45.487Z ERRO crucible-agent (worker): Cannot find region "b9a1e47e-3b29-4ac0-997a-4194f46c2a89" to remove: Dataset does not exist!
21:19:07.432Z ERRO crucible-agent (worker): Dataset c4317960-cbfe-4d44-9d73-5925854c01b8 creation failed: zfs create failed! out: err:cannot create 'oxp_3194ea28-4cbc-4fdb-bfc7-4970f3d6390c/crucible/regions/c4317960-cbfe-4d44-9d73-5925854c01b8': out of space
21:19:56.461Z ERRO crucible-agent (worker): Cannot find region "c4317960-cbfe-4d44-9d73-5925854c01b8" to remove: Dataset does not exist!
21:26:40.871Z ERRO crucible-agent (worker): Dataset b7c06fca-a28c-47c6-a403-4aa29506ecf4 creation failed: zfs create failed! out: err:cannot create 'oxp_3194ea28-4cbc-4fdb-bfc7-4970f3d6390c/crucible/regions/b7c06fca-a28c-47c6-a403-4aa29506ecf4': out of space
21:26:42.300Z ERRO crucible-agent (worker): Cannot find region "b7c06fca-a28c-47c6-a403-4aa29506ecf4" to remove: Dataset does not exist!
21:43:11.432Z ERRO crucible-agent (worker): Dataset 85548fb2-7660-4859-ad54-70e7c8726b13 creation failed: zfs create failed! out: err:cannot create 'oxp_3194ea28-4cbc-4fdb-bfc7-4970f3d6390c/crucible/regions/85548fb2-7660-4859-ad54-70e7c8726b13': out of space
21:43:12.118Z ERRO crucible-agent (worker): Cannot find region "85548fb2-7660-4859-ad54-70e7c8726b13" to remove: Dataset does not exist!
21:46:52.034Z ERRO crucible-agent (worker): Dataset c58d09d8-2b79-4f54-9344-15d4eec48c45 creation failed: zfs create failed! out: err:cannot create 'oxp_3194ea28-4cbc-4fdb-bfc7-4970f3d6390c/crucible/regions/c58d09d8-2b79-4f54-9344-15d4eec48c45': out of space
21:46:52.870Z ERRO crucible-agent (worker): Cannot find region "c58d09d8-2b79-4f54-9344-15d4eec48c45" to remove: Dataset does not exist!
21:47:11.233Z ERRO crucible-agent (worker): Dataset 36376376-2a9f-4b68-9091-37fc467f4210 creation failed: zfs create failed! out: err:cannot create 'oxp_3194ea28-4cbc-4fdb-bfc7-4970f3d6390c/crucible/regions/36376376-2a9f-4b68-9091-37fc467f4210': out of space
21:47:12.439Z ERRO crucible-agent (worker): Cannot find region "36376376-2a9f-4b68-9091-37fc467f4210" to remove: Dataset does not exist!
21:47:44.414Z ERRO crucible-agent (worker): Dataset c5111e31-2ab0-447e-a409-54ff8d0cc03a creation failed: zfs create failed! out: err:cannot create 'oxp_3194ea28-4cbc-4fdb-bfc7-4970f3d6390c/crucible/regions/c5111e31-2ab0-447e-a409-54ff8d0cc03a': out of space
21:47:45.377Z ERRO crucible-agent (worker): Cannot find region "c5111e31-2ab0-447e-a409-54ff8d0cc03a" to remove: Dataset does not exist!
21:47:56.387Z ERRO crucible-agent (worker): Dataset a6aabfb9-cf04-4956-a532-61f925502ea4 creation failed: zfs create failed! out: err:cannot create 'oxp_3194ea28-4cbc-4fdb-bfc7-4970f3d6390c/crucible/regions/a6aabfb9-cf04-4956-a532-61f925502ea4': out of space
21:47:57.073Z ERRO crucible-agent (worker): Cannot find region "a6aabfb9-cf04-4956-a532-61f925502ea4" to remove: Dataset does not exist!
21:46:32.106Z ERRO crucible-agent (worker): Dataset d238b33f-bbcf-4c3c-aedb-31f9c913ca03 creation failed: zfs create failed! out: err:cannot create 'oxp_905ebd85-8811-4426-a2fb-fff687e387cd/crucible/regions/d238b33f-bbcf-4c3c-aedb-31f9c913ca03': out of space
21:46:32.848Z ERRO crucible-agent (worker): Cannot find region "d238b33f-bbcf-4c3c-aedb-31f9c913ca03" to remove: Dataset does not exist!
21:47:19.128Z ERRO crucible-agent (worker): Dataset cbf4d908-5007-4579-b6cf-07bc1653f6ed creation failed: zfs create failed! out: err:cannot create 'oxp_905ebd85-8811-4426-a2fb-fff687e387cd/crucible/regions/cbf4d908-5007-4579-b6cf-07bc1653f6ed': out of space
21:47:20.771Z ERRO crucible-agent (worker): Cannot find region "cbf4d908-5007-4579-b6cf-07bc1653f6ed" to remove: Dataset does not exist!
21:47:56.392Z ERRO crucible-agent (worker): Dataset a67b0a36-7242-41b5-a4dc-737d44e65225 creation failed: zfs create failed! out: err:cannot create 'oxp_905ebd85-8811-4426-a2fb-fff687e387cd/crucible/regions/a67b0a36-7242-41b5-a4dc-737d44e65225': out of space
21:47:57.073Z ERRO crucible-agent (worker): Cannot find region "a67b0a36-7242-41b5-a4dc-737d44e65225" to remove: Dataset does not exist!

I know we have a few space-accounting issues open right now, but I was a little surprised to see this--there shouldn't be that much data in use on the system (we've got the original image, the first instance's disk, the region snapshots for that disk, and the image created from that snapshot; even if everything is maximally sized and stored in triplicate, that's a total of 96 GiB, which is well under the 320 GiB of unexpunged disk that I'd expect the system to have), and even then I'd expect most of this consumption to be accounted for already, since it's Crucible data and not e.g. auxiliary zone filesystem space.

Even if I have the math wrong it seems like it would be nice to return something other than a 500 here (assuming we can figure out how to get the right error to bubble back out from the Crucible agent).

@gjcolombo gjcolombo added nexus Related to nexus storage Related to storage. labels Dec 20, 2024
@gjcolombo
Copy link
Contributor Author

Probably related: #4234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus storage Related to storage.
Projects
None yet
Development

No branches or pull requests

1 participant