-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate resource accounting update for a stopped/deleted instance #5525
Comments
The complete nexus, sled-agent and propolis log files can be found under: catacomb.eng.oxide.computer:/staff/core/omicron-5525 |
I revised the issue description to remove the association with multiple concurrent instance delete requests because the resource reduction for a vmm happened when an instance is stopped and shouldn't be related to the delete operation. The nexus log shows that there was only one instance-stop request issued for the instance in question. Based on the propolis and sled-agent logs, the timeline of events are as follows: 14:27:48.081Z propolis received API stop request
14:27:48.082Z sled-agent saw vmm state transition to stopping
14:27:48.314Z sled-agent saw vmm state transition to destroyed
On nexus side,
The problem seems to be caused by sled-agent issuing vmm termination state change a second time at |
This may be a similar but different race condition as #5042. It happens with stop, not start. Please close if it's a dup. |
…#5830) Builds on #5081 and #5089 , but more out of convenience than necessity. # Summary This PR attempts to validate that, for the "virtual provisioning collection {insert, delete}" operations, they are idempotent. Currently, our usage of `max_instance_gen` only **partially** prevents updates during instance provisioning deletions: - If `max_instance_gen` is smaller than the observed instance generation number... - ... we avoid deleting the `virtual_provisioning_resource` record (which is great) - ... but we still decrement the `virtual_provisioning_collection` values (which is really not great). This basically means that we can "only cause the project/silo/fleet usage values to decrement arbitrarily, with no other changes". This has been, mechanically, the root cause of our observed underflows (e.g, #5525). # Details of this change - All the changes in `nexus/db-queries/src/db/datastore/virtual_provisioning_collection.rs` are tests validating idempotency of these operations. - All the changes in `nexus/db-queries/src/db/queries/virtual_provisioning_collection_update.rs` are actually changes to the query which change functionality. The objective of these changes is to preserve idempotency of the newly added tests, and to prevent undercounting of virtual provisioning resources. If these changes are reverted, the newly added tests start failing, showing a lack of coverage.
We have a project on rack2 that shows negative compute usage which manifested as an error during disk deletion:
A database query again the
virtual_resource_collection
table for the related project showed that it had negative compute usage:Here are the vmm/instances that were in the project previously:
Based on the deletion timestamps of the instances and disks (not included here), the most recent three deleted instances are likely the one that contributed to the usage accounting issues.
The nexus log showed that one of the instances went through resource clean up twice:
The duplicate reduction in usage matches the negative 4 vcpus and 8 GB memory usage in the collection table.
The text was updated successfully, but these errors were encountered: