Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate resource accounting update for a stopped/deleted instance #5525

Open
askfongjojo opened this issue Apr 12, 2024 · 3 comments
Open

Comments

@askfongjojo
Copy link

askfongjojo commented Apr 12, 2024

We have a project on rack2 that shows negative compute usage which manifested as an error during disk deletion:

{
  "msg": "request completed",
  "v": 0,
  "name": "65a11c18-7f59-41ac-b9e7-680627f996e7",
  "level": 30,
  "time": "2024-04-12T14:28:00.793206966Z",
  "hostname": "oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7",
  "pid": 10277,
  "uri": "//v1/disks/hubris-disk-lpc55xpresso-279c894b-ed57-4d14-8100-5928cf03924c?project=hubris-build",
  "method": "DELETE",
  "req_id": "aa0543d3-0c79-4faa-9b86-8ada4e0ca201",
  "remote_addr": "172.20.16.182:47642",
  "local_addr": "172.30.2.5:443",
  "component": "dropshot_external",
  "file": "/home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/29ae98d/dropshot/src/server.rs:837",
  "error_message_external": "Internal Server Error",
  "error_message_internal": "saga ACTION error at node \"no_result1\": unexpected database error: value is too small for a byte count",
  "latency_us": 133503,
  "response_code": "500"
}

A database query again the virtual_resource_collection table for the related project showed that it had negative compute usage:

  a494b278-d167-4940-9ffb-6d900ea0e47d | 2024-04-12 14:28:00.75635+00  | Project         |                              0 |               -4 |     -8589934592

Here are the vmm/instances that were in the project previously:

root@[fd00:1122:3344:105::3]:32221/omicron> select id, name, ncpus, memory, state, time_created, time_deleted from instance where project_id = 'a494b278-d167-4940-9ffb-6d900ea0e47d' order by time_created;
                   id                  |                              name                               | ncpus |   memory   |   state   |         time_created          |         time_deleted
---------------------------------------+-----------------------------------------------------------------+-------+------------+-----------+-------------------------------+--------------------------------
  c306cd25-bc22-4b40-be8a-d1f7ed3850fd | hubris-build-oxide-rot-1-ad95a39e-de73-43f6-ad83-053fb8b91fad   |     4 | 8589934592 | destroyed | 2024-04-04 19:54:39.735178+00 | 2024-04-04 20:00:43.863439+00
  4709d422-537c-4021-8881-c3f2c9dc68fb | hubris-build-oxide-rot-1-dev-ad95a39e-de73-43f6-ad83-053fb8b91f |     4 | 8589934592 | destroyed | 2024-04-04 19:54:39.737293+00 | 2024-04-04 20:01:08.697802+00
  1d99250b-d425-4d93-bae1-9f87393cfaa4 | hubris-build-oxide-rot-1-50be9da6-43dd-48ed-b68f-591e3a969fbc   |     4 | 8589934592 | destroyed | 2024-04-04 20:03:43.664052+00 | 2024-04-04 20:10:40.70392+00
  b899cf15-547d-4d0a-ac04-2b137e85434e | hubris-build-oxide-rot-1-dev-50be9da6-43dd-48ed-b68f-591e3a969f |     4 | 8589934592 | destroyed | 2024-04-04 20:03:43.664272+00 | 2024-04-04 20:10:16.119842+00
  c25a7cf2-10ce-4c7a-b344-f8800734627c | hubris-build-oxide-rot-1-dev-4ac8baca-afec-4c0e-9782-3a42e32454 |     4 | 8589934592 | destroyed | 2024-04-04 20:20:07.852417+00 | 2024-04-04 20:26:30.777506+00
  02977cb7-180c-4bf7-af6f-3c2740c3f3d8 | hubris-build-oxide-rot-1-4ac8baca-afec-4c0e-9782-3a42e324542f   |     4 | 8589934592 | destroyed | 2024-04-04 20:20:07.857239+00 | 2024-04-04 20:26:21.589897+00
  51842add-0437-43a0-b8e4-24f6d936b694 | hubris-build-lpc55xpresso-cd98825f-d6db-40e0-b392-4198d68361e3  |     4 | 8589934592 | destroyed | 2024-04-11 21:14:39.701095+00 | 2024-04-11 21:18:24.44294+00
  951d3327-5d74-4934-82b9-921596296138 | hubris-build-oxide-rot-1-cd98825f-d6db-40e0-b392-4198d68361e3   |     4 | 8589934592 | destroyed | 2024-04-11 21:14:39.707547+00 | 2024-04-11 21:18:28.578122+00
  0b33f7f7-1629-4ba6-9a96-c40f3328f26a | hubris-build-rot-carrier-cd98825f-d6db-40e0-b392-4198d68361e3   |     4 | 8589934592 | destroyed | 2024-04-11 21:14:39.707748+00 | 2024-04-11 21:18:31.286399+00
  1f164a98-c0ef-4a3d-b773-7b4125d20512 | hubris-build-rot-carrier-fb64e4dc-9339-4c94-b96c-4082335fd518   |     4 | 8589934592 | destroyed | 2024-04-11 21:19:08.822619+00 | 2024-04-12 14:23:25.212172+00
  22ba56bb-e4d7-40f3-ad48-c497a19927fc | hubris-build-oxide-rot-1-fb64e4dc-9339-4c94-b96c-4082335fd518   |     4 | 8589934592 | destroyed | 2024-04-11 21:19:08.831067+00 | 2024-04-11 21:24:08.810187+00
  de9b28fb-329c-4dcb-84af-d433a78a4e16 | hubris-build-lpc55xpresso-fb64e4dc-9339-4c94-b96c-4082335fd518  |     4 | 8589934592 | destroyed | 2024-04-11 21:19:08.831157+00 | 2024-04-11 21:24:08.309922+00
  a1b7498a-af14-4e5b-8509-3c61ab3e3a34 | hubris-build-oxide-rot-1-1ebe6ce9-ee71-4f1a-bdc2-e87621887a3f   |     4 | 8589934592 | destroyed | 2024-04-12 14:12:23.851901+00 | 2024-04-12 14:18:28.260342+00
  5a2f39cc-2e64-4910-956c-c136dc38cb2f | hubris-build-oxide-rot-1-c2a05043-7b52-46b9-9a2b-84f591499979   |     4 | 8589934592 | destroyed | 2024-04-12 14:19:31.630662+00 | 2024-04-12 14:25:42.181182+00
  1d85d5e0-b38c-4550-8ba1-2988925a6927 | hubris-build-lpc55xpresso-279c894b-ed57-4d14-8100-5928cf03924c  |     4 | 8589934592 | destroyed | 2024-04-12 14:24:18.699901+00 | 2024-04-12 14:28:00.439503+00
  fe48a0be-d78b-4959-961d-92fa74d50bda | hubris-build-oxide-rot-1-279c894b-ed57-4d14-8100-5928cf03924c   |     4 | 8589934592 | destroyed | 2024-04-12 14:24:18.706574+00 | 2024-04-12 14:27:50.841866+00
  b3b42394-a7d9-4175-9c8c-75819408a328 | hubris-build-rot-carrier-279c894b-ed57-4d14-8100-5928cf03924c   |     4 | 8589934592 | destroyed | 2024-04-12 14:24:18.707483+00 | 2024-04-12 14:27:56.580457+00

Based on the deletion timestamps of the instances and disks (not included here), the most recent three deleted instances are likely the one that contributed to the usage accounting issues.

The nexus log showed that one of the instances went through resource clean up twice:

14:27:56.227Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): vmm is terminated, cleaning up resources
    file = nexus/src/app/instance.rs:1627
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    propolis_id = aab1f007-a6d2-4005-a4c3-f3673054344c 
...
14:27:56.264Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): vmm is terminated, cleaning up resources
    file = nexus/src/app/instance.rs:1627
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    propolis_id = aab1f007-a6d2-4005-a4c3-f3673054344c

The duplicate reduction in usage matches the negative 4 vcpus and 8 GB memory usage in the collection table.

@askfongjojo
Copy link
Author

askfongjojo commented Apr 12, 2024

The complete nexus, sled-agent and propolis log files can be found under: catacomb.eng.oxide.computer:/staff/core/omicron-5525

@askfongjojo askfongjojo changed the title Duplicate resource accounting update when multiple stop/deletion requests are made on the same instance Duplicate resource accounting update for a stopped/deleted instance Apr 13, 2024
@askfongjojo
Copy link
Author

I revised the issue description to remove the association with multiple concurrent instance delete requests because the resource reduction for a vmm happened when an instance is stopped and shouldn't be related to the delete operation. The nexus log shows that there was only one instance-stop request issued for the instance in question.

Based on the propolis and sled-agent logs, the timeline of events are as follows:

14:27:48.081Z propolis received API stop request
14:27:48.292Z propolis completed shutdown

14:27:48.081Z INFO propolis-server (vm_controller): Requested state Stop via API
...
14:27:48.090Z INFO propolis-server (oximeter-producer): received request to begin graceful shutdown
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/773889e/dropshot/src/server.rs:266
    local_addr = [fd00:1122:3344:101::1:25a]:12400
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:36084
    req_id = 24af61b0-977c-4ac5-a3f6-354f80395f72
    uri = /instance
14:27:48.292Z INFO propolis-server: request completed
    latency_us = 95
    local_addr = [fd00:1122:3344:101::1:25a]:12400
    method = GET
    remote_addr = [fd00:1122:3344:101::1]:59741
    req_id = d45430ce-5268-4b4b-9a22-c9a62adace54
    response_code = 200
    uri = /instance/state-monitor
14:27:48.313Z INFO propolis-server: request completed
    latency_us = 31
    local_addr = [fd00:1122:3344:101::1:25a]:12400
    method = GET
    remote_addr = [fd00:1122:3344:101::1]:59741
    req_id = 69c9d15d-80cd-457e-9deb-7c07c29717d0
    response_code = 200
    uri = /instance/state-monitor

14:27:48.082Z sled-agent saw vmm state transition to stopping
14:27:48.292Z sled-agent published vmm state to nexus

14:27:48.082Z INFO SledAgent (InstanceManager): updated state after observing Propolis state change
    file = sled-agent/src/instance.rs:591
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    new_instance_state = InstanceRuntimeState { propolis_id: Some(aab1f007-a6d2-4005-a4c3-f3673054344c), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-04-12T14:24:25.277504787Z }
    new_vmm_state = VmmRuntimeState { state: Stopping, gen: Generation(5), time_updated: 2024-04-12T14:27:48.082093489Z }
    propolis_id = aab1f007-a6d2-4005-a4c3-f3673054344c
14:27:48.082Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:491
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    state = SledInstanceState { instance_state: InstanceRuntimeState { propolis_id: Some(aab1f007-a6d2-4005-a4c3-f3673054344c), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-04-12T14:24:25.277504787Z }, propolis_id: aab1f007-a6d2-4005-a4c3-f3673054344c, vmm_state: VmmRuntimeState { state: Stopping, gen: Generation(5), time_updated: 2024-04-12T14:27:48.082093489Z } }
14:27:48.292Z INFO SledAgent (InstanceManager): updated state after observing Propolis state change
    file = sled-agent/src/instance.rs:591
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    new_instance_state = InstanceRuntimeState { propolis_id: Some(aab1f007-a6d2-4005-a4c3-f3673054344c), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-04-12T14:24:25.277504787Z }
    new_vmm_state = VmmRuntimeState { state: Stopping, gen: Generation(6), time_updated: 2024-04-12T14:27:48.292621861Z }
    propolis_id = aab1f007-a6d2-4005-a4c3-f3673054344c
14:27:48.292Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:491
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    state = SledInstanceState { instance_state: InstanceRuntimeState { propolis_id: Some(aab1f007-a6d2-4005-a4c3-f3673054344c), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-04-12T14:24:25.277504787Z }, propolis_id: aab1f007-a6d2-4005-a4c3-f3673054344c, vmm_state: VmmRuntimeState { state: Stopping, gen: Generation(6), time_updated: 2024-04-12T14:27:48.292621861Z } }

14:27:48.314Z sled-agent saw vmm state transition to destroyed
14:27:52.540Z sled-agent published vmm state to nexus
14:27:56.235Z sled-agent published vmm state to nexus again

14:27:48.314Z INFO SledAgent (InstanceManager): updated state after observing Propolis state change
    file = sled-agent/src/instance.rs:591
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    new_instance_state = InstanceRuntimeState { propolis_id: None, dst_propolis_id: None, migration_id: None, gen: Generation(5), time_updated: 2024-04-12T14:27:48.313995841Z }
    new_vmm_state = VmmRuntimeState { state: Destroyed, gen: Generation(7), time_updated: 2024-04-12T14:27:48.313995841Z }
    propolis_id = aab1f007-a6d2-4005-a4c3-f3673054344c
14:27:48.314Z INFO SledAgent (ZoneBundler): creating zone bundle
    context = ZoneBundleContext { storage_dirs: ["/pool/int/01376d9a-434a-42d3-a4c8-dc78edcdfd06/debug/bundle/zone", "/pool/int/dbf26410-71a8-4120-bf16-42ce7f5a4acf/debug/bundle/zone"], cause: TerminatedInstance, extra_log_dirs: ["/pool/ext/6bfb4120-488d-4f3d-90ef-e9bfa523b388/crypt/debug", "/pool/ext/d732addc-cfe8-4c2c-8028-72eb4481b04e/crypt/debug", "/pool/ext/4c7ad252-55c2-4a1a-9d93-9dfcdfdfacca/crypt/debug", "/pool/ext/c99e6032-1d4f-47d2-9efe-ae2b2479554e/crypt/debug", "/pool/ext/653065d2-ab70-47c9-b832-34238fdc95ef/crypt/debug", "/pool/ext/923c930c-80f8-448d-8321-cebfc6c41760/crypt/debug", "/pool/ext/c764a8ae-6862-4eec-9db0-cc6ea478e4a7/crypt/debug", "/pool/ext/fcb0e4c7-e046-4cf5-ad35-3ad90e1eb90c/crypt/debug", "/pool/ext/677b0057-3a80-461b-aca8-c2cb501a7278/crypt/debug", "/pool/ext/ac789935-fa42-4d00-8967-df0d96dbb74e/crypt/debug"] }
    file = sled-agent/src/zone_bundle.rs:437
    zone_name = oxz_propolis-server_aab1f007-a6d2-4005-a4c3-f3673054344c
14:27:49.545Z INFO SledAgent (ZoneBundler): finished zone bundle
    file = sled-agent/src/zone_bundle.rs:1230
    metadata = ZoneBundleMetadata { id: ZoneBundleId { zone_name: "oxz_propolis-server_aab1f007-a6d2-4005-a4c3-f3673054344c", bundle_id: b94eaccc-a027-4091-b3e2-4cf807923cf2 }, time_created: 2024-04-12T14:27:48.316512323Z, version: 0, cause: TerminatedInstance }
14:27:49.545Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_aab1f007-a6d2-4005-a4c3-f3673054344c
    file = sled-agent/src/instance.rs:780
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
...
14:27:52.540Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:491
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    state = SledInstanceState { instance_state: InstanceRuntimeState { propolis_id: None, dst_propolis_id: None, migration_id: None, gen: Generation(5), time_updated: 2024-04-12T14:27:48.313995841Z }, propolis_id: aab1f007-a6d2-4005-a4c3-f3673054344c, vmm_state: VmmRuntimeState { state: Destroyed, gen: Generation(8), time_updated: 2024-04-12T14:27:52.540001638Z } }
...
14:27:56.235Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:491
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    state = SledInstanceState { instance_state: InstanceRuntimeState { propolis_id: None, dst_propolis_id: None, migration_id: None, gen: Generation(5), time_updated: 2024-04-12T14:27:48.313995841Z }, propolis_id: aab1f007-a6d2-4005-a4c3-f3673054344c, vmm_state: VmmRuntimeState { state: Destroyed, gen: Generation(8), time_updated: 2024-04-12T14:27:52.540001638Z } }

On nexus side,
14:27:56.227Z cleaned up resource for terminated vmm
14:27:56.264Z cleaned up resource again (no state change-related log lines around that time)

14:27:56.227Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): vmm is terminated, cleaning up resources
    file = nexus/src/app/instance.rs:1627
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    propolis_id = aab1f007-a6d2-4005-a4c3-f3673054344c
14:27:56.235Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): instance and vmm updated by sled agent
    file = nexus/src/app/instance.rs:1650
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    instance_updated = true
    propolis_id = aab1f007-a6d2-4005-a4c3-f3673054344c
    vmm_updated = true
14:27:56.236Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): received new runtime state from sled agent
    file = nexus/src/app/instance.rs:1514
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    instance_state = InstanceRuntimeState { propolis_id: None, dst_propolis_id: None, migration_id: None, gen: Generation(5), time_updated: 2024-04-12T14:27:48.313995841Z }
    propolis_id = aab1f007-a6d2-4005-a4c3-f3673054344c
    vmm_state = VmmRuntimeState { state: Destroyed, gen: Generation(8), time_updated: 2024-04-12T14:27:52.540001638Z }
...
14:27:56.264Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): vmm is terminated, cleaning up resources
    file = nexus/src/app/instance.rs:1627
    instance_id = b3b42394-a7d9-4175-9c8c-75819408a328
    propolis_id = aab1f007-a6d2-4005-a4c3-f3673054344c

The problem seems to be caused by sled-agent issuing vmm termination state change a second time at 14:27:56.235Z, triggering another resource cleanup unnecessarily.

@askfongjojo
Copy link
Author

This may be a similar but different race condition as #5042. It happens with stop, not start. Please close if it's a dup.

smklein added a commit that referenced this issue May 30, 2024
…#5830)

Builds on #5081 and
#5089 , but more out of
convenience than necessity.

# Summary

This PR attempts to validate that, for the "virtual provisioning
collection {insert, delete}" operations, they are idempotent. Currently,
our usage of `max_instance_gen` only **partially** prevents updates
during instance provisioning deletions:
- If `max_instance_gen` is smaller than the observed instance generation
number...
- ... we avoid deleting the `virtual_provisioning_resource` record
(which is great)
- ... but we still decrement the `virtual_provisioning_collection`
values (which is really not great).

This basically means that we can "only cause the project/silo/fleet
usage values to decrement arbitrarily, with no other changes". This has
been, mechanically, the root cause of our observed underflows (e.g,
#5525).

# Details of this change

- All the changes in
`nexus/db-queries/src/db/datastore/virtual_provisioning_collection.rs`
are tests validating idempotency of these operations.
- All the changes in
`nexus/db-queries/src/db/queries/virtual_provisioning_collection_update.rs`
are actually changes to the query which change functionality. The
objective of these changes is to preserve idempotency of the newly added
tests, and to prevent undercounting of virtual provisioning resources.
If these changes are reverted, the newly added tests start failing,
showing a lack of coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant