Nexus return communication errors when trying to delete a producer during vmm destroy operation #5367

askfongjojo · 2024-04-01T17:16:00Z

As a workaround to unstick instances stuck in stopping state, we make this kind of backdoor Nexus requests to destroy a vmm in limbo:

curl "http://[fd00:1122:3344:106::3]:12221/instances/7d8207da-fafe-49e1-8e48-0ec38b3285bd" -H  "Content-Type: application/json" -XPUT --data @- <<EOF
{
  "instance_state": {
    "propolis_id": null,
    "dst_propolis_id": null,
    "migration_id": null,
    "gen": 40,
    "time_updated": "2024-03-08T01:23:45.678900000Z"
  },
  "propolis_id": "185aa460-e8b3-499b-aed8-fbbb30814d62",
  "vmm_state": {
    "state": "destroyed",
    "gen": 40,
    "time_updated": "2024-03-08T01:23:45.678900000Z"
  }
}
EOF

After #5326, such requests would frequently hit 500 errors like the one below:

17:57:22.653Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = failed to delete producer from collector: Communication Error: error sending request for url (http://[fd00:1122:3344:10a::3]:12223/producers/3fbf0053-2634-4a93-8d51-14beb2f98762): operation timed out
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/29ae98d/dropshot/src/server.rs:837
    latency_us = 18273761
    local_addr = [fd00:1122:3344:103::3]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:103::1]:58552
    req_id = 4915b2aa-be5c-425b-84ed-7ea86bc88bad
    response_code = 500
    uri = /instances/3fbf0053-2634-4a93-8d51-14beb2f98762

The requests would complete after one or few retries. I looked at oximeter log during the failure and there was no indication that it stopped serving requests (it's up and running the entire time). I haven't checked any switch zone logs for network partitions but there was no other failure in the system at the time AFAICT.

I've uploaded the relevant nexus and oximeter log files to catacomb:/staff/core/omicron-5367.

The text was updated successfully, but these errors were encountered:

bnaecker · 2024-04-01T22:08:58Z

@rcgoodfellow confirmed that the log messages in both Nexus (for timeouts) and oximeter (for routing issues) were during the same timeframe as when Dogfood had routing issues caused by some interactions with racktest. With that, we're confident that all the issues here have been explained, and I'm going to close this one.

bnaecker closed this as completed Apr 1, 2024

bnaecker mentioned this issue Apr 2, 2024

oximeter collector logs could include address information #5377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus return communication errors when trying to delete a producer during vmm destroy operation #5367

Nexus return communication errors when trying to delete a producer during vmm destroy operation #5367

askfongjojo commented Apr 1, 2024 •

edited

Loading

bnaecker commented Apr 1, 2024

Nexus return communication errors when trying to delete a producer during vmm destroy operation #5367

Nexus return communication errors when trying to delete a producer during vmm destroy operation #5367

Comments

askfongjojo commented Apr 1, 2024 • edited Loading

bnaecker commented Apr 1, 2024

askfongjojo commented Apr 1, 2024 •

edited

Loading