Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus return communication errors when trying to delete a producer during vmm destroy operation #5367

Closed
askfongjojo opened this issue Apr 1, 2024 · 1 comment

Comments

@askfongjojo
Copy link

askfongjojo commented Apr 1, 2024

As a workaround to unstick instances stuck in stopping state, we make this kind of backdoor Nexus requests to destroy a vmm in limbo:

curl "http://[fd00:1122:3344:106::3]:12221/instances/7d8207da-fafe-49e1-8e48-0ec38b3285bd" -H  "Content-Type: application/json" -XPUT --data @- <<EOF
{
  "instance_state": {
    "propolis_id": null,
    "dst_propolis_id": null,
    "migration_id": null,
    "gen": 40,
    "time_updated": "2024-03-08T01:23:45.678900000Z"
  },
  "propolis_id": "185aa460-e8b3-499b-aed8-fbbb30814d62",
  "vmm_state": {
    "state": "destroyed",
    "gen": 40,
    "time_updated": "2024-03-08T01:23:45.678900000Z"
  }
}
EOF

After #5326, such requests would frequently hit 500 errors like the one below:

17:57:22.653Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = failed to delete producer from collector: Communication Error: error sending request for url (http://[fd00:1122:3344:10a::3]:12223/producers/3fbf0053-2634-4a93-8d51-14beb2f98762): operation timed out
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/29ae98d/dropshot/src/server.rs:837
    latency_us = 18273761
    local_addr = [fd00:1122:3344:103::3]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:103::1]:58552
    req_id = 4915b2aa-be5c-425b-84ed-7ea86bc88bad
    response_code = 500
    uri = /instances/3fbf0053-2634-4a93-8d51-14beb2f98762

The requests would complete after one or few retries. I looked at oximeter log during the failure and there was no indication that it stopped serving requests (it's up and running the entire time). I haven't checked any switch zone logs for network partitions but there was no other failure in the system at the time AFAICT.

I've uploaded the relevant nexus and oximeter log files to catacomb:/staff/core/omicron-5367.

@bnaecker
Copy link
Collaborator

bnaecker commented Apr 1, 2024

@rcgoodfellow confirmed that the log messages in both Nexus (for timeouts) and oximeter (for routing issues) were during the same timeframe as when Dogfood had routing issues caused by some interactions with racktest. With that, we're confident that all the issues here have been explained, and I'm going to close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants