You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After #5326, such requests would frequently hit 500 errors like the one below:
17:57:22.653Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_internal): request completed
error_message_external = Internal Server Error
error_message_internal = failed to delete producer from collector: Communication Error: error sending request for url (http://[fd00:1122:3344:10a::3]:12223/producers/3fbf0053-2634-4a93-8d51-14beb2f98762): operation timed out
file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/29ae98d/dropshot/src/server.rs:837
latency_us = 18273761
local_addr = [fd00:1122:3344:103::3]:12221
method = PUT
remote_addr = [fd00:1122:3344:103::1]:58552
req_id = 4915b2aa-be5c-425b-84ed-7ea86bc88bad
response_code = 500
uri = /instances/3fbf0053-2634-4a93-8d51-14beb2f98762
The requests would complete after one or few retries. I looked at oximeter log during the failure and there was no indication that it stopped serving requests (it's up and running the entire time). I haven't checked any switch zone logs for network partitions but there was no other failure in the system at the time AFAICT.
I've uploaded the relevant nexus and oximeter log files to catacomb:/staff/core/omicron-5367.
The text was updated successfully, but these errors were encountered:
@rcgoodfellow confirmed that the log messages in both Nexus (for timeouts) and oximeter (for routing issues) were during the same timeframe as when Dogfood had routing issues caused by some interactions with racktest. With that, we're confident that all the issues here have been explained, and I'm going to close this one.
As a workaround to unstick instances stuck in stopping state, we make this kind of backdoor Nexus requests to destroy a vmm in limbo:
After #5326, such requests would frequently hit 500 errors like the one below:
The requests would complete after one or few retries. I looked at oximeter log during the failure and there was no indication that it stopped serving requests (it's up and running the entire time). I haven't checked any switch zone logs for network partitions but there was no other failure in the system at the time AFAICT.
I've uploaded the relevant nexus and oximeter log files to catacomb:/staff/core/omicron-5367.
The text was updated successfully, but these errors were encountered: