-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance stuck in stopping state when stopping instances en masse concurrently #5363
Comments
sled-agent and propolis logs have been uploaded to catacomb.eng.oxide.computer:/staff/core/omicron-5363. |
By |
I found that two nexus nodes were serving the sled-agent requests from the stuck-in-stopping instances and updated the issue description with the relevant log lines from both nexus. I reviewed the sled-agent log lines to try to understand what happened between vmm state moving to stopping and log archival. Beside background tasks, the sled-agent was also deprovisioning 3 other instances. They all happened to hit a 500 error (they were successfully destroyed in the end):
Looking at the nexus logs, these are all errors from oximeter producer deletion. Those failed requests had rather long latencies. I filed #5364 to report that as a potential issue. It isn't necessarily related to the issue reported here. |
I just ran into this issue again (which is harder to reproduce after PR #5366 and #5373). Taking a quick look at the sled-agent log again, it looks like the issue is that the VMM destroy event was never published to sled-agent. In fact the propolis zone wasn't torn down completely. The propolis zone reported in this ticket was already removed during the last rack update. But the one related to the latest occurrence of this same issue can be seen here:
The last time sled-agent reported on it was while it was "Stopping" and it never made more progress:
The sled-agent and propolis logs for this newer instance have been uploaded to catacomb:/staff/core/omicron-5363/propolis-c4be0a7d. |
This appears to be a case of oxidecomputer/propolis#675. Hopping onto that node, I see threads deadlocked with similar mutex operations. edit, for context:
|
I've tested this use case a bunch in the last 3 weeks. I haven't been able to reproduce the issue with the fix on oxidecomputer/propolis#675. Marking this ticket closed should be prudent. |
The issue was seen on rack2:
It is unclear if this is a concurrency issue. The logs I've looked at so far do not show any errors but there were significant time gaps between events. Here is the timeline of major events:
propolis
2024-03-30T23:26:55.311292282Z: request received
2024-03-30T23:26:55.330018213Z: shutdown complete
sled-agent
2024-03-30T23:26:55.330195531Z propolis state changed to stopping
2024-03-30T23:32:13.943149396Z: log archival beginning
Nexus log lines related to the propolis zone state changes (note: both sled 8 and 12's nexus zones were serving the sled-agent requests)
The text was updated successfully, but these errors were encountered: