Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance stuck in stopping state when stopping instances en masse concurrently #5363

Closed
askfongjojo opened this issue Mar 31, 2024 · 6 comments
Labels
known issue To include in customer documentation and training
Milestone

Comments

@askfongjojo
Copy link

askfongjojo commented Mar 31, 2024

The issue was seen on rack2:

  • instance_id = 18f91470-9d6f-40db-aa91-092ca88427d8
  • propolis_id = 1348b7be-ffcc-45a9-88e6-f62905c56898

It is unclear if this is a concurrency issue. The logs I've looked at so far do not show any errors but there were significant time gaps between events. Here is the timeline of major events:

propolis
2024-03-30T23:26:55.311292282Z: request received
2024-03-30T23:26:55.330018213Z: shutdown complete

23:26:55.311Z INFO propolis-server (vm_controller): Dropping VM controller
23:26:55.311Z INFO propolis-server (oximeter-producer): received request to begin graceful shutdown
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/c3e8467/dropshot/src/server.rs:266
    local_addr = [fd00:1122:3344:10b::1:182]:12400
    method = PUT
    remote_addr = [fd00:1122:3344:10b::1]:48386
    req_id = 71eeec63-c13e-450d-b5ec-b13342e2c6c4
    uri = /instance
23:26:55.330Z INFO propolis-server: request completed
    latency_us = 55
    local_addr = [fd00:1122:3344:10b::1:182]:12400
    method = GET
    remote_addr = [fd00:1122:3344:10b::1]:43168
    req_id = 449cc296-a359-4568-8076-d95c93b2ec41
    response_code = 200
    uri = /instance/state-monitor

sled-agent
2024-03-30T23:26:55.330195531Z propolis state changed to stopping
2024-03-30T23:32:13.943149396Z: log archival beginning

23:26:55.330Z INFO SledAgent (InstanceManager): updated state after observing Propolis state change
    file = sled-agent/src/instance.rs:591
    instance_id = 18f91470-9d6f-40db-aa91-092ca88427d8
    new_instance_state = InstanceRuntimeState { propolis_id: Some(1348b7be-ffcc-45a9-88e6-f62905c56898), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-03-30T23:16:37.314302303Z }
    new_vmm_state = VmmRuntimeState { state: Stopping, gen: Generation(6), time_updated: 2024-03-30T23:26:55.330100179Z }
    propolis_id = 1348b7be-ffcc-45a9-88e6-f62905c56898
23:26:55.330Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:491
    instance_id = 18f91470-9d6f-40db-aa91-092ca88427d8
    state = SledInstanceState { instance_state: InstanceRuntimeState { propolis_id: Some(1348b7be-ffcc-45a9-88e6-f62905c56898), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-03-30T23:16:37.314302303Z }, propolis_id: 1348b7be-ffcc-45a9-88e6-f62905c56898, vmm_state: VmmRuntimeState { state: Stopping, gen: Generation(6), time_updated: 2024-03-30T23:26:55.330100179Z } }

23:32:13.943Z INFO SledAgent (DumpSetup-worker): Archiving 104 log files from oxz_propolis-server_1348b7be-ffcc-45a9-88e6-f62905c56898 zone
    file = sled-agent/src/dump_setup.rs:878

Nexus log lines related to the propolis zone state changes (note: both sled 8 and 12's nexus zones were serving the sled-agent requests)

BRM44220011 # grep 1348b7be-ffcc-45a9-88e6-f62905c56898 /pool/ext/e126ddcc-8bee-46ba-8199-2a74df0ba040/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1711841400 | looker
23:26:55.305Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): writing instance state returned from sled agent
    instance_id = 18f91470-9d6f-40db-aa91-092ca88427d8
    new_state = Some(SledInstanceState { instance_state: InstanceRuntimeState { propolis_id: Some(1348b7be-ffcc-45a9-88e6-f62905c56898), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-03-30T23:16:37.314302303Z }, propolis_id: 1348b7be-ffcc-45a9-88e6-f62905c56898, vmm_state: VmmRuntimeState { state: Stopping, gen: Generation(4), time_updated: 2024-03-30T23:26:55.305572005Z } })
23:26:55.313Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): attempted to write instance state from sled agent
    instance_id = 18f91470-9d6f-40db-aa91-092ca88427d8
    propolis_id = 1348b7be-ffcc-45a9-88e6-f62905c56898
    result = Ok((false, true))

BRM42220057 # grep 1348b7be-ffcc-45a9-88e6-f62905c56898  /pool/ext/416fd29e-d3b5-4fdf-8101-d0d163fa0706/crypt/debug/oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a/oxide-nexus:default.log.1711843200 | looker
23:26:55.306Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): received new runtime state from sled agent
    file = nexus/src/app/instance.rs:1495
    instance_id = 18f91470-9d6f-40db-aa91-092ca88427d8
    instance_state = InstanceRuntimeState { propolis_id: Some(1348b7be-ffcc-45a9-88e6-f62905c56898), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-03-30T23:16:37.314302303Z }
    propolis_id = 1348b7be-ffcc-45a9-88e6-f62905c56898
    vmm_state = VmmRuntimeState { state: Stopping, gen: Generation(5), time_updated: 2024-03-30T23:26:55.305762007Z }
23:26:55.329Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): instance and vmm updated by sled agent
    file = nexus/src/app/instance.rs:1631
    instance_id = 18f91470-9d6f-40db-aa91-092ca88427d8
    instance_updated = false
    propolis_id = 1348b7be-ffcc-45a9-88e6-f62905c56898
    vmm_updated = true
23:26:55.330Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): received new runtime state from sled agent
    file = nexus/src/app/instance.rs:1495
    instance_id = 18f91470-9d6f-40db-aa91-092ca88427d8
    instance_state = InstanceRuntimeState { propolis_id: Some(1348b7be-ffcc-45a9-88e6-f62905c56898), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-03-30T23:16:37.314302303Z }
    propolis_id = 1348b7be-ffcc-45a9-88e6-f62905c56898
    vmm_state = VmmRuntimeState { state: Stopping, gen: Generation(6), time_updated: 2024-03-30T23:26:55.330100179Z }
23:26:55.354Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): instance and vmm updated by sled agent
    file = nexus/src/app/instance.rs:1631
    instance_id = 18f91470-9d6f-40db-aa91-092ca88427d8
    instance_updated = false
    propolis_id = 1348b7be-ffcc-45a9-88e6-f62905c56898
    vmm_updated = true
@askfongjojo
Copy link
Author

sled-agent and propolis logs have been uploaded to catacomb.eng.oxide.computer:/staff/core/omicron-5363.

@askfongjojo
Copy link
Author

By en masse, I meant a terraform destroy of 75 VMs. I've done up to 120 in the past and many times between 80-120. I have not encountered this particular error before.

@askfongjojo
Copy link
Author

I found that two nexus nodes were serving the sled-agent requests from the stuck-in-stopping instances and updated the issue description with the relevant log lines from both nexus.

I reviewed the sled-agent log lines to try to understand what happened between vmm state moving to stopping and log archival. Beside background tasks, the sled-agent was also deprovisioning 3 other instances. They all happened to hit a 500 error (they were successfully destroyed in the end):

23:26:59.128Z WARN SledAgent (InstanceManager): Failed to publish instance state to Nexus: Notifying Nexus failed: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "7c621728-8303-4c9e-8bce-633d8c81f05d", "content-length": "124", "date": "Sat, 30 Mar 2024 23:26:59 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "7c621728-8303-4c9e-8bce-633d8c81f05d" }
    file = sled-agent/src/instance.rs:545
    instance_id = 9a137e18-b22e-4419-999e-f18629d0c1d0
    retry_after = 181.657037ms

23:27:59.435Z WARN SledAgent (InstanceManager): Failed to publish instance state to Nexus: Notifying Nexus failed: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "04a46873-ca82-47d2-aef0-60c726d20c19", "content-length": "124", "date": "Sat, 30 Mar 2024 23:27:58 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "04a46873-ca82-47d2-aef0-60c726d20c19" }
    file = sled-agent/src/instance.rs:545
    instance_id = b91a5c67-077e-4e05-a8c5-d85b997de150
    retry_after = 153.664517ms

23:28:26.114Z WARN SledAgent (InstanceManager): Failed to publish instance state to Nexus: Notifying Nexus failed: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "9827a802-9200-4cd3-9ce3-e12ebff4ff23", "content-length": "124", "date": "Sat, 30 Mar 2024 23:28:25 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "9827a802-9200-4cd3-9ce3-e12ebff4ff23" }
    file = sled-agent/src/instance.rs:545
    instance_id = cf5f67c2-afa4-4942-b83e-47cfb56fb367
    retry_after = 364.755109ms

Looking at the nexus logs, these are all errors from oximeter producer deletion. Those failed requests had rather long latencies. I filed #5364 to report that as a potential issue. It isn't necessarily related to the issue reported here.

@morlandi7 morlandi7 added this to the 7 milestone Apr 1, 2024
@askfongjojo
Copy link
Author

askfongjojo commented Apr 3, 2024

I just ran into this issue again (which is harder to reproduce after PR #5366 and #5373).
#5376 might be a variant of this same issue.

Taking a quick look at the sled-agent log again, it looks like the issue is that the VMM destroy event was never published to sled-agent. In fact the propolis zone wasn't torn down completely. The propolis zone reported in this ticket was already removed during the last rack update. But the one related to the latest occurrence of this same issue can be seen here:

BRM42220009 # zoneadm list | grep c4be0a7d-8b89-4798-ba12-bbf06817bd0a
oxz_propolis-server_c4be0a7d-8b89-4798-ba12-bbf06817bd0a
BRM42220009 # svcs *c4be0a7d-8b89-4798-ba12-bbf06817bd0a*

The last time sled-agent reported on it was while it was "Stopping" and it never made more progress:

BRM42220009 # egrep 'c4be0a7d|0be1d087' /pool/ext/62a4c68a-2073-42d0-8e49-01f5e8b90cd4/crypt/debug/global/oxide-sled-agent:default.log.1712113793 | grep Generation | looker
....
00:26:12.368Z INFO SledAgent (InstanceManager): updated state after observing Propolis state change
    file = sled-agent/src/instance.rs:591
    instance_id = 0be1d087-2b2c-4c3e-837a-18b633f3ae68
    new_instance_state = InstanceRuntimeState { propolis_id: Some(c4be0a7d-8b89-4798-ba12-bbf06817bd0a), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-04-02T22:48:13.798433582Z }
    new_vmm_state = VmmRuntimeState { state: Stopping, gen: Generation(5), time_updated: 2024-04-03T00:26:12.368505444Z }
    propolis_id = c4be0a7d-8b89-4798-ba12-bbf06817bd0a
00:26:12.368Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:491
    instance_id = 0be1d087-2b2c-4c3e-837a-18b633f3ae68
    state = SledInstanceState { instance_state: InstanceRuntimeState { propolis_id: Some(c4be0a7d-8b89-4798-ba12-bbf06817bd0a), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-04-02T22:48:13.798433582Z }, propolis_id: c4be0a7d-8b89-4798-ba12-bbf06817bd0a, vmm_state: VmmRuntimeState { state: Stopping, gen: Generation(5), time_updated: 2024-04-03T00:26:12.368505444Z } }
00:26:12.390Z INFO SledAgent (InstanceManager): updated state after observing Propolis state change
    file = sled-agent/src/instance.rs:591
    instance_id = 0be1d087-2b2c-4c3e-837a-18b633f3ae68
    new_instance_state = InstanceRuntimeState { propolis_id: Some(c4be0a7d-8b89-4798-ba12-bbf06817bd0a), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-04-02T22:48:13.798433582Z }
    new_vmm_state = VmmRuntimeState { state: Stopping, gen: Generation(6), time_updated: 2024-04-03T00:26:12.390583850Z }
    propolis_id = c4be0a7d-8b89-4798-ba12-bbf06817bd0a
00:26:12.390Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:491
    instance_id = 0be1d087-2b2c-4c3e-837a-18b633f3ae68
    state = SledInstanceState { instance_state: InstanceRuntimeState { propolis_id: Some(c4be0a7d-8b89-4798-ba12-bbf06817bd0a), dst_propolis_id: None, migration_id: None, gen: Generation(3), time_updated: 2024-04-02T22:48:13.798433582Z }, propolis_id: c4be0a7d-8b89-4798-ba12-bbf06817bd0a, vmm_state: VmmRuntimeState { state: Stopping, gen: Generation(6), time_updated: 2024-04-03T00:26:12.390583850Z } }

The sled-agent and propolis logs for this newer instance have been uploaded to catacomb:/staff/core/omicron-5363/propolis-c4be0a7d.

@pfmooney
Copy link
Contributor

pfmooney commented Apr 3, 2024

This appears to be a case of oxidecomputer/propolis#675. Hopping onto that node, I see threads deadlocked with similar mutex operations.

edit, for context:

-------- thread# 138 / lwp# 138 [in-memory worker 4] ---------
 fffff5ffef1639f7 lwp_park (0, 0, 0)
 fffff5ffef15b6a3 mutex_lock_impl (2ba4730, 0) + 113
 fffff5ffef15b813 mutex_lock (2ba4730) + 13
 00000000014055cf _ZN76_$LT$propolis..accessors..Node$LT$T$GT$$u20$as$u20$core..ops..drop..Drop$GT$4drop17h29f18860ed1bcf01E () + 8f
 0000000001421892 _ZN5alloc4sync16Arc$LT$T$C$A$GT$9drop_slow17hd80794aa0785a878E () + 12
 00000000013d4ff5 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h4b97c75f8bb7d3ccE () + e25
 00000000013d56cf _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h2e8912362722ecf5E () + 6f
 00000000021af619 _ZN3std3sys3pal4unix6thread6Thread3new12thread_start17h9ad099a424212e51E () + 29
 fffff5ffef163667 _thrp_setup (fffff5ffee6e4240) + 77
 fffff5ffef1639b0 _lwp_start ()

--------------------- thread# 1 / lwp# 1 ---------------------
 fffff5ffef1639f7 lwp_park (0, 0, 0)
 fffff5ffef15b6a3 mutex_lock_impl (2ba4730, 0) + 113
 fffff5ffef15b813 mutex_lock (2ba4730) + 13
 00000000014055cf _ZN76_$LT$propolis..accessors..Node$LT$T$GT$$u20$as$u20$core..ops..drop..Drop$GT$4drop17h29f18860ed1bcf01E () + 8f
 0000000001421892 _ZN5alloc4sync16Arc$LT$T$C$A$GT$9drop_slow17hd80794aa0785a878E () + 12
 000000000140a4df _ZN8propolis9accessors17Accessor$LT$T$GT$6poison17h1875032ed78822cbE () + 51f
 0000000001360854 _ZN8propolis3vmm7machine7Machine10do_destroy17h9d9c6ac18754cb17E () + 24
 00000000013606c3 _ZN8propolis3vmm7machine7Machine7destroy17hd97a473d1aaa8ee4E () + 13
 0000000000f6d88c _ZN75_$LT$propolis_server..vm..VmController$u20$as$u20$core..ops..drop..Drop$GT$4drop17hd125d907be93e61bE () + 12c
 0000000001014c8d _ZN5alloc4sync16Arc$LT$T$C$A$GT$9drop_slow17hb4245e848b740c26E () + 1d
 0000000000e9ec88 _ZN79_$LT$propolis_server..vnc..PropolisVncServer$u20$as$u20$rfb..server..Server$GT$4stop28_$u7b$$u7b$closure$u7d$$u7d$17h4691720ee73a0769E () + 148
 0000000000def8aa _ZN95_$LT$futures_util..future..poll_fn..PollFn$LT$F$GT$$u20$as$u20$core..future..future..Future$GT$4poll17h54bcd78bc1140e45E () + 9ca
 0000000000dec442 _ZN5tokio7runtime4park16CachedParkThread8block_on17h9f45431f2bc4c7e0E () + 872
 0000000000e18195 _ZN5tokio7runtime7context7runtime13enter_runtime17h0a3054d090f8fb51E () + 145
 0000000000db340d _ZN5tokio7runtime7runtime7Runtime8block_on17ha258d52bf3d1be38E () + 5d
 0000000000e276f4 _ZN15propolis_server4main17h860bc799b12a7d43E () + 94
 0000000000de42c6 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17hc748a204e6280757E () + 6
 0000000000e299e1 _ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17hcea168323adede24E.llvm.9366677054160684672 () + 11
 0000000002189f13 _ZN3std2rt19lang_start_internal17hdc1ac086c909d074E () + 2d3
 0000000000e278ac main () + 2c
 0000000000d75ff7 _start_crt () + 87
 0000000000d75f58 _start () + 18

@askfongjojo askfongjojo added the known issue To include in customer documentation and training label Apr 3, 2024
@askfongjojo askfongjojo modified the milestones: 7, 8 Apr 9, 2024
@askfongjojo askfongjojo modified the milestones: 8, 9 Apr 24, 2024
@askfongjojo
Copy link
Author

I've tested this use case a bunch in the last 3 weeks. I haven't been able to reproduce the issue with the fix on oxidecomputer/propolis#675. Marking this ticket closed should be prudent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
known issue To include in customer documentation and training
Projects
None yet
Development

No branches or pull requests

3 participants