Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance stuck in stopping state as scrubbing seems to be getting into the way? #5376

Closed
askfongjojo opened this issue Apr 2, 2024 · 3 comments

Comments

@askfongjojo
Copy link

askfongjojo commented Apr 2, 2024

I have an instance that's stuck in stopping state after an attempt of instance-stop API request:

root@oxz_switch0:~# omdb db instances | grep 1b62a66f-3c8b-4656-a495-71b6816d8f49
1b62a66f-3c8b-4656-a495-71b6816d8f49 app-52                                                          stopping 3aa75e75-079c-4dac-ad83-55401d3c4938 0c7011f7-a4bf-4daf-90cc-1c2410103300 BRM42220057 

From the propolis log, it would appear that after VM controller has been removed, scrubbing is still trying to finish its job and causing the teardown to stall.

15:16:21.766Z INFO propolis-server (vnc-server): stopping VNC server
15:16:21.766Z INFO propolis-server (vm_controller): Dropping VM controller
15:16:21.766Z INFO propolis-server (oximeter-producer): received request to begin graceful shutdown
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/c3e8467/dropshot/src/server.rs:266
    local_addr = [fd00:1122:3344:104::1:e4]:12400
    method = PUT
    remote_addr = [fd00:1122:3344:104::1]:60771
    req_id = c80b6599-1ffd-47c6-9afd-65cd5b842c57
    uri = /instance
15:16:21.788Z INFO propolis-server: request completed
    latency_us = 80
    local_addr = [fd00:1122:3344:104::1:e4]:12400
    method = GET
    remote_addr = [fd00:1122:3344:104::1]:53432
    req_id = f916953e-8146-4ed6-967b-134bb6035b6e
    response_code = 200
    uri = /instance/state-monitor
15:16:31.262Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 1342208/4194304 sp:1342208
15:16:50.125Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 1510144/4194304 sp:1510144
15:17:08.946Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 1677824/4194304 sp:1677824
15:17:27.521Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 1845504/4194304 sp:1845504
15:17:46.187Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 2013440/4194304 sp:2013440
15:18:04.911Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 2181120/4194304 sp:2181120
15:18:23.824Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 2349056/4194304 sp:2349056
15:18:42.557Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 2516736/4194304 sp:2516736
15:19:01.060Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 2684416/4194304 sp:2684416
15:19:06.326Z INFO propolis-server: accepted connection
    local_addr = [fd00:1122:3344:104::1:e4]:12400
    remote_addr = [fd00:1122:3344:104::1]:56183
15:19:06.327Z INFO propolis-server: request completed
    error_message_external = Not Found
    error_message_internal = Server not initialized (no instance)
    latency_us = 59
    local_addr = [fd00:1122:3344:104::1:e4]:12400
    method = PUT
    remote_addr = [fd00:1122:3344:104::1]:56183
    req_id = a1af2b13-31f8-4357-bbcc-6d8fcc8d1c8c
    response_code = 404
    uri = /instance/state
15:19:19.757Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 2852352/4194304 sp:2852352
15:19:38.241Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 3020032/4194304 sp:3020032
15:19:56.704Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 3187712/4194304 sp:3187712
15:20:15.196Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 3355648/4194304 sp:3355648
15:20:33.661Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 3523328/4194304 sp:3523328
15:20:52.126Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 3691008/4194304 sp:3691008
15:21:10.587Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 3858944/4194304 sp:3858944
15:21:29.046Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 4026624/4194304 sp:4026624
15:21:47.510Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub at offset 4194304/4194304 sp:4194304
15:21:47.537Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub 6ff0b1c1-1252-4c16-ba60-0f78df566113 done in 464 seconds. Retries:0 scrub_size:131072 size:4194304 pause_milli:25
15:21:47.538Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Deactivate read only parent 6ff0b1c1-1252-4c16-ba60-0f78df566113
15:21:47.538Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Request to deactivate this guest
    session_id = c6b320dc-b9ae-498f-9165-bfa178732b9d
15:21:47.541Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): checking for deactivation
    session_id = c6b320dc-b9ae-498f-9165-bfa178732b9d
15:21:47.541Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): [0] deactivate job 23261 not InProgress flush, NO
     = downstairs
    session_id = c6b320dc-b9ae-498f-9165-bfa178732b9d
15:21:47.541Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): not ready to deactivate client 0
    session_id = c6b320dc-b9ae-498f-9165-bfa178732b9d
15:21:47.541Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): [1] deactivate job 23261 not InProgress flush, NO
     = downstairs
    session_id = c6b320dc-b9ae-498f-9165-bfa178732b9d
15:21:47.541Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): not ready to deactivate client 1
    session_id = c6b320dc-b9ae-498f-9165-bfa178732b9d
15:21:47.541Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): [2] deactivate job 23261 not InProgress flush, NO
     = downstairs
    session_id = c6b320dc-b9ae-498f-9165-bfa178732b9d
15:21:47.541Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): not ready to deactivate client 2
    session_id = c6b320dc-b9ae-498f-9165-bfa178732b9d
15:21:47.541Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): not ready to deactivate due to state Active
     = downstairs
    client = 0
    session_id = c6b320dc-b9ae-498f-9165-bfa178732b9d

The complete log has been uploaded to catacomb:/staff/core/crucible-5376.

@leftwo
Copy link
Contributor

leftwo commented Apr 2, 2024

Did it eventually finish?

From the propolis log, the scrub + deactivation did eventually finish:

15:21:47.544Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): checking for deactivation
15:21:47.544Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): ready to deactivate from state New    client = 0
15:21:47.544Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): ready to deactivate from state New    client = 1
15:21:47.544Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): ready to deactivate from state New    client = 2
15:21:47.544Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): All DS in the proper state! -> INIT
15:21:47.544Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Scrub of volume 6ff0b1c1-1252-4c16-ba60-0f78df566113 completed, remove parent
15:21:47.544Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): client task is exiting    client = 0
15:21:47.862Z INFO propolis-server (crucible-6ff0b1c1-1252-4c16-ba60-0f78df566113): Submitted removal for read only parent on 6ff0b1c1-1252-4c16-ba60-0f78df566113

@leftwo
Copy link
Contributor

leftwo commented Apr 2, 2024

Propolis started shutdown around:
15:16:21.766Z INFO propolis-server (oximeter-producer): received request to begin graceful shutdown

But Crucible did not let go till
15:21:47.862Z

We might need some propolis eyes on this to see why the shutdown can't make forward progress.

@askfongjojo
Copy link
Author

Same as #5363, I haven't been able to reproduce this issue after oxidecomputer/propolis#675. I'm marking this closed but will leave the link customer issue open for a bit longer until customer has confirmed that they aren't seeing instances stuck in stopping state any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants