-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance hitting ext4-fs read/write error after running non-disk I/O heavy workload for some time #1572
Comments
The crucible downstairs logs are relatively empty. It's unclear if the connection timeout is due to networking issues or crucible-side connection issues. Within the 3 crucible zones, I see a connection established with the propolis zone but not among the crucible zones:
|
This instance is now stopped, but I can find it's logs using the zone from the initial omdb disks output:
Over on 11, I can find the propolis logs:
I've added those logs along with the rest. |
We have a timeout in the upstairs for client 2 starting at 06:30:54:331
We eventually timeout, then try to connect again:
The upstairs tries and tries, and eventually it reconnects to client 2, almost 8 minutes later:
from omdb, that is this downstairs:
Looking at the downstairs log for d9d5:
Let's just stop right here. This is showing the service starting. And we are already at 06:38:54, which is way way after the initial timeout started. The log continues as expected and shows the upstairs connecting.
So, where are the earlier logs for that downstairs service. They must exist somewhere as the upstairs was working? |
Looking on sled 21, where the downstairs zone is located, it looks like all the downstairs (and other) services were started at the same time:
So, this whole zone does not appear to exist before 06:38. Looking for logs for the propolis instance from the omdb disks command above, I only find these:
And, none of those contain any info from the time just before this initial timeout seen as the first line in the log file that does have info in it:
And, the downstairs zones themselves don't even exist at 06:30:54, so I don't know what that propolis was talking to? It's still a mystery, but without the earlier logs, it's difficult to determine what the upstairs was talking to, and how.
I've copied and extracted its contents here: /staff/core/crucible-1572/zone-bundle-for-oxz_propolis-server_9ba5b631 So, without the earlier propolis log, and given the timings of what we do have, I can't figure out what has gone wrong with that instance. It also seems as if we have lost a log file for propolis in all this, as I can't find the startup messages from |
Ah ha! I found something. Turns out there are two downstairs on the same sled, had I been paying attention I might not have missed that:
root@oxz_switch0:~# pilot host exec -c 'uptime' 0-31
21 BRM42220031 ok: BRM42220031
|
Kernel panic issue is here: oxidecomputer/opte#618 |
So this disk was indeed very old - it was created on Sept 1, 2023, predating the downstairs spread change on Oct 2, 2023. With this finding, this issue can be closed as the thing to fix is the opte panic. In a situation like this, a stop/start of the VM should fix the problem (I didn't try exactly that but the VM did come back up after a platform update). |
The serial console showing many of these errors:
This is the instance in question:
I was using the instance as a load generator against a database workload. It was idle for a few hours before I used it again to run iperf3 tests (as the server and then as the client).
The propolis log shows that the vm_state_driver constantly getting connection timeout:
Logs are being copied to
/staff/core/crucible-1572
.The text was updated successfully, but these errors were encountered: