Instance hitting ext4-fs read/write error after running non-disk I/O heavy workload for some time #1572

askfongjojo · 2024-11-26T07:14:45Z

The serial console showing many of these errors:

[32112.487146] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1588: inode #43: comm gmain: reading directory lblock 0
[32112.489875] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1588: inode #80759: comm gmain: reading directory lblock 0
[32112.492222] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1588: inode #80759: comm gmain: reading directory lblock 0
[32112.494515] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1588: inode #80759: comm gmain: reading directory lblock 0
[32116.487785] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1588: inode #43: comm gmain: reading directory lblock 0
[32116.490535] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1588: inode #80759: comm gmain: reading directory lblock 0
[32116.492921] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1588: inode #80759: comm gmain: reading directory lblock 0
[32116.493054] systemd-journald[373]: Failed to rotate /var/log/journal/719fe5167b40444b90c4593f07a2f54b/system.journal: Read-only file system
[32116.495327] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1588: inode #80759: comm gmain: reading directory lblock 0
[32116.497860] systemd-journald[373]: Failed to rotate /var/log/journal/719fe5167b40444b90c4593f07a2f54b/user-1000.journal: Read-only file system
[32116.502743] systemd-journald[373]: Failed to write entry (9 items, 324 bytes), ignoring: Input/output error

This is the instance in question:

root@oxz_switch0:~# omdb db disks info e9d09df3-4d87-4623-a2a1-3ef1c9b0d82a
HOST_SERIAL DISK_NAME INSTANCE_NAME PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE 
BRM42220006 loadgen   loadgen       oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108 fa79bd51-4f63-4728-8000-3a56c9bad8be attached   
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK                        
BRM42220031 d9d5964e-cbe4-4039-9edf-07608773befa oxz_crucible_13c1e91e-bfcc-4eea-8185-412fc37fdea3 c7eb9927-c026-4886-8890-80f541461106 
BRM42220006 3b0a9708-bb47-42db-bdaa-1543c4a17e11 oxz_crucible_605be8b9-c652-4a5f-94ca-068ec7a39472 37e5680c-4645-425f-bfd5-e717acf0c263 
BRM42220031 b8e71396-dcfd-40e5-81a8-ba69e7dcf1b6 oxz_crucible_c9cb60af-9e0e-4b3b-b971-53138a9b8d27 35102217-5817-4ade-951e-c3dbc0e8d3b6 

VCR from volume ID fa79bd51-4f63-4728-8000-3a56c9bad8be
ID                                   BS  SUB_VOLUMES READ_ONLY_PARENT 
e9d09df3-4d87-4623-a2a1-3ef1c9b0d82a 512 1           false            

SUB VOLUME 0
    ID                                   BS  BPE    EC  GEN READ_ONLY 
    e9d09df3-4d87-4623-a2a1-3ef1c9b0d82a 512 131072 480 97  false     
    [fd00:1122:3344:106::a]:19001
    [fd00:1122:3344:102::4]:19000
    [fd00:1122:3344:102::9]:19000

I was using the instance as a load generator against a database workload. It was idle for a few hours before I used it again to run iperf3 tests (as the server and then as the client).

BRM42220006 # ls -ltr /pool/ext/*/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-ill* | tail -3
-rw-r--r--   1 root     root       49678 Nov 25 22:18 /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732572841
-rw-r--r--   1 root     root     1459564 Nov 26 02:15 /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732587253
-rw-r--r--   1 root     root           0 Nov 26 06:16 /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732587301

The propolis log shows that the vm_state_driver constantly getting connection timeout:

06:38:24.322Z INFO propolis-server (vm_state_driver): notified Nexus of client stopped
     = downstairs
    downstairs_id = b8e71396-dcfd-40e5-81a8-ba69e7dcf1b6
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:24.399Z WARN propolis-server (vm_state_driver): connect timeout
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:24.399Z WARN propolis-server (vm_state_driver): client task is sending Done(ConnectionTimeout)
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:24.399Z WARN propolis-server (vm_state_driver): downstairs task for 2 stopped due to ConnectionTimeout
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:24.399Z INFO propolis-server (vm_state_driver): client task is exiting
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:24.399Z INFO propolis-server (vm_state_driver): new DNS resolver
     = downstairs
    addresses = [[fd00:1122:3344:1::1]:53, [fd00:1122:3344:2::1]:53, [fd00:1122:3344:3::1]:53, [fd00:1122:3344:4::1]:53, [fd00:1122:3344:5::1]:53]
    downstairs_id = d9d5964e-cbe4-4039-9edf-07608773befa
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:24.406Z INFO propolis-server (vm_state_driver): notified Nexus of client stopped
     = downstairs
    downstairs_id = d9d5964e-cbe4-4039-9edf-07608773befa
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.315Z INFO propolis-server (vm_state_driver): connecting to [fd00:1122:3344:102::4]:19000
     = downstairs
    client = 1
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.315Z INFO propolis-server (vm_state_driver): ds_connection connected from Ok([fd00:1122:3344:106::1:579]:54180)
     = downstairs
    client = 1
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.316Z WARN propolis-server (vm_state_driver): version negotiation from state Offline
     = downstairs
    client = 1
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.316Z INFO propolis-server (vm_state_driver): downstairs client at Some([fd00:1122:3344:102::4]:19000) has region UUID b8e71396-dcfd-40e5-81a8-ba69e7dcf1b6
     = downstairs
    client = 1
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.316Z INFO propolis-server (vm_state_driver): Returning UUID:b8e71396-dcfd-40e5-81a8-ba69e7dcf1b6 matches
     = downstairs
    client = 1
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.316Z INFO propolis-server (vm_state_driver): send last flush ID to this DS: 26538
     = downstairs
    client = 1
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.316Z INFO propolis-server (vm_state_driver): Replied this last flush ID: 26538
     = downstairs
    client = 1
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.316Z INFO propolis-server (vm_state_driver): ds_transition from Offline to Active
     = downstairs
    client = 1
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.316Z INFO propolis-server (vm_state_driver): [1] Transition from Offline to Active
     = downstairs
    client = 1
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.316Z INFO propolis-server (vm_state_driver): [1] client re-new 5 jobs since flush 26538
     = downstairs
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.316Z INFO propolis-server (vm_state_driver): [1] Replayed 5 jobs since flush: 26538
     = downstairs
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.400Z INFO propolis-server (vm_state_driver): connecting to [fd00:1122:3344:102::9]:19000
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.400Z INFO propolis-server (vm_state_driver): ds_connection connected from Ok([fd00:1122:3344:106::1:579]:36598)
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.429Z WARN propolis-server (vm_state_driver): version negotiation from state Offline
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.429Z INFO propolis-server (vm_state_driver): downstairs client at Some([fd00:1122:3344:102::9]:19000) has region UUID d9d5964e-cbe4-4039-9edf-07608773befa
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.429Z INFO propolis-server (vm_state_driver): Returning UUID:d9d5964e-cbe4-4039-9edf-07608773befa matches
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.429Z INFO propolis-server (vm_state_driver): send last flush ID to this DS: 26538
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.429Z INFO propolis-server (vm_state_driver): Replied this last flush ID: 26538
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.429Z INFO propolis-server (vm_state_driver): ds_transition from Offline to Active
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.429Z INFO propolis-server (vm_state_driver): [2] Transition from Offline to Active
     = downstairs
    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.429Z INFO propolis-server (vm_state_driver): [2] client re-new 5 jobs since flush 26538
     = downstairs
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.429Z INFO propolis-server (vm_state_driver): [2] Replayed 5 jobs since flush: 26538
     = downstairs
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:45:10.783Z INFO propolis-server: accepted connection
    local_addr = [fd00:1122:3344:106::1:579]:12400
    remote_addr = [fd00:1122:3344:103::3]:37924
06:45:10.795Z INFO propolis-server: request completed
    latency_us = 11722
    local_addr = [fd00:1122:3344:106::1:579]:12400
    method = GET
    remote_addr = [fd00:1122:3344:103::3]:37924
    req_id = b680b95d-c194-4a16-820a-1be48f4e3ed0
    response_code = 101
    uri = /instance/serial?most_recent=262144

Logs are being copied to /staff/core/crucible-1572.

The text was updated successfully, but these errors were encountered:

askfongjojo · 2024-11-26T08:11:38Z

The crucible downstairs logs are relatively empty. It's unclear if the connection timeout is due to networking issues or crucible-side connection issues.

Within the 3 crucible zones, I see a connection established with the propolis zone but not among the crucible zones:

root@oxz_crucible_13c1e91e:~# netstat -an | grep ESTABLISHED
fd00:1122:3344:102::9.19002       fd00:1122:3344:101::1:553.44724   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:102::9.19035       fd00:1122:3344:105::1:65f.47255   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:102::9.19025       fd00:1122:3344:109::1:58c.41759   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:102::9.19027       fd00:1122:3344:105::1:65f.47883   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:102::9.19008       fd00:1122:3344:10b::1:574.36476   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:102::9.19000       fd00:1122:3344:106::1:579.36598   142848      0 1008864      0 ESTABLISHED 

root@oxz_crucible_13c1e91e:~# netstat -an | grep LISTEN | grep 19000
fd00:1122:3344:102::9.19000             *.*                              0      0 1000000      0 LISTEN

root@oxz_crucible_c9cb60af:~# netstat -an | grep ESTABLISHED
fd00:1122:3344:102::4.19006       fd00:1122:3344:101::1:55a.54322   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:102::4.19011       fd00:1122:3344:10a::1:4d9.33584   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:102::4.19000       fd00:1122:3344:106::1:579.54180   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:102::4.19002       fd00:1122:3344:10a::1:4d7.36637   142848      0 1008864      0 ESTABLISHED

root@oxz_crucible_c9cb60af:~# netstat -an | grep LISTEN | grep 19000
fd00:1122:3344:102::4.19000             *.*                              0      0 1000000      0 LISTEN

root@oxz_crucible_605be8b9:~# netstat -an | grep ESTABLISHED
fd00:1122:3344:106::a.19019       fd00:1122:3344:101::1:552.59410   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:106::a.19001       fd00:1122:3344:106::1:579.48120   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:106::a.19020       fd00:1122:3344:103::1:533.44145   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:106::a.19018       fd00:1122:3344:105::1:65f.36603   142848      0 1008864      0 ESTABLISHED 
fd00:1122:3344:106::a.19028       fd00:1122:3344:103::1:53b.46422   142848      0 1008864      0 ESTABLISHED
 
root@oxz_crucible_605be8b9:~# netstat -an | grep LISTEN | grep 19001
fd00:1122:3344:106::a.19001             *.*                              0      0 1000000      0 LISTEN

leftwo · 2024-11-27T17:03:37Z

This instance is now stopped, but I can find it's logs using the zone from the initial omdb disks output:

root@oxz_switch0:~# pilot host exec -c '/opt/oxide/oxlog/oxlog zones | grep oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108||true' 0-31
 7  BRM27230045        ok: 
 8  BRM44220011        ok: 
 9  BRM44220005        ok: 
10  BRM42220009        ok: 
11  BRM42220006        ok: oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108
12  BRM42220057        ok: 
13  BRM42220018        ok: 
14  BRM42220051        ok: 
16  BRM42220014        ok: 
17  BRM42220017        ok: 
21  BRM42220031        ok: 
23  BRM42220016        ok: 
25  BRM44220010        ok:

Over on 11, I can find the propolis logs:

BRM42220006 # /opt/oxide/oxlog/oxlog logs oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108 --archived
/pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732572841
/pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732587253
/pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732608734

I've added those logs along with the rest.

leftwo · 2024-11-27T17:30:38Z

We have a timeout in the upstairs for client 2 starting at 06:30:54:331

06:30:54.331Z WARN propolis-server (vm_state_driver): timeout 1/3
     = downstairs client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a

We eventually timeout, then try to connect again:

06:31:34.336Z INFO propolis-server (vm_state_driver): connecting to [fd00:1122:3344:102::9]:19000
     = downstairs   client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a

The upstairs tries and tries, and eventually it reconnects to client 2, almost 8 minutes later:

06:38:34.400Z INFO propolis-server (vm_state_driver): connecting to [fd00:1122:3344:102::9]:19000
     = downstairs    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.400Z INFO propolis-server (vm_state_driver): ds_connection connected from Ok([fd00:1122:3344:106::1:579]:36598)
     = downstairs    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.429Z WARN propolis-server (vm_state_driver): version negotiation from state Offline
     = downstairs    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a
06:38:34.429Z INFO propolis-server (vm_state_driver): downstairs client at Some([fd00:1122:3344:102::9]:19000) has region UUID d9d5964e-cbe4-4039-9edf-07608773befa
     = downstairs    client = 2
    session_id = cdab8adb-56a9-4da2-aa70-88e2980c105a

from omdb, that is this downstairs:

BRM42220031 d9d5964e-cbe4-4039-9edf-07608773befa oxz_crucible_13c1e91e-bfcc-4eea-8185-412fc37fdea3 c7eb9927-c026-4886-8890-80f541461106

Looking at the downstairs log for d9d5:

alan@atrium:crucible-1572$ cat oxide-crucible-downstairs\:downstairs-d9d5964e-cbe4-4039-9edf-07608773befa.log.1732603114 | looker
[ Nov 26 06:38:24 Enabled. ]
[ Nov 26 06:38:24 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/downstairs.sh"). ]
06:38:24.903Z INFO crucible: Opened existing region file "/data/regions/d9d5964e-cbe4-4039-9edf-07608773befa/region.json"

Let's just stop right here. This is showing the service starting. And we are already at 06:38:54, which is way way after the initial timeout started.

The log continues as expected and shows the upstairs connecting.

06:38:24.903Z INFO crucible: Database read version 1
06:38:24.903Z INFO crucible: Database write version 1
06:38:28.177Z INFO crucible: UUID: d9d5964e-cbe4-4039-9edf-07608773befa
06:38:28.177Z INFO crucible: Blocks per extent:131072 Total Extents: 480
06:38:28.234Z INFO crucible: Crucible Version: Crucible Version: 0.0.1
    Commit SHA: 1d82cc9fd9925814d097d39f7cbafd62fb8cfb7e
    Commit timestamp: 2024-11-18T19:35:49.000000000Z  branch: main
    rustc: 1.82.0 stable x86_64-unknown-illumos
    Cargo: x86_64-unknown-illumos  Debug: false Opt level: 3
    task = main
06:38:28.234Z INFO crucible: Upstairs <-> Downstairs Message Version: 12
    task = main
06:38:28.234Z INFO crucible: Repair listens on [fd00:1122:3344:102::9]:23000 for path:"/data/regions/d9d5964e-cbe4-4039-9edf-07608773befa"
    task = repair
06:38:28.234Z INFO crucible: listening
    local_addr = [fd00:1122:3344:102::9]:23000
    task = repair
06:38:28.234Z INFO crucible: Using repair address: [fd00:1122:3344:102::9]:23000
    task = main
06:38:28.234Z INFO crucible: No SSL acceptor configured
    task = main
06:38:28.234Z INFO crucible: downstairs listening on [fd00:1122:3344:102::9]:19000
    task = main
06:38:34.429Z INFO crucible: accepted connection from [fd00:1122:3344:106::1:579]:36598
    task = main
06:38:34.429Z INFO crucible: connection ([fd00:1122:3344:106::1:579]:36598): tasks spawned
    id = 0
06:38:34.429Z INFO crucible: Connection request from e9d09df3-4d87-4623-a2a1-3ef1c9b0d82a with version 12
06:38:34.429Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e9d09df3-4d87-4623-a2a1-3ef1c9b0d82a, session_id: cdab8adb-56a9-4da2-aa70-88e2980c105a, gen: 96 } connected, version 12
06:38:34.429Z INFO crucible: UpstairsConnection { upstairs_id: e9d09df3-4d87-4623-a2a1-3ef1c9b0d82a, session_id: cdab8adb-56a9-4da2-aa70-88e2980c105a, gen: 96 } is now active (read-write)
06:38:34.429Z INFO crucible: Set last flush 26538
06:38:34.429Z INFO crucible: Downstairs has completed Negotiation

So, where are the earlier logs for that downstairs service. They must exist somewhere as the upstairs was working?

leftwo · 2024-11-27T19:01:41Z

Looking on sled 21, where the downstairs zone is located, it looks like all the downstairs (and other) services were started at the same time:

state_time   Tue Nov 26 06:38:24 2024

So, this whole zone does not appear to exist before 06:38.
So, what and or how was the upstairs talking to a downstairs before that?

Looking for logs for the propolis instance from the omdb disks command above, I only find these:

BRM42220006 # /opt/oxide/oxlog/oxlog logs oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108 --archived
/pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732572841
/pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732587253
/pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732608734
BRM42220006 # ls -l /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server*
-rw-r--r--   1 root     root       49678 Nov 25 22:18 /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732572841
-rw-r--r--   1 root     root     1459564 Nov 26 02:15 /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732587253
-rw-r--r--   1 root     root           0 Nov 26 06:16 /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732587301
-rw-r--r--   1 root     root      144659 Nov 26 10:17 /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732608734
-rw-r--r--   1 root     root           0 Nov 26 14:18 /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732616101
-rw-r--r--   1 root     root           0 Nov 26 18:19 /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732630501
-rw-r--r--   1 root     root           0 Nov 26 22:15 /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108/system-illumos-propolis-server:default.log.1732644901

And, none of those contain any info from the time just before this initial timeout seen as the first line in the log file that does have info in it:

06:30:54.331Z WARN propolis-server (vm_state_driver): timeout 1/3

And, the downstairs zones themselves don't even exist at 06:30:54, so I don't know what that propolis was talking to?

It's still a mystery, but without the earlier logs, it's difficult to determine what the upstairs was talking to, and how.
There is a zone bundle for that propolis zone, but it does not tell me much:

BRM42220006 # /opt/oxide/sled-agent/zone-bundle ls | grep 9ba5b631
oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108         9d6c059c-1d42-45ae-bb1e-c6a114dba543 2024-11-27 00:34:46.814036397 UTC

I've copied and extracted its contents here: /staff/core/crucible-1572/zone-bundle-for-oxz_propolis-server_9ba5b631

So, without the earlier propolis log, and given the timings of what we do have, I can't figure out what has gone wrong with that instance.

It also seems as if we have lost a log file for propolis in all this, as I can't find the startup messages from oxz_propolis-server_9ba5b631-659a-4dd2-95d0-61ee03cd9108 when it first starts.

leftwo · 2024-11-27T19:15:32Z

Ah ha! I found something.

Turns out there are two downstairs on the same sled, had I been paying attention I might not have missed that:
``
HOST_SERIAL REGION ZONE PHYSICAL_DISK
BRM42220031 d9d5964e-cbe4-4039-9edf-07608773befa oxz_crucible_13c1e91e-bfcc-4eea-8185-412fc37fdea3 c7eb9927-c026-4886-8890-80f541461106
BRM42220006 3b0a9708-bb47-42db-bdaa-1543c4a17e11 oxz_crucible_605be8b9-c652-4a5f-94ca-068ec7a39472 37e5680c-4645-425f-bfd5-e717acf0c263
BRM42220031 b8e71396-dcfd-40e5-81a8-ba69e7dcf1b6 oxz_crucible_c9cb60af-9e0e-4b3b-b971-53138a9b8d27 35102217-5817-4ade-951e-c3dbc0e8d3b6


We have two downstairs on `BRM42220031`
I thought everyone had the same uptime:

root@oxz_switch0:~# pilot host exec -c 'uptime' 0-31
7 BRM27230045 ok: 19:11:40 up 1 day(s), 21:12, 0 users, load average: 0.49, 0.44, 0.43
8 BRM44220011 ok: 19:11:40 up 1 day(s), 21:12, 0 users, load average: 0.64, 0.66, 0.66
9 BRM44220005 ok: 19:11:40 up 1 day(s), 21:11, 0 users, load average: 0.93, 0.92, 0.95
10 BRM42220009 ok: 19:11:40 up 1 day(s), 21:09, 0 users, load average: 0.98, 0.92, 0.92
11 BRM42220006 ok: 19:11:41 up 1 day(s), 21:09, 1 user, load average: 0.93, 0.91, 0.89
12 BRM42220057 ok: 19:11:41 up 1 day(s), 21:10, 1 user, load average: 3.68, 3.42, 3.39
14 BRM42220051 ok: 19:11:41 up 1 day(s), 21:55, 0 users, load average: 3.00, 2.95, 2.91
16 BRM42220014 ok: 19:11:42 up 1 day(s), 21:12, 1 user, load average: 2.77, 2.71, 2.65
17 BRM42220017 ok: 19:11:42 up 1 day(s), 21:10, 0 users, load average: 1.00, 1.09, 1.12
21 BRM42220031 ok: 19:11:42 up 1 day(s), 12:37, 1 user, load average: 1.65, 1.57, 1.59
23 BRM42220016 ok: 19:11:42 up 1 day(s), 21:10, 0 users, load average: 0.62, 0.54, 0.52
25 BRM44220010 ok: 19:11:43 up 1 day(s), 21:09, 1 user, load average: 0.77, 0.80, 0.80


But if you look close, sled 21, where 2/3 downstairs are, has a slightly newer uptime.

Now, checking for a kernel dump, we appear to have one:

21 BRM42220031 ok: BRM42220031
-rw-r--r-- 1 root root 13.5G Nov 26 06:36 /pool/ext/69f0b863-f73f-42b2-9822-b2cb99f09003/crypt/debug/vmdump.0


So, the sled paniced and rebooted, and that caused 2/3 downstairs to go away (for a bit) but then come back.
Having 2 downstairs gone will cause IO errors, so that part is expected.

leftwo · 2024-11-27T19:45:19Z

Kernel panic issue is here: oxidecomputer/opte#618

askfongjojo · 2024-11-28T19:08:30Z

So this disk was indeed very old - it was created on Sept 1, 2023, predating the downstairs spread change on Oct 2, 2023. With this finding, this issue can be closed as the thing to fix is the opte panic. In a situation like this, a stop/start of the VM should fix the problem (I didn't try exactly that but the VM did come back up after a platform update).

askfongjojo added this to the 12 milestone Nov 26, 2024

morlandi7 assigned mkeeter, jmpesp and leftwo Nov 27, 2024

askfongjojo closed this as completed Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance hitting ext4-fs read/write error after running non-disk I/O heavy workload for some time #1572

Instance hitting ext4-fs read/write error after running non-disk I/O heavy workload for some time #1572

askfongjojo commented Nov 26, 2024

askfongjojo commented Nov 26, 2024

leftwo commented Nov 27, 2024

leftwo commented Nov 27, 2024

leftwo commented Nov 27, 2024

leftwo commented Nov 27, 2024

leftwo commented Nov 27, 2024

askfongjojo commented Nov 28, 2024

Instance hitting ext4-fs read/write error after running non-disk I/O heavy workload for some time #1572

Instance hitting ext4-fs read/write error after running non-disk I/O heavy workload for some time #1572

Comments

askfongjojo commented Nov 26, 2024

askfongjojo commented Nov 26, 2024

leftwo commented Nov 27, 2024

leftwo commented Nov 27, 2024

leftwo commented Nov 27, 2024

leftwo commented Nov 27, 2024

leftwo commented Nov 27, 2024

askfongjojo commented Nov 28, 2024