Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mystery failure on london BRM42220036 #1553

Open
faithanalog opened this issue Nov 8, 2024 · 7 comments
Open

mystery failure on london BRM42220036 #1553

faithanalog opened this issue Nov 8, 2024 · 7 comments

Comments

@faithanalog
Copy link
Contributor

This issue happened twice, both times with the 4krandw workload. The first time we didn't get the logs we wanted because we accidentally kernel panic'd the system trying to recover them. Second time around, we got them. I haven't looked into them yet, just writing this up before logging off for sleep.

Notably, the VMs seem to keep writing (to where?), and thus have abnormally high IOPS. In actuality the disk has failed out from under them.

it is extremely suspicious that both times, all the failing VMs were on BRM42220036 in particular, and not any of the other sleds.

logs: /staff/artemis/2024-11-08-londonBRM42220036-explosion/logs

  • contains upstairs & downstairs logs
  • folders are named by VM UUID
  • the logs for the failing VMs are tens of gigabytes long. beware. i recommend accessing from catacomb.

suspect VM UUIDs, look at these in that dir:

0f5c9139-4d66-4ae9-885f-08967d2c13ce
4e453ffc-d871-49b3-88fa-2a816aff9bde
0881cb23-b061-487c-900e-179e92386fe2
b9242f22-f8c5-4804-8c3d-fbbe3a73e5cb
07a66988-1402-41e1-959b-1d114129d568
9d50966a-f98c-481e-ab3f-55c83a12e702
e1ab4a1d-535c-44e1-b8f6-f9813ddfc7be
2f28e04c-a25b-47ae-9312-faede629f8ee
c52b3d20-1eb7-427f-bcc0-30979cf6605c
48b70269-9f20-4841-8599-81c0a2184bf3

extra data: /staff/artemis/2024-11-08-londonBRM42220036-explosion/measure-4krandw

  • notably, lrstate seems to indicate that 4e453ffc-d871-49b3-88fa-2a816aff9bde (lrstate has propolis UUIDs, gotta map them to the VM UUID), but none of the others, has 266 ds_live_repair_aborted and none completed. the rest have none aborted, none completed.

a sampling of dmesg from one of the guest VMs

[   12.396400] EDAC MC: Ver: 3.0.0
[   13.257310] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s8: link becomes ready
[   15.005756] NET: Registered PF_PACKET protocol family
[ 6536.202312] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 6536.202323] clocksource:                       'acpi_pm' wd_nsec: 396359593 wd_now: 2a0773 wd_last: 146150 mask: ffffff
[ 6536.202326] clocksource:                       'tsc' cs_nsec: 5083341649 cs_now: be15eaea178 cs_last: bdf01dbbfdc mask: ffffffffffffffff
[ 6536.202328] clocksource:                       'tsc' is current clocksource.
[ 6536.202338] tsc: Marking TSC unstable due to clocksource watchdog
[ 6536.203745] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 6536.203753] sched_clock: Marking unstable (6536636056199, -432146598)<-(6536205179789, -1437751)
[ 6536.205910] clocksource: Not enough CPUs to check clocksource 'tsc'.
[ 6536.205930] clocksource: Switched to clocksource acpi_pm
[ 7026.601955] hrtimer: interrupt took 22633310 ns
[ 7893.772021] nvme nvme1: I/O 832 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772081] nvme nvme1: Abort status: 0x0
[ 7893.772109] nvme nvme1: I/O 838 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772155] nvme nvme1: Abort status: 0x0
[ 7893.772168] nvme nvme1: I/O 839 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772208] nvme nvme1: Abort status: 0x0
[ 7893.772221] nvme nvme1: I/O 840 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772253] nvme nvme1: Abort status: 0x0
[ 7893.772271] nvme nvme1: I/O 841 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772303] nvme nvme1: Abort status: 0x0
[ 7893.772316] nvme nvme1: I/O 842 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772348] nvme nvme1: Abort status: 0x0
[ 7893.772363] nvme nvme1: I/O 843 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772396] nvme nvme1: Abort status: 0x0
[ 7893.772416] nvme nvme1: I/O 844 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772453] nvme nvme1: Abort status: 0x0
[ 7893.772467] nvme nvme1: I/O 845 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772505] nvme nvme1: Abort status: 0x0
[ 7893.772518] nvme nvme1: I/O 846 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772550] nvme nvme1: Abort status: 0x0
[ 7893.772563] nvme nvme1: I/O 847 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772599] nvme nvme1: Abort status: 0x0
[ 7893.772618] nvme nvme1: I/O 848 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772650] nvme nvme1: Abort status: 0x0
[ 7893.772662] nvme nvme1: I/O 849 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772694] nvme nvme1: Abort status: 0x0
[ 7893.772707] nvme nvme1: I/O 850 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772739] nvme nvme1: Abort status: 0x0
[ 7893.772752] nvme nvme1: I/O 851 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772788] nvme nvme1: Abort status: 0x0
[ 7893.772801] nvme nvme1: I/O 853 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772834] nvme nvme1: Abort status: 0x0
[ 7893.772847] nvme nvme1: I/O 854 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772930] nvme nvme1: Abort status: 0x0
[ 7893.774908] nvme nvme1: I/O 855 (I/O Cmd) QID 1 timeout, aborting
[ 7893.774954] nvme nvme1: Abort status: 0x0
[ 7893.774969] nvme nvme1: I/O 856 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775001] nvme nvme1: Abort status: 0x0
[ 7893.775013] nvme nvme1: I/O 857 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775044] nvme nvme1: Abort status: 0x0
[ 7893.775057] nvme nvme1: I/O 858 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775088] nvme nvme1: Abort status: 0x0
[ 7893.775105] nvme nvme1: I/O 859 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775141] nvme nvme1: Abort status: 0x0
[ 7893.775155] nvme nvme1: I/O 874 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775186] nvme nvme1: Abort status: 0x0
[ 7893.775198] nvme nvme1: I/O 875 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775229] nvme nvme1: Abort status: 0x0
[ 7893.775242] nvme nvme1: I/O 876 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775273] nvme nvme1: Abort status: 0x0
[ 7893.775286] nvme nvme1: I/O 877 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775319] nvme nvme1: Abort status: 0x0
[ 7893.775332] nvme nvme1: I/O 878 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775364] nvme nvme1: Abort status: 0x0
[ 7893.775376] nvme nvme1: I/O 879 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775407] nvme nvme1: Abort status: 0x0
[ 7893.775423] nvme nvme1: I/O 880 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775455] nvme nvme1: Abort status: 0x0
[ 7893.775467] nvme nvme1: I/O 881 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775498] nvme nvme1: Abort status: 0x0
[ 7893.775510] nvme nvme1: I/O 882 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775541] nvme nvme1: Abort status: 0x0
[ 7893.775553] nvme nvme1: I/O 883 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775583] nvme nvme1: Abort status: 0x0
[ 7893.775597] nvme nvme1: I/O 891 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775627] nvme nvme1: Abort status: 0x0
[ 7893.775640] nvme nvme1: I/O 892 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775671] nvme nvme1: Abort status: 0x0
[ 7893.775689] nvme nvme1: I/O 895 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775719] nvme nvme1: Abort status: 0x0
[ 8068.581663] nvme1n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x0 / sc 0x4) 
[ 8068.582652] I/O error, dev nvme1n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 8068.582652] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[ 8068.620609] nvme1n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x0 / sc 0x4) 
[ 8068.620889] I/O error, dev nvme1n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 8068.620889] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[ 8068.624543] ldm_validate_partition_table(): Disk read failed.
[ 8068.625245] nvme1n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x0 / sc 0x4) 
[ 8068.625882] I/O error, dev nvme1n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 8068.625882] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[ 8068.628915]  nvme1n1: unable to read partition table
[ 8068.673398] nvme1n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x0 / sc 0x4) 
[ 8068.674387] I/O error, dev nvme1n1, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
[ 8068.677450] nvme1n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x0 / sc 0x4) 
[ 8068.678443] I/O error, dev nvme1n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 8068.678443] Buffer I/O error on dev nvme1n1, logical block 0, async page read
@faithanalog
Copy link
Contributor Author

commits:

  • omicron 64eb23e10
  • crucible 4b0f256
  • propolis 86101ea

@mkeeter
Copy link
Contributor

mkeeter commented Nov 11, 2024

We investigated this on Friday and made some progress, but the root cause remains elusive.

There appears to be network weather of some kind; the first warning in the logs is of a (partial) timeout. I'm using interleave-bunyan to combine all logs, then paging through:

matt@atrium /staff/core/crucible-1553/logs $ ./interleave-bunyan */*propolis* | looker | less
04:09:22.394Z WARN propolis-server (vm_state_driver): timeout 1/3
     = downstairs
    client = 2
    session_id = 822e8ddc-9945-46db-bb38-421e833e1052
    source = 4e453ffc-d871-49b3-88fa-2a816aff9bde/system-illumos-propolis-server:default.log.1731039289

Scattered partial timeouts continue, then we see our first disconnection at 4:13:04:

04:13:04.695Z WARN propolis-server (vm_state_driver): client task is sending Done(WriteFailed(IO Error: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))
     = downstairs
    client = 1
    session_id = 4eba309e-a613-4c90-9742-62afd5671c95
    source = d3f3e6fd-61de-444c-832e-45af8aa04741/system-illumos-propolis-server:default.log.1731039301
04:13:04.702Z WARN propolis-server (vm_state_driver): downstairs client error Connection reset by peer (os error 131)
     = downstairs
    client = 1
    session_id = 4eba309e-a613-4c90-9742-62afd5671c95
    source = d3f3e6fd-61de-444c-832e-45af8aa04741/system-illumos-propolis-server:default.log.1731039301

Both the write and read sides fail with EPIPE and ECONNRESET respectively.

In this case, client 1 is at [fd00:1122:3344:101::e]:19001. I'm pretty sure this is a Downstairs running on the exact same sled, because I see 1122:3344:101 elsewhere in the logs:

04:01:46.351Z INFO propolis-server: listening
    local_addr = [fd00:1122:3344:101::1:1]:12400

Things continue to fail at various rates, getting worse over time. Eventually, we manage to get into a state with all 3x Downstairs faulted, from which there is no recovery (#1556). There's some low-hanging fruit w.r.t less logging in Crucible, but that's not the root cause of our problems.


Other things to be suspicious about:

  • The failed sled is one of the Scrimlets, so it's running extra stuff (including new sensor logging code!)
  • We see higher baseline RAM usage on both Scrimlets (30ish GiB instead of 15ish? see the debugging session recording for details)
    • However, only one of the Scrimlets is falling over!
    • Are there other services that happen to be running on BRM42220036, but not the other Scrimlet?
  • Artemis also noticed API errors, which points to the system being broadly unhappy

@faithanalog
Copy link
Contributor Author

faithanalog commented Nov 11, 2024

We were running prstat sorted by RSS at the time.

On the problem scrimlet:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 16769 root     4029M 3279M sleep   59    0   7:38:30 0.2% oximeter/131
  1484 root     2830M 1822M sleep   59    0   1:47:23 0.1% mgs/132

On the scrimlet which did not keel over:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  1492 root     2869M 2736M sleep   59    0   1:45:04 0.1% mgs/132
(no oximeter)

I didn't catch a pmap on these at the time, but I suspect both mgs are actually using the same amount of ram but the problem scrimlet has mgs swapping (which is why the RSS and SIZE dont line up there). and you've got oximeter adding an extra 4 gigs of memory pressure


Also on the topic of memory pressure, our propolis servers were climbing quite high in their userspace usage, getting into the hundred of megs, or peaking towards the 1 gig that the version we were running has as the limit. Unclear whether that's a root cause or a knock-on effect from something else. #1515 will bring our buffer length limit down to 50MiB

@faithanalog
Copy link
Contributor Author

I added the memstat/vmstat/prstat/swapsh loop outputs to /staff/core/crucible-1553/retry-first-crucible-failure/BRMxxxxxxxx

@faithanalog
Copy link
Contributor Author

faithanalog commented Nov 26, 2024

I ran a test while loading the system with memory, using this:

https://github.com/oxidecomputer/artemiscellaneous/tree/main/scripts/ballast

I wanted to load on the amount of memory we'd normally see in use. Propolis uses about 1 gig of memory after the internal buffers are all filled up, and then each downstairs uses about 150MiB.

1.5 gigs * 63 / 4 = 24 gigs per sled

After loading 24 gigs, I noticed we had 94 gigs free per sled, per vmstat. That's because we haven't actually filled up the ZFS max cache of 100 GiB, since we haven't done any real load. So I load an additional 86 gigs, for 110 gigs loaded and about 8 gigs free per sled.

launching one VM with 110 gig ballast

I created one VM, rather than our normal 64.

The VM starts up fine and starts filling its disk.

I try making lots of API calls in parallel with

seq 1 256 | xargs -P256 -I'%' oxide instance list --project iodriver

and I see no problem. No timeouts, everything is snappy.

The guest fills its disk at 250MiB/s just fine, and is bottlenecked by my guest's /dev/urandom performance, not the IO stack.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   94.69    5.31    0.00    0.00

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
loop0            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
nvme0n1          0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00  480.40 245964.80     0.00   0.00    0.28   512.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.13  30.63
vda              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

starting 4k randw with 110 gigs ballast

fio is going. so far no problem. nothing in dmesg indicates an issue

[root@iodriver:~]# iostat -ykcd 15
Linux 6.1.71 (iodriver)         11/19/2024      _x86_64_        (1 CPU)


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.08    0.00    6.28   92.64    0.00    0.00

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
loop0             0.00         0.00         0.00         0.00          0          0          0
nvme0n1           0.00         0.00         0.00         0.00          0          0          0
nvme1n1        2832.40         0.00     11329.60         0.00          0     169944          0
vda               0.00         0.00         0.00         0.00          0          0          0

we are iowaiting a lot, but it seems steady.

starting 4k randw with 130 gigs ballast - swapping

ok lets try making it swap. we are at 3 gigs free

0 0 0 270264732 3274852 1209 14603 0 0 0 0 0 0 0 0  0 16553 308152 21498 1 1 98

we have a 256 gig swap device on hand. lets try pushing 20 gigs more memory. I'll remove the 4k randw load while I do this.

 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr bk bk bk bk   in   sy   cs us sy id
 2 0 0 265839472 1093872 2657 82096 396 235423 242253 2097152 325409212 0 0 0 5 1146251 105421 452687 1 15 84
 1 0 0 265901224 1098288 2889 84716 456 223187 234064 2097152 328640875 0 0 0 0 1149639 104955 428453 0 16 84
 1 0 0 264822508 1014588 2599 82266 1172 225677 237598 2097152 326716953 0 0 0 0 1254992 97199 384936 0 20 80
 1 0 0 264864624 971860 3139 93285 716 240102 242010 2097152 316031082 0 0 0 0 1289866 217813 424202 1 20 79
 1 0 0 264995488 1028308 2313 90794 1336 239191 247682 2097152 317791405 0 0 0 0 1299172 261414 506989 1 21 78
 1 0 0 264086332 1048764 2094 88383 3812 198554 205354 1887440 306629706 0 0 0 3 1366812 356771 448033 2 21 78
 0 0 0 264109968 1081960 1497 84020 912 239776 248924 2097152 311298285 0 0 0 0 1329680 393893 559345 1 19 80
 3 0 0 264331736 1189100 3910 99438 3263 236762 291562 2097152 308577848 0 0 0 0 1311757 410205 502880 1 19 80
 0 0 0 264429340 1259252 2491 78372 634 237368 271441 2097152 310544848 0 0 0 0 1250579 392911 493539 1 21 78
 0 0 0 263504744 1383996 3932 86090 1108 251511 284390 2097152 313911750 0 0 0 0 1236474 395825 515020 1 21 78
 0 0 0 263630988 1476448 2535 75195 540 254346 267176 2097152 316031314 0 0 0 28 1313970 317680 510916 1 21 79
 0 0 0 263840340 1680772 3530 70121 1788 245780 251824 2097152 317630277 0 0 0 0 1280342 215462 510532 1 21 78
 0 0 0 263987396 1761664 4368 96875 5957 231950 239302 2097152 322024484 0 0 0 0 1369113 105548 503371 1 21 78
 1 0 0 263057428 1769500 3178 92116 1681 244132 258313 2097152 320170652 0 0 0 0 1350878 95602 481942 2 21 78
 0 0 0 263214964 1890244 2611 77242 1008 241854 253884 2097152 331610174 0 0 0 0 1220552 47975 490418 1 20 79
 1 0 0 263321668 2008312 2951 78153 1860 278810 291445 2097152 323456779 0 0 0 2 1190407 55393 355364 1 20 79
 3 0 0 262414772 2016468 2445 112189 440 193428 202181 2097152 266863838 0 0 0 0 1131927 104869 477782 0 18 82
 0 0 0 262648328 2102348 2053 116625 520 193862 209650 2097152 247489204 0 0 0 0 980492 109513 431555 1 17 82
 1 0 0 262743512 1997624 4003 80501 3121 234883 245510 2097152 319510448 0 0 0 0 1166172 84498 438802 1 19 80

Ok now we are swapping... though, because this ballast is "inactive" memory, we wont be touching it to thrash swap i think.

BRM42220036 # swap -sh
total: 140.88G allocated + 1.01G reserved = 141.90G used, 250.62G available
BRM42220036 # swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/oxi_a15c936d-bf01-4411-aab2-d401b66910c6/swap 171,1        4K  256.00G  243.99G

Let's try our 4krandw

seems fine.

run 4krandw with no ballast, then add 200 gigs

now lets try applying memory pressure while its running. we'll allocate 200 gigs this time. really push the envelope. while watching guest iostat.

This won't happen instantaneously, and indeed it will get slower the more we dip into swap.

115 gigs loaded

starting to see a slowdown around 115gigs allocated

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.81    0.00    4.67   94.52    0.00    0.00

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
loop0             0.00         0.00         0.00         0.00          0          0          0
nvme0n1           0.00         0.00         0.00         0.00          0          0          0
nvme1n1        1516.80         0.00      6067.20         0.00          0      30336          0
vda               0.00         0.00         0.00         0.00          0          0          0

but it is going a bit steady state here

150 gigs loaded

We are getting some noise at 150 gigs allocated. Averaging 6000KiB/s like above but jumping between 5500-7500

179 gigs loaded

In the guest we are averaging 3500-5000KiB/s. And prstat has frozen.

45 seconds later, it starts to plummet to 2500, then 300.

Tue Nov 19 11:06:40 PM UTC 2024
Linux 6.1.71 (iodriver)         11/19/2024      _x86_64_        (1 CPU)


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.61   99.39    0.00    0.00

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
loop0             0.00         0.00         0.00         0.00          0          0          0
nvme0n1           0.00         0.00         0.00         0.00          0          0          0
nvme1n1          76.40         0.00       305.60         0.00          0       1528          0
vda               0.00         0.00         0.00         0.00          0          0          0

Then my ssh connections freeze, both to the host and to the guest. It actually freezes half-way through writing a guest iostat line on the guest. It remains frozen for about 30 seconds, and then recovers. When we get back, we have an iostat line with 0 IO, and some errors in dmesg

Tue Nov 19 11:07:10 PM UTC 2024
Linux 6.1.71 (iodriver)         11/19/2024      _x86_64_        (1 CPU)


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.40   99.60    0.00    0.00

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
loop0             0.00         0.00         0.00         0.00          0          0          0
nvme0n1           0.00         0.00         0.00         0.00          0          0          0
nvme1n1           0.00         0.00         0.00         0.00          0          0          0
vda               0.00         0.00         0.00         0.00          0          0          0
[ 1308.910032] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 1308.910038] clocksource:                       'acpi_pm' wd_nsec: 228860371 wd_now: 806c8e wd_last: 73ec7e mask: ffffff
[ 1308.910040] clocksource:                       'tsc' cs_nsec: 4915830799 cs_now: 26309ba048c cs_last: 260c0d500c0 mask: ffffffffffffffff
[ 1308.910042] clocksource:                       'tsc' is current clocksource.
[ 1308.910043] tsc: Marking TSC unstable due to clocksource watchdog
[ 1308.910070] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 1308.910070] sched_clock: Marking unstable (1308952100921, -41834539)<-(1308910329617, -259726)
[ 1308.912285] clocksource: Not enough CPUs to check clocksource 'tsc'.
[ 1308.912368] clocksource: Switched to clocksource acpi_pm
[ 1319.900502] nvme nvme1: I/O 897 (I/O Cmd) QID 1 timeout, aborting
[ 1319.900859] nvme nvme1: Abort status: 0x0
[ 1319.900876] nvme nvme1: I/O 898 (I/O Cmd) QID 1 timeout, aborting
[ 1319.900905] nvme nvme1: Abort status: 0x0
[ 1319.900915] nvme nvme1: I/O 899 (I/O Cmd) QID 1 timeout, aborting
[ 1319.900939] nvme nvme1: Abort status: 0x0
[ 1319.900949] nvme nvme1: I/O 900 (I/O Cmd) QID 1 timeout, aborting
[ 1319.900973] nvme nvme1: Abort status: 0x0
[ 1319.900982] nvme nvme1: I/O 901 (I/O Cmd) QID 1 timeout, aborting
[ 1319.901008] nvme nvme1: Abort status: 0x0
[ 1319.901017] nvme nvme1: I/O 902 (I/O Cmd) QID 1 timeout, aborting
[ 1319.901040] nvme nvme1: Abort status: 0x0
[ 1319.901050] nvme nvme1: I/O 903 (I/O Cmd) QID 1 timeout, aborting
[ 1319.901072] nvme nvme1: Abort status: 0x0
[ 1319.901082] nvme nvme1: I/O 904 (I/O Cmd) QID 1 timeout, aborting
[ 1319.901106] nvme nvme1: Abort status: 0x0
[ 1319.901115] nvme nvme1: I/O 905 (I/O Cmd) QID 1 timeout, aborting

But then we recover, to a steady state of about 3000-4000.

Tue Nov 19 11:08:36 PM UTC 2024
Linux 6.1.71 (iodriver)         11/19/2024      _x86_64_        (1 CPU)


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.41    0.00    3.67   95.92    0.00    0.00

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
loop0             0.00         0.00         0.00         0.00          0          0          0
nvme0n1           0.00         0.00         0.00         0.00          0          0          0
nvme1n1         996.60         0.00      3986.40         0.00          0      19932          0
vda               0.00         0.00         0.00         0.00          0          0          0

Prstat has also recovered, and we are done with our allocations.

So at this point it seems that our performance is pretty well degraded. We are not recovering back to the 11000 we saw earlier. You might think we are in live repair, and indeed, we are. But the memory pressure also maintains continuous impact on our performance. I removed it, and slowly it comes back up to 7000KiB/s

crucible upstairs/downstairs logs

are in /staff/core/crucible-1553/20241119-mempressure-test

recovery to full speed

after an hour i check back in. live repair should be done now:

Wed Nov 20 12:30:46 AM UTC 2024
Linux 6.1.71 (iodriver)         11/20/2024      _x86_64_        (1 CPU)


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.09    0.00   36.42   60.49    0.00    0.00

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
loop0             0.00         0.00         0.00         0.00          0          0          0
nvme0n1           0.00         0.00         0.00         0.00          0          0          0
nvme1n1       10143.00         0.00     40572.00         0.00          0     202860          0
vda               0.00         0.00         0.00         0.00          0          0          0


Wed Nov 20 12:30:51 AM UTC 2024
Linux 6.1.71 (iodriver)         11/20/2024      _x86_64_        (1 CPU)

Indeed we see higher speed here than we even did on the first test. Perhaps this is our IOs going into the void, but it seems more likely that in our first test our performance was already degraded and I just didn't know it.

@jclulow
Copy link
Collaborator

jclulow commented Nov 26, 2024

prstat has frozen

Whenever this happens it's usually good to understand why if possible. It's presumably trying to open or read something in /proc and is having to wait for a lock that it needs, but long blockages like that often mean, due to memory pressure or whatever else, we're sitting on something we shouldn't be sitting on.

@faithanalog
Copy link
Contributor Author

faithanalog commented Dec 12, 2024

I ran a test with omicron main from a few days ago (1922c8ecb8a667c9b1efa876f75d72a7437b3c77). Using the crucible and propolis it usually uses (I downloaded the TUF repo from CI). I saw many downstairs disconnect/reconnect, resulting in live repair.

This happened only during the write tests. It was also consistently the same downstairs instances disconnecting, sometimes multiple times within the span of a few minutes.

Curious, I ran 2 back to back tests with:

  • Omicron cfd5a8973
  • Propolis 220a6f3
  • two crucible commits
    • Crucible bd3cc54 - Latest commit on main
    • Crucible 3712da4 - A commit i made that reduces the replay buffer to 64MiB, staged on the above

With bd3cc54 (latest commit on main), I saw the same bad behavior as my first test

With 3712da4 (reduce replay buffer to 64MiB), I saw not a single disconnect, not a single live repair.

Comparing memory usage of propolis for the two with pmap:

for host in $(cat .ssh/config | awk '/londonBRM/{ print $2 }'); do ssh "$host" \
    'pgrep propolis-s | xargs -n1 pmap -S | grep "total Kb"'
done
  • bd3cc54 - average 1189MiB (1.16GiB) ram used in userspace
  • 3712da4 - average 197MiB ram used in userspace

Full propolis/crucible logs for each run:

  • [BAD] Omicron 1922c8ecb8a667c9b1efa876f75d72a7437b3c77 (CI build)
    • /staff/artemis/2024-12-09/logs
  • [BAD] Crucible bd3cc54 (as above)
    • /staff/artemis/2024-12-11/logs
  • [GOOD] Crucible 3712da4 (as above)
    • /staff/artemis/2024-12-10/logs

Of particular note on bd3cc54: observe VM f24c46b1-baca-4a42-b4a0-cc089e4c2fe0 which enters the Unlucky Bad State of all 3 downstairs going way, and then continually connecting and trying to live repair (with nobody to repair from) ad infinitum ( as covered in #1556 ). Upstairs continues to accept writes at an alarming rate (as covered in #1555 ). The other VMs were not as unlucky and exhibited similar behavior as to 1922c8ecb8a667c9b1efa876f75d72a7437b3c77. As long as not all 3 disconnect, we seem to be able to recover automatically.

Across all 3 runs average performance was not significantly different, which indicates we are at least handling the live repair load reasonably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants