-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mystery failure on london BRM42220036 #1553
Comments
commits:
|
We investigated this on Friday and made some progress, but the root cause remains elusive. There appears to be network weather of some kind; the first warning in the logs is of a (partial) timeout. I'm using interleave-bunyan to combine all logs, then paging through:
Scattered partial timeouts continue, then we see our first disconnection at 4:13:04:
Both the write and read sides fail with In this case, client 1 is at
Things continue to fail at various rates, getting worse over time. Eventually, we manage to get into a state with all 3x Downstairs faulted, from which there is no recovery (#1556). There's some low-hanging fruit w.r.t less logging in Crucible, but that's not the root cause of our problems. Other things to be suspicious about:
|
We were running On the problem scrimlet:
On the scrimlet which did not keel over:
I didn't catch a pmap on these at the time, but I suspect both mgs are actually using the same amount of ram but the problem scrimlet has mgs swapping (which is why the RSS and SIZE dont line up there). and you've got oximeter adding an extra 4 gigs of memory pressure Also on the topic of memory pressure, our propolis servers were climbing quite high in their userspace usage, getting into the hundred of megs, or peaking towards the 1 gig that the version we were running has as the limit. Unclear whether that's a root cause or a knock-on effect from something else. #1515 will bring our buffer length limit down to 50MiB |
I added the memstat/vmstat/prstat/swapsh loop outputs to |
I ran a test while loading the system with memory, using this: https://github.com/oxidecomputer/artemiscellaneous/tree/main/scripts/ballast I wanted to load on the amount of memory we'd normally see in use. Propolis uses about 1 gig of memory after the internal buffers are all filled up, and then each downstairs uses about 150MiB.
After loading 24 gigs, I noticed we had 94 gigs free per sled, per vmstat. That's because we haven't actually filled up the ZFS max cache of 100 GiB, since we haven't done any real load. So I load an additional 86 gigs, for 110 gigs loaded and about 8 gigs free per sled. launching one VM with 110 gig ballastI created one VM, rather than our normal 64. The VM starts up fine and starts filling its disk. I try making lots of API calls in parallel with
and I see no problem. No timeouts, everything is snappy. The guest fills its disk at 250MiB/s just fine, and is bottlenecked by my guest's
starting 4k randw with 110 gigs ballastfio is going. so far no problem. nothing in dmesg indicates an issue
we are iowaiting a lot, but it seems steady. starting 4k randw with 130 gigs ballast - swappingok lets try making it swap. we are at 3 gigs free
we have a 256 gig swap device on hand. lets try pushing 20 gigs more memory. I'll remove the 4k randw load while I do this.
Ok now we are swapping... though, because this ballast is "inactive" memory, we wont be touching it to thrash swap i think.
Let's try our 4krandw seems fine. run 4krandw with no ballast, then add 200 gigsnow lets try applying memory pressure while its running. we'll allocate 200 gigs this time. really push the envelope. while watching guest iostat. This won't happen instantaneously, and indeed it will get slower the more we dip into swap. 115 gigs loadedstarting to see a slowdown around 115gigs allocated
but it is going a bit steady state here 150 gigs loadedWe are getting some noise at 150 gigs allocated. Averaging 6000KiB/s like above but jumping between 5500-7500 179 gigs loadedIn the guest we are averaging 3500-5000KiB/s. And 45 seconds later, it starts to plummet to 2500, then 300.
Then my ssh connections freeze, both to the host and to the guest. It actually freezes half-way through writing a guest iostat line on the guest. It remains frozen for about 30 seconds, and then recovers. When we get back, we have an iostat line with 0 IO, and some errors in
But then we recover, to a steady state of about 3000-4000.
Prstat has also recovered, and we are done with our allocations. So at this point it seems that our performance is pretty well degraded. We are not recovering back to the 11000 we saw earlier. You might think we are in live repair, and indeed, we are. But the memory pressure also maintains continuous impact on our performance. I removed it, and slowly it comes back up to 7000KiB/s crucible upstairs/downstairs logsare in recovery to full speedafter an hour i check back in. live repair should be done now:
Indeed we see higher speed here than we even did on the first test. Perhaps this is our IOs going into the void, but it seems more likely that in our first test our performance was already degraded and I just didn't know it. |
Whenever this happens it's usually good to understand why if possible. It's presumably trying to open or read something in /proc and is having to wait for a lock that it needs, but long blockages like that often mean, due to memory pressure or whatever else, we're sitting on something we shouldn't be sitting on. |
I ran a test with omicron main from a few days ago (1922c8ecb8a667c9b1efa876f75d72a7437b3c77). Using the crucible and propolis it usually uses (I downloaded the TUF repo from CI). I saw many downstairs disconnect/reconnect, resulting in live repair. This happened only during the write tests. It was also consistently the same downstairs instances disconnecting, sometimes multiple times within the span of a few minutes. Curious, I ran 2 back to back tests with:
With bd3cc54 (latest commit on main), I saw the same bad behavior as my first test With 3712da4 (reduce replay buffer to 64MiB), I saw not a single disconnect, not a single live repair. Comparing memory usage of propolis for the two with pmap:
Full propolis/crucible logs for each run:
Of particular note on bd3cc54: observe VM f24c46b1-baca-4a42-b4a0-cc089e4c2fe0 which enters the Unlucky Bad State of all 3 downstairs going way, and then continually connecting and trying to live repair (with nobody to repair from) ad infinitum ( as covered in #1556 ). Upstairs continues to accept writes at an alarming rate (as covered in #1555 ). The other VMs were not as unlucky and exhibited similar behavior as to 1922c8ecb8a667c9b1efa876f75d72a7437b3c77. As long as not all 3 disconnect, we seem to be able to recover automatically. Across all 3 runs average performance was not significantly different, which indicates we are at least handling the live repair load reasonably. |
This issue happened twice, both times with the 4krandw workload. The first time we didn't get the logs we wanted because we accidentally kernel panic'd the system trying to recover them. Second time around, we got them. I haven't looked into them yet, just writing this up before logging off for sleep.
Notably, the VMs seem to keep writing (to where?), and thus have abnormally high IOPS. In actuality the disk has failed out from under them.
it is extremely suspicious that both times, all the failing VMs were on BRM42220036 in particular, and not any of the other sleds.
logs:
/staff/artemis/2024-11-08-londonBRM42220036-explosion/logs
suspect VM UUIDs, look at these in that dir:
extra data:
/staff/artemis/2024-11-08-londonBRM42220036-explosion/measure-4krandw
lrstate
seems to indicate that4e453ffc-d871-49b3-88fa-2a816aff9bde
(lrstate has propolis UUIDs, gotta map them to the VM UUID), but none of the others, has 266ds_live_repair_aborted
and none completed. the rest have none aborted, none completed.a sampling of dmesg from one of the guest VMs
The text was updated successfully, but these errors were encountered: