-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
system using only 8 of 10 U.2s on madrid BRM42220004 #5128
Comments
The call-stack is coming from here:
omicron/sled-hardware/src/illumos/partitions.rs Lines 46 to 52 in 65ebf72
Which is called for U.2s here: omicron/sled-hardware/src/illumos/partitions.rs Lines 147 to 150 in 65ebf72
For U.2s, we only expect a single partition: the one holding the ZFS Zpool:
Given the surrounding context in
So, to summarize:
|
We are indeed in this path. From the same log file, just before the errors above:
|
Here's what I got from MDB, poking at one of these devices:
|
Okay, I'm able to access some raw blocks from this device with MDB. Here we go! Note: Our block size is 512, which in hex is
|
Unless we have reason to believe otherwise from the control plane, the contents of the disk don't indicate that we had a partition ever in-use here. It's always possible we had something here and zeroed it out, but I'm not seeing ZFS headers or anything. I'm not sure who formatted this GPT, but it could have been that way for a while? Regardless, the case of "GPT exists, has zero partitions" is a case that we should handle. |
Specifically, by adding a single partition for the zpool via |
I'm not sure I would expect anything at the first usable LBA FWIW. I'd do the same check on a disk that does have partitions and pools that work, and make sure you're not just lucking onto a region that is ordinarily zeroes. |
Good point. Here's what I'm seeing on another disk. Maybe I need to dig deeper, this also shows a zero'd first block, even though it claims to have a ZFS partition.
|
Huh, on a "known good" disk, I am starting to see data around offset 0x23fd0, which is like LBA ~288? (or LBA 32 within the ZFS partition). This includes the zpool name, "meta slab" stuff, etc, and looks like a legit pool. Lemme check for that info on the misbehaving U.2s... |
On I'm seeing something very odd on the other disk:
ENDLBA < STARTLBA? That seems weird. |
Also seeing nothing at ~LBA 288 and onwards (where we've seen zpool metadata on other valid disks). Just zeroes. |
For anyone trying to manually check the zpool metadata:
On any normal U.2: I see this metadata. On these two: I don't. |
To add one more recap here:
Fixing the "GPT but no partitions" case doesn't seem terrible -- we should be able to still Fixing the "GPT exists, AND has a partition, BUT it sucks" case seems more difficult. It seems much trickier to determine "is this partition actually, truly, genuinely unusable?" in a way that's completely safe, and wouldn't accidentally destroy valid data. In this situation, the "end LBA" is smaller than the "start LBA" which kinda looks like "obviously bad" to me a human, but that's a weird heuristic for a program to use - and certainly not the only way in which we could have a "partition zero" that looks invalid. |
Sean, is doing a rework of the disk adoption process, and he's going to take this issue. He's already debugged it all anyway, so may as well get full credit 🥇 |
This was originally observed under #5111 (where this sled went through the "add sled" flow). But we also found that after a fresh install of Omicron that included BRM42220004 (without the "add sled" flow), the same issue happened and there were only 8 Crucible zones on it.
The errors are:
The text was updated successfully, but these errors were encountered: