Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system using only 8 of 10 U.2s on madrid BRM42220004 #5128

Open
davepacheco opened this issue Feb 23, 2024 · 14 comments
Open

system using only 8 of 10 U.2s on madrid BRM42220004 #5128

davepacheco opened this issue Feb 23, 2024 · 14 comments
Assignees

Comments

@davepacheco
Copy link
Collaborator

This was originally observed under #5111 (where this sled went through the "add sled" flow). But we also found that after a fresh install of Omicron that included BRM42220004 (without the "add sled" flow), the same issue happened and there were only 8 Crucible zones on it.

The errors are:

00:26:53.592Z ERRO SledAgent (StorageManager): Persistent error:not queueing disk
    disk_id = DiskIdentity { vendor: "1b96", serial: "A079E3F8", model: "WUS4C6432DSP3X3" }
    err = PooledDisk(BadPartitionLayout { path: "/devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0:wd,raw", why: "Expected 0 partitions, only saw 1" })
    file = sled-storage/src/manager.rs:472
00:26:54.179Z ERRO SledAgent (StorageManager): Persistent error:not queueing disk
    disk_id = DiskIdentity { vendor: "1b96", serial: "A079DE8D", model: "WUS4C6432DSP3X3" }
    err = PooledDisk(ZpoolCreate(CreateError { err: Execution(CommandFailure(CommandFailureInfo { command: "/usr/sbin/zpool create oxp_d6149e62-ae84-4209-b943-4053fb9a8713 /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "cannot open '/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a': No such device or address\\n" })) }))
    file = sled-storage/src/manager.rs:472
00:28:50.331Z ERRO SledAgent (StorageManager): Persistent error:not queueing disk
    disk_id = DiskIdentity { vendor: "1b96", serial: "A079DE8D", model: "WUS4C6432DSP3X3" }
    err = PooledDisk(ZpoolCreate(CreateError { err: Execution(CommandFailure(CommandFailureInfo { command: "/usr/sbin/zpool create oxp_6ad8e920-ad0c-4629-8208-19dbf938a354 /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "cannot open '/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a': No such device or address\\n" })) }))
    file = sled-storage/src/manager.rs:472
00:28:50.605Z ERRO SledAgent (StorageManager): Persistent error:not queueing disk
    disk_id = DiskIdentity { vendor: "1b96", serial: "A079E3F8", model: "WUS4C6432DSP3X3" }
    err = PooledDisk(BadPartitionLayout { path: "/devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0:wd,raw", why: "Expected 0 partitions, only saw 1" })
    file = sled-storage/src/manager.rs:472
@smklein
Copy link
Collaborator

smklein commented Feb 23, 2024

The call-stack is coming from here:

parse_partition_types:

return Err(PooledDiskError::BadPartitionLayout {
path: path.to_path_buf(),
why: format!(
"Expected {} partitions, only saw {}",
partitions.len(),
N
),

Which is called for U.2s here:

DiskVariant::U2 => {
partitions.truncate(U2_EXPECTED_PARTITION_COUNT);
parse_partition_types(&path, &partitions, &U2_EXPECTED_PARTITIONS)
}

For U.2s, we only expect a single partition: the one holding the ZFS Zpool:

static U2_EXPECTED_PARTITIONS: [Partition; U2_EXPECTED_PARTITION_COUNT] =

Given the surrounding context in internal_ensure_partition_layout, here's what I suspect is happening:

  1. Sled Agent sees a raw disk from libdevinfo. It's a U.2
  2. Sled Agent tries to ensure this disk has a GPT with the right partitions. First, it checks if the GPT exists.
  3. AFAICT, the GPT does exist. I think we're taking this pathway:
    Ok(gpt) => {
    // This should be the common steady-state case
    info!(log, "Disk at {} already has a GPT", paths.devfs_path);
    gpt
    }
  4. This means we aren't trying to write anything. It just checks that "oh, someone already made the GPT, let's just see if it's formatted okay".
  5. It's not. We bail.

So, to summarize:

  • If a U.2 has a GPT with a single ZFS partition -> Return Ok
  • If a U.2 has NO GPT -> Sled formats GPT + Zpool with Zpool::create, Return Ok
  • If a U.2 has a GPT but it does not contain the partitions we expect -> Err. This is the case we're hitting.

@jgallagher
Copy link
Contributor

3. AFAICT, the GPT does exist. I think we're taking this pathway:

Ok(gpt) => {
// This should be the common steady-state case
info!(log, "Disk at {} already has a GPT", paths.devfs_path);
gpt
}

We are indeed in this path. From the same log file, just before the errors above:

00:26:53.592Z INFO SledAgent (StorageManager): Disk at /devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0 already has a GPT
    file = sled-hardware/src/illumos/partitions.rs:103
00:28:50.605Z INFO SledAgent (StorageManager): Disk at /devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0 already has a GPT
    file = sled-hardware/src/illumos/partitions.rs:103

@smklein
Copy link
Collaborator

smklein commented Feb 23, 2024

Here's what I got from MDB, poking at one of these devices:

BRM42220004 # mdb /dev/rdsk/c10t0014EE81000BC440d0p0
> ::load disk_label
> 
> ::help gpt

NAME
  gpt - dump an EFI GPT

SYNOPSIS
  [ addr ] ::gpt [-ag]

DESCRIPTION
  Display an EFI GUID Partition Table.
  
  -a Display the alternate GPT
  -g Show unique GUID for each table entry

ATTRIBUTES

  Target: raw
  Module: disk_label
  Interface Stability: Unstable
>
> ::gpt
Signature: EFI PART (valid)
Revision: 1.0
HeaderSize: 92 bytes
HeaderCRC32: 0x2bd8ebf7 (should be 0x2bd8ebf7)
Reserved1: 0 (should be 0x0)
MyLBA: 1 (should be 1)
AlternateLBA: 6251233967
FirstUsableLBA: 34
LastUsableLBA: 6251233934
DiskGUID: 6601dc84-5899-e48f-877b-a768969d4f59
PartitionEntryLBA: 2
NumberOfPartitionEntries: 0
SizeOfPartitionEntry: 0x80 bytes
PartitionEntryArrayCRC32: 0 (should be 0)

PART TYPE                STARTLBA      ENDLBA        ATTR     NAME
> ::gpt -a
Signature: EFI PART (valid)
Revision: 1.0
HeaderSize: 92 bytes
HeaderCRC32: 0x603f95a (should be 0x603f95a)
Reserved1: 0 (should be 0x0)
MyLBA: 6251233967 (should be 6251233967)
AlternateLBA: 1
FirstUsableLBA: 34
LastUsableLBA: 6251233934
DiskGUID: 6601dc84-5899-e48f-877b-a768969d4f59
PartitionEntryLBA: 6251233935
NumberOfPartitionEntries: 0
SizeOfPartitionEntry: 0x80 bytes
PartitionEntryArrayCRC32: 0 (should be 0)

PART TYPE                STARTLBA      ENDLBA        ATTR     NAME

@smklein
Copy link
Collaborator

smklein commented Feb 23, 2024

Okay, I'm able to access some raw blocks from this device with MDB. Here we go!

Note: Our block size is 512, which in hex is 0x200.

BRM42220004 # mdb /dev/rdsk/c10t0014EE81000BC440d0p0
> ::load disk_label

LBA 0:
> 0::dump -f -l 0x200
      \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
000:  00000000 00000000 00000000 00000000  ................
010:  00000000 00000000 00000000 00000000  ................
020:  00000000 00000000 00000000 00000000  ................
030:  00000000 00000000 00000000 00000000  ................
040:  00000000 00000000 00000000 00000000  ................
050:  00000000 00000000 00000000 00000000  ................
060:  00000000 00000000 00000000 00000000  ................
070:  00000000 00000000 00000000 00000000  ................
080:  00000000 00000000 00000000 00000000  ................
090:  00000000 00000000 00000000 00000000  ................
0a0:  00000000 00000000 00000000 00000000  ................
0b0:  00000000 00000000 00000000 00000000  ................
0c0:  00000000 00000000 00000000 00000000  ................
0d0:  00000000 00000000 00000000 00000000  ................
0e0:  00000000 00000000 00000000 00000000  ................
0f0:  00000000 00000000 00000000 00000000  ................
100:  00000000 00000000 00000000 00000000  ................
110:  00000000 00000000 00000000 00000000  ................
120:  00000000 00000000 00000000 00000000  ................
130:  00000000 00000000 00000000 00000000  ................
140:  00000000 00000000 00000000 00000000  ................
150:  00000000 00000000 00000000 00000000  ................
160:  00000000 00000000 00000000 00000000  ................
170:  00000000 00000000 00000000 00000000  ................
180:  00000000 00000000 00000000 00000000  ................
190:  00000000 00000000 00000000 00000000  ................
1a0:  00000000 00000000 00000000 00000000  ................
1b0:  00000000 00000000 00000000 00000000  ................
1c0:  0200eeff ffff0100 0000ffff ffff0000  ................
1d0:  00000000 00000000 00000000 00000000  ................
1e0:  00000000 00000000 00000000 00000000  ................
1f0:  00000000 00000000 00000000 000055aa  ..............U.

LBA 1: The GPT  table itself
> 0x200::dump -f -l 0x200
      \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
200:  45464920 50415254 00000100 5c000000  EFI PART....\...
210:  f7ebd82b 00000000 01000000 00000000  ...+............
220:  af429a74 01000000 22000000 00000000  .B.t....".......
230:  8e429a74 01000000 6601dc84 5899e48f  .B.t....f...X...
240:  877ba768 969d4f59 02000000 00000000  .{.h..OY........
250:  00000000 80000000 00000000 00000000  ................
260:  00000000 00000000 00000000 00000000  ................
270:  00000000 00000000 00000000 00000000  ................
280:  00000000 00000000 00000000 00000000  ................
290:  00000000 00000000 00000000 00000000  ................
2a0:  00000000 00000000 00000000 00000000  ................
2b0:  00000000 00000000 00000000 00000000  ................
2c0:  00000000 00000000 00000000 00000000  ................
2d0:  00000000 00000000 00000000 00000000  ................
2e0:  00000000 00000000 00000000 00000000  ................
2f0:  00000000 00000000 00000000 00000000  ................
300:  00000000 00000000 00000000 00000000  ................
310:  00000000 00000000 00000000 00000000  ................
320:  00000000 00000000 00000000 00000000  ................
330:  00000000 00000000 00000000 00000000  ................
340:  00000000 00000000 00000000 00000000  ................
350:  00000000 00000000 00000000 00000000  ................
360:  00000000 00000000 00000000 00000000  ................
370:  00000000 00000000 00000000 00000000  ................
380:  00000000 00000000 00000000 00000000  ................
390:  00000000 00000000 00000000 00000000  ................
3a0:  00000000 00000000 00000000 00000000  ................
3b0:  00000000 00000000 00000000 00000000  ................
3c0:  00000000 00000000 00000000 00000000  ................
3d0:  00000000 00000000 00000000 00000000  ................
3e0:  00000000 00000000 00000000 00000000  ................
3f0:  00000000 00000000 00000000 00000000  ................

LBA 34, which should be the first usable LBA:
> 0x4400::dump -f -l 0x200
       \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
4400:  00000000 00000000 00000000 00000000  ................
4410:  00000000 00000000 00000000 00000000  ................
4420:  00000000 00000000 00000000 00000000  ................
4430:  00000000 00000000 00000000 00000000  ................
4440:  00000000 00000000 00000000 00000000  ................
4450:  00000000 00000000 00000000 00000000  ................
4460:  00000000 00000000 00000000 00000000  ................
4470:  00000000 00000000 00000000 00000000  ................
4480:  00000000 00000000 00000000 00000000  ................
4490:  00000000 00000000 00000000 00000000  ................
44a0:  00000000 00000000 00000000 00000000  ................
44b0:  00000000 00000000 00000000 00000000  ................
44c0:  00000000 00000000 00000000 00000000  ................
44d0:  00000000 00000000 00000000 00000000  ................
44e0:  00000000 00000000 00000000 00000000  ................
44f0:  00000000 00000000 00000000 00000000  ................
4500:  00000000 00000000 00000000 00000000  ................
4510:  00000000 00000000 00000000 00000000  ................
4520:  00000000 00000000 00000000 00000000  ................
4530:  00000000 00000000 00000000 00000000  ................
4540:  00000000 00000000 00000000 00000000  ................
4550:  00000000 00000000 00000000 00000000  ................
4560:  00000000 00000000 00000000 00000000  ................
4570:  00000000 00000000 00000000 00000000  ................
4580:  00000000 00000000 00000000 00000000  ................
4590:  00000000 00000000 00000000 00000000  ................
45a0:  00000000 00000000 00000000 00000000  ................
45b0:  00000000 00000000 00000000 00000000  ................
45c0:  00000000 00000000 00000000 00000000  ................
45d0:  00000000 00000000 00000000 00000000  ................
45e0:  00000000 00000000 00000000 00000000  ................
45f0:  00000000 00000000 00000000 00000000  ................

@smklein
Copy link
Collaborator

smklein commented Feb 23, 2024

Unless we have reason to believe otherwise from the control plane, the contents of the disk don't indicate that we had a partition ever in-use here. It's always possible we had something here and zeroed it out, but I'm not seeing ZFS headers or anything.

I'm not sure who formatted this GPT, but it could have been that way for a while?

Regardless, the case of "GPT exists, has zero partitions" is a case that we should handle.

@smklein
Copy link
Collaborator

smklein commented Feb 23, 2024

Regardless, the case of "GPT exists, has zero partitions" is a case that we should handle.

Specifically, by adding a single partition for the zpool via zpool create

@jclulow
Copy link
Collaborator

jclulow commented Feb 23, 2024

I'm not sure I would expect anything at the first usable LBA FWIW. I'd do the same check on a disk that does have partitions and pools that work, and make sure you're not just lucking onto a region that is ordinarily zeroes.

@smklein
Copy link
Collaborator

smklein commented Feb 23, 2024

I'm not sure I would expect anything at the first usable LBA FWIW. I'd do the same check on a disk that does have partitions and pools that work, and make sure you're not just lucking onto a region that is ordinarily zeroes.

Good point. Here's what I'm seeing on another disk. Maybe I need to dig deeper, this also shows a zero'd first block, even though it claims to have a ZFS partition.

BRM42220004 # mdb /dev/rdsk/c3t0014EE81000BC4F1d0p0
> ::load disk_label
> ::gpt
Signature: EFI PART (valid)
Revision: 1.0
HeaderSize: 92 bytes
HeaderCRC32: 0x17b6bd4b (should be 0x17b6bd4b)
Reserved1: 0 (should be 0x0)
MyLBA: 1 (should be 1)
AlternateLBA: 6251233967
FirstUsableLBA: 34
LastUsableLBA: 6251233934
DiskGUID: 3c61bf3f-81e5-ec55-ec1c-ff1aa537e314
PartitionEntryLBA: 2
NumberOfPartitionEntries: 9
SizeOfPartitionEntry: 0x80 bytes
PartitionEntryArrayCRC32: 0xf1791746 (should be 0xf1791746)

PART TYPE                STARTLBA      ENDLBA        ATTR     NAME
0    EFI_USR             256           6251217550    0        zfs
1    EFI_UNUSED         
2    EFI_UNUSED         
3    EFI_UNUSED         
4    EFI_UNUSED         
5    EFI_UNUSED         
6    EFI_UNUSED         
7    EFI_UNUSED         
8    EFI_RESERVED        6251217551    6251233934    0 

This should be LBA 256?
> 0x20000::dump -f -l 0x200
        \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
20000:  00000000 00000000 00000000 00000000  ................
20010:  00000000 00000000 00000000 00000000  ................
20020:  00000000 00000000 00000000 00000000  ................
20030:  00000000 00000000 00000000 00000000  ................
20040:  00000000 00000000 00000000 00000000  ................
20050:  00000000 00000000 00000000 00000000  ................
20060:  00000000 00000000 00000000 00000000  ................
20070:  00000000 00000000 00000000 00000000  ................
20080:  00000000 00000000 00000000 00000000  ................
20090:  00000000 00000000 00000000 00000000  ................
200a0:  00000000 00000000 00000000 00000000  ................
200b0:  00000000 00000000 00000000 00000000  ................
200c0:  00000000 00000000 00000000 00000000  ................
200d0:  00000000 00000000 00000000 00000000  ................
200e0:  00000000 00000000 00000000 00000000  ................
200f0:  00000000 00000000 00000000 00000000  ................
20100:  00000000 00000000 00000000 00000000  ................
20110:  00000000 00000000 00000000 00000000  ................
20120:  00000000 00000000 00000000 00000000  ................
20130:  00000000 00000000 00000000 00000000  ................
20140:  00000000 00000000 00000000 00000000  ................
20150:  00000000 00000000 00000000 00000000  ................
20160:  00000000 00000000 00000000 00000000  ................
20170:  00000000 00000000 00000000 00000000  ................
20180:  00000000 00000000 00000000 00000000  ................
20190:  00000000 00000000 00000000 00000000  ................
201a0:  00000000 00000000 00000000 00000000  ................
201b0:  00000000 00000000 00000000 00000000  ................
201c0:  00000000 00000000 00000000 00000000  ................
201d0:  00000000 00000000 00000000 00000000  ................
201e0:  00000000 00000000 00000000 00000000  ................
201f0:  00000000 00000000 00000000 00000000  ................\

@smklein
Copy link
Collaborator

smklein commented Feb 23, 2024

Huh, on a "known good" disk, I am starting to see data around offset 0x23fd0, which is like LBA ~288? (or LBA 32 within the ZFS partition). This includes the zpool name, "meta slab" stuff, etc, and looks like a legit pool. Lemme check for that info on the misbehaving U.2s...

@smklein
Copy link
Collaborator

smklein commented Feb 23, 2024

On /dev/rdsk/c10t0014EE81000BC440d0p0, aka /devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440, this is still zeroed around LBA 288, which is where I started seeing zpool metadata.

I'm seeing something very odd on the other disk:

BRM42220004 # mdb /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:q,raw
> 
> 
> ::load disk_label
> 
> 
> ::gpt
Signature: EFI PART (valid)
Revision: 1.0
HeaderSize: 92 bytes
HeaderCRC32: 0xcbd22368 (should be 0xcbd22368)
Reserved1: 0 (should be 0x0)
MyLBA: 1 (should be 1)
AlternateLBA: 6251233967
FirstUsableLBA: 34
LastUsableLBA: 6251233934
DiskGUID: c9172905-e06f-4af9-b61a-ceb23c9add2a
PartitionEntryLBA: 2
NumberOfPartitionEntries: 1
SizeOfPartitionEntry: 0x80 bytes
PartitionEntryArrayCRC32: 0x8242bb87 (should be 0x8242bb87)

PART TYPE                STARTLBA      ENDLBA        ATTR     NAME
0    EFI_USR             256           255           0        

ENDLBA < STARTLBA? That seems weird.

@smklein
Copy link
Collaborator

smklein commented Feb 23, 2024

On /dev/rdsk/c10t0014EE81000BC440d0p0, aka /devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440, this is still zeroed around LBA 288, which is where I started seeing zpool metadata.

I'm seeing something very odd on the other disk:

...

Also seeing nothing at ~LBA 288 and onwards (where we've seen zpool metadata on other valid disks). Just zeroes.

@smklein
Copy link
Collaborator

smklein commented Feb 23, 2024

For anyone trying to manually check the zpool metadata:

$ mdb /dev/rdsk/<pick whatever disk you want>p0
> ::load disk_label
# With `::gpt`: You should see partitions 0 - 8, and partition 0 starts at LBA 256
> ::gpt
# This is LBA 288, which is 32 LBAs into the first partition. You should see zpool metadata.
> 0x24000::dump -l 0x200

On any normal U.2: I see this metadata. On these two: I don't.

@smklein
Copy link
Collaborator

smklein commented Feb 23, 2024

To add one more recap here:

  • If a U.2 has a GPT with a single ZFS partition -> Return Ok
  • If a U.2 has NO GPT -> Sled formats GPT + Zpool with Zpool::create, Return Ok
  • If a U.2 has a GPT but it does not contain the partitions we expect -> Err. This is the case we're hitting for /devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0.
  • If a U.2 has a GPT, and it does contain at least the one partition that we expect, but that partition is "invalid" for reasons that prevent us from opening it -> Err. This is the case we're hitting for /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0.

Fixing the "GPT but no partitions" case doesn't seem terrible -- we should be able to still zpool create there anyway.

Fixing the "GPT exists, AND has a partition, BUT it sucks" case seems more difficult. It seems much trickier to determine "is this partition actually, truly, genuinely unusable?" in a way that's completely safe, and wouldn't accidentally destroy valid data. In this situation, the "end LBA" is smaller than the "start LBA" which kinda looks like "obviously bad" to me a human, but that's a weird heuristic for a program to use - and certainly not the only way in which we could have a "partition zero" that looks invalid.

@andrewjstone andrewjstone assigned smklein and unassigned andrewjstone Feb 26, 2024
@andrewjstone
Copy link
Contributor

Sean, is doing a rework of the disk adoption process, and he's going to take this issue. He's already debugged it all anyway, so may as well get full credit 🥇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants