Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled-agent needs to deal with SSD overprovisioning #5158

Merged
merged 8 commits into from
Mar 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,7 @@ itertools = "0.12.1"
key-manager = { path = "key-manager" }
kstat-rs = "0.2.3"
libc = "0.2.153"
libnvme = { git = "https://github.com/oxidecomputer/libnvme", rev = "6fffcc81d2c423ed2d2e6c5c2827485554c4ecbe" }
linear-map = "1.2.0"
macaddr = { version = "1.0.1", features = ["serde_std"] }
maplit = "1.0.2"
Expand Down
1 change: 1 addition & 0 deletions sled-hardware/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ omicron-workspace-hack.workspace = true
[target.'cfg(target_os = "illumos")'.dependencies]
illumos-devinfo = { git = "https://github.com/oxidecomputer/illumos-devinfo", branch = "main" }
libefi-illumos = { git = "https://github.com/oxidecomputer/libefi-illumos", branch = "master" }
libnvme.workspace = true

[dev-dependencies]
illumos-utils = { workspace = true, features = ["testing"] }
Expand Down
6 changes: 5 additions & 1 deletion sled-hardware/src/disk.rs
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ pub enum PooledDiskError {
CannotFormatMissingDevPath { path: Utf8PathBuf },
#[error("Formatting M.2 devices is not yet implemented")]
CannotFormatM2NotImplemented,
#[error(transparent)]
NvmeFormatAndResize(#[from] NvmeFormattingError),
}

/// A partition (or 'slice') of a disk.
Expand Down Expand Up @@ -196,9 +198,11 @@ impl PooledDisk {
) -> Result<Self, PooledDiskError> {
let paths = &unparsed_disk.paths;
let variant = unparsed_disk.variant;
let identity = unparsed_disk.identity();
// Ensure the GPT has the right format. This does not necessarily
// mean that the partitions are populated with the data we need.
let partitions = ensure_partition_layout(&log, &paths, variant)?;
let partitions =
ensure_partition_layout(&log, &paths, variant, identity)?;

// Find the path to the zpool which exists on this disk.
//
Expand Down
2 changes: 1 addition & 1 deletion sled-hardware/src/illumos/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ mod gpt;
mod partitions;
mod sysconf;

pub use partitions::ensure_partition_layout;
pub use partitions::{ensure_partition_layout, NvmeFormattingError};

#[derive(thiserror::Error, Debug)]
enum Error {
Expand Down
199 changes: 197 additions & 2 deletions sled-hardware/src/illumos/partitions.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,14 @@

//! illumos-specific mechanisms for parsing disk info.

use std::collections::HashMap;
use std::sync::OnceLock;

use crate::illumos::gpt;
use crate::{DiskPaths, DiskVariant, Partition, PooledDiskError};
use camino::Utf8Path;
use illumos_utils::zpool::ZpoolName;
use omicron_common::disk::DiskIdentity;
use slog::info;
use slog::Logger;
use uuid::Uuid;
Expand All @@ -17,6 +21,70 @@ use illumos_utils::zpool::MockZpool as Zpool;
#[cfg(not(test))]
use illumos_utils::zpool::Zpool;

/// NVMe devices use a meta size of 0 as we don't support writing addditional
/// metadata
static NVME_LBA_META_SIZE: u32 = 0;
/// NVMe devices default to using 4k logical block addressing unless overriden.
static DEFAULT_NVME_LBA_DATA_SIZE: u64 = 4096;

/// NVMe device settings for a particular NVMe model.
struct NvmeDeviceSettings {
/// The desired disk size for dealing with overprovisioning.
size: u32,
/// An override for the default 4k LBA formatting.
lba_data_size_override: Option<u64>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing uses this today, so I am open to removing it from this PR in favor of not having dead code.

}

/// A mapping from model to desired settings.
/// A device not found in this lookup table will not be modified by sled-agent.
static PREFERRED_NVME_DEVICE_SETTINGS: OnceLock<
HashMap<&'static str, NvmeDeviceSettings>,
> = OnceLock::new();

fn preferred_nvme_device_settings(
) -> &'static HashMap<&'static str, NvmeDeviceSettings> {
PREFERRED_NVME_DEVICE_SETTINGS.get_or_init(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: if having to do this dance is annoying, you could use https://docs.rs/once_cell/latest/once_cell/sync/struct.Lazy.html instead, so you could use PREFERRED_NVME_DEVICE_SETTINGS directly instead of having to go through this helper. We use it elsewhere, and eventually it will land in the std lib.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about this one. I am okay leaving it as is if that works for you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, totally fine.

HashMap::from([
(
"WUS4C6432DSP3X3",
NvmeDeviceSettings { size: 3200, lba_data_size_override: None },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts, neither of which are necessarily relevant to landing this PR (and are kinda half baked; maybe @smklein has more thoughts here?):

  • This list being built into sled-agent means adding support for more disks requires an OS update. Is that okay?
  • If being baked into sled-agent is okay, do we want to define this in code, or would it make sense to put it in the sled-agent config.toml (which would require some work to pass it down into sled-hardware, admittedly).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good questions, @jgallagher.

Just to clarify, I think this means that if we want to support adopting a disk that needs non-default changes (i.e. something other than the 4K sector size) then we will need to do something. If we're going to move it at all, then we should probably make nexus own this list and tell sled-agent (not unreasonable per se). I think that'll fit better with the general firmware update and related flows that we've been discussing as well for disks. What I'd suggest is that for the moment we keep this here and then work to move it into nexus as part of the disk adoption path and make this something we can send down.

What I'd expect in terms of user interface flow over time is that we'd see something like:

  • sled-agent detects a new disk and tells nexus.
  • nexus will prompt the operator and if this is not a known / expected disk we will make the prompt to adopt very clear.
  • At adoption time, nexus will send down instructions on how to transform the disk and things like resizing are things we may want to do in the future if we for example feel good undoing the overprovisioning.
  • This may mean that a nexus update is required for us not to show the scarier unsupported disk warning and it may still require an OS update to fully update sled-agent on the transformations required.

I'm not sure it's worth an intermediate step to put it in the toml file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I much prefer the nexus path laid out here to using the toml file. Thanks for the details @rmustacc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the helpful writeup @rmustacc. What you outlined here makes sense to me and likely ties into some of the same work that's going to happen with nvme firmware upgrades. I can keep this in the back of my mind as I start to figure out what that's going to look like.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #5172 merges, I'm hoping this pathway will get much easier, but I totally agree with Robert here. Expect that Nexus will end up sending a struct to Sled Agent, basically saying "Please use (and potentially format) these disks" -- and Sled Agent won't try writing to storage until that happens.

I'm okay with this being an explicit lookup table in Sled Agent until we get to that point. I don't think this PR needs to block on 5172

),
(
"WUS5EA138ESP7E1",
NvmeDeviceSettings { size: 3200, lba_data_size_override: None },
),
(
"WUS5EA138ESP7E3",
NvmeDeviceSettings { size: 3200, lba_data_size_override: None },
),
(
"WUS5EA176ESP7E1",
NvmeDeviceSettings { size: 6400, lba_data_size_override: None },
),
(
"WUS5EA176ESP7E3",
NvmeDeviceSettings { size: 6400, lba_data_size_override: None },
),
])
})
}

#[derive(Debug, thiserror::Error)]
pub enum NvmeFormattingError {
#[error(transparent)]
NvmeInit(#[from] libnvme::NvmeInitError),
#[error(transparent)]
Nvme(#[from] libnvme::NvmeError),
#[error("Device is missing expected LBA format")]
LbaFormatMissing,
#[error("Device has {0} active namespaces but we expected 1")]
UnexpectedNamespaces(usize),
#[error(transparent)]
InfoError(#[from] libnvme::controller_info::NvmeInfoError),
#[error("Could not find NVMe controller for disk with serial {0}")]
NoController(String),
}

// The expected layout of an M.2 device within the Oxide rack.
//
// Partitions beyond this "expected partition" array are ignored.
Expand Down Expand Up @@ -79,8 +147,11 @@ pub fn ensure_partition_layout(
log: &Logger,
paths: &DiskPaths,
variant: DiskVariant,
identity: &DiskIdentity,
) -> Result<Vec<Partition>, PooledDiskError> {
internal_ensure_partition_layout::<libefi_illumos::Gpt>(log, paths, variant)
internal_ensure_partition_layout::<libefi_illumos::Gpt>(
log, paths, variant, identity,
)
}

// Same as the [ensure_partition_layout], but with generic parameters
Expand All @@ -89,6 +160,7 @@ fn internal_ensure_partition_layout<GPT: gpt::LibEfiGpt>(
log: &Logger,
paths: &DiskPaths,
variant: DiskVariant,
identity: &DiskIdentity,
) -> Result<Vec<Partition>, PooledDiskError> {
// Open the "Whole Disk" as a raw device to be parsed by the
// libefi-illumos library. This lets us peek at the GPT before
Expand Down Expand Up @@ -119,6 +191,11 @@ fn internal_ensure_partition_layout<GPT: gpt::LibEfiGpt>(
};
match variant {
DiskVariant::U2 => {
// First we need to check that this disk is of the proper
// size and correct logical block address formatting.
ensure_size_and_formatting(log, identity)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly for my own education: what does this do with disks that already have data on them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't touch a disk with a label/zpool as changing the LBA format erases all data on the disk. This should only be preformed when StorageManger finds a new disk that has not been setup yet i.e first boot, add disk etc

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I expect that when we transform to an explicit adoption model then we'll want this to not occur until that adoption process is initiated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only gets run if there is no GPT table. It's hidden by github above the diff.


// If we were successful we can create a zpool on this disk.
info!(log, "Formatting zpool on disk {}", paths.devfs_path);
// If a zpool does not already exist, create one.
let zpool_name = ZpoolName::new_external(Uuid::new_v4());
Expand Down Expand Up @@ -154,13 +231,124 @@ fn internal_ensure_partition_layout<GPT: gpt::LibEfiGpt>(
}
}

fn ensure_size_and_formatting(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any problem running this over after failure on a reboot or restart of sled-agent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while testing I found that if a disk is left in the unattached state then sled-agent will ignore it. That's due to these lines. So, if we were to detach blkdev from the disk and then take a sled-agent reboot we wouldn't be notified of that disk again I believe.

Now if we were in the process of formatting a disk and the box panicked or rebooted for some reason I don't know what would happen. Perhaps @rmustacc could shed some light on that failure scenario.

Regardless I don't know what the best course of action is for the first failure mode, maybe we need an enhancement in how we detect disks from devinfo?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now if we were in the process of formatting a disk and the box panicked or rebooted for some reason I don't know what would happen. Perhaps @rmustacc could shed some light on that failure scenario.

There is nothing persistent about a blkdev attach/detach, so the kernel will attempt to if it can. Because we don't support namespace management and aren't doing anything here, this should likely be okay, though I can't comment on what happens if a device fails in the middle of say a format nvm command.

Regardless I don't know what the best course of action is for the first failure mode, maybe we need an enhancement in how we detect disks from devinfo?

I would suggest we file a follow up bug that moves this to looking for the NVMe node and not the blkdev node. This is for a few reasons:

  • An NVMe device may not have any blkdev instances attached as mentioned to start this.
  • An NVMe device may have the wrong number of namespaces created if it came from another system. This would show up as two distinct block devices or possibly no block devices, which would likely be confusing.

Copy link
Contributor Author

@papertigers papertigers Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed this as #5241

log: &Logger,
identity: &DiskIdentity,
) -> Result<(), NvmeFormattingError> {
use libnvme::namespace::NamespaceDiscoveryLevel;
use libnvme::Nvme;

let mut controller_found = false;

if let Some(nvme_settings) =
preferred_nvme_device_settings().get(identity.model.as_str())
{
let nvme = Nvme::new()?;
for controller in nvme.controller_discovery()? {
let controller = controller?.write_lock().map_err(|(_, e)| e)?;
let controller_info = controller.get_info()?;
// Make sure we are operating on the correct NVMe device.
if controller_info.serial() != identity.serial {
continue;
};
controller_found = true;
let nsdisc = controller
.namespace_discovery(NamespaceDiscoveryLevel::Active)?;
let namespaces =
nsdisc.into_iter().collect::<Result<Vec<_>, _>>()?;
if namespaces.len() != 1 {
return Err(NvmeFormattingError::UnexpectedNamespaces(
namespaces.len(),
));
}
// Safe because verified there is exactly one namespace.
let namespace = namespaces.into_iter().next().unwrap();

// NB: Only some vendors such as WDC support adjusting the size
// of the disk to deal with overprovisioning. This will need to be
// abstracted away if/when we ever start using another vendor with
// this capability.
let size = controller.wdc_resize_get()?;

// First we need to detach blkdev from the namespace.
namespace.blkdev_detach()?;

// Resize the device if needed to ensure we get the expected
// durability level in terms of drive writes per day.
if size != nvme_settings.size {
controller.wdc_resize_set(nvme_settings.size)?;
info!(
log,
"Resized {} from {size} to {}",
identity.serial,
nvme_settings.size
)
}

// Find the LBA format we want to use for the device.
let wanted_data_size = nvme_settings
.lba_data_size_override
.unwrap_or(DEFAULT_NVME_LBA_DATA_SIZE);
let desired_lba = controller_info
.lba_formats()
.collect::<Result<Vec<_>, _>>()?
.into_iter()
.find(|lba| {
lba.meta_size() == NVME_LBA_META_SIZE
&& lba.data_size() == wanted_data_size
})
.ok_or_else(|| NvmeFormattingError::LbaFormatMissing)?;

// If the controller isn't formatted to our desired LBA we need to
// issue a format request.
let ns_info = namespace.get_info()?;
let current_lba = ns_info.current_format()?;
if current_lba.id() != desired_lba.id() {
controller
.format_request()?
.set_lbaf(desired_lba.id())?
// TODO map this to libnvme::BROADCAST_NAMESPACE once added
.set_nsid(u32::MAX)?
// No secure erase
.set_ses(0)?
.execute()?;

info!(
log,
"Formatted disk with serial {} to an LBA with data size \
{wanted_data_size}",
identity.serial,
);
}

// Attach blkdev to the namespace again
namespace.blkdev_attach()?;
}
} else {
info!(
log,
"There are no preferred NVMe settings for disk model {}; nothing to\
do for disk with serial {}",
identity.model,
identity.serial
);
Comment on lines +327 to +334
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for handling this case -- this means that all the file-backed vdevs in testing will still work.

return Ok(());
}

if !controller_found {
return Err(NvmeFormattingError::NoController(identity.serial.clone()));
}

Ok(())
}

#[cfg(test)]
mod test {
use super::*;
use crate::DiskPaths;
use camino::Utf8PathBuf;
use illumos_utils::zpool::MockZpool;
use omicron_test_utils::dev::test_setup_log;
use omicron_test_utils::dev::{mock_disk_identity, test_setup_log};
use std::path::Path;

struct FakePartition {
Expand Down Expand Up @@ -196,6 +384,7 @@ mod test {
&log,
&DiskPaths { devfs_path, dev_path: None },
DiskVariant::U2,
&mock_disk_identity(),
);
match result {
Err(PooledDiskError::CannotFormatMissingDevPath { .. }) => {}
Expand Down Expand Up @@ -229,6 +418,7 @@ mod test {
dev_path: Some(Utf8PathBuf::from(DEV_PATH)),
},
DiskVariant::U2,
&mock_disk_identity(),
)
.expect("Should have succeeded partitioning disk");

Expand All @@ -253,6 +443,7 @@ mod test {
dev_path: Some(Utf8PathBuf::from(DEV_PATH))
},
DiskVariant::M2,
&mock_disk_identity(),
)
.is_err());

Expand Down Expand Up @@ -290,6 +481,7 @@ mod test {
dev_path: Some(Utf8PathBuf::from(DEV_PATH)),
},
DiskVariant::U2,
&mock_disk_identity(),
)
.expect("Should be able to parse disk");

Expand Down Expand Up @@ -332,6 +524,7 @@ mod test {
dev_path: Some(Utf8PathBuf::from(DEV_PATH)),
},
DiskVariant::M2,
&mock_disk_identity(),
)
.expect("Should be able to parse disk");

Expand Down Expand Up @@ -371,6 +564,7 @@ mod test {
dev_path: Some(Utf8PathBuf::from(DEV_PATH)),
},
DiskVariant::M2,
&mock_disk_identity(),
)
.expect_err("Should have failed parsing empty GPT"),
PooledDiskError::BadPartitionLayout { .. }
Expand All @@ -396,6 +590,7 @@ mod test {
dev_path: Some(Utf8PathBuf::from(DEV_PATH)),
},
DiskVariant::U2,
&mock_disk_identity(),
)
.expect_err("Should have failed parsing empty GPT"),
PooledDiskError::BadPartitionLayout { .. }
Expand Down
Loading
Loading