Skip to content

Commit

Permalink
Extract storage manager related code from sled-agent (#4332)
Browse files Browse the repository at this point in the history
This is a large PR. The changes are in many ways *just* moving code
around,
and at a cursory glance may appear to be a giant waste of time. To
justify
these changes and my rationale I will first present the goals I believe
we
should strive to meet in general sled-agent development. I'll then go
into
how I believe these goals can be realized via code patterns. I'll then
discuss
the concrete changes in this PR and how they implement the discussed
patterns.
Lastly, I'll list a few open questions about the implementation in the
PR.

I want to emphasize that a lot of what I talk about below is already
done in
sled-agent, at least part way. We can pick the best patterns already in
use and
standardize on them to build a more coherent and yet decoupled
architecture for
easier testing and understanding.

# Goals

This PR is an attempt to start making sled-agent easier to maintain. In
order to
make things easier to maintain, I believe we should attempt to realize a
few key
goals:

1. Initial startup and control code should read linearly. You shouldn't
have to
jump around to follow the gist of what's going on.
2. Tasks, like functions, should live at a certain abstraction level. It
must be clear
 what state the task is managing and how it gets mutated.
3. A developer must be able to come into the codebase and know where a
given
functionality should live and be able to clearly follow existing
patterns such
that their new code doesn't break or cross abstractions that make
following the
code harder in the future.
4. Tests for individual abstractions and task driven code should be easy
to
write. You shouldn't need to spin up the whole sled-agent in order to
test an
algorithm or behavior.
 5. Hardware should be abstracted out, and much of the code shouldn't 
 deal with hardware directly.

# Patterns

## Task/Handle

In my opinion, the most important pattern to follow for async rust
software
is what I call the "task/handle" pattern. This pattern helps code
maintenance
by explicitly managing state and making task driven code easier to test.
All
tasks that provide services to other tasks should follow this pattern.
In this
pattern, all state is contained in a type owned by the task and
inaccessible
directly to other tasks. The task has a corresponding handle which must
be
`Send` and `Clone`, and can be used to make requests of the task. The
handle
interacts with the task solely via message passing. Mutexes are never
used.

A few things fall directly out of this pattern:

* Code that manipulates state inside and outside the task can be
synchronous
* We don't have to worry about mutex behavior with regards to
cancellation,
   await points, etc...
 * Testing becomes easier:
   * We can test a bunch of the code without spawning a task in many
cases. We just [directly construct the state object and call
functions](https://github.com/oxidecomputer/omicron/pull/4332/files#diff-08315639e20a9323f2740455a7813d83ca43b05dca197ebbda66c13f750d446bR634-R656),
     whether sync or async.
   * When testing sequential operations that are fire and forget,
we [know when they have been processed by the task
loop](https://github.com/oxidecomputer/omicron/pull/4332/files#diff-08315639e20a9323f2740455a7813d83ca43b05dca197ebbda66c13f750d446bR702-R709),
     because our next request will only run after those operations. 
This is a direct result of the fifo channel between handle and task.
This is not possible with state shared outside the task via a mutex.
     We must poll that mutex protected state until it either changes to
     the value we expect or we get a timeout. If we expect no change, we
     must use a side-channel.
   * We can write complex state handling algorithms, such as those in 
      maghemite or bootstore, in a deterministic, synchronous style that
allows property based testing and model checking in as straightforward
     a manner as possible.
* We can completely [fake the
task](https://github.com/oxidecomputer/omicron/pull/4332/files#diff-08315639e20a9323f2740455a7813d83ca43b05dca197ebbda66c13f750d446bR163-R223)
without changing any of the calling code except the constructor. The
real
handle can instead send messages to a fake task. Importantly, this
strategy
can also be used to abstract out hardware for unit tests and simulation.

Beyond all that though, the understanding of how state is mutated and
accessed
is clear. State is only mutated inside the task, and can only be
retrieved from
code that has a copy of the handle. There is no need for separate
`_locked`
methods, and no need to worry about order of locking for different bits
of task
state. This can make the task code itself much shorter and easier to
read as
demonstrated by the new `StorageManager` code in `sled_storage`. We can
also
maintain internal stats for the task, and all of this can be retrieved
from the
handle for reporting in `omdb`.

There is one common question/problem with this pattern. How do you
choose a
bounded channel size for handle to task messages?

This can be tricky, but in general, you want the channel to *act*
unbounded,
such that sends never fail. This makes the channels reliable, such that
we never
drop messages inside the process, and the caller doesn't have to choose
what to
do when overloaded. This simplifies things drastically for developers.
However,
you also don't want to make the channel actually unbounded, because that
can
lead to run-away memory growth and pathological behaviors, such that
requests
get slower over time until the system crashes.

After discussions with @jgallagher and reflections from @sunshowers and
@davepacheco on RFD 400, the solution is to choose a large enough bound
such that we should never hit it in practice unless we are truly
overloaded.
If we hit the bound it means that beyond that requests will start to
build
up and we will eventually topple over. So when we hit this bound, we
just
[go ahead and
panic](https://github.com/oxidecomputer/omicron/pull/4332/files#diff-08315639e20a9323f2740455a7813d83ca43b05dca197ebbda66c13f750d446bR75-R77).

Picking a channel bound is hard to do empirically, but practically, if
requests are
mostly mutating task local state, a bound of 1024 or even 8192 should be
plenty.
Tasks that must perform longer running ops can spawn helper tasks as
necessary
or include their own handles for replies rather than synchronously
waiting. Memory
for the queue can be kept small with boxing of large messages.

It should also be noted that mutexes also have queues that can build up
and
cause pathological slowness. The difference is that these queues are
implicit
and hard to track.

## Isolate subsystems via the message driven decoupling

We have a bunch of managers (`HardwareManager`, `StorageManager`,
`ServiceManager`, `InstanceManager`) that interact with each other to
provide sled-agent services. However, much of that interaction is ad-
hoc with state shared between tasks via an `Inner` struct protected
by a mutex. It's hard to isolate each of these managers and test
them independently. By ensuring their API is well defined via a
`Handle`,
we can fake out different managers as needed for independent testing.
However, we can go even farther, without the need for dependency
injection
by having different managers announce their updates via [broadcast or
watch

channels](https://github.com/oxidecomputer/omicron/pull/4332/files#diff-08315639e20a9323f2740455a7813d83ca43b05dca197ebbda66c13f750d446bR129-R133).

With this strategy, we can drive tests for a task by using the handle to
both
trigger operations, and then wait for the outflow of messages that
should occur
rather than mocking out the handle API of another task directly and
checking
the interaction via the fake. This is especially useful for lower level
services
that others depend upon such as `StorageManager`, and we should do this
when we
can, rather than having tasks talk directly to each other. This strategy
leads
directly to the last pattern I really want to mention, the `monitor`
pattern.

## High level interaction via monitors

Remember that a primary goal of these patterns is to make the sled-agent
easier
to read and discover what is happening at any given point in time. This
has to
happen with the inherent concurrency of all the separate behaviors
occurring
in the sled-agent such as monitoring for hardware and faults, or
handling user
requests from Nexus. If our low level managers announce updates via
channels
we can have high level `Monitors` that wait for those updates and then
dispatch
them to other subsystems or across clients to other sleds or services.
This
helps further decouple services for easier testing, and also allows us
to
understand all the asynchrony in the system in fewer spots. We can put
RAS code
in these monitors and use them as central points to maintain stats and
track the
overall load on the system.

# The punchline

How do the patterns above satisfy the goals? By decoupling tasks and
interacting
solely via message passing we make testing easier(goal 4). By separating
out
managers/ services into separate tasks and maintaining state locally we
satisfy
goals 2 and 5. By making the tasks clear in their responsibilities, and
their
APIs evident via their handles, we help satisfy goal 3. Goal 3 is also
satisfied
to a degree by the elimination of sharing state and the decoupling via
monitors.
Lastly, by combining all the patterns, we can spawn all our top level
tasks and
monitors in one place after initializing any global state. This allows
the code
to flow linearly.

Importantly, it's also easy to violate goal 3 by dropping some mutexes
into the
middle of a task and oversharing handles between subsystems. I believe
we can
prevent this, and also make testing and APIs clearer, by separating
subsystems
into their own rust packages and then adding monitors for them in the
sled-agent.
I took a first cut at this with the `StorageManager`, as that was the
subsystem I was
most familiar with.

# Concrete Code changes

Keeping in line with my stated goals and patterns, I abstracted the
storage
manager code into its own package called `sled-storage`. The
`StorageManager`
uses the `task/handle` pattern, and there is a monitor for it in
sled-agent
that takes notifications and updates nexus and the zone bundler. There
are also a
bunch of unit tests where none existed before. The bulk of the rest of
the code
changes fell out of this. In particular, now that we have a separate
`sled-storage`
package, all high level disk management code lives there. Construction
and
formatting of partitions still happens in `sled-hardware`, but anything
above the
zpool level occurs in `sled-storage`. This allows testing of real
storage code on
any illumos based system using file backed zpools. Importantly, the
key-management
code now lives at this abstraction level, rather than in
`sled-hardware`, which
makes more sense since encryption only operates on datasets in ZFS.

I'd like to do similar things to the other managers, and think that's a
good way
to continue cleaning up sled-agent.

The other large change in this code base is simplifying the startup of
the bootstrap agent such that all tasks that run for the lifetime of
the sled-agent process are spawned in `long_running_tasks`. There is a
struct called `LongRunningTaskHandles` that allows interaction with all
the
"managers" spawned here. This change also has the side effect of
removing the
`wait_while_handling_hardware_updates` concurrent future code, since we
have
a hardware monitor now running at the beginning of time and can never
miss
updates. This also has the effect of negating the need to start a
separate
hardware manager when the sled-agent starts up. Because we now know
which
tasks are long running, we don't have to save their tokio `JoinHandle`s
to display
ownership and thus gain clonability.

# Open Questions

* I'm not really a fan of the name `long_running_task_handles`. Is there
a better
name for these?
* Instead of calling `spawn_all_longrunning_tasks` should we just call
the
individual spawn functions directly in `bootstrap/pre_server`? Doing it
this
way would allow the `ServiceManager` to also live in the long running
handles
and remove some ordering issues. For instance, we wait for the bootdisk
inside
`spawn_all_longrunning_tasks` and we also upsert synthetic disks.
Neither
of these is really part of spawning tasks.
  • Loading branch information
andrewjstone authored Nov 14, 2023
1 parent a1d0738 commit 2fc0dfd
Show file tree
Hide file tree
Showing 58 changed files with 3,689 additions and 2,775 deletions.
33 changes: 31 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ members = [
"rpaths",
"sled-agent",
"sled-hardware",
"sled-storage",
"sp-sim",
"test-utils",
"tufaceous-lib",
Expand Down Expand Up @@ -122,6 +123,7 @@ default-members = [
"rpaths",
"sled-agent",
"sled-hardware",
"sled-storage",
"sp-sim",
"test-utils",
"tufaceous-lib",
Expand Down Expand Up @@ -329,6 +331,7 @@ similar-asserts = "1.5.0"
sled = "0.34"
sled-agent-client = { path = "clients/sled-agent-client" }
sled-hardware = { path = "sled-hardware" }
sled-storage = { path = "sled-storage" }
slog = { version = "2.7", features = [ "dynamic-keys", "max_level_trace", "release_max_level_debug" ] }
slog-async = "2.8"
slog-dtrace = "0.2"
Expand Down
2 changes: 2 additions & 0 deletions clients/nexus-client/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ futures.workspace = true
ipnetwork.workspace = true
omicron-common.workspace = true
omicron-passwords.workspace = true
sled-hardware.workspace = true
sled-storage.workspace = true
progenitor.workspace = true
regress.workspace = true
reqwest = { workspace = true, features = ["rustls-tls", "stream"] }
Expand Down
33 changes: 33 additions & 0 deletions clients/nexus-client/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -388,3 +388,36 @@ impl From<omicron_common::api::internal::shared::ExternalPortDiscovery>
}
}
}

impl From<sled_hardware::DiskVariant> for types::PhysicalDiskKind {
fn from(value: sled_hardware::DiskVariant) -> Self {
match value {
sled_hardware::DiskVariant::U2 => types::PhysicalDiskKind::U2,
sled_hardware::DiskVariant::M2 => types::PhysicalDiskKind::M2,
}
}
}

impl From<sled_hardware::Baseboard> for types::Baseboard {
fn from(b: sled_hardware::Baseboard) -> types::Baseboard {
types::Baseboard {
serial_number: b.identifier().to_string(),
part_number: b.model().to_string(),
revision: b.revision(),
}
}
}

impl From<sled_storage::dataset::DatasetKind> for types::DatasetKind {
fn from(k: sled_storage::dataset::DatasetKind) -> Self {
use sled_storage::dataset::DatasetKind::*;
match k {
CockroachDb => Self::Cockroach,
Crucible => Self::Crucible,
Clickhouse => Self::Clickhouse,
ClickhouseKeeper => Self::ClickhouseKeeper,
ExternalDns => Self::ExternalDns,
InternalDns => Self::InternalDns,
}
}
}
1 change: 1 addition & 0 deletions clients/sled-agent-client/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,6 @@ regress.workspace = true
reqwest = { workspace = true, features = [ "json", "rustls-tls", "stream" ] }
serde.workspace = true
slog.workspace = true
sled-storage.workspace = true
uuid.workspace = true
omicron-workspace-hack.workspace = true
25 changes: 25 additions & 0 deletions clients/sled-agent-client/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
use async_trait::async_trait;
use std::convert::TryFrom;
use std::str::FromStr;
use uuid::Uuid;

progenitor::generate_api!(
Expand Down Expand Up @@ -528,3 +529,27 @@ impl TestInterfaces for Client {
.expect("disk_finish_transition() failed unexpectedly");
}
}

impl From<sled_storage::dataset::DatasetKind> for types::DatasetKind {
fn from(k: sled_storage::dataset::DatasetKind) -> Self {
use sled_storage::dataset::DatasetKind::*;
match k {
CockroachDb => Self::CockroachDb,
Crucible => Self::Crucible,
Clickhouse => Self::Clickhouse,
ClickhouseKeeper => Self::ClickhouseKeeper,
ExternalDns => Self::ExternalDns,
InternalDns => Self::InternalDns,
}
}
}

impl From<sled_storage::dataset::DatasetName> for types::DatasetName {
fn from(n: sled_storage::dataset::DatasetName) -> Self {
Self {
pool_name: types::ZpoolName::from_str(&n.pool().to_string())
.unwrap(),
kind: n.dataset().clone().into(),
}
}
}
2 changes: 1 addition & 1 deletion common/src/disk.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
//! Disk related types shared among crates
/// Uniquely identifies a disk.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
#[derive(Debug, Clone, PartialEq, Eq, Hash, Ord, PartialOrd)]
pub struct DiskIdentity {
pub vendor: String,
pub serial: String,
Expand Down
3 changes: 3 additions & 0 deletions illumos-utils/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,6 @@ toml.workspace = true
[features]
# Enable to generate MockZones
testing = ["mockall"]
# Useful for tests that want real functionality and ability to run without
# pfexec
tmp_keypath = []
33 changes: 32 additions & 1 deletion illumos-utils/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@

//! Wrappers around illumos-specific commands.
#[allow(unused)]
use std::sync::atomic::{AtomicBool, Ordering};

use cfg_if::cfg_if;

pub mod addrobj;
Expand Down Expand Up @@ -93,7 +96,7 @@ mod inner {

// Helper function for starting the process and checking the
// exit code result.
pub fn execute(
pub fn execute_helper(
command: &mut std::process::Command,
) -> Result<std::process::Output, ExecutionError> {
let output = command.output().map_err(|err| {
Expand All @@ -108,6 +111,34 @@ mod inner {
}
}

// Due to feature unification, the `testing` feature is enabled when some tests
// don't actually want to use it. We allow them to opt out of the use of the
// free function here. We also explicitly opt-in where mocks are used.
//
// Note that this only works if the tests that use mocks and those that don't
// are run sequentially. However, this is how we do things in CI with nextest,
// so there is no problem currently.
//
// We can remove all this when we get rid of the mocks.
#[cfg(any(test, feature = "testing"))]
pub static USE_MOCKS: AtomicBool = AtomicBool::new(false);

pub fn execute(
command: &mut std::process::Command,
) -> Result<std::process::Output, ExecutionError> {
cfg_if! {
if #[cfg(any(test, feature = "testing"))] {
if USE_MOCKS.load(Ordering::SeqCst) {
mock_inner::execute_helper(command)
} else {
inner::execute_helper(command)
}
} else {
inner::execute_helper(command)
}
}
}

cfg_if! {
if #[cfg(any(test, feature = "testing"))] {
pub use mock_inner::*;
Expand Down
33 changes: 25 additions & 8 deletions illumos-utils/src/zfs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,16 @@ pub const ZONE_ZFS_RAMDISK_DATASET_MOUNTPOINT: &str = "/zone";
pub const ZONE_ZFS_RAMDISK_DATASET: &str = "rpool/zone";

pub const ZFS: &str = "/usr/sbin/zfs";

/// This path is intentionally on a `tmpfs` to prevent copy-on-write behavior
/// and to ensure it goes away on power off.
///
/// We want minimize the time the key files are in memory, and so we rederive
/// the keys and recreate the files on demand when creating and mounting
/// encrypted filesystems. We then zero them and unlink them.
pub const KEYPATH_ROOT: &str = "/var/run/oxide/";
// Use /tmp so we don't have to worry about running tests with pfexec
pub const TEST_KEYPATH_ROOT: &str = "/tmp";

/// Error returned by [`Zfs::list_datasets`].
#[derive(thiserror::Error, Debug)]
Expand Down Expand Up @@ -158,19 +167,27 @@ impl fmt::Display for Keypath {
}
}

#[cfg(not(feature = "tmp_keypath"))]
impl From<&DiskIdentity> for Keypath {
fn from(id: &DiskIdentity) -> Self {
build_keypath(id, KEYPATH_ROOT)
}
}

#[cfg(feature = "tmp_keypath")]
impl From<&DiskIdentity> for Keypath {
fn from(id: &DiskIdentity) -> Self {
let filename = format!(
"{}-{}-{}-zfs-aes-256-gcm.key",
id.vendor, id.serial, id.model
);
let mut path = Utf8PathBuf::new();
path.push(KEYPATH_ROOT);
path.push(filename);
Keypath(path)
build_keypath(id, TEST_KEYPATH_ROOT)
}
}

fn build_keypath(id: &DiskIdentity, root: &str) -> Keypath {
let filename =
format!("{}-{}-{}-zfs-aes-256-gcm.key", id.vendor, id.serial, id.model);
let path: Utf8PathBuf = [root, &filename].iter().collect();
Keypath(path)
}

#[derive(Debug)]
pub struct EncryptionDetails {
pub keypath: Keypath,
Expand Down
37 changes: 34 additions & 3 deletions illumos-utils/src/zpool.rs
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,13 @@ pub struct CreateError {
err: Error,
}

#[derive(thiserror::Error, Debug)]
#[error("Failed to destroy zpool: {err}")]
pub struct DestroyError {
#[from]
err: Error,
}

#[derive(thiserror::Error, Debug)]
#[error("Failed to list zpools: {err}")]
pub struct ListError {
Expand Down Expand Up @@ -89,7 +96,7 @@ impl FromStr for ZpoolHealth {
}

/// Describes a Zpool.
#[derive(Clone, Debug)]
#[derive(Clone, Debug, PartialEq, Eq)]
pub struct ZpoolInfo {
name: String,
size: u64,
Expand Down Expand Up @@ -121,6 +128,17 @@ impl ZpoolInfo {
pub fn health(&self) -> ZpoolHealth {
self.health
}

#[cfg(any(test, feature = "testing"))]
pub fn new_hardcoded(name: String) -> ZpoolInfo {
ZpoolInfo {
name,
size: 1024 * 1024 * 64,
allocated: 1024,
free: 1024 * 1023 * 64,
health: ZpoolHealth::Online,
}
}
}

impl FromStr for ZpoolInfo {
Expand Down Expand Up @@ -167,7 +185,10 @@ pub struct Zpool {}

#[cfg_attr(any(test, feature = "testing"), mockall::automock, allow(dead_code))]
impl Zpool {
pub fn create(name: ZpoolName, vdev: &Utf8Path) -> Result<(), CreateError> {
pub fn create(
name: &ZpoolName,
vdev: &Utf8Path,
) -> Result<(), CreateError> {
let mut cmd = std::process::Command::new(PFEXEC);
cmd.env_clear();
cmd.env("LC_ALL", "C.UTF-8");
Expand All @@ -189,7 +210,17 @@ impl Zpool {
Ok(())
}

pub fn import(name: ZpoolName) -> Result<(), Error> {
pub fn destroy(name: &ZpoolName) -> Result<(), DestroyError> {
let mut cmd = std::process::Command::new(PFEXEC);
cmd.env_clear();
cmd.env("LC_ALL", "C.UTF-8");
cmd.arg(ZPOOL).arg("destroy");
cmd.arg(&name.to_string());
execute(&mut cmd).map_err(Error::from)?;
Ok(())
}

pub fn import(name: &ZpoolName) -> Result<(), Error> {
let mut cmd = std::process::Command::new(PFEXEC);
cmd.env_clear();
cmd.env("LC_ALL", "C.UTF-8");
Expand Down
1 change: 1 addition & 0 deletions installinator/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ omicron-common.workspace = true
reqwest.workspace = true
sha2.workspace = true
sled-hardware.workspace = true
sled-storage.workspace = true
slog.workspace = true
slog-async.workspace = true
slog-envlogger.workspace = true
Expand Down
Loading

0 comments on commit 2fc0dfd

Please sign in to comment.