Skip to content

Commit

Permalink
Add some unit tests for sled-agent Instance creation
Browse files Browse the repository at this point in the history
At time of writing, instance creation roughly looks like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: `Instance::propolis_ensure`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
      - sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
        - `RunningZone::install` and `Zones::boot`
        - `illumos_utils::svc::wait_for_service`
        - `self::wait_for_http_server` for propolis-server itself
      - sled-agent: `Instance::ensure_propolis_and_tasks`
        - sled-agent: spawn `Instance::monitor_state_task`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
    - sled-agent: return ok result
- nexus: `handle_instance_put_result`

Or at least, it does in the happy path. omicron#3927 saw propolis zone
creation take longer than the minute nexus's call to sled-agent's
`instance_put_state`. That might've looked something like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: `Instance::propolis_ensure`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
      - sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
        - `RunningZone::install` and `Zones::boot`
- nexus: i've been waiting a whole minute for this. connection timeout!
- nexus: `handle_instance_put_result`
    - sled-agent: [...] return... oh, they hung up. :(

To avoid this timeout being implicit at the *Dropshot configuration*
layer (that is to say, we should still have *some* timeout),
we could consider a small refactor to make `instance_put_state` not a
blocking call -- especially since it's already sending nexus updates on
its progress via out-of-band `cpapi_instances_put` calls! That might look
something like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: spawn {
      - sled-agent: `Instance::propolis_ensure`
        - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
        - sled-agent: `Instance::setup_propolis_locked` (blocking!)
        - sled-agent: `Instance::ensure_propolis_and_tasks`
          - sled-agent: spawn `Instance::monitor_state_task`
        - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
        - sled-agent -> nexus: a cpapi call equivalent to the `handle_instance_put_result` nexus currently invokes after getting the response from the blocking call

(With a way for nexus to cancel an instance creation by ID, and a timeout
in sled-agent itself for terminating the attempt and reporting the failure
back to nexus, and a shorter threshold for logging the event of an instance
creation taking a long time.)

Before such a change, though, we should really have some more tests around
sled-agent's instance creation code at all! So here's a few.
  • Loading branch information
lif committed Nov 30, 2023
1 parent a04f5e3 commit 74d4b87
Show file tree
Hide file tree
Showing 5 changed files with 551 additions and 5 deletions.
27 changes: 25 additions & 2 deletions illumos-utils/src/running_zone.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1174,24 +1174,30 @@ pub struct ZoneBuilder<'a> {
zone_root_path: Option<&'a Utf8Path>,
zone_image_paths: Option<&'a [Utf8PathBuf]>,
zone_type: Option<&'a str>,
unique_name: Option<Uuid>, // actually optional
// actually optional - that is, skipping this field in the builder will
// still result in an `Ok(InstalledZone)` from `.install()`, rather than
// an `Err(InstallZoneError::IncompleteBuilder)`.
unique_name: Option<Uuid>,
datasets: Option<&'a [zone::Dataset]>,
filesystems: Option<&'a [zone::Fs]>,
data_links: Option<&'a [String]>,
devices: Option<&'a [zone::Device]>,
opte_ports: Option<Vec<(Port, PortTicket)>>,
bootstrap_vnic: Option<Link>, // actually optional
// actually optional (as above)
bootstrap_vnic: Option<Link>,
links: Option<Vec<Link>>,
limit_priv: Option<Vec<String>>,
fake_cfg: Option<FakeZoneBuilderConfig>,
}

impl<'a> ZoneBuilder<'a> {
/// Logger to which status messages are written during zone installation.
pub fn with_log(mut self, log: Logger) -> Self {
self.log = Some(log);
self
}

/// Allocates the NIC used for control plane communication.
pub fn with_underlay_vnic_allocator(
mut self,
vnic_allocator: &'a VnicAllocator<Etherstub>,
Expand All @@ -1200,11 +1206,14 @@ impl<'a> ZoneBuilder<'a> {
self
}

/// Filesystem path at which the installed zone will reside.
pub fn with_zone_root_path(mut self, root_path: &'a Utf8Path) -> Self {
self.zone_root_path = Some(root_path);
self
}

/// The directories that will be searched for the image tarball for the
/// provided zone type ([`Self::with_zone_type`]).
pub fn with_zone_image_paths(
mut self,
image_paths: &'a [Utf8PathBuf],
Expand All @@ -1213,56 +1222,67 @@ impl<'a> ZoneBuilder<'a> {
self
}

/// The name of the type of zone being created (e.g. "propolis-server")
pub fn with_zone_type(mut self, zone_type: &'a str) -> Self {
self.zone_type = Some(zone_type);
self
}

/// Unique ID of the instance of the zone being created.
pub fn with_unique_name(mut self, uuid: Uuid) -> Self {
self.unique_name = Some(uuid);
self
}

/// ZFS datasets to be accessed from within the zone.
pub fn with_datasets(mut self, datasets: &'a [zone::Dataset]) -> Self {
self.datasets = Some(datasets);
self
}

/// Filesystems to mount within the zone.
pub fn with_filesystems(mut self, filesystems: &'a [zone::Fs]) -> Self {
self.filesystems = Some(filesystems);
self
}

/// Additional network device names to add to the zone.
pub fn with_data_links(mut self, links: &'a [String]) -> Self {
self.data_links = Some(links);
self
}

/// Device nodes to pass through to the zone.
pub fn with_devices(mut self, devices: &'a [zone::Device]) -> Self {
self.devices = Some(devices);
self
}

/// OPTE devices for the guest network interfaces.
pub fn with_opte_ports(mut self, ports: Vec<(Port, PortTicket)>) -> Self {
self.opte_ports = Some(ports);
self
}

/// NIC to use for creating a bootstrap address on the switch zone.
pub fn with_bootstrap_vnic(mut self, vnic: Link) -> Self {
self.bootstrap_vnic = Some(vnic);
self
}

/// Physical NICs possibly provisioned to the zone.
pub fn with_links(mut self, links: Vec<Link>) -> Self {
self.links = Some(links);
self
}

/// The maximum set of privileges any process in this zone can obtain.
pub fn with_limit_priv(mut self, limit_priv: Vec<String>) -> Self {
self.limit_priv = Some(limit_priv);
self
}

// (used in unit tests)
fn fake_install(self) -> Result<InstalledZone, InstallZoneError> {
let zone = self
.zone_type
Expand Down Expand Up @@ -1300,6 +1320,9 @@ impl<'a> ZoneBuilder<'a> {
.ok_or(InstallZoneError::IncompleteBuilder)
}

/// Create the zone with the provided parameters.
/// Returns `Err(InstallZoneError::IncompleteBuilder)` if a necessary
/// parameter was not provided.
pub async fn install(self) -> Result<InstalledZone, InstallZoneError> {
if self.fake_cfg.is_some() {
return self.fake_install();
Expand Down
41 changes: 38 additions & 3 deletions sled-agent/src/fakes/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,18 @@
//! to operate correctly.
use dropshot::{
endpoint, ApiDescription, FreeformBody, HttpError, HttpResponseOk, Path,
RequestContext,
endpoint, ApiDescription, FreeformBody, HttpError, HttpResponseOk,
HttpResponseUpdatedNoContent, Path, RequestContext, TypedBody,
};
use hyper::Body;
use internal_dns::ServiceName;
use omicron_common::api::external::Error;
use omicron_common::api::internal::nexus::UpdateArtifactId;
use omicron_common::api::internal::nexus::{
SledInstanceState, UpdateArtifactId,
};
use schemars::JsonSchema;
use serde::Deserialize;
use uuid::Uuid;

/// Implements a fake Nexus.
///
Expand All @@ -28,6 +33,13 @@ pub trait FakeNexusServer: Send + Sync {
) -> Result<Vec<u8>, Error> {
Err(Error::internal_error("Not implemented"))
}
fn cpapi_instances_put(
&self,
_instance_id: Uuid,
_new_runtime_state: SledInstanceState,
) -> Result<(), Error> {
Err(Error::internal_error("Not implemented"))
}
}

/// Describes the server context type.
Expand All @@ -52,9 +64,32 @@ async fn cpapi_artifact_download(
))
}

#[derive(Deserialize, JsonSchema)]
struct InstancePathParam {
instance_id: Uuid,
}

#[endpoint {
method = PUT,
path = "/instances/{instance_id}",
}]
async fn cpapi_instances_put(
request_context: RequestContext<ServerContext>,
path_params: Path<InstancePathParam>,
new_runtime_state: TypedBody<SledInstanceState>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
let context = request_context.context();
context.cpapi_instances_put(
path_params.into_inner().instance_id,
new_runtime_state.into_inner(),
)?;
Ok(HttpResponseUpdatedNoContent())
}

fn api() -> ApiDescription<ServerContext> {
let mut api = ApiDescription::new();
api.register(cpapi_artifact_download).unwrap();
api.register(cpapi_instances_put).unwrap();
api
}

Expand Down
Loading

0 comments on commit 74d4b87

Please sign in to comment.