move specific background tasks into submodule (#5930)

oxidecomputer · Jun 26, 2024 · 49f6e01 · 49f6e01
1 parent 6e29409
commit 49f6e01
Show file tree

Hide file tree

Showing 26 changed files with 275 additions and 261 deletions.
diff --git a/nexus/src/app/background/common.rs → nexus/src/app/background/driver.rs b/nexus/src/app/background/common.rs → nexus/src/app/background/driver.rs
@@ -2,132 +2,10 @@
 // License, v. 2.0. If a copy of the MPL was not distributed with this
 // file, You can obtain one at https://mozilla.org/MPL/2.0/.
 
-//! # Nexus Background Tasks
-//!
-//! A **background task** in Nexus is any operation that can be activated both
-//! periodically and by an explicit signal.  This is aimed at RFD 373-style
-//! "reliable persistent workflows", also called "reconcilers" or "controllers".
-//! These are a kind of automation that examines some _current_ state, compares
-//! it to some _intended_ state, and potentially takes action to try to bring
-//! the current state in sync with the intended state.  Our canonical example is
-//! that we want to have Nexus monitor the intended DNS configuration.  When it
-//! changes, we want to propagate the new configuration to all DNS servers.  We
-//! implement this with three different background tasks:
-//!
-//! 1. `DnsConfigWatcher` reads the DNS configuration from the database, stores
-//!    it in memory, and makes it available via a `tokio::sync::watch` channel.
-//! 2. `DnsServersWatcher` reads the list of DNS servers from the database,
-//!    stores it in memory, and makes it available via a `tokio::sync::watch`
-//!    channel.
-//! 3. `DnsPropagator` uses the the watch channels provided by the other two
-//!    background tasks to notice when either the DNS configuration or the list
-//!    of DNS servers has changed.  It uses the latest values to make a request
-//!    to each server to update its configuration.
-//!
-//! When Nexus changes the DNS configuration, it will update the database with
-//! the new configuration and then explicitly activate the `DnsConfigWatcher`.
-//! When it reads the new config, it will send it to its watch channel, and that
-//! will activate the `DnsPropagator`.  If any of this fails, or if Nexus
-//! crashes at any point, then the periodic activation of every background task
-//! will eventually cause the latest config to be propagated to all of the
-//! current servers.
-//!
-//! The background task framework here is pretty minimal: essentially what it
-//! gives you is that you just write an idempotent function that you want to
-//! happen periodically or on-demand, wrap it in an impl of `BackgroundTask`,
-//! register that with the `Driver`, and you're done.  The framework will take
-//! care of:
-//!
-//! * providing a way for Nexus at-large to activate your task
-//! * activating your task periodically
-//! * ensuring that the task is activated only once at a time in this Nexus
-//!   (but note that it may always be running concurrently in other Nexus
-//!   instances)
-//! * providing basic visibility into whether the task is running, when the task
-//!   last ran, etc.
-//!
-//! We may well want to extend the framework as we build more tasks in general
-//! and reconcilers specifically.  But we should be mindful not to create
-//! footguns for ourselves!  See "Design notes" below.
-//!
-//! ## Notes for background task implementors
-//!
-//! Background tasks are not necessarily just for reconcilers.  That's just the
-//! design center.  The first two DNS background tasks above aren't reconcilers
-//! in any non-trivial sense.
-//!
-//! Background task activations do not accept input, by design.  See "Design
-//! notes" below.
-//!
-//! Generally, you probably don't want to have your background task do retries.
-//! If things fail, you rely on the periodic reactivation to try again.
-//!
-//! ## Design notes
-//!
-//! The underlying design for RFD 373-style reconcilers is inspired by a few
-//! related principles:
-//!
-//! * the principle in distributed systems of having exactly one code path to
-//!   achieve a thing, and then always using that path to do that thing (as
-//!   opposed to having separate paths for, say, the happy path vs. failover,
-//!   and having one of those paths rarely used)
-//! * the [constant-work pattern][1], which basically suggests that a system can
-//!   be more robust and scalable if it's constructed in a way that always does
-//!   the same amount of work.  Imagine if we made requests to the DNS servers
-//!   to incrementally update their config every time the DNS data changed.
-//!   This system does more work as users make more requests.  During overloads,
-//!   things can fall over.  Compare with a system whose frontend merely updates
-//!   the DNS configuration that _should_ exist and whose backend periodically
-//!   scans the complete intended state and then sets its own state accordingly.
-//!   The backend does the same amount of work no matter how many requests were
-//!   made, making it more resistant to overload.  A big downside of this
-//!   approach is increased latency from the user making a request to seeing it
-//!   applied.  This can be mitigated (sacrificing some, but not all, of the
-//!   "constant work" property) by triggering a backend scan operation when user
-//!   requests complete.
-//! * the design pattern in distributed systems of keeping two copies of data in
-//!   sync using both event notifications (like a changelog) _and_ periodic full
-//!   scans.  The hope is that a full scan never finds a change that wasn't
-//!   correctly sync'd, but incorporating an occasional full scan into the
-//!   design ensures that such bugs are found and their impact repaired
-//!   automatically.
-//!
-//! [1]: https://aws.amazon.com/builders-library/reliability-and-constant-work/
-//!
-//! Combining these, we get a design pattern for a "reconciler" where:
-//!
-//! * The reconciler is activated by explicit request (when we know it has work
-//!   to do) _and_ periodically (to deal with all manner of transient failures)
-//! * The reconciler's activity is idempotent: given the same underlying state
-//!   (e.g., database state), it always attempts to do the same thing.
-//! * Each activation of the reconciler accepts no input.  That is, even when we
-//!   think we know what changed, we do not use that information.  This ensures
-//!   that the reconciler really is idempotent and its actions are based solely
-//!   on the state that it's watching.  Put differently: having reconcilers
-//!   accept an explicit hint about what changed (and then doing something
-//!   differently based on that) bifurcates the code: there's the common case
-//!   where that hint is available and the rarely-exercised case when it's not
-//!   (e.g., because Nexus crashed and it's the subsequent periodic activation
-//!   that's propagating this change).  This is what we're trying to avoid.
-//! * We do allow reconcilers to be triggered by a `tokio::sync::watch` channel
-//!   -- but again, not using the _data_ from that channel.  There are two big
-//!   advantages here: (1) reduced latency from when a change is made to when
-//!   the reconciler applies it, and (2) (arguably another way to say the same
-//!   thing) we can space out the periodic activations much further, knowing
-//!   that most of the time we're not increasing latency by doing this.  This
-//!   compromises the "constant-work" pattern a bit: we might wind up running
-//!   the reconciler more often during busy times than during idle times, and we
-//!   could find that overloads something.  However, the _operation_ of the
-//!   reconciler can still be constant work, and there's no more than that
-//!   amount of work going on at any given time.
-//!
-//!   `watch` channels are a convenient primitive here because they only store
-//!   one value.  With a little care, we can ensure that the writer never blocks
-//!   and the readers can all see the latest value.  (By design, reconcilers
-//!   generally only care about the latest state of something, not any
-//!   intermediate states.)  We don't have to worry about an unbounded queue, or
-//!   handling a full queue, or other forms of backpressure.
+//! Manages execution of background tasks
 
+use super::BackgroundTask;
+use super::TaskHandle;
 use assert_matches::assert_matches;
 use chrono::Utc;
 use futures::future::BoxFuture;
@@ -149,16 +27,6 @@ use tokio::sync::watch;
 use tokio::sync::Notify;
 use tokio::time::MissedTickBehavior;
 
-/// An operation activated both periodically and by an explicit signal
-///
-/// See module-level documentation for details.
-pub trait BackgroundTask: Send + Sync {
-    fn activate<'a>(
-        &'a mut self,
-        opctx: &'a OpContext,
-    ) -> BoxFuture<'a, serde_json::Value>;
-}
-
 /// Drives the execution of background tasks
 ///
 /// Nexus has only one Driver.  All background tasks are registered with the
@@ -170,21 +38,6 @@ pub struct Driver {
     tasks: BTreeMap<TaskHandle, Task>,
 }
 
-/// Identifies a background task
-///
-/// This is returned by [`Driver::register()`] to identify the corresponding
-/// background task.  It's then accepted by functions like
-/// [`Driver::activate()`] and [`Driver::task_status()`] to identify the task.
-#[derive(Clone, Debug, Ord, PartialOrd, PartialEq, Eq)]
-pub struct TaskHandle(String);
-
-impl TaskHandle {
-    /// Returns the unique name of this background task
-    pub fn name(&self) -> &str {
-        &self.0
-    }
-}
-
 /// Driver-side state of a background task
 struct Task {
     /// what this task does (for developers)
@@ -466,14 +319,14 @@ impl<T: Send + Sync> GenericWatcher for watch::Receiver<T> {
 mod test {
     use super::BackgroundTask;
     use super::Driver;
-    use crate::app::background::common::ActivationReason;
     use crate::app::sagas::SagaRequest;
     use assert_matches::assert_matches;
     use chrono::Utc;
     use futures::future::BoxFuture;
     use futures::FutureExt;
     use nexus_db_queries::context::OpContext;
     use nexus_test_utils_macros::nexus_test;
+    use nexus_types::internal_api::views::ActivationReason;
     use std::time::Duration;
     use std::time::Instant;
     use tokio::sync::mpsc;

diff --git a/nexus/src/app/background/init.rs b/nexus/src/app/background/init.rs
@@ -2,30 +2,31 @@
 // License, v. 2.0. If a copy of the MPL was not distributed with this
 // file, You can obtain one at https://mozilla.org/MPL/2.0/.
 
-//! Background task initialization
-
-use super::abandoned_vmm_reaper;
-use super::bfd;
-use super::blueprint_execution;
-use super::blueprint_load;
-use super::common;
-use super::crdb_node_id_collector;
-use super::dns_config;
-use super::dns_propagation;
-use super::dns_servers;
-use super::external_endpoints;
-use super::instance_watcher;
-use super::inventory_collection;
-use super::metrics_producer_gc;
-use super::nat_cleanup;
-use super::phantom_disks;
-use super::physical_disk_adoption;
-use super::region_replacement;
-use super::region_replacement_driver;
-use super::service_firewall_rules;
-use super::sync_service_zone_nat::ServiceZoneNatTracker;
-use super::sync_switch_configuration::SwitchPortSettingsManager;
-use super::v2p_mappings::V2PManager;
+//! Specific background task initialization
+
+use super::tasks::abandoned_vmm_reaper;
+use super::tasks::bfd;
+use super::tasks::blueprint_execution;
+use super::tasks::blueprint_load;
+use super::tasks::crdb_node_id_collector;
+use super::tasks::dns_config;
+use super::tasks::dns_propagation;
+use super::tasks::dns_servers;
+use super::tasks::external_endpoints;
+use super::tasks::instance_watcher;
+use super::tasks::inventory_collection;
+use super::tasks::metrics_producer_gc;
+use super::tasks::nat_cleanup;
+use super::tasks::phantom_disks;
+use super::tasks::physical_disk_adoption;
+use super::tasks::region_replacement;
+use super::tasks::region_replacement_driver;
+use super::tasks::service_firewall_rules;
+use super::tasks::sync_service_zone_nat::ServiceZoneNatTracker;
+use super::tasks::sync_switch_configuration::SwitchPortSettingsManager;
+use super::tasks::v2p_mappings::V2PManager;
+use super::Driver;
+use super::TaskHandle;
 use crate::app::oximeter::PRODUCER_LEASE_DURATION;
 use crate::app::sagas::SagaRequest;
 use nexus_config::BackgroundTaskConfig;
@@ -47,76 +48,76 @@ use uuid::Uuid;
 pub struct BackgroundTasks {
     /// interface for working with background tasks (activation, checking
     /// status, etc.)
-    pub driver: common::Driver,
+    pub driver: Driver,
 
     /// task handle for the internal DNS config background task
-    pub task_internal_dns_config: common::TaskHandle,
+    pub task_internal_dns_config: TaskHandle,
     /// task handle for the internal DNS servers background task
-    pub task_internal_dns_servers: common::TaskHandle,
+    pub task_internal_dns_servers: TaskHandle,
     /// task handle for the external DNS config background task
-    pub task_external_dns_config: common::TaskHandle,
+    pub task_external_dns_config: TaskHandle,
     /// task handle for the external DNS servers background task
-    pub task_external_dns_servers: common::TaskHandle,
+    pub task_external_dns_servers: TaskHandle,
 
     /// task handle for pruning metrics producers with expired leases
-    pub task_metrics_producer_gc: common::TaskHandle,
+    pub task_metrics_producer_gc: TaskHandle,
 
     /// task handle for the task that keeps track of external endpoints
-    pub task_external_endpoints: common::TaskHandle,
+    pub task_external_endpoints: TaskHandle,
     /// external endpoints read by the background task
     pub external_endpoints: tokio::sync::watch::Receiver<
         Option<external_endpoints::ExternalEndpoints>,
     >,
     /// task handle for the ipv4 nat entry garbage collector
-    pub nat_cleanup: common::TaskHandle,
+    pub nat_cleanup: TaskHandle,
 
     /// task handle for the switch bfd manager
-    pub bfd_manager: common::TaskHandle,
+    pub bfd_manager: TaskHandle,
 
     /// task handle for the task that collects inventory
-    pub task_inventory_collection: common::TaskHandle,
+    pub task_inventory_collection: TaskHandle,
 
     /// task handle for the task that collects inventory
-    pub task_physical_disk_adoption: common::TaskHandle,
+    pub task_physical_disk_adoption: TaskHandle,
 
     /// task handle for the task that detects phantom disks
-    pub task_phantom_disks: common::TaskHandle,
+    pub task_phantom_disks: TaskHandle,
 
     /// task handle for blueprint target loader
-    pub task_blueprint_loader: common::TaskHandle,
+    pub task_blueprint_loader: TaskHandle,
 
     /// task handle for blueprint execution background task
-    pub task_blueprint_executor: common::TaskHandle,
+    pub task_blueprint_executor: TaskHandle,
 
     /// task handle for collecting CockroachDB node IDs
-    pub task_crdb_node_id_collector: common::TaskHandle,
+    pub task_crdb_node_id_collector: TaskHandle,
 
     /// task handle for the service zone nat tracker
-    pub task_service_zone_nat_tracker: common::TaskHandle,
+    pub task_service_zone_nat_tracker: TaskHandle,
 
     /// task handle for the switch port settings manager
-    pub task_switch_port_settings_manager: common::TaskHandle,
+    pub task_switch_port_settings_manager: TaskHandle,
 
     /// task handle for the opte v2p manager
-    pub task_v2p_manager: common::TaskHandle,
+    pub task_v2p_manager: TaskHandle,
 
     /// task handle for the task that detects if regions need replacement and
     /// begins the process
-    pub task_region_replacement: common::TaskHandle,
+    pub task_region_replacement: TaskHandle,
 
     /// task handle for the task that drives region replacements forward
-    pub task_region_replacement_driver: common::TaskHandle,
+    pub task_region_replacement_driver: TaskHandle,
 
     /// task handle for the task that polls sled agents for instance states.
-    pub task_instance_watcher: common::TaskHandle,
+    pub task_instance_watcher: TaskHandle,
 
     /// task handle for propagation of VPC firewall rules for Omicron services
     /// with external network connectivity,
-    pub task_service_firewall_propagation: common::TaskHandle,
+    pub task_service_firewall_propagation: TaskHandle,
 
     /// task handle for deletion of database records for VMMs abandoned by their
     /// instances.
-    pub task_abandoned_vmm_reaper: common::TaskHandle,
+    pub task_abandoned_vmm_reaper: TaskHandle,
 }
 
 impl BackgroundTasks {
@@ -136,7 +137,7 @@ impl BackgroundTasks {
         ),
         producer_registry: &ProducerRegistry,
     ) -> BackgroundTasks {
-        let mut driver = common::Driver::new();
+        let mut driver = Driver::new();
 
         let (task_internal_dns_config, task_internal_dns_servers) = init_dns(
             &mut driver,
@@ -494,19 +495,19 @@ impl BackgroundTasks {
         }
     }
 
-    pub fn activate(&self, task: &common::TaskHandle) {
+    pub fn activate(&self, task: &TaskHandle) {
         self.driver.activate(task);
     }
 }
 
 fn init_dns(
-    driver: &mut common::Driver,
+    driver: &mut Driver,
     opctx: &OpContext,
     datastore: Arc<DataStore>,
     dns_group: DnsGroup,
     resolver: internal_dns::resolver::Resolver,
     config: &DnsTasksConfig,
-) -> (common::TaskHandle, common::TaskHandle) {
+) -> (TaskHandle, TaskHandle) {
     let dns_group_name = dns_group.to_string();
     let metadata = BTreeMap::from([("dns_group".to_string(), dns_group_name)]);