Do not double count region snapshots records!

`decrease_crucible_resource_count_and_soft_delete_volume` does not disambiguate cases where the snapshot_addr of a region_snapshot is duplicated with another one, which can occur due to the Crucible Agent reclaiming ports from destroyed daemons (see also oxidecomputer#4049, which makes the simulated Crucible agent do this). Several invocations of the snapshot create and snapshot delete sagas could race in such a way that one of these ports would be reclaimed, and then be used in a different snapshot, and the lifetime of both of these would overlap! This would confuse our reference counting, which was written with a naive assumption that this port reuse **wouldn't** occur with these overlapping lifetimes. Spoiler alert, it can: root@[fd00:1122:3344:101::3]:32221/omicron> select * from region_snapshot where snapshot_addr = '[fd00:1122:3344:102::7]:19016'; dataset_id | region_id | snapshot_id | snapshot_addr | volume_references ---------------------------------------+--------------------------------------+--------------------------------------+-------------------------------+-------------------- 80790bfd-4b81-4381-9262-20912e3826cc | 0387bbb7-1d54-4683-943c-6c17d6804de9 | 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 | 0 80790bfd-4b81-4381-9262-20912e3826cc | ff20e066-8815-4eb6-ac84-fab9b9103462 | bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 | 0 (2 rows) One way to solve this would be to create a UNIQUE INDEX on `snapshot_addr` here, but then in these cases the snapshot creation would return a 500 error to the user. This commit adds a sixth column: `deleting`, a boolean that is true when the region snapshot is part of a volume's `resources_to_clean_up`, and false otherwise. This is used to select (as part of the transaction for `decrease_crucible_resource_count_and_soft_delete_volume`) only the region_snapshot records that were decremented as part of that transaction, and skip re-deleting them otherwise. This works because the overlapping lifetime of the records in the DB is **not** the overlapping lifetime of the actual read-only downstairs daemon: for the port to be reclaimed, the original daemon has to be DELETEd, which happens after the decrement transaction has already computed which resources to clean up: 1) a snapshot record is created: snapshot_id | snapshot_addr | volume_references | deleted | -------------------------------------+-------------------------------+-------------------+---------- 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 | 0 | false | 2) it is incremented as part of `volume_create`: snapshot_id | snapshot_addr | volume_references | deleted | -------------------------------------+-------------------------------+-------------------+---------- 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 | 1 | false | 3) when the volume is deleted, then the decrement transaction will: a) decrease `volume_references` by 1 snapshot_id | snapshot_addr | volume_references | deleted | -------------------------------------+-------------------------------+-------------------+---------- 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 | 0 | false | b) note any `region_snapshot` records whose `volume_references` went to 0 and have `deleted` = false, and return those in the list of resources to clean up: [ 1a800928-8f93-4cd3-9df1-4129582ffc20 ] c) set deleted = true for any region_snapshot records whose `volume_references` went to 0 and have deleted = false snapshot_id | snapshot_addr | volume_references | deleted | -------------------------------------+-------------------------------+-------------------+---------- 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 | 0 | true | 4) That read-only snapshot daemon is DELETEd, freeing up the port. Another snapshot creation occurs, using that reclaimed port: snapshot_id | snapshot_addr | volume_references | deleted | -------------------------------------+-------------------------------+-------------------+---------- 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 | 0 | true | bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 | 0 | false | 5) That new snapshot is incremented as part of `volume_create`: snapshot_id | snapshot_addr | volume_references | deleted | -------------------------------------+-------------------------------+-------------------+---------- 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 | 0 | true | bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 | 1 | false | 6) It is later deleted, and the decrement transaction will: a) decrease `volume_references` by 1: snapshot_id | snapshot_addr | volume_references | deleted | -------------------------------------+-------------------------------+-------------------+---------- 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 | 0 | true | bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 | 0 | false | b) note any `region_snapshot` records whose `volume_references` went to 0 and have `deleted` = false, and return those in the list of resources to clean up: [ bdd9614e-f089-4a94-ae46-e10b96b79ba3 ] c) set deleted = true for any region_snapshot records whose `volume_references` went to 0 and have deleted = false snapshot_id | snapshot_addr | volume_references | deleted | -------------------------------------+-------------------------------+-------------------+---------- 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 | 0 | true | bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 | 0 | true |
jmpesp · Sep 15, 2023 · 7de4154 · 7de4154
1 parent aceb744
commit 7de4154
Show file tree

Hide file tree

Showing 10 changed files with 405 additions and 6 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/nexus/db-model/src/region_snapshot.rs b/nexus/db-model/src/region_snapshot.rs
@@ -32,4 +32,7 @@ pub struct RegionSnapshot {
 
     // how many volumes reference this?
     pub volume_references: i64,
+
+    // true if part of a volume's `resources_to_clean_up` already
+    pub deleting: bool,
 }
diff --git a/nexus/db-model/src/schema.rs b/nexus/db-model/src/schema.rs
@@ -856,6 +856,7 @@ table! {
         snapshot_id -> Uuid,
         snapshot_addr -> Text,
         volume_references -> Int8,
+        deleting -> Bool,
     }
 }
 
@@ -1130,7 +1131,7 @@ table! {
 ///
 /// This should be updated whenever the schema is changed. For more details,
 /// refer to: schema/crdb/README.adoc
-pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(4, 0, 0);
+pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(5, 0, 0);
 
 allow_tables_to_appear_in_same_query!(
     system_update,

diff --git a/nexus/db-queries/src/db/datastore/region_snapshot.rs b/nexus/db-queries/src/db/datastore/region_snapshot.rs
@@ -10,9 +10,11 @@ use crate::db::error::public_error_from_diesel_pool;
 use crate::db::error::ErrorHandler;
 use crate::db::model::RegionSnapshot;
 use async_bb8_diesel::AsyncRunQueryDsl;
+use async_bb8_diesel::OptionalExtension;
 use diesel::prelude::*;
 use omicron_common::api::external::CreateResult;
 use omicron_common::api::external::DeleteResult;
+use omicron_common::api::external::LookupResult;
 use uuid::Uuid;
 
 impl DataStore {
@@ -31,6 +33,25 @@ impl DataStore {
             .map_err(|e| public_error_from_diesel_pool(e, ErrorHandler::Server))
     }
 
+    pub async fn region_snapshot_get(
+        &self,
+        dataset_id: Uuid,
+        region_id: Uuid,
+        snapshot_id: Uuid,
+    ) -> LookupResult<Option<RegionSnapshot>> {
+        use db::schema::region_snapshot::dsl;
+
+        dsl::region_snapshot
+            .filter(dsl::dataset_id.eq(dataset_id))
+            .filter(dsl::region_id.eq(region_id))
+            .filter(dsl::snapshot_id.eq(snapshot_id))
+            .select(RegionSnapshot::as_select())
+            .first_async::<RegionSnapshot>(self.pool())
+            .await
+            .optional()
+            .map_err(|e| public_error_from_diesel_pool(e, ErrorHandler::Server))
+    }
+
     pub async fn region_snapshot_remove(
         &self,
         dataset_id: Uuid,

diff --git a/nexus/db-queries/src/db/datastore/volume.rs b/nexus/db-queries/src/db/datastore/volume.rs
@@ -120,6 +120,9 @@ impl DataStore {
                         .filter(
                             rs_dsl::snapshot_addr.eq(read_only_target.clone()),
                         )
+                        .filter(
+                            rs_dsl::deleting.eq(false)
+                        )
                         .set(
                             rs_dsl::volume_references
                                 .eq(rs_dsl::volume_references + 1),
@@ -637,17 +640,44 @@ impl DataStore {
                     }
                 };
 
-                // Decrease the number of uses for each referenced region snapshot.
+                // Decrease the number of uses for each non-deleted referenced
+                // region snapshot.
+
                 use db::schema::region_snapshot::dsl;
 
                 diesel::update(dsl::region_snapshot)
                     .filter(
                         dsl::snapshot_addr
                             .eq_any(&crucible_targets.read_only_targets),
                     )
+                    .filter(dsl::volume_references.gt(0))
+                    .filter(dsl::deleting.eq(false))
                     .set(dsl::volume_references.eq(dsl::volume_references - 1))
                     .execute(conn)?;
 
+                // Then, note anything that was set to zero from the above
+                // UPDATE, and then mark all those as deleted.
+                let snapshots_to_delete: Vec<RegionSnapshot> =
+                    dsl::region_snapshot
+                        .filter(
+                            dsl::snapshot_addr
+                                .eq_any(&crucible_targets.read_only_targets),
+                        )
+                        .filter(dsl::volume_references.eq(0))
+                        .filter(dsl::deleting.eq(false))
+                        .select(RegionSnapshot::as_select())
+                        .load(conn)?;
+
+                diesel::update(dsl::region_snapshot)
+                    .filter(
+                        dsl::snapshot_addr
+                            .eq_any(&crucible_targets.read_only_targets),
+                    )
+                    .filter(dsl::volume_references.eq(0))
+                    .filter(dsl::deleting.eq(false))
+                    .set(dsl::deleting.eq(true))
+                    .execute(conn)?;
+
                 // Return what results can be cleaned up
                 let result = CrucibleResources::V1(CrucibleResourcesV1 {
                     // The only use of a read-write region will be at the top level of a
@@ -673,6 +703,7 @@ impl DataStore {
                                     .eq(region_dsl::id)
                                     .and(dsl::dataset_id.eq(dataset_dsl::id))),
                             )
+                            .filter(dsl::deleting.eq(true))
                             .filter(
                                 dsl::volume_references
                                     .eq(0)
@@ -707,12 +738,27 @@ impl DataStore {
                             // delete a read-only downstairs running for a
                             // snapshot that doesn't exist will return a 404,
                             // causing the saga to error and unwind.
+                            //
+                            // XXX are any other filters required, except the
+                            // three from snapshots_to_delete?
                             .filter(
                                 dsl::snapshot_addr.eq_any(
                                     &crucible_targets.read_only_targets,
                                 ),
                             )
-                            .filter(dsl::volume_references.eq(0))
+                            // XXX is there a better way to do this, rather than
+                            // one filter per struct field?
+                            .filter(dsl::dataset_id.eq_any(snapshots_to_delete.iter().map(|x| x.dataset_id)))
+                            .filter(dsl::region_id.eq_any(snapshots_to_delete.iter().map(|x| x.region_id)))
+                            .filter(dsl::snapshot_id.eq_any(snapshots_to_delete.iter().map(|x| x.snapshot_id)))
+                            .filter(dsl::deleting.eq(true))
+                            .filter(
+                                dsl::volume_references
+                                    .eq(0)
+                                    // Despite the SQL specifying that this column is NOT NULL,
+                                    // this null check is required for this function to work!
+                                    .or(dsl::volume_references.is_null()),
+                            )
                             .inner_join(
                                 dataset_dsl::dataset
                                     .on(dsl::dataset_id.eq(dataset_dsl::id)),
@@ -960,7 +1006,7 @@ impl DataStore {
 
 #[derive(Default, Debug, Serialize, Deserialize)]
 pub struct CrucibleTargets {
-    read_only_targets: Vec<String>,
+    pub read_only_targets: Vec<String>,
 }
 
 // Serialize this enum into the `resources_to_clean_up` column to handle

diff --git a/nexus/src/app/sagas/snapshot_create.rs b/nexus/src/app/sagas/snapshot_create.rs
@@ -1275,6 +1275,7 @@ async fn ssc_start_running_snapshot(
                 snapshot_id,
                 snapshot_addr,
                 volume_references: 0, // to be filled later
+                deleting: false,
             })
             .await
             .map_err(ActionError::action_failed)?;

diff --git a/nexus/tests/integration_tests/snapshots.rs b/nexus/tests/integration_tests/snapshots.rs
@@ -1094,6 +1094,7 @@ async fn test_region_snapshot_create_idempotent(
         snapshot_addr: "[::]:12345".to_string(),
 
         volume_references: 1,
+        deleting: false,
     };
 
     datastore.region_snapshot_create(region_snapshot.clone()).await.unwrap();