Skip to content

Commit

Permalink
Do not double count region snapshots records!
Browse files Browse the repository at this point in the history
`decrease_crucible_resource_count_and_soft_delete_volume` does not
disambiguate cases where the snapshot_addr of a region_snapshot is
duplicated with another one, which can occur due to the Crucible Agent
reclaiming ports from destroyed daemons (see also oxidecomputer#4049, which makes the
simulated Crucible agent do this).

Several invocations of the snapshot create and snapshot delete sagas
could race in such a way that one of these ports would be reclaimed, and
then be used in a different snapshot, and the lifetime of both of these
would overlap! This would confuse our reference counting, which was
written with a naive assumption that this port reuse **wouldn't** occur
with these overlapping lifetimes. Spoiler alert, it can:

    root@[fd00:1122:3344:101::3]:32221/omicron> select * from region_snapshot where snapshot_addr = '[fd00:1122:3344:102::7]:19016';
                   dataset_id              |              region_id               |             snapshot_id              |         snapshot_addr         | volume_references
    ---------------------------------------+--------------------------------------+--------------------------------------+-------------------------------+--------------------
      80790bfd-4b81-4381-9262-20912e3826cc | 0387bbb7-1d54-4683-943c-6c17d6804de9 | 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0
      80790bfd-4b81-4381-9262-20912e3826cc | ff20e066-8815-4eb6-ac84-fab9b9103462 | bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0
    (2 rows)

One way to solve this would be to create a UNIQUE INDEX on
`snapshot_addr` here, but then in these cases the snapshot creation
would return a 500 error to the user.

This commit adds a sixth column: `deleting`, a boolean that is true when
the region snapshot is part of a volume's `resources_to_clean_up`, and
false otherwise. This is used to select (as part of the transaction
for `decrease_crucible_resource_count_and_soft_delete_volume`) only the
region_snapshot records that were decremented as part of that
transaction, and skip re-deleting them otherwise.

This works because the overlapping lifetime of the records in the DB is
**not** the overlapping lifetime of the actual read-only downstairs
daemon: for the port to be reclaimed, the original daemon has to be
DELETEd, which happens after the decrement transaction has already
computed which resources to clean up:

1) a snapshot record is created:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

2) it is incremented as part of `volume_create`:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |
```

3) when the volume is deleted, then the decrement transaction will:

  a) decrease `volume_references` by 1

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

  b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ 1a800928-8f93-4cd3-9df1-4129582ffc20 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false
```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
```

4) That read-only snapshot daemon is DELETEd, freeing up the port.
   Another snapshot creation occurs, using that reclaimed port:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

5) That new snapshot is incremented as part of `volume_create`:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |
```

6) It is later deleted, and the decrement transaction will:

  a) decrease `volume_references` by 1:

```
j           snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

  b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ bdd9614e-f089-4a94-ae46-e10b96b79ba3 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
```
  • Loading branch information
jmpesp committed Sep 15, 2023
1 parent aceb744 commit 5e19a9b
Show file tree
Hide file tree
Showing 10 changed files with 417 additions and 6 deletions.
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions nexus/db-model/src/region_snapshot.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,7 @@ pub struct RegionSnapshot {

// how many volumes reference this?
pub volume_references: i64,

// true if part of a volume's `resources_to_clean_up` already
pub deleting: bool,
}
3 changes: 2 additions & 1 deletion nexus/db-model/src/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -856,6 +856,7 @@ table! {
snapshot_id -> Uuid,
snapshot_addr -> Text,
volume_references -> Int8,
deleting -> Bool,
}
}

Expand Down Expand Up @@ -1130,7 +1131,7 @@ table! {
///
/// This should be updated whenever the schema is changed. For more details,
/// refer to: schema/crdb/README.adoc
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(4, 0, 0);
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(5, 0, 0);

allow_tables_to_appear_in_same_query!(
system_update,
Expand Down
21 changes: 21 additions & 0 deletions nexus/db-queries/src/db/datastore/region_snapshot.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,11 @@ use crate::db::error::public_error_from_diesel_pool;
use crate::db::error::ErrorHandler;
use crate::db::model::RegionSnapshot;
use async_bb8_diesel::AsyncRunQueryDsl;
use async_bb8_diesel::OptionalExtension;
use diesel::prelude::*;
use omicron_common::api::external::CreateResult;
use omicron_common::api::external::DeleteResult;
use omicron_common::api::external::LookupResult;
use uuid::Uuid;

impl DataStore {
Expand All @@ -31,6 +33,25 @@ impl DataStore {
.map_err(|e| public_error_from_diesel_pool(e, ErrorHandler::Server))
}

pub async fn region_snapshot_get(
&self,
dataset_id: Uuid,
region_id: Uuid,
snapshot_id: Uuid,
) -> LookupResult<Option<RegionSnapshot>> {
use db::schema::region_snapshot::dsl;

dsl::region_snapshot
.filter(dsl::dataset_id.eq(dataset_id))
.filter(dsl::region_id.eq(region_id))
.filter(dsl::snapshot_id.eq(snapshot_id))
.select(RegionSnapshot::as_select())
.first_async::<RegionSnapshot>(self.pool())
.await
.optional()
.map_err(|e| public_error_from_diesel_pool(e, ErrorHandler::Server))
}

pub async fn region_snapshot_remove(
&self,
dataset_id: Uuid,
Expand Down
64 changes: 61 additions & 3 deletions nexus/db-queries/src/db/datastore/volume.rs
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ impl DataStore {
.filter(
rs_dsl::snapshot_addr.eq(read_only_target.clone()),
)
.filter(rs_dsl::deleting.eq(false))
.set(
rs_dsl::volume_references
.eq(rs_dsl::volume_references + 1),
Expand Down Expand Up @@ -637,17 +638,44 @@ impl DataStore {
}
};

// Decrease the number of uses for each referenced region snapshot.
// Decrease the number of uses for each non-deleted referenced
// region snapshot.

use db::schema::region_snapshot::dsl;

diesel::update(dsl::region_snapshot)
.filter(
dsl::snapshot_addr
.eq_any(&crucible_targets.read_only_targets),
)
.filter(dsl::volume_references.gt(0))
.filter(dsl::deleting.eq(false))
.set(dsl::volume_references.eq(dsl::volume_references - 1))
.execute(conn)?;

// Then, note anything that was set to zero from the above
// UPDATE, and then mark all those as deleted.
let snapshots_to_delete: Vec<RegionSnapshot> =
dsl::region_snapshot
.filter(
dsl::snapshot_addr
.eq_any(&crucible_targets.read_only_targets),
)
.filter(dsl::volume_references.eq(0))
.filter(dsl::deleting.eq(false))
.select(RegionSnapshot::as_select())
.load(conn)?;

diesel::update(dsl::region_snapshot)
.filter(
dsl::snapshot_addr
.eq_any(&crucible_targets.read_only_targets),
)
.filter(dsl::volume_references.eq(0))
.filter(dsl::deleting.eq(false))
.set(dsl::deleting.eq(true))
.execute(conn)?;

// Return what results can be cleaned up
let result = CrucibleResources::V1(CrucibleResourcesV1 {
// The only use of a read-write region will be at the top level of a
Expand All @@ -673,6 +701,7 @@ impl DataStore {
.eq(region_dsl::id)
.and(dsl::dataset_id.eq(dataset_dsl::id))),
)
.filter(dsl::deleting.eq(true))
.filter(
dsl::volume_references
.eq(0)
Expand Down Expand Up @@ -707,12 +736,41 @@ impl DataStore {
// delete a read-only downstairs running for a
// snapshot that doesn't exist will return a 404,
// causing the saga to error and unwind.
//
// XXX are any other filters required, except the
// three from snapshots_to_delete?
.filter(
dsl::snapshot_addr.eq_any(
&crucible_targets.read_only_targets,
),
)
.filter(dsl::volume_references.eq(0))
// XXX is there a better way to do this, rather than
// one filter per struct field?
.filter(
dsl::dataset_id.eq_any(
snapshots_to_delete
.iter()
.map(|x| x.dataset_id),
),
)
.filter(dsl::region_id.eq_any(
snapshots_to_delete.iter().map(|x| x.region_id),
))
.filter(
dsl::snapshot_id.eq_any(
snapshots_to_delete
.iter()
.map(|x| x.snapshot_id),
),
)
.filter(dsl::deleting.eq(true))
.filter(
dsl::volume_references
.eq(0)
// Despite the SQL specifying that this column is NOT NULL,
// this null check is required for this function to work!
.or(dsl::volume_references.is_null()),
)
.inner_join(
dataset_dsl::dataset
.on(dsl::dataset_id.eq(dataset_dsl::id)),
Expand Down Expand Up @@ -960,7 +1018,7 @@ impl DataStore {

#[derive(Default, Debug, Serialize, Deserialize)]
pub struct CrucibleTargets {
read_only_targets: Vec<String>,
pub read_only_targets: Vec<String>,
}

// Serialize this enum into the `resources_to_clean_up` column to handle
Expand Down
1 change: 1 addition & 0 deletions nexus/src/app/sagas/snapshot_create.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1275,6 +1275,7 @@ async fn ssc_start_running_snapshot(
snapshot_id,
snapshot_addr,
volume_references: 0, // to be filled later
deleting: false,
})
.await
.map_err(ActionError::action_failed)?;
Expand Down
1 change: 1 addition & 0 deletions nexus/tests/integration_tests/snapshots.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1094,6 +1094,7 @@ async fn test_region_snapshot_create_idempotent(
snapshot_addr: "[::]:12345".to_string(),

volume_references: 1,
deleting: false,
};

datastore.region_snapshot_create(region_snapshot.clone()).await.unwrap();
Expand Down
Loading

0 comments on commit 5e19a9b

Please sign in to comment.