Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulated Crucible agent should be stricter #4049

Merged
merged 14 commits into from
Sep 9, 2024

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Sep 7, 2023

There's a few more info messages here that will aid debugging, plus a few ways the simulated Crucible agent is now stricter:

  • it should be impossible to create a snapshot of a region in a bad state.

  • it should be an error to try to delete a snapshot for a non-existent region (the real sled agent would return a 404)

  • it should be an error to attempt creating a running snapshot for a snapshot that does not exist

The actual Crucible agent reuses freed ports, so the simulated one should too.

There's a few more info messages here that will aid debugging, plus a
few ways the simulated Crucible agent is now stricter:

- it should be impossible to create a snapshot of a region in a bad
  state.

- it should be an error to try to delete a snapshot for a non-existent
  region (the real sled agent would return a 404)

- it should be an error to attempt creating a running snapshot for a
  snapshot that does not exist

The actual Crucible agent reuses freed ports, so the simulated one
should too.
jmpesp added a commit to jmpesp/omicron that referenced this pull request Sep 15, 2023
`decrease_crucible_resource_count_and_soft_delete_volume` does not
disambiguate cases where the snapshot_addr of a region_snapshot is
duplicated with another one, which can occur due to the Crucible Agent
reclaiming ports from destroyed daemons (see also oxidecomputer#4049, which makes the
simulated Crucible agent do this).

Several invocations of the snapshot create and snapshot delete sagas
could race in such a way that one of these ports would be reclaimed, and
then be used in a different snapshot, and the lifetime of both of these
would overlap! This would confuse our reference counting, which was
written with a naive assumption that this port reuse **wouldn't** occur
with these overlapping lifetimes. Spoiler alert, it can:

    root@[fd00:1122:3344:101::3]:32221/omicron> select * from region_snapshot where snapshot_addr = '[fd00:1122:3344:102::7]:19016';
                   dataset_id              |              region_id               |             snapshot_id              |         snapshot_addr         | volume_references
    ---------------------------------------+--------------------------------------+--------------------------------------+-------------------------------+--------------------
      80790bfd-4b81-4381-9262-20912e3826cc | 0387bbb7-1d54-4683-943c-6c17d6804de9 | 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0
      80790bfd-4b81-4381-9262-20912e3826cc | ff20e066-8815-4eb6-ac84-fab9b9103462 | bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0
    (2 rows)

One way to solve this would be to create a UNIQUE INDEX on
`snapshot_addr` here, but then in these cases the snapshot creation
would return a 500 error to the user.

This commit adds a sixth column: `deleting`, a boolean that is true when
the region snapshot is part of a volume's `resources_to_clean_up`, and
false otherwise. This is used to select (as part of the transaction
for `decrease_crucible_resource_count_and_soft_delete_volume`) only the
region_snapshot records that were decremented as part of that
transaction, and skip re-deleting them otherwise.

This works because the overlapping lifetime of the records in the DB is
**not** the overlapping lifetime of the actual read-only downstairs
daemon: for the port to be reclaimed, the original daemon has to be
DELETEd, which happens after the decrement transaction has already
computed which resources to clean up:

1) a snapshot record is created:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |

2) it is incremented as part of `volume_create`:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |

3) when the volume is deleted, then the decrement transaction will:

  a) decrease `volume_references` by 1

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |

  b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ 1a800928-8f93-4cd3-9df1-4129582ffc20 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |

4) That read-only snapshot daemon is DELETEd, freeing up the port.
   Another snapshot creation occurs, using that reclaimed port:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |

5) That new snapshot is incremented as part of `volume_create`:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |

6) It is later deleted, and the decrement transaction will:

  a) decrease `volume_references` by 1:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |

  b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ bdd9614e-f089-4a94-ae46-e10b96b79ba3 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
jmpesp added a commit to jmpesp/omicron that referenced this pull request Sep 15, 2023
`decrease_crucible_resource_count_and_soft_delete_volume` does not
disambiguate cases where the snapshot_addr of a region_snapshot is
duplicated with another one, which can occur due to the Crucible Agent
reclaiming ports from destroyed daemons (see also oxidecomputer#4049, which makes the
simulated Crucible agent do this).

Several invocations of the snapshot create and snapshot delete sagas
could race in such a way that one of these ports would be reclaimed, and
then be used in a different snapshot, and the lifetime of both of these
would overlap! This would confuse our reference counting, which was
written with a naive assumption that this port reuse **wouldn't** occur
with these overlapping lifetimes. Spoiler alert, it can:

    root@[fd00:1122:3344:101::3]:32221/omicron> select * from region_snapshot where snapshot_addr = '[fd00:1122:3344:102::7]:19016';
                   dataset_id              |              region_id               |             snapshot_id              |         snapshot_addr         | volume_references
    ---------------------------------------+--------------------------------------+--------------------------------------+-------------------------------+--------------------
      80790bfd-4b81-4381-9262-20912e3826cc | 0387bbb7-1d54-4683-943c-6c17d6804de9 | 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0
      80790bfd-4b81-4381-9262-20912e3826cc | ff20e066-8815-4eb6-ac84-fab9b9103462 | bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0
    (2 rows)

One way to solve this would be to create a UNIQUE INDEX on
`snapshot_addr` here, but then in these cases the snapshot creation
would return a 500 error to the user.

This commit adds a sixth column: `deleting`, a boolean that is true when
the region snapshot is part of a volume's `resources_to_clean_up`, and
false otherwise. This is used to select (as part of the transaction
for `decrease_crucible_resource_count_and_soft_delete_volume`) only the
region_snapshot records that were decremented as part of that
transaction, and skip re-deleting them otherwise.

This works because the overlapping lifetime of the records in the DB is
**not** the overlapping lifetime of the actual read-only downstairs
daemon: for the port to be reclaimed, the original daemon has to be
DELETEd, which happens after the decrement transaction has already
computed which resources to clean up:

1) a snapshot record is created:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |

2) it is incremented as part of `volume_create`:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |

3) when the volume is deleted, then the decrement transaction will:

  a) decrease `volume_references` by 1

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |

  b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ 1a800928-8f93-4cd3-9df1-4129582ffc20 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |

4) That read-only snapshot daemon is DELETEd, freeing up the port.
   Another snapshot creation occurs, using that reclaimed port:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |

5) That new snapshot is incremented as part of `volume_create`:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |

6) It is later deleted, and the decrement transaction will:

  a) decrease `volume_references` by 1:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |

  b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ bdd9614e-f089-4a94-ae46-e10b96b79ba3 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
jmpesp added a commit to jmpesp/omicron that referenced this pull request Sep 15, 2023
`decrease_crucible_resource_count_and_soft_delete_volume` does not
disambiguate cases where the snapshot_addr of a region_snapshot is
duplicated with another one, which can occur due to the Crucible Agent
reclaiming ports from destroyed daemons (see also oxidecomputer#4049, which makes the
simulated Crucible agent do this).

Several invocations of the snapshot create and snapshot delete sagas
could race in such a way that one of these ports would be reclaimed, and
then be used in a different snapshot, and the lifetime of both of these
would overlap! This would confuse our reference counting, which was
written with a naive assumption that this port reuse **wouldn't** occur
with these overlapping lifetimes. Spoiler alert, it can:

    root@[fd00:1122:3344:101::3]:32221/omicron> select * from region_snapshot where snapshot_addr = '[fd00:1122:3344:102::7]:19016';
                   dataset_id              |              region_id               |             snapshot_id              |         snapshot_addr         | volume_references
    ---------------------------------------+--------------------------------------+--------------------------------------+-------------------------------+--------------------
      80790bfd-4b81-4381-9262-20912e3826cc | 0387bbb7-1d54-4683-943c-6c17d6804de9 | 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0
      80790bfd-4b81-4381-9262-20912e3826cc | ff20e066-8815-4eb6-ac84-fab9b9103462 | bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0
    (2 rows)

One way to solve this would be to create a UNIQUE INDEX on
`snapshot_addr` here, but then in these cases the snapshot creation
would return a 500 error to the user.

This commit adds a sixth column: `deleting`, a boolean that is true when
the region snapshot is part of a volume's `resources_to_clean_up`, and
false otherwise. This is used to select (as part of the transaction
for `decrease_crucible_resource_count_and_soft_delete_volume`) only the
region_snapshot records that were decremented as part of that
transaction, and skip re-deleting them otherwise.

This works because the overlapping lifetime of the records in the DB is
**not** the overlapping lifetime of the actual read-only downstairs
daemon: for the port to be reclaimed, the original daemon has to be
DELETEd, which happens after the decrement transaction has already
computed which resources to clean up:

1) a snapshot record is created:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

2) it is incremented as part of `volume_create`:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |

3) when the volume is deleted, then the decrement transaction will:

  a) decrease `volume_references` by 1

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |

  b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ 1a800928-8f93-4cd3-9df1-4129582ffc20 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |

4) That read-only snapshot daemon is DELETEd, freeing up the port.
   Another snapshot creation occurs, using that reclaimed port:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |

5) That new snapshot is incremented as part of `volume_create`:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |

6) It is later deleted, and the decrement transaction will:

  a) decrease `volume_references` by 1:

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |

  b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ bdd9614e-f089-4a94-ae46-e10b96b79ba3 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false

                snapshot_id              |         snapshot_addr         | volume_references | deleted |
    -------------------------------------+-------------------------------+-------------------+----------
    1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
    bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
jmpesp added a commit to jmpesp/omicron that referenced this pull request Sep 15, 2023
`decrease_crucible_resource_count_and_soft_delete_volume` does not
disambiguate cases where the snapshot_addr of a region_snapshot is
duplicated with another one, which can occur due to the Crucible Agent
reclaiming ports from destroyed daemons (see also oxidecomputer#4049, which makes the
simulated Crucible agent do this).

Several invocations of the snapshot create and snapshot delete sagas
could race in such a way that one of these ports would be reclaimed, and
then be used in a different snapshot, and the lifetime of both of these
would overlap! This would confuse our reference counting, which was
written with a naive assumption that this port reuse **wouldn't** occur
with these overlapping lifetimes. Spoiler alert, it can:

    root@[fd00:1122:3344:101::3]:32221/omicron> select * from region_snapshot where snapshot_addr = '[fd00:1122:3344:102::7]:19016';
                   dataset_id              |              region_id               |             snapshot_id              |         snapshot_addr         | volume_references
    ---------------------------------------+--------------------------------------+--------------------------------------+-------------------------------+--------------------
      80790bfd-4b81-4381-9262-20912e3826cc | 0387bbb7-1d54-4683-943c-6c17d6804de9 | 1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0
      80790bfd-4b81-4381-9262-20912e3826cc | ff20e066-8815-4eb6-ac84-fab9b9103462 | bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0
    (2 rows)

One way to solve this would be to create a UNIQUE INDEX on
`snapshot_addr` here, but then in these cases the snapshot creation
would return a 500 error to the user.

This commit adds a sixth column: `deleting`, a boolean that is true when
the region snapshot is part of a volume's `resources_to_clean_up`, and
false otherwise. This is used to select (as part of the transaction
for `decrease_crucible_resource_count_and_soft_delete_volume`) only the
region_snapshot records that were decremented as part of that
transaction, and skip re-deleting them otherwise.

This works because the overlapping lifetime of the records in the DB is
**not** the overlapping lifetime of the actual read-only downstairs
daemon: for the port to be reclaimed, the original daemon has to be
DELETEd, which happens after the decrement transaction has already
computed which resources to clean up:

1) a snapshot record is created:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

2) it is incremented as part of `volume_create`:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |
```

3) when the volume is deleted, then the decrement transaction will:

  a) decrease `volume_references` by 1

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

  b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ 1a800928-8f93-4cd3-9df1-4129582ffc20 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false
```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
```

4) That read-only snapshot daemon is DELETEd, freeing up the port.
   Another snapshot creation occurs, using that reclaimed port:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

5) That new snapshot is incremented as part of `volume_create`:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |
```

6) It is later deleted, and the decrement transaction will:

  a) decrease `volume_references` by 1:

```
j           snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

  b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ bdd9614e-f089-4a94-ae46-e10b96b79ba3 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
```
jmpesp added a commit that referenced this pull request Oct 10, 2023
`decrease_crucible_resource_count_and_soft_delete_volume` does not
disambiguate cases where the snapshot_addr of a region_snapshot is
duplicated with another one, which can occur due to the Crucible Agent
reclaiming ports from destroyed daemons (see also #4049, which makes the
simulated Crucible agent do this).

Several invocations of the snapshot create and snapshot delete sagas
could race in such a way that one of these ports would be reclaimed, and
then be used in a different snapshot, and the lifetime of both of these
would overlap! This would confuse our reference counting, which was
written with a naive assumption that this port reuse **wouldn't** occur
with these overlapping lifetimes. Spoiler alert, it can:

root@[fd00:1122:3344:101::3]:32221/omicron> select * from
region_snapshot where snapshot_addr = '[fd00:1122:3344:102::7]:19016';
dataset_id | region_id | snapshot_id | snapshot_addr | volume_references

---------------------------------------+--------------------------------------+--------------------------------------+-------------------------------+--------------------
80790bfd-4b81-4381-9262-20912e3826cc |
0387bbb7-1d54-4683-943c-6c17d6804de9 |
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 | 0
80790bfd-4b81-4381-9262-20912e3826cc |
ff20e066-8815-4eb6-ac84-fab9b9103462 |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 | 0
    (2 rows)

One way to solve this would be to create a UNIQUE INDEX on
`snapshot_addr` here, but then in these cases the snapshot creation
would return a 500 error to the user.

This commit adds a sixth column: `deleting`, a boolean that is true when
the region snapshot is part of a volume's `resources_to_clean_up`, and
false otherwise. This is used to select (as part of the transaction for
`decrease_crucible_resource_count_and_soft_delete_volume`) only the
region_snapshot records that were decremented as part of that
transaction, and skip re-deleting them otherwise.

This works because the overlapping lifetime of the records in the DB is
**not** the overlapping lifetime of the actual read-only downstairs
daemon: for the port to be reclaimed, the original daemon has to be
DELETEd, which happens after the decrement transaction has already
computed which resources to clean up:

1) a snapshot record is created:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

2) it is incremented as part of `volume_create`:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |
```

3) when the volume is deleted, then the decrement transaction will:

  a) decrease `volume_references` by 1

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ 1a800928-8f93-4cd3-9df1-4129582ffc20 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false
```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
```

4) That read-only snapshot daemon is DELETEd, freeing up the port.
   Another snapshot creation occurs, using that reclaimed port:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

5) That new snapshot is incremented as part of `volume_create`:

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 1 |  false  |
```

6) It is later deleted, and the decrement transaction will:

  a) decrease `volume_references` by 1:

```
j           snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |  false  |
```

b) note any `region_snapshot` records whose `volume_references` went to
     0 and have `deleted` = false, and return those in the list of
     resources to clean up:

     [ bdd9614e-f089-4a94-ae46-e10b96b79ba3 ]

  c) set deleted = true for any region_snapshot records whose
     `volume_references` went to 0 and have deleted = false

```
            snapshot_id              |         snapshot_addr         | volume_references | deleted |
-------------------------------------+-------------------------------+-------------------+----------
1a800928-8f93-4cd3-9df1-4129582ffc20 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
bdd9614e-f089-4a94-ae46-e10b96b79ba3 | [fd00:1122:3344:102::7]:19016 |                 0 |   true  |
```
@jmpesp jmpesp requested review from leftwo and removed request for jordanhendricks September 6, 2024 21:02
match region.state {
State::Failed | State::Destroyed | State::Tombstoned => {
bail!(
"cannot create snapshot of region in state {:?}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do you care to include the region ID in the bail/error message here? Or will that be easy enough to determine what it was that failed if you had many running at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done in 2a3cbd5


let map =
self.running_snapshots.entry(id).or_insert_with(|| HashMap::new());

// If a running snapshot exists already, return it - this endpoint must
// be idempotent.
if let Some(running_snapshot) = map.get(&name.to_string()) {
self.used_ports.remove(&port_number);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you put the let port_number = .. below this if and then you don't have to remove it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Done in 4b705af

use super::*;
use omicron_test_utils::dev::test_setup_log;

/// Validate that the Crucible agent reuses ports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this testing the sim agent, or the real agent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it's the sim one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's all testing the sim, we should make it clear in the comments that is what we are testing, so I don't come back later and think this is testing a real agent :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! see bbca204

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the request to change the comments about sim testing should say it is testing sim.

@jmpesp jmpesp enabled auto-merge (squash) September 9, 2024 21:10
@jmpesp jmpesp merged commit 69af861 into oxidecomputer:main Sep 9, 2024
16 checks passed
@jmpesp jmpesp deleted the stricter_sled_agent branch September 10, 2024 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants