From 8fa1da2e81599df50ac035a1948d17533ffc31bc Mon Sep 17 00:00:00 2001
From: Sean Klein <sean@oxide.computer>
Date: Tue, 27 Aug 2024 16:11:33 -0700
Subject: [PATCH] Add notes on fixing broken schema updates (#6448)

Addresses feedback on https://github.com/oxidecomputer/omicron/pull/6436
---
 schema/crdb/README.adoc | 111 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 101 insertions(+), 10 deletions(-)

diff --git a/schema/crdb/README.adoc b/schema/crdb/README.adoc
index e017c01316..3567821ea6 100644
--- a/schema/crdb/README.adoc
+++ b/schema/crdb/README.adoc
@@ -126,19 +126,47 @@ same `NEW_VERSION`:**, then your `OLD_VERSION` has changed and so _your_
    new version that came in from "main").
 * Update the version in `dbinit.sql` to match the new `NEW_VERSION`.
 
-=== General notes
+=== Constraints on Schema Updates
 
-CockroachDB's representation of the schema includes some opaque
-internally-generated fields that are order dependent, like the names of
-anonymous CHECK constraints.  Our schema comparison tools intentionally ignore
-these values. As a result, when performing schema changes, the order of new
-tables and constraints should generally not be important.
+==== Adding a new column without a default value [[add_column_constraint]]
+
+When adding a new non-nullable column to an existing table, that column must
+contain a default to help back-fill existing rows in that table which may
+exist. Without this default value, the schema upgrade might fail with
+an error like `null value in column "..." violates not-null constraint`.
+Unfortunately, it's possible that the schema upgrade might NOT fail with that
+error, if no rows are present in the table when the schema is updated. This
+results in an inconsistent state, where the schema upgrade might succeed on
+some deployments but fail on others.
+
+If you'd like to add a column without a default value, we recommend
+doing the following, if a `DEFAULT` value makes sense for a one-time update:
+
+1. Adding the column with a `DEFAULT` value.
+2. Dropping the `DEFAULT` constraint.
+
+If a `DEFAULT` value does not make sense, then you need to implement a
+multi-step process.
+
+. Add the column without a `NOT NULL` constraint.
+. Migrate existing data to a non-null value.
+. Once all data has been migrated to a non-null value, alter the table again to
+add the `NOT NULL` constraint.
 
-As convention, however, we recommend keeping the `db_metadata` file at the end
-of `dbinit.sql`, so that the database does not contain a version until it is
-fully populated.
+For the time being, if you can write the data migration in SQL (e.g., using a
+SQL `UPDATE`), then you can do this with a single new schema version where the
+second step is an `UPDATE`. See schema version 54 (`blueprint-add-external-ip-id`)
+for an example of this (though that one did not make the new column `NOT NULL` --
+you'd want to do that in another step). Update the `validate_data_migration()`
+test in `nexus/tests/integration_tests/schema.rs` to add a test for this.
 
-=== Scenario-specific gotchas
+In the future when schema updates happen while the control plane is online,
+this may not be a tenable path because the operation may take a very long time
+on large tables.
+
+If you cannot write the data migration in SQL, you would need to figure out a
+different way to backfill the data before you can apply the step that adds the
+`NOT NULL` constraint. This is likely a substantial project
 
 ==== Renaming columns
 
@@ -151,3 +179,66 @@ functions as a workaround.)
 An (imperfect) workaround is to use the `#[diesel(column_name = foo)]` attribute
 in Rust code to preserve the existing name of a column in the database while
 giving its corresponding struct field a different, more meaningful name.
+
+Note that this constraint does not apply to renaming tables: the statement
+`ALTER TABLE IF EXISTS ... RENAME TO ...` is valid and idempotent.
+
+=== Fixing broken Schema Updates
+
+WARNING: This section is somewhat speculative - what "broken" means may differ
+significantly from one schema update to the next. Take this as as a recommendation
+based on experience, but not as a hard truth that will apply to all broken schema
+updates.
+
+In cases where a schema update cannot complete successfully, additional steps
+may be necessary to enable schema updates to proceed (for example, if a schema
+update tried <<add_column_constraint>>). In these situations, the goal should be
+the following:
+
+. Fix the schema update such that deployments which have not applied it yet
+do not fail.
+.. It is important to update the *exact* "upN.sql" file which failed, rather than
+re-numbering or otherwise changing the order of schema updates. Internally, Nexus
+tracks which individual step of a schema update has been applied, to avoid applying
+older schema upgrades which may no longer be relevant.
+. Add a follow-up named schema update to ensure that deployments which have
+*already* applied it arrive at the same state. This is only necessary if it is
+possible for the schema update to apply successfully in any possible
+deployment. This schema update should be added like any other "new" schema update,
+appended to the list of all updates, rather than re-ordering history. This
+schema update will run on systems that deployed both versions of the earlier
+schema update.
+. Determine whether any of the schema versions after the broken one need to
+change because they depended on the specific behavior that you changed to _fix_
+that version.
+
+We can use the following terminology here:
+
+* `S(bad)`: The particular `upN.sql` schema update which is "broken".
+* `S(fixed)`: That same `upN.sql` file after being updated to a non-broken version.
+* `S(converge)`: Some later schema update that converges the deployment to a known-good
+state.
+
+**This process is risky**. By changing the contents of the old schema update `S(bad)`
+to `S(fixed)`, we create two divergent histories on our deployments: one where `S(bad)`
+may have been applied, and one where only `S(fixed)` was applied.
+
+Although the goal of `S(converge)` is to make sure that these deployments end
+up looking the same, there are no guarantees that other schema updates between
+`S(bad)` and `S(converge)` will be identical between these two variant update
+timelines. When fixing broken schema updates, do so with caution, and consider
+all schema updates between `S(bad)` and `S(converge)` - these updates must be
+able to complete successfully regardless of which one of `S(bad)` or `S(fixed)`
+was applied.
+
+=== General notes
+
+CockroachDB's representation of the schema includes some opaque
+internally-generated fields that are order dependent, like the names of
+anonymous CHECK constraints.  Our schema comparison tools intentionally ignore
+these values. As a result, when performing schema changes, the order of new
+tables and constraints should generally not be important.
+
+As convention, however, we recommend keeping the `db_metadata` row insertion at
+the end of `dbinit.sql`, so that the database does not contain a version until
+it is fully populated.