Skip to content

Commit

Permalink
Add notes on fixing broken schema updates (#6448)
Browse files Browse the repository at this point in the history
Addresses feedback on #6436
  • Loading branch information
smklein authored Aug 27, 2024
1 parent c313f99 commit 8fa1da2
Showing 1 changed file with 101 additions and 10 deletions.
111 changes: 101 additions & 10 deletions schema/crdb/README.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -126,19 +126,47 @@ same `NEW_VERSION`:**, then your `OLD_VERSION` has changed and so _your_
new version that came in from "main").
* Update the version in `dbinit.sql` to match the new `NEW_VERSION`.

=== General notes
=== Constraints on Schema Updates

CockroachDB's representation of the schema includes some opaque
internally-generated fields that are order dependent, like the names of
anonymous CHECK constraints. Our schema comparison tools intentionally ignore
these values. As a result, when performing schema changes, the order of new
tables and constraints should generally not be important.
==== Adding a new column without a default value [[add_column_constraint]]

When adding a new non-nullable column to an existing table, that column must
contain a default to help back-fill existing rows in that table which may
exist. Without this default value, the schema upgrade might fail with
an error like `null value in column "..." violates not-null constraint`.
Unfortunately, it's possible that the schema upgrade might NOT fail with that
error, if no rows are present in the table when the schema is updated. This
results in an inconsistent state, where the schema upgrade might succeed on
some deployments but fail on others.

If you'd like to add a column without a default value, we recommend
doing the following, if a `DEFAULT` value makes sense for a one-time update:

1. Adding the column with a `DEFAULT` value.
2. Dropping the `DEFAULT` constraint.

If a `DEFAULT` value does not make sense, then you need to implement a
multi-step process.

. Add the column without a `NOT NULL` constraint.
. Migrate existing data to a non-null value.
. Once all data has been migrated to a non-null value, alter the table again to
add the `NOT NULL` constraint.

As convention, however, we recommend keeping the `db_metadata` file at the end
of `dbinit.sql`, so that the database does not contain a version until it is
fully populated.
For the time being, if you can write the data migration in SQL (e.g., using a
SQL `UPDATE`), then you can do this with a single new schema version where the
second step is an `UPDATE`. See schema version 54 (`blueprint-add-external-ip-id`)
for an example of this (though that one did not make the new column `NOT NULL` --
you'd want to do that in another step). Update the `validate_data_migration()`
test in `nexus/tests/integration_tests/schema.rs` to add a test for this.

=== Scenario-specific gotchas
In the future when schema updates happen while the control plane is online,
this may not be a tenable path because the operation may take a very long time
on large tables.

If you cannot write the data migration in SQL, you would need to figure out a
different way to backfill the data before you can apply the step that adds the
`NOT NULL` constraint. This is likely a substantial project

==== Renaming columns

Expand All @@ -151,3 +179,66 @@ functions as a workaround.)
An (imperfect) workaround is to use the `#[diesel(column_name = foo)]` attribute
in Rust code to preserve the existing name of a column in the database while
giving its corresponding struct field a different, more meaningful name.

Note that this constraint does not apply to renaming tables: the statement
`ALTER TABLE IF EXISTS ... RENAME TO ...` is valid and idempotent.

=== Fixing broken Schema Updates

WARNING: This section is somewhat speculative - what "broken" means may differ
significantly from one schema update to the next. Take this as as a recommendation
based on experience, but not as a hard truth that will apply to all broken schema
updates.

In cases where a schema update cannot complete successfully, additional steps
may be necessary to enable schema updates to proceed (for example, if a schema
update tried <<add_column_constraint>>). In these situations, the goal should be
the following:

. Fix the schema update such that deployments which have not applied it yet
do not fail.
.. It is important to update the *exact* "upN.sql" file which failed, rather than
re-numbering or otherwise changing the order of schema updates. Internally, Nexus
tracks which individual step of a schema update has been applied, to avoid applying
older schema upgrades which may no longer be relevant.
. Add a follow-up named schema update to ensure that deployments which have
*already* applied it arrive at the same state. This is only necessary if it is
possible for the schema update to apply successfully in any possible
deployment. This schema update should be added like any other "new" schema update,
appended to the list of all updates, rather than re-ordering history. This
schema update will run on systems that deployed both versions of the earlier
schema update.
. Determine whether any of the schema versions after the broken one need to
change because they depended on the specific behavior that you changed to _fix_
that version.

We can use the following terminology here:

* `S(bad)`: The particular `upN.sql` schema update which is "broken".
* `S(fixed)`: That same `upN.sql` file after being updated to a non-broken version.
* `S(converge)`: Some later schema update that converges the deployment to a known-good
state.

**This process is risky**. By changing the contents of the old schema update `S(bad)`
to `S(fixed)`, we create two divergent histories on our deployments: one where `S(bad)`
may have been applied, and one where only `S(fixed)` was applied.

Although the goal of `S(converge)` is to make sure that these deployments end
up looking the same, there are no guarantees that other schema updates between
`S(bad)` and `S(converge)` will be identical between these two variant update
timelines. When fixing broken schema updates, do so with caution, and consider
all schema updates between `S(bad)` and `S(converge)` - these updates must be
able to complete successfully regardless of which one of `S(bad)` or `S(fixed)`
was applied.

=== General notes

CockroachDB's representation of the schema includes some opaque
internally-generated fields that are order dependent, like the names of
anonymous CHECK constraints. Our schema comparison tools intentionally ignore
these values. As a result, when performing schema changes, the order of new
tables and constraints should generally not be important.

As convention, however, we recommend keeping the `db_metadata` row insertion at
the end of `dbinit.sql`, so that the database does not contain a version until
it is fully populated.

0 comments on commit 8fa1da2

Please sign in to comment.