Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include enums in dbinit_equal_sum_of_all_up test #4472

Merged
merged 7 commits into from
Nov 10, 2023
Merged

Conversation

david-crespo
Copy link
Contributor

While working on #4261 I ran into created a situation where the order of items in a DB enum matters because I want to sort by it. In this case it's about specificity — the type is silo or fleet and I want to sort by it to ensure I get the most "specific" thing first. @zephraph noticed I had an order mismatch between my migration adding the enum and the copy in dbinit.sql, and it wasn't caught by the tests.

So, here I've added the list of user-defined enums to the set of things we compare when we ask whether the result of our migrations matches the contents of dbinit.sql. There were two existing enums that failed the test because the copy in dbinit.sql was sorted alphabetically, but a migration added a new item at the end of the list. Very easy to fix, though now are enums aren't sorted nicely, which is sad.

The question is: should we enforce enum order? Seems like we might as well if it doesn't cost us much. But should we write application logic like #4261 that relies on enum order? In that case, at least, I have a reasonable alternative because the list of things I'm sorting is at most 2 long, so I can easily get the whole list and pick the element I want in app code instead of doing an order by + limit 1 in SQL. It is made quite clear in a comment, but it still feels a little dicey.

@david-crespo david-crespo marked this pull request as ready for review November 8, 2023 19:49
@david-crespo david-crespo changed the title include enums in DB sum_of_all_up check Include enums in DB sum_of_all_up check Nov 8, 2023
@david-crespo david-crespo changed the title Include enums in DB sum_of_all_up check Include enums in dbinit_equal_sum_of_all_up test Nov 8, 2023
@@ -168,13 +168,14 @@ async fn query_crdb_schema_version(crdb: &CockroachInstance) -> String {
//
// Note that for the purposes of schema comparisons, we don't care about parsing
// the contents of the database, merely the schema and equality of contained data.
#[derive(Eq, PartialEq, Clone, Debug)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to get rid of Eq to make this work because floats don't impl Eq. Fortunately it seems our equality checks only need PartialEq anyway.

@zephraph
Copy link
Contributor

zephraph commented Nov 8, 2023

I spent some time reading through Cockroach's RFC for adding enums to get a little bit of a better understanding of how they're implemented. Inserts at the beginning or the middle of a enum are supported and work fine w/ alter.

In the cases that were updated in the db init, the migration itself was wrong. For the clickhouse_keeper one, the migration should've been

-ALTER TYPE omicron.public.service_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';
+ALTER TYPE omicron.public.service_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper' AFTER 'clickhouse';

I think this justifies the test if nothing else. Running dbinit on an empty db should result in the same enum as running the migrations from the first schema version. I don't know what bugs, if any, we might run into here, but this does seem like a correctness case that the test does a good job of solving.

@@ -191,7 +191,6 @@ CREATE UNIQUE INDEX IF NOT EXISTS lookup_switch_by_rack ON omicron.public.switch

CREATE TYPE IF NOT EXISTS omicron.public.service_kind AS ENUM (
'clickhouse',
'clickhouse_keeper',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure what I'm about to say is absolutely taboo, but I think we should actually update the migration and not this. The specific update would be an AFTER 'clickhouse' being added to the alter clause of the enum. For enums where clickhouse already exists that migration step won't be re-run, but for new systems it'll be the same as what we have today.

Copy link
Contributor Author

@david-crespo david-crespo Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's very cool. I didn't know you could insert it into a particular spot. We might have issues with the order number comparison if we do that. I think when you insert in the middle of two existing members, it's going to give it an decimal order number halfway between them. Then again I think we will also have this problem if we ever delete a variant. Important not-really-edge cases to consider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be an issue because the enum isn't stored as a decimal. You can find an overview of the format here. Postgres stores enums as floats so you'd [1.0, 2.0] and inserting a value between them would end up with [1.0, 1.5, 2.0]. CRDB just does some fancy byte index padding which essentially ends up being a similar concept. So this should still work as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m saying those order numbers won’t match what’s generated from scratch in dbinit.sql, which would be [1.0, 2.0, 3.0], so the test will fail as currently written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! To my surprise, this still passes. I don't think we can change the migrations that have already been applied on customer racks, but I agree the AFTER trick is what we should do in the future.

diff --git a/schema/crdb/4.0.0/up1.sql b/schema/crdb/4.0.0/up1.sql
index b26b646d6..ba67d8394 100644
--- a/schema/crdb/4.0.0/up1.sql
+++ b/schema/crdb/4.0.0/up1.sql
@@ -1 +1 @@
-ALTER TYPE omicron.public.service_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';
+ALTER TYPE omicron.public.service_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper' AFTER 'clickhouse';

diff --git a/schema/crdb/4.0.0/up2.sql b/schema/crdb/4.0.0/up2.sql
index 018d39c13..e71cb22f3 100644
--- a/schema/crdb/4.0.0/up2.sql
+++ b/schema/crdb/4.0.0/up2.sql
@@ -1 +1 @@
-ALTER TYPE omicron.public.dataset_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';
+ALTER TYPE omicron.public.dataset_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper' AFTER 'clickhouse';

diff --git a/schema/crdb/dbinit.sql b/schema/crdb/dbinit.sql
index 6e8ab0e39..da842cbfe 100644
--- a/schema/crdb/dbinit.sql
+++ b/schema/crdb/dbinit.sql
@@ -191,6 +191,7 @@ CREATE UNIQUE INDEX IF NOT EXISTS lookup_switch_by_rack ON omicron.public.switch
 
 CREATE TYPE IF NOT EXISTS omicron.public.service_kind AS ENUM (
   'clickhouse',
+  'clickhouse_keeper',
   'cockroach',
   'crucible',
   'crucible_pantry',
@@ -201,7 +202,6 @@ CREATE TYPE IF NOT EXISTS omicron.public.service_kind AS ENUM (
   'ntp',
   'oximeter',
   'tfport',
-  'clickhouse_keeper',
   'mgd'
 );
 
@@ -402,9 +402,9 @@ CREATE TYPE IF NOT EXISTS omicron.public.dataset_kind AS ENUM (
   'crucible',
   'cockroach',
   'clickhouse',
+  'clickhouse_keeper',
   'external_dns',
-  'internal_dns',
-  'clickhouse_keeper'
+  'internal_dns'
 );
 
 /*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have to leave the migrations as is we should at least comment on why the enum arms are out of order

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this - I feel bad that the initial db comparison tooling wasn't comprehensive enough (ugh why would CRDB silently drop user-defined types from InformationSchema) but I really appreciate you handling this case.

similar_asserts::assert_eq!(
self.enums,
other.enums,
"Enums did not match. Members must have the same order in dbinit.sql and migrations. If a migration adds a member, it should be added at the end of the definition in dbinit.sql."

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

@@ -202,6 +201,7 @@ CREATE TYPE IF NOT EXISTS omicron.public.service_kind AS ENUM (
'ntp',
'oximeter',
'tfport',
'clickhouse_keeper',
Copy link
Collaborator

@smklein smklein Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zephraph asked the following extremely relevant question:

is it unsafe to update the migrations to add the AFTER clause? That migration won't rerun on customer deployments, right? Even if it did, the enum is already present so it wouldn't try to readd it

So there are a few cases we need to consider:

  1. Someone hasn't deployed yet. In this case, whatever dbinit.sql says goes, so it "matches" trivially during formatting.
  2. Someone ran with an old version of dbinit.sql , and applied the schema migrations to get to this order.
    Before this PR: clickhouse_keeper appears at the end of the enum in the deployed DB, which doesn't match dbinit.sql
    After this PR:dbinit.sql matches the order applied by schema changes.
  3. Someone ran with a newish version of dbinit.sql, and hasn't applied any schema changes.
    Before this PR: clickhouse_keeper does not appear at the end of the enum in the deployed DB.
    After this PR: the enum order is no longer correct! In the deployed DB, clickhouse_keeper is early in the order, but this doesn't match dbinit.sql, which thinks it shows up later.

By making this change, we "fix" case (2), but we "break" case (3).

I point this out to identify: if you rely on enum order after this change, there are still deployments that exist where we do not have such a guarantee.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is to say -- we can make this change, and if you want, but I think it's still unsafe to rely on order of our enums with possible deployments unless we add a "schema change" that validates order.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, to be explicit, the changes that you've added to the integration tests are extremely useful still, in either situation)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pain-in-the-ass, but we could:

This is a "full table scan" operation but the data sizes are really small so I don't think it's so bad
Then we add that as a schema change while updating dbinit.sql at the same time
New deployments -> Use dbinit.sql, new order
Old deployments -> Use the schema change, land on the same order as dbinit.sql

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor note... it doesn't have to be at the end. It'll still work being inserted after clickhouse as we intended clickhouse_keeper to originally be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this approach (even if it makes me slightly nervous b/c we have to do a small amount of data migration). There are two primary justifications that lead me to believe this is a good idea

  1. Given that enums are ordered, having this integration test is valuable. It's caught this case where migrations result in an effectively different enum shape than what we intended in dbinit
  2. We have an active case of database bifurcation. Installations which were or will be provisioned from a schema version greater 4.0.0 will effectively have different values for the service_kind and dataset_kind than those provisioned with an earlier version that had to run that migration. I don't have a full understanding of what challenges this could cause outside of ordering issues but it seems safer to sand down this sharp edge before we cut ourselves on it.

Copy link
Contributor Author

@david-crespo david-crespo Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did it in 338828a. Rather than keep clickhouse_keeper2 around to confuse people in perpetuity, I decided to do the same thing again to change it back to clickhouse_keeper after fixing the order.

See the readme I added for the migration summarizing what it does: https://github.com/oxidecomputer/omicron/blob/338828a24837790f7b96a30dde7aecf7963748d0/schema/crdb/10.0.0/README.md

@david-crespo
Copy link
Contributor Author

Still working on the above comments, but the show enums change avoids the question of spurious differences in enumsortorder, making the test only care about the actual resulting order, which is what we care about. And the diff on failure is much easier to read.

Differences (-left|+right):
                         [
                             "crucible",
                             "cockroach",
                             "clickhouse",
-                            "clickhouse_keeper",
                             "external_dns",
                             "internal_dns",
+                            "clickhouse_keeper",
                         ],
                     ),
                 ),
             },

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense to me -- good catch to not leave clickhouse_keeper2 lying around!

Thanks for doing this!

@david-crespo
Copy link
Contributor Author

Only the “nexus applies update on boot” test fails on our friend the table scan prohibition. Easiest fix would be to temporarily turn that flag off.

{"msg":"Failed to ensure schema version: Internal Error: unexpected database error: query `UPDATE omicron.public.dataset\nSET kind = 'clickhouse_keeper2'\nWHERE kind = 'clickhouse_keeper'` contains a full table/index scan which is explicitly disallowed","v":0,"name":"nexus_applies_update_on_boot","level":40,"time":"2023-11-10T05:27:37.02062162Z","hostname":"ip-10-150-1-55","pid":115570,"component":"nexus","component":"ServerContext","name":"a49e7c37-3b13-4252-86d5-39ce75834297"}

@zephraph
Copy link
Contributor

I believe that we can narrowly enable the FTS for only the impacted query:

/// SQL used to enable full table scans for the duration of the current
/// transaction.
///
/// We normally disallow table scans in effort to identify scalability issues
/// during development. There are some rare cases where we do want to do full
/// scans on small tables. This SQL can be used to re-enable table scans for
/// the duration of the current transaction.
///
/// This should only be used when we know a table will be very small, it's not
/// paginated (if it were paginated, we could scan an index), and there's no
/// good way to limit the query based on an index.
///
/// This SQL appears to have no effect when used outside of a transaction.
/// That's intentional. We do not want to use `SET` (rather than `SET LOCAL`)
/// here because that would change the behavior for any code that happens to use
/// the same pooled connection after this SQL gets run.
///
/// **BE VERY CAREFUL WHEN USING THIS.**
pub const ALLOW_FULL_TABLE_SCAN_SQL: &str =
"set local disallow_full_table_scans = off; \
set local large_full_scan_rows = 1000;";

@david-crespo david-crespo enabled auto-merge (squash) November 10, 2023 16:18
@david-crespo david-crespo merged commit 0315715 into main Nov 10, 2023
19 of 20 checks passed
@david-crespo david-crespo deleted the check-schema-enums branch November 10, 2023 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants