Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IP pools data migration #4903

Merged
merged 3 commits into from
Jan 26, 2024
Merged

Fix IP pools data migration #4903

merged 3 commits into from
Jan 26, 2024

Conversation

david-crespo
Copy link
Contributor

@david-crespo david-crespo commented Jan 26, 2024

Closes #4875

Problem

After the IP pools migrations on the dogfood rack, the default pool was not marked is_default=true for the oxide silo when it should have been.

Diagnosis

When checking for silo-scoped default pools overriding a fleet-scoped default, I neglected to require that the silo-scoped defaults in question were non-deleted. This means that if there was a deleted pool with silo_id=<oxide silo id> and is_default=true, that would be considered an overriding default and leave us with is_default=false on the default pool.

Well, I can't check silo_id and is_default on the pools because those columns have been dropped, but there is a deleted pool called oxide-default that says in the description it was meant as the default pool for only the oxide silo.

oot@[fd00:1122:3344:105::3]:32221/omicron> select * from omicron.public.ip_pool;
                   id                  |        name        |          description           |         time_created          |         time_modified         |         time_deleted          | rcgen
---------------------------------------+--------------------+--------------------------------+-------------------------------+-------------------------------+-------------------------------+--------
  1efa49a2-3f3a-43ab-97ac-d38658069c39 | oxide-default      | oxide silo-only pool - default | 2023-08-31 05:33:00.11079+00  | 2023-08-31 05:33:00.11079+00  | 2023-08-31 06:03:22.426488+00 |     1

I think we can be pretty confident this is what got us.

Fix

Add AND time_deleted IS NULL to the subquery.

Mitigation in existing systems

Already done. Dogfood is the only long-running system where the bad migration ran, and all I had to do there was use the API to set is_default=true for the (default pool, oxide silo) link.


// fleet-level pool that was not default becomes non-default on all silos
assert_row(&ip_pool_resources[4].values, POOL4, SILO1, false);
assert_row(&ip_pool_resources[5].values, POOL4, SILO2, false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These asserts are the same as before, there are just a couple more cases, and I made them denser so I could stare at them more easily for 5 hours.

FROM omicron.public.ip_pool p0
WHERE p0.silo_id = s.id
AND p0.is_default
AND p0.time_deleted IS NULL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and time_deleted is null is the fix.

@@ -975,9 +977,11 @@ fn before_23_0_0(client: &Client) -> BoxFuture<'_, ()> {
// no corresponding silo.
client.batch_execute(&format!("INSERT INTO ip_pool
(id, name, description, time_created, time_modified, time_deleted, rcgen, silo_id, is_default) VALUES
('{POOL0}', 'pool2', '', now(), now(), now(), 1, '{SILO2}', true),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this deleted pool added to the test data but the fix commented out, the test fails in the way we expect, which matches the problem on dogfood: the (pool3, silo2) link gets is_default=false instead of is_default=true.

@david-crespo david-crespo requested a review from zephraph January 26, 2024 04:29
@david-crespo david-crespo merged commit d5dace6 into main Jan 26, 2024
20 checks passed
@david-crespo david-crespo deleted the fix-ip-pools-migration branch January 26, 2024 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IP pool migration failed to keep fleet default as default
1 participant