Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IP pool migration failed to keep fleet default as default #4875

Closed
david-crespo opened this issue Jan 23, 2024 · 0 comments · Fixed by #4903
Closed

IP pool migration failed to keep fleet default as default #4875

david-crespo opened this issue Jan 23, 2024 · 0 comments · Fixed by #4903
Assignees
Milestone

Comments

@david-crespo
Copy link
Contributor

david-crespo commented Jan 23, 2024

This bit of data migration from #4261 did not behave as expected on dogfood. The default pool was a fleet-level default, and there was a second pool oxide-pool associated directly with the oxide silo but as non-default. In this situation, as the comment describes, the migration ends up associating both pool default and oxide-pool with silo oxide, with default keeping is_default = true and oxide-pool getting is_default = false. When we ran the update, however, both had is_default = false, leaving silo oxide without a default pool.

-- Special handling is required for conflicts between a fleet default and a
-- silo default. If pool P1 is a fleet default and pool P2 is a silo default
-- on silo S1, we cannot link both to S1 with is_default = true. What we
-- really want in that case is:
--
-- row 1: (P1, S1, is_default=false)
-- row 2: (P2, S1, is_default=true)
--
-- i.e., we want to link both, but have the silo default take precedence. The
-- AND NOT EXISTS here causes is_default to be false in row 1 if there is a
-- conflicting silo default pool. row 2 is inserted in up5.
p.is_default AND NOT EXISTS (
SELECT 1 FROM omicron.public.ip_pool
WHERE silo_id = s.id AND is_default
)

This is despite our lovely test for this very scenario. Indeed, this is the change these tests were added for.

// pool3 did not previously have a corresponding silo, so now it's associated
// with both silos as a new resource in each.
//
// Additionally, silo1 already had a default pool (pool1), but silo2 did
// not have one. As a result, pool3 becomes the new default pool for silo2.

@david-crespo david-crespo changed the title IP pool migration failed to keep fleet default as defafult IP pool migration failed to keep fleet default as default Jan 23, 2024
@david-crespo david-crespo added this to the 6 milestone Jan 23, 2024
@david-crespo david-crespo self-assigned this Jan 23, 2024
david-crespo added a commit that referenced this issue Jan 26, 2024
Closes #4875

## Problem

After the IP pools migrations on the dogfood rack, the `default` pool
was not marked `is_default=true` for the `oxide` silo when it should
have been.

## Diagnosis

When checking for silo-scoped default pools overriding a fleet-scoped
default, I neglected to require that the silo-scoped defaults in
question were non-deleted. This means that if there was a deleted pool
with `silo_id=<oxide silo id>` and `is_default=true`, that would be
considered an overriding default and leave us with `is_default=false` on
the `default` pool.

Well, I can't check `silo_id` and `is_default` on the pools because
those columns have been dropped, but there is a deleted pool called
`oxide-default` that says in the description it was meant as the default
pool for only the `oxide` silo.

```
oot@[fd00:1122:3344:105::3]:32221/omicron> select * from omicron.public.ip_pool;
                   id                  |        name        |          description           |         time_created          |         time_modified         |         time_deleted          | rcgen
---------------------------------------+--------------------+--------------------------------+-------------------------------+-------------------------------+-------------------------------+--------
  1efa49a2-3f3a-43ab-97ac-d38658069c39 | oxide-default      | oxide silo-only pool - default | 2023-08-31 05:33:00.11079+00  | 2023-08-31 05:33:00.11079+00  | 2023-08-31 06:03:22.426488+00 |     1
```

I think we can be pretty confident this is what got us.

## Fix

Add `AND time_deleted IS NULL` to the subquery.

## Mitigation in existing systems

Already done. Dogfood is the only long-running system where the bad
migration ran, and all I had to do there was use the API to set
`is_default=true` for the (`default` pool, `oxide` silo) link.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant