IP pool migration failed to keep fleet default as default #4875

david-crespo · 2024-01-23T21:15:38Z

This bit of data migration from #4261 did not behave as expected on dogfood. The default pool was a fleet-level default, and there was a second pool oxide-pool associated directly with the oxide silo but as non-default. In this situation, as the comment describes, the migration ends up associating both pool default and oxide-pool with silo oxide, with default keeping is_default = true and oxide-pool getting is_default = false. When we ran the update, however, both had is_default = false, leaving silo oxide without a default pool.

omicron/schema/crdb/23.0.0/up4.sql

Lines 14 to 28 in 624fbba

    
           -- Special handling is required for conflicts between a fleet default and a 
        
           -- silo default. If pool P1 is a fleet default and pool P2 is a silo default 
        
           -- on silo S1, we cannot link both to S1 with is_default = true. What we 
        
           -- really want in that case is: 
        
           --  
        
           --   row 1: (P1, S1, is_default=false) 
        
           --   row 2: (P2, S1, is_default=true) 
        
           --  
        
           -- i.e., we want to link both, but have the silo default take precedence. The 
        
           -- AND NOT EXISTS here causes is_default to be false in row 1 if there is a 
        
           -- conflicting silo default pool. row 2 is inserted in up5. 
        
           p.is_default AND NOT EXISTS ( 
        
             SELECT 1 FROM omicron.public.ip_pool  
        
             WHERE silo_id = s.id AND is_default 
        
           )

This is despite our lovely test for this very scenario. Indeed, this is the change these tests were added for.

omicron/nexus/tests/integration_tests/schema.rs

Lines 1022 to 1026 in 624fbba

    
           // pool3 did not previously have a corresponding silo, so now it's associated 
        
           // with both silos as a new resource in each. 
        
           // 
        
           // Additionally, silo1 already had a default pool (pool1), but silo2 did 
        
           // not have one. As a result, pool3 becomes the new default pool for silo2.

The text was updated successfully, but these errors were encountered:

Closes #4875 ## Problem After the IP pools migrations on the dogfood rack, the `default` pool was not marked `is_default=true` for the `oxide` silo when it should have been. ## Diagnosis When checking for silo-scoped default pools overriding a fleet-scoped default, I neglected to require that the silo-scoped defaults in question were non-deleted. This means that if there was a deleted pool with `silo_id=<oxide silo id>` and `is_default=true`, that would be considered an overriding default and leave us with `is_default=false` on the `default` pool. Well, I can't check `silo_id` and `is_default` on the pools because those columns have been dropped, but there is a deleted pool called `oxide-default` that says in the description it was meant as the default pool for only the `oxide` silo. ``` oot@[fd00:1122:3344:105::3]:32221/omicron> select * from omicron.public.ip_pool; id | name | description | time_created | time_modified | time_deleted | rcgen ---------------------------------------+--------------------+--------------------------------+-------------------------------+-------------------------------+-------------------------------+-------- 1efa49a2-3f3a-43ab-97ac-d38658069c39 | oxide-default | oxide silo-only pool - default | 2023-08-31 05:33:00.11079+00 | 2023-08-31 05:33:00.11079+00 | 2023-08-31 06:03:22.426488+00 | 1 ``` I think we can be pretty confident this is what got us. ## Fix Add `AND time_deleted IS NULL` to the subquery. ## Mitigation in existing systems Already done. Dogfood is the only long-running system where the bad migration ran, and all I had to do there was use the API to set `is_default=true` for the (`default` pool, `oxide` silo) link.

david-crespo changed the title ~~IP pool migration failed to keep fleet default as defafult~~ IP pool migration failed to keep fleet default as default Jan 23, 2024

david-crespo added this to the 6 milestone Jan 23, 2024

david-crespo self-assigned this Jan 23, 2024

david-crespo mentioned this issue Jan 26, 2024

Fix IP pools data migration #4903

Merged

david-crespo closed this as completed in #4903 Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IP pool migration failed to keep fleet default as default #4875

IP pool migration failed to keep fleet default as default #4875

david-crespo commented Jan 23, 2024 •

edited

Loading

IP pool migration failed to keep fleet default as default #4875

IP pool migration failed to keep fleet default as default #4875

Comments

david-crespo commented Jan 23, 2024 • edited Loading

david-crespo commented Jan 23, 2024 •

edited

Loading