IP pools data model and API rework #4261

david-crespo · 2023-10-11T19:24:12Z

Closes #2148
Closes #4002
Closes #4003
Closes #4006

Background

#3985 (and followups #3998 and #4007) made it possible to associate an IP pool with a silo so that instances created in that silo would get their ephemeral IPs from said pool by default (i.e., without the user having to say anything other than "I want an ephemeral IP"). An IP pool associated with a silo was not accessible for ephemeral IP allocation from other silos — if a disallowed pool was specified by name at instance create time, the request would 404.

However! That was the quick version, and the data model left much to be desired. The relation was modeled by adding a nullable silo_id and sort-of-not-really-nullable is_default column directly on the IP pool table, which has the following limitations (and there are probably more):

A given IP pool could only be associated with at most one silo, could not be shared
The concept of default was treated as a property of the pool itself, rather than a property of the association with another resource, which is quite strange. Even if you could associate the pool with multiple silos, you could not have it be the default for one and not for the other
There is no way to create an IP pool without associating it with either the fleet or a silo
Extending this model to allow association at the project level would be inelegant — we'd have to add a project_id column (which I did in [nexus] Silo IP pools schema change #3981 before removing it in [nexus] Silo IP pools #3985)

More broadly (and vaguely), the idea of an IP pool "knowing" about silos or projects doesn't really make sense. Entities aren't really supposed to know about each other unless they have a parent-child relationship.

Changes in this PR

No such thing as fleet-scoped pool, only silo

Thanks to @zephraph for encouraging me to make this change. It is dramatically easier to explain "link silo to IP pool" than it is to explain "link resource (fleet or silo) to IP pool". The way to recreate the behavior of a single default pool for the fleet is to simply associate a pool with all silos. Data migrations ensure that existing fleet-scoped pools will be associated with all silos. There can only be one default pool for a silo, so in the rare case where pool A is a fleet default and pool B is default on silo S, we associate both A and B with S, but only B is made silo default pool.

API

These endpoints are added. They're pretty self-explanatory.

ip_pool_silo_link                        POST     /v1/system/ip-pools/{pool}/silos
ip_pool_silo_list                        GET      /v1/system/ip-pools/{pool}/silos
ip_pool_silo_unlink                      DELETE   /v1/system/ip-pools/{pool}/silos/{silo}
ip_pool_silo_update                      PUT      /v1/system/ip-pools/{pool}/silos/{silo}

The silo_id and is_default fields are removed from the IpPool response as they are now a property of the IpPoolLink, not the pool itself.

I also fixed the silo-scoped IP pools list (/v1/ip-pools) and fetch (/v1/ip-pools/{pool}) endpoints, which a) did not actually filter for the current silo, allowing any user to fetch any pool, and b) took a spurious project query param that didn't do anything.

DB

The association between IP pools and fleet or silo (or eventually projects, but not here) is now modeled through a polymorphic join table called ip_pool_resource:

ip_pool_id	resource_type	resource_id	is_default
123	silo	23	true
123	silo	4	false
65	~~fleet~~	~~FLEET_ID~~	~~true~~

Now, instead of setting the association with a silo or fleet at IP pool create or update time, there are separate endpoints for adding and removing an association. A pool can be associated with any number of resources, but a unique index ensures that a given resource can only have one default pool.

Default IP pool logic

If an instance ephemeral IP or a floating IP is created with a pool specified, we simply use that pool if it exists and is linked to the user's silo.

If an instance ephemeral IP or a floating IP is created without a pool unspecified, we look for a default pool for the current silo. If there is a pool linked with the current silo with is_default=true, use that. Otherwise, there is no default pool for the given scope and IP allocation will fail, which means the instance create or floating IP create request will fail.

The difference introduced in this PR is that we do not fall back to fleet default if there is no silo default because we have removed the concept of a fleet-scoped pool.

Tests and test helpers

This is the source of a lot of noise in this PR. Because there can no longer be a fleet default pool, we can no longer rely on that for tests. The test setup was really confusing. We assumed a default IP pool existed, but we still had to populate it (add a range) if we had to do anything with it. Now, we don't assume it exists, we create it and add a range and associate it with a silo all in one helper.

What do customers have to do when they upgrade?

They should not have to do anything at upgrade time.

If they were relying on a single fleet default pool to automatically be used by new silos, when they create silos in the future they will have to manually associate each new silo with the desired pool. We are working on ways to make that easier or more automatic, but that's not in this change. It is less urgent because silo creation is an infrequent operation.

If they are not using the previously fleet default IP pool named default and do not want it to exist, they can simply delete any IP ranges it contains, unlink it from all silos and delete it. If they are not using it, there should not be any IPs allocated from it (which means they can delete it).

david-crespo · 2023-10-12T05:19:49Z

schema/crdb/7.0.0/up5.sql

+SELECT id, 'silo', silo_id, is_default
+FROM ip_pool
+WHERE silo_id IS NOT null
+  AND time_deleted IS null;


@smklein integration_tests::schema::versions_have_idempotent_up is mad at me for up4 and up5 because the is_default and silo_id columns get deleted from ip_pool in up7, so these inserts fail when run a second time. I can't figure out how to make this insert conditional on those columns existing. Moving up6 and up7 into their own separate migration 8.0.0 migration makes the test pass. Is that good, though? Almost feels like too easy of a workaround.

david-crespo · 2023-10-20T22:30:08Z

Tested updated association post body with the TS generator and it is working exactly as desired:

/**
 * Parameters for associating an IP pool with a resource (fleet, silo)
 */
export type IpPoolAssociationCreate =
  | { isDefault: boolean; resourceType: 'silo'; silo: NameOrId }
  | { isDefault: boolean; resourceType: 'fleet' }

zephraph · 2023-12-14T17:57:19Z

schema/crdb/21.0.0/up5.sql

+WHERE ipr.resource_type = 'silo' -- technically unnecessary because there is only silo
+  AND ipr.is_default = true
+  AND ip.is_default = true -- both being default is the conflict being resolved
+  AND ip.silo_id IS NOT NULL


Should this be AND ip.silo_id IS null b/c that's what indicates a fleet ip pool?

Nope, we're looking at silo pools in ip_pool, lining them up against entries in ip_pool_resource that are type=silo but came from a fleet pool in up4, preferring the former as a default. I admit this is pure evil!

…check

As raised in Matrix, this backports a fix from #4261 where a new floating IP can be allocated using the name or ID of an IP pool which is bound to another silo. --------- Co-authored-by: David Crespo <[email protected]>

zephraph · 2023-12-19T15:25:12Z

nexus/db-queries/src/db/datastore/ip_pool.rs

+            .map(|ip_pool| {
+                let authz_pool =
+                    authz::IpPool::new(authz::FLEET, ip_pool.id(), lookup_type);
+                (authz_pool, ip_pool)


This is interesting, I haven't noticed where we've done something like this explicitly in the data store before. I'm not sure why the authz / resource tuple was necessary.

Ahh, I found it below. This is a non-standard lookup so the id/name generated lookups that hang off of nexus wouldn't be available and thus its existence.

Right. An edge case that doesn't really fit in the look up by name, id, or composite id model. Arguably it's a composite ID, which is why I'm using it as a sort of catchall for this.

zephraph · 2023-12-19T15:45:56Z

nexus/db-queries/src/db/datastore/ip_pool.rs

+        let (authz_pool1_for_silo, ip_pool) = datastore
            .ip_pools_fetch_default(&opctx)
            .await
            .expect("Failed to get silo's default IP pool");


I see, because this doesn't have a normal lookup implemented this fetch implementation acts as one. Seems fine.

zephraph · 2023-12-19T15:57:22Z

nexus/src/external_api/http_entrypoints.rs

+    path = "/v1/system/ip-pools/{pool}/silos/{silo}",
+    tags = ["system/networking"],
+}]
+async fn ip_pool_silo_update(


A minor nit here, but the expressed intention of this endpoint and its operation name feel disjointed. By the operation name there's an implication that you may be able to update other properties of the pool whereas you're only granted the option to update its default state. Isn't a huge deal, but I think we should eventually either expand this to be a full update or update the op name to reflect that it only impacts the default.

Yeah, it could be ip_pool_silo_set_default. There are no other properties of the link that can be updated. Another option would be ip_pool_silo_link_update to make clearer it's about updating the link and not the pool or the silo.

zephraph

Overall I'm feeling pretty good about the state of this PR. There's undoubtedly areas that I've missed in review but it feels holistically better than where we where before it.

Given we have time before the next release I feel it would be helpful to land this soon and be able to iterate on it as we feel it out through usage.

Bugfix for #4261. Using `distinct` to eliminate dupes from the left outer join works, but it's better to just use the well-known name of the service IP pool to exclude it from whatever operations it needs to be excluded from. Potential followup related to #4762: if the ID was well-known to, we could do the `is_internal` check without having to hit the DB.

david-crespo added 9 commits October 10, 2023 16:30

migrations and diesel schema for ip_pool_resource join table

580594e

add pools at rack setup time in the new way, update fetch_default_for

3065a61

update service pool lookup logic

f3809b5

delete silo_id and is_default from ip_pool model, test still passes

6404e27

rip out silo_id and is_default on ip_pools in db

94d0ebf

let's add an endpoint!

1de49fc

use joinable to make our joins more cute

7259531

get all ip_pools integration tests passing

4c8c5ea

instance ephemeral IP test passes

5386043

david-crespo self-assigned this Oct 11, 2023

david-crespo added this to the 3 milestone Oct 11, 2023

david-crespo commented Oct 12, 2023

View reviewed changes

shuffle migration files in anticipation of merging main

a1eb6e4

This was referenced Oct 17, 2023

dbinit is getting too order dependent #4286

Closed

Make dbinit.sql slightly less order-dependent #4288

Merged

david-crespo added 9 commits October 18, 2023 11:25

Merge branch 'main' into ip-pools-rework

9b4edaa

unauthorized test

49c9927

test for pagination logic

9c8d2d5

exclude service pool from list and block delete/update/modification

ff3a4e2

fix the tests!

db29bd3

get conflict/update logic right around is_default

9cbc423

add dissociate endpoint as DELETE /associate

ee5549c

rename endpoints and params to match convention better

0ab0cfc

change API to allow associating silo by name. it's beautiful

2b4f326

david-crespo added 5 commits October 23, 2023 16:02

move migration up a number before merging main

e2b7600

Merge branch 'main' into ip-pools-rework

acda55a

get silo ID so we can test association by name or id

80a4d22

pluralize /associations for restiness, prep for list endpoint

9c101ac

endpoint to list associations for IP pool

32b2f75

zephraph reviewed Dec 14, 2023

View reviewed changes

david-crespo added 2 commits December 14, 2023 11:57

unlink takes authz_silo and authz_pool directly and does better auth …

3830c2a

…check

simplify and clarify migrations by using a subquery

06fcc99

FelixMcFelix mentioned this pull request Dec 15, 2023

Fix FIP creation being able to access IP Pools in other silos #4705

Merged

morlandi7 modified the milestones: 5, 6 Dec 18, 2023

zephraph reviewed Dec 19, 2023

View reviewed changes

zephraph approved these changes Dec 19, 2023

View reviewed changes

david-crespo mentioned this pull request Jan 2, 2024

Want a way to test data migrations #4747

Closed

david-crespo added 3 commits January 3, 2024 13:07

move migration up

e02a0e6

Merge branch 'main' into ip-pools-rework (with conflicts)

ba1eb0e

resolve conflicts

ea4cbfa

david-crespo force-pushed the ip-pools-rework branch from 9a3ffb8 to ea4cbfa Compare January 3, 2024 22:51

add ip pool resource type to list of allowed enum names

8aa01b9

david-crespo mentioned this pull request Jan 4, 2024

Give end user a way to tell which IP pool is silo default #4752

Closed

david-crespo added 3 commits January 5, 2024 11:35

clean up merge leftovers and TODOs that are becoming issues

584cb14

move migrations up one

4976404

Merge branch 'main' into ip-pools-rework

16cbd29

This was referenced Jan 5, 2024

Internal IP pool should not be linkable to other silos #4762

Closed

Need operator endpoint to list IP pools for a silo #4763

Closed

david-crespo enabled auto-merge (squash) January 5, 2024 18:10

david-crespo merged commit bf49833 into main Jan 5, 2024
20 of 21 checks passed

david-crespo deleted the ip-pools-rework branch January 5, 2024 18:13

david-crespo mentioned this pull request Jan 12, 2024

Fix duplicate entries in IP pools list #4808

Merged

david-crespo mentioned this pull request Jan 23, 2024

IP pool migration failed to keep fleet default as default #4875

Closed

sunshowers mentioned this pull request Feb 23, 2024

[nexus] add basic support for expunged sled policy and decommissioned sled state #5032

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IP pools data model and API rework #4261

IP pools data model and API rework #4261

david-crespo commented Oct 11, 2023 •

edited

Loading

david-crespo Oct 12, 2023

david-crespo commented Oct 20, 2023

zephraph Dec 14, 2023

david-crespo Dec 14, 2023 •

edited

Loading

zephraph Dec 19, 2023

zephraph Dec 19, 2023

david-crespo Dec 19, 2023

zephraph Dec 19, 2023

zephraph Dec 19, 2023

david-crespo Dec 19, 2023

zephraph left a comment

IP pools data model and API rework #4261

IP pools data model and API rework #4261

Conversation

david-crespo commented Oct 11, 2023 • edited Loading

Background

Changes in this PR

No such thing as fleet-scoped pool, only silo

API

DB

Default IP pool logic

Tests and test helpers

What do customers have to do when they upgrade?

Choose a reason for hiding this comment

david-crespo commented Oct 20, 2023

Choose a reason for hiding this comment

david-crespo Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zephraph left a comment

Choose a reason for hiding this comment

david-crespo commented Oct 11, 2023 •

edited

Loading

david-crespo Dec 14, 2023 •

edited

Loading