Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to support live attach/detach of floating IPs from instances #4630

Closed
FelixMcFelix opened this issue Dec 6, 2023 · 0 comments · Fixed by #4694
Closed

Need to support live attach/detach of floating IPs from instances #4630

FelixMcFelix opened this issue Dec 6, 2023 · 0 comments · Fixed by #4694
Assignees
Labels
api Related to the API. enhancement New feature or request. networking Related to the networking. Sled Agent Related to the Per-Sled Configuration and Management
Milestone

Comments

@FelixMcFelix
Copy link
Contributor

#4559 defines the API and plumbing for floating IPs to be used by instances. However, they can only be attached and detached from instances during their creation and deletion, respectively. We need to be able to perform these modifications on running instances.

At a high level (based on my current prototype), I think this needs:

  • New endpoints on nexus, likely modeled on disk-attach.
  • New endpoints on sled-agent to add/remove an external IP.
  • Additional sagas to coordinate nexus and sled-agent operations.

There's a question of how this will interact with migration which Greg, Mike, Trey, Luqman and myself discussed earlier this week. I think in an ideal world we would send sled-agent messages to both the source and target sled (in addition to ensuring new NAT bindings are propagated out to DPD for the current active VM). For an initial implementation, we are probably going to be best-served by returning a 503+retry-after if a migration is in progress, while possibly blocking new migrations using a sentinel all-zeroes migration_id.

@FelixMcFelix FelixMcFelix added enhancement New feature or request. api Related to the API. networking Related to the networking. Sled Agent Related to the Per-Sled Configuration and Management labels Dec 6, 2023
@FelixMcFelix FelixMcFelix self-assigned this Dec 6, 2023
@FelixMcFelix FelixMcFelix added this to the 6 milestone Dec 18, 2023
FelixMcFelix added a commit that referenced this issue Jan 24, 2024
This PR adds new endpoints to attach and detach external IPs to/from an
individual instance at runtime, when instances are either stopped or
started. These new endpoints are:

* POST `/v1/floating-ips/{floating_ip}/attach`
* POST `/v1/floating-ips/{floating_ip}/detach`
* POST `/v1/instances/{instance}/external-ips/ephemeral`
* DELETE `/v1/instances/{instance}/external-ips/ephemeral`

These follow and enforce the same rules as external IPs registered
during instance creation: at most one ephemeral IP, and at most 32
external IPs total.

`/v1/floating-ips/{floating_ip}/attach` includes a `kind` field to
account for future API resources which a FIP may be bound to -- such as
internet gateways, load balancers, and services.

## Interaction with other instance lifecycle changes and sagas

Both external IP modify sagas begin with an atomic update to external IP
attach state conditioned on $\mathit{state}\in[
\mathit{started},\mathit{stopped}]$. As a result, we know that an
external IP saga can only ever start before any other instance state
change occurs. We then only need to think about how these other
sagas/events must behave when called *during* an attach/detach, keeping
in mind that these are worst-case orderings: attach/detach are likely to
complete quickly.

### Instance start & migrate

Both of these sagas alter an instance's functional sled ID, which
controls whether NAT entry insertion and OPTE port state updates are
performed. If an IP attach/detach is incomplete when either saga reaches
`instance_ensure_dpd_config` or `instance_ensure_registered` (e.g., any
IP associated with the target instance is in attaching/detaching state),
the start/migrate will unwind with an HTTP 503.

Generally, neither should undo in practice since IP attach/detach are
fast operations -- particularly when an instance is formerly stopped.
This is used solely to guarantee that only one saga is accessing a given
external IP at a time, and that the update target remains unchanged.

### Instance stop & delete

These operations are either not sagaized (stop), or cannot unwind
(delete), and so we cannot block them using IP attach state. IP
attach/detach will unwind if a given sled-agent is no longer responsible
for an instance. Instance delete will force-detach IP addresses bound to
an instance, and if this is seen then IP attach will deliberately unwind
to potentially clean up NAT state. OPTE/DPD undo operations are
best-effort in such a case to prevent stuck sagas.

Instance stop and IP attach may interleave such that the latter adds
additional NAT entries after other network state is cleared. Because we
cannot unwind in this case, `instance_ensure_dpd_config` will now
attempt to remove leftover conflicting RPW entries if they are detected,
since we know they are a deviation from intended state.

## Additional/supporting changes

* Pool/floating IP specifiers in instance create now take `NameOrId`,
parameter names changed to match.
* External IP create/bind in instance create no longer double-resolves
name on saga unwind.
* `views::ExternalIp` can now contain `FloatingIp` body.
* DPD NAT insert/remove functions now perform single rule update via ID
instead of index into the EIP list -- index-based was unstable under
live addition/removal.
* NAT RPW ensure is now more authoritative, and will remove conflicting
entries if an initial insert fails.
* Pool `NameOrId` resolution for floating IP allocation pulled up from
`Datastore` into `Nexus`.

---

Closes #4630 and #4628.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Related to the API. enhancement New feature or request. networking Related to the networking. Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant