background task for service zone nat #4857

internet-diglett · 2024-01-20T02:33:36Z

Currently the logic for configuring NAT for service zones is deeply nested and crosses sled-agent http API boundaries. The cleanest way to deliver eventual consistency for service zone nat entries was to pull the zone information from inventory and use that to generate nat entries to reconcile against the ipv4_nat_entry table. This covers us in the following scenarios:

RSS:

User provides configuration to RSS
RSS process ultimately creates a sled plan and service plan
Application of service plan by sled-agents creates zones
zone create makes direct calls to dendrite to configure NAT (it is the only way it can be done at this time)
eventually the Nexus zones are launched and handoff to Nexus is complete
inventory task is run, recording zone locations to db
service zone nat background task reads inventory from db and uses the data to generate records for ipv4_nat_entry table, then triggers dendrite sync.
sync is ultimately a noop because nat entries already exist in dendrite (dendrite operations are idempotent)

Cold boot:

sled-agents create switch zones if they are managing a scrimlet, and subsequently create zones written to their ledgers. This may result in direct calls to dendrite.
Once nexus is back up, inventory will resume being collected
service zone nat background task will read inventory from db to reconcile entries in ipv4_nat_entry table and then trigger dendrite sync.
If nat is out of date on dendrite, it will be updated on trigger.

Dendrite crash

If dendrite crashes and restarts, it will immediately contact Nexus for re-sync (pre-existing logic from earlier NAT RPW work)
service zone and instance nat entries are now present in rpw table, so all nat entries will be restored

Migration / Relocation of service zone

New zone gets created on a sled in the rack. Direct call to dendrite will be made (it uses the same logic as pre-nexus to create zone).
Inventory task will record new location of service zone
Service zone nat background task will use inventory to update table, adding and removing the necessary nat entries and triggering a dendrite update

Considerations

Because this relies on data from the inventory task which runs on a periodic timer (600s), and because this task also runs on a periodic timer (30s), there may be some latency for picking up changes. A few potential avenues for improvement:

Plumb additional logic into service zone nat configuration that enables direct updates to the ipv4_nat_entry table once nexus is online. Of note, this would further bifurcate the logic of pre-nexus and post-nexus state management. At this moment, it seems that this is the most painful approach. An argument can be made that we ultimately should be lifting the nat configuration logic out of the service zone creation instead.
Decrease the timer for the inventory task. This is the simplest change, however this would result in more frequent collection, increasing overhead. I do not know how much this would increase overhead. Maybe it is negligible.
Plumb in the ability to trigger the inventory collection task for interesting control plane events. This would allow us to keep the relatively infrequent timing intervals but allow us to refresh on-demand when needed.

I took this for a spin in a4x2. After the system came up, I rebooted scrimlet0. When it came back up most of the NAT entries came back with it, and I was able to hit the Nexus API successfully without any manual prodding 🎉 . The NAT entry for external DNS was missing however. I noted where I believe the issue is in the code review.

nexus/db-queries/src/db/datastore/ip_pool.rs

nexus/db-queries/src/db/datastore/ipv4_nat_entry.rs

nexus/src/app/background/init.rs

nexus/src/app/background/sync_service_zone_nat.rs

rcgoodfellow

LGTM, thanks Levon! Let's just get this rebased on main and run through another round of testing before we merge.

smklein · 2024-01-22T18:46:07Z

nexus/src/app/background/sync_service_zone_nat.rs

+            }
+
+            // reconcile service zone nat entries
+            let result = match self.datastore.ipv4_nat_sync_service_zones(opctx, &ipv4_nat_values).await {


We discussed this in chat, but this whole flow has some implications for our calls to ensure_nat_entry / ensure_ipv4_nat_entry:

If you're writing code in Nexus to make an instance with a NAT entry, you must call these functions explicitly. No one else will add your NAT entry to the DB!

If you're writing code in Nexus to make a service with a NAT entry, you should not call these functions, and should rely on the "Nexus -> Sled Agent -> Inventory -> Service Zone NAT RPW" pathway" to ensure that these entries get populated. If you tried to add the entry to the DB, you'd risk a race condition between "collection via inventory / RPW" and "explicitly inserting the record".

I think that's okay, but we should add documentation around the ensure_nat_entry function to make that distinction extremely clear! I think it's kinda subtle that the same table is treated quite differently depending on the purpose of the NAT entry.

Also, for posterity, I created this drawing to map out the data flow here.

My "TLDR" of the above is that I wanted to ensure we avoided having loops in this data flow graph:

(Source: https://docs.google.com/drawings/d/19MkoKsgZ8vuPng6uKaCF1hG2hiHI9735jv9ThLgajVM/edit?usp=sharing )

I think this is great feedback, adding some documentation to it now. I think one day we may get to a place where we can lift the NAT logic out of the service zone creation functionality in sled-agent, which could help us move towards a more consistent pattern of interacting with this table.

internet-diglett · 2024-01-23T22:19:04Z

I've added a stopgap that was discussed in chat, we are now checking to ensure that we have a minimum number of each service zone type that we're interested in before reconciling the entries.

smklein · 2024-01-23T22:23:06Z

nexus/src/app/background/sync_service_zone_nat.rs

+// Minumum number of nexus zones that should be present in a valid
+// set of service zone nat configurations.
+const MIN_NEXUS_COUNT: usize = 3;


I'm not sure we should do this -- it's possible that one of the three Nexus instances have gone away, and we'd still want our NAT entries to be up-to-date.

This is a reasonable goal to try to achieve in the "graceful shutdown" case -- if we want three Nexus instances to run, we should provision a 4th one before removing one of the original 3 -- but if a sled is yanked from the rack, we do have two Nexus instances. That's just a truth! We should aspire to have the blueprint creator create a new Nexus service, provision it, and get the NAT entries populated, but it's very possible to run under our redundancy expectations. That's why we use redundancy!

Would setting all of the minimums to 1 be a reasonable compromise? If we don't have at least 1 Nexus, NTP, and ExternalDns zone that can be found in inventory, I would think we have more serious problems, no?

Chatting with some folks in the update sync today, it sounds like the inventory system -- which gathers the inventory in a "collection", which contains a lot of objects -- may, at any time, simply "not report a sled" within that collection. Theoretically, that means we could see an "inventory collection" that doesn't contain any sleds which contain Nexus, NTP, external DNS, etc.

This has some weird implications for depending on the inventory system as the source-of-truth here, but without a full implementation of blueprints, I acknowledge that there isn't a great alternative yet.

So: "Could we set all the minimums to 1?" That would stop us from propagating the state of the inventory system if we saw a blip that eliminated all of these critical zones. So in that sense, it's arguably better than not doing the check! I also think it avoids "breaking" NAT when we're under-provisioned, as I mentioned in the case below.

If we're on the same page that, eventually, the right source-of-truth is "info from the blueprint, somehow", I think this is a reasonable intermediate step.

Yeah, it's my understanding that blueprints are the future, inventory is what we're using for now.

(also, I appreciate you tolerating all this churn. Getting RPWs + NAT propagation right is tricky, and having this portion of the system be "not-totally-ready" does make this extra hairy. Thank you for pushing through regardless)

No worries! I know this is critical so I appreciate the extra eyes and feedback!

smklein · 2024-01-23T22:24:39Z

nexus/src/app/background/sync_service_zone_nat.rs

+                });
+            }
+
+            if nexus_count < MIN_NEXUS_COUNT {


Example where this could go awry:

Sled pulled (or crashes, or catches fire) we mark it as "Removed"

We identify that the services on the sled are not present

Inventory reports that two Nexus instances are running

2 < 3

So we fail the RPW here, and never update any NAT entries until redundancy is fully restored?

Yes, that is correct. Will we not attempt to move the service zone to a new sled in this situation?

As soon as we implement service re-provisioning, this would happen, but I think the scope of "new service provisions" is going to start with only Crucible and NTP.

Until that is fully implemented, this check would just stop NAT propagation

rcgoodfellow · 2024-01-24T08:25:20Z

I took this for another spin in a4x2 after the rebase. Setting aside for the moment the general startup issues I had on cold boot I did notice some strange behavior that seems specific to NAT logic.

I started my test by letting a4x2 come all the way up to where I could reach the Omicron external API. I then rebooted scrimlet0. This worked out well. When the scrimlet came back up all the way, all the expected NAT entries were there and I could hit the API just fine.

Next, I rebooted scrimlet1. When this scrimlet came back up, it appeared to get stuck in time synchronization. I checked chronyc tracking in the NTP zone, and it was not tracking time.cloudflare.com like it should be. I then realized it had no connectivity, e.g

root@oxz_ntp_0634836d:~# host time.cloudflare.com
;; communications error to 1.1.1.1#53: timed out
;; communications error to 1.1.1.1#53: timed out
;; communications error to 9.9.9.9#53: timed out
;; no servers could be reached

Looking at the sled agent logs, things were indeed stuck on time sync.

08:13:42.465Z WARN SledAgent (ServiceManager): Time not yet synchronized (retrying in 1.440876112s)
    error = "No sync TimeSync { sync: false, ref_id: 2139029761, ip_addr: ::, stratum: 10, ref_time: 1706083959.2016006, correction: 0.0 }"

Looking at the NAT entries on the switches. Scrimlet 0 went from 7 to 2 (due to the reboot of scrimlet 1), and scrimlet 1 only had 3 entires. Looking at the OPTE entries on scrimlet 1 where the NTP zone with no connectivity is, we see

 /opt/oxide/opte/bin/opteadm list-ports
LINK                             MAC ADDRESS              IPv4 ADDRESS     EPHEMERAL IPv4   FLOATING IPv4    IPv6 ADDRESS                             EXTERNAL IPv6                            FLOATING IPv6                            STATE
opte0                            A8:40:25:FF:98:EC        172.30.3.5       None                              None                                     None                                     None                                     running

This MAC address has a NAT entry on scrimlet 1, but not scrimlet 0, which I suspect is the connectivity problem. But the question is why is this NAT entry missing from scrimlet 0.

We weren't filtering the soft-deleted entries when calculating the diff between entries to add and entries to delete. This caused us to skip re-adding entries when an exact match was previously soft-deleted.

rcgoodfellow · 2024-01-26T07:43:26Z

I tested this after d3501dc, and things look much better. I restarted both scrimlets one after the other, and after each restart, all the NAT entries came back on the scrimlet that had been restarted.

Overview --- This PR ensures the rest of our `dpd` configuration is covered by a RPW to help recover state in the event of `dendrite` crashing, the switch zone restarting / being replaced, or the sled restarting. This is accomplished via a background task in Nexus that periodically ensures `dpd` is up to date with the tables in Nexus. The tradeoffs of this design is that we don't track versioning and reconcile the entire state every time, but since the actual number of ports will never be that high (relative to something like NAT entries) the tradeoff of less efficiency for much greater simplicity seems to make sense today, and it requires much less rework in Nexus and Dendrite should we choose to replace this strategy down the road. Tasks --- - [x] Ensure that Service Zones configured during rss, cold boot, and nexus have their NAT entries added to the NAT RPW table (extracted into #4857) - [x] Create background task that periodically reconciles switch port configuration for dendrite instances - [x] Move switch zone uplink SMF property updates to RPW - [x] Move routing updates (via mg) to RPW - [x] Static Routing - [x] BGP - [x] Move bootstore updates to RPW - [x] Move loopback address management to RPW - [x] Move Nexus-side switch zone service on-demand lookups as outlined in #5092 Verifications Performed --- - [x] Basic instance deployment - [x] Loopback Address Creation - [x] BGP configuration (a4x2) - [ ] BGP configuration modification (a4x2) - [x] Static routing - [x] Static routing configuration modification Related --- Closes #4715 Closes #4650 Depends on oxidecomputer/dendrite#838 --------- Co-authored-by: Levon Tarver <[email protected]>

internet-diglett added 6 commits January 16, 2024 23:37

update instructions for deploying dev env

d13a4ea

WIP: background task for syncronizing NAT information for service zones

112765e

WIP: ensure service zone nat entries are tracked by RPW

085b0db

wrap up not TODOs

402f513

revert changes to inventory_collection.rs

56ba5be

add config for nat sync task

c794b1a

internet-diglett mentioned this pull request Jan 20, 2024

Rpws for all networking #4822

Merged

15 tasks

rcgoodfellow self-requested a review January 20, 2024 03:13

internet-diglett changed the title ~~Rpw for service zone nat~~ background task for service zone nat Jan 20, 2024

rcgoodfellow reviewed Jan 20, 2024

View reviewed changes

PR fixes

fe1e3e8

rcgoodfellow approved these changes Jan 21, 2024

View reviewed changes

internet-diglett added 2 commits January 22, 2024 13:02

Merge branch 'main' into rpw-for-service-zone-nat

337bb23

update schema

929ba7c

smklein reviewed Jan 22, 2024

View reviewed changes

internet-diglett added 2 commits January 22, 2024 19:37

add additional documentation

d62b814

create tracking issue for TODO

9eb1111

internet-diglett enabled auto-merge (squash) January 22, 2024 21:39

internet-diglett disabled auto-merge January 22, 2024 21:39

internet-diglett added 3 commits January 23, 2024 19:16

set mandatory minimums for service zone nat entries

b92b0bd

adjust minimum ntp count so job will actually run

1c6e60e

remove comment, issue is WIP

22f36d6

smklein reviewed Jan 23, 2024

View reviewed changes

internet-diglett added 3 commits January 23, 2024 22:59

adjust minimum count for all services to 1

3ebf9df

bump schema version

5811317

Merge branch 'main' into rpw-for-service-zone-nat

28f61b0

internet-diglett enabled auto-merge (squash) January 24, 2024 03:52

internet-diglett disabled auto-merge January 24, 2024 03:55

BUGFIX: nat entries missing after sled restart

d3501dc

We weren't filtering the soft-deleted entries when calculating the diff between entries to add and entries to delete. This caused us to skip re-adding entries when an exact match was previously soft-deleted.

Merge branch 'main' into rpw-for-service-zone-nat

a193845

internet-diglett enabled auto-merge (squash) January 26, 2024 21:36

internet-diglett added 2 commits January 26, 2024 22:00

Merge branch 'main' into rpw-for-service-zone-nat

8c6a23e

bump schema

6cd49f7

internet-diglett merged commit 5215d85 into main Jan 26, 2024
20 checks passed

internet-diglett deleted the rpw-for-service-zone-nat branch January 26, 2024 23:48

internet-diglett mentioned this pull request Feb 1, 2024

Be more resilient during service provisioning. #2933

Closed

internet-diglett mentioned this pull request Mar 5, 2024

Cold boot should handle scrimlet sled-agent restarts #4592

Closed

rcgoodfellow mentioned this pull request Jul 12, 2024

NAT entries go missing if switch zone is restarted in early networking. #6068

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

background task for service zone nat #4857

background task for service zone nat #4857

internet-diglett commented Jan 20, 2024 •

edited

Loading

rcgoodfellow left a comment •

edited

Loading

rcgoodfellow left a comment

smklein Jan 22, 2024

smklein Jan 22, 2024 •

edited

Loading

internet-diglett Jan 22, 2024

internet-diglett commented Jan 23, 2024

smklein Jan 23, 2024

internet-diglett Jan 23, 2024

smklein Jan 23, 2024

internet-diglett Jan 23, 2024

smklein Jan 23, 2024

internet-diglett Jan 23, 2024

smklein Jan 23, 2024

internet-diglett Jan 23, 2024

smklein Jan 23, 2024

rcgoodfellow commented Jan 24, 2024

rcgoodfellow commented Jan 26, 2024

background task for service zone nat #4857

background task for service zone nat #4857

Conversation

internet-diglett commented Jan 20, 2024 • edited Loading

RSS:

Cold boot:

Dendrite crash

Migration / Relocation of service zone

Considerations

Related

rcgoodfellow left a comment • edited Loading

Choose a reason for hiding this comment

rcgoodfellow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

internet-diglett commented Jan 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcgoodfellow commented Jan 24, 2024

rcgoodfellow commented Jan 26, 2024

internet-diglett commented Jan 20, 2024 •

edited

Loading

rcgoodfellow left a comment •

edited

Loading

smklein Jan 22, 2024 •

edited

Loading