Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

background task for service zone nat #4857

Merged
merged 21 commits into from
Jan 26, 2024
Merged

Conversation

internet-diglett
Copy link
Contributor

@internet-diglett internet-diglett commented Jan 20, 2024

Currently the logic for configuring NAT for service zones is deeply nested and crosses sled-agent http API boundaries. The cleanest way to deliver eventual consistency for service zone nat entries was to pull the zone information from inventory and use that to generate nat entries to reconcile against the ipv4_nat_entry table. This covers us in the following scenarios:

RSS:

  • User provides configuration to RSS
  • RSS process ultimately creates a sled plan and service plan
  • Application of service plan by sled-agents creates zones
  • zone create makes direct calls to dendrite to configure NAT (it is the only way it can be done at this time)
  • eventually the Nexus zones are launched and handoff to Nexus is complete
  • inventory task is run, recording zone locations to db
  • service zone nat background task reads inventory from db and uses the data to generate records for ipv4_nat_entry table, then triggers dendrite sync.
  • sync is ultimately a noop because nat entries already exist in dendrite (dendrite operations are idempotent)

Cold boot:

  • sled-agents create switch zones if they are managing a scrimlet, and subsequently create zones written to their ledgers. This may result in direct calls to dendrite.
  • Once nexus is back up, inventory will resume being collected
  • service zone nat background task will read inventory from db to reconcile entries in ipv4_nat_entry table and then trigger dendrite sync.
  • If nat is out of date on dendrite, it will be updated on trigger.

Dendrite crash

  • If dendrite crashes and restarts, it will immediately contact Nexus for re-sync (pre-existing logic from earlier NAT RPW work)
  • service zone and instance nat entries are now present in rpw table, so all nat entries will be restored

Migration / Relocation of service zone

  • New zone gets created on a sled in the rack. Direct call to dendrite will be made (it uses the same logic as pre-nexus to create zone).
  • Inventory task will record new location of service zone
  • Service zone nat background task will use inventory to update table, adding and removing the necessary nat entries and triggering a dendrite update

Considerations

Because this relies on data from the inventory task which runs on a periodic timer (600s), and because this task also runs on a periodic timer (30s), there may be some latency for picking up changes. A few potential avenues for improvement:

  • Plumb additional logic into service zone nat configuration that enables direct updates to the ipv4_nat_entry table once nexus is online. Of note, this would further bifurcate the logic of pre-nexus and post-nexus state management. At this moment, it seems that this is the most painful approach. An argument can be made that we ultimately should be lifting the nat configuration logic out of the service zone creation instead.

  • Decrease the timer for the inventory task. This is the simplest change, however this would result in more frequent collection, increasing overhead. I do not know how much this would increase overhead. Maybe it is negligible.

  • Plumb in the ability to trigger the inventory collection task for interesting control plane events. This would allow us to keep the relatively infrequent timing intervals but allow us to refresh on-demand when needed.

Related

Closes #4650
Extracted from #4822

@internet-diglett internet-diglett mentioned this pull request Jan 20, 2024
15 tasks
@rcgoodfellow rcgoodfellow self-requested a review January 20, 2024 03:13
@internet-diglett internet-diglett changed the title Rpw for service zone nat background task for service zone nat Jan 20, 2024
Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Levon!

The approach of using inventory as the source of truth for reconciling the service entries seems solid.

I took this for a spin in a4x2. After the system came up, I rebooted scrimlet0. When it came back up most of the NAT entries came back with it, and I was able to hit the Nexus API successfully without any manual prodding 🎉 . The NAT entry for external DNS was missing however. I noted where I believe the issue is in the code review.

nexus/db-queries/src/db/datastore/ip_pool.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/ipv4_nat_entry.rs Outdated Show resolved Hide resolved
nexus/src/app/background/init.rs Outdated Show resolved Hide resolved
nexus/src/app/background/sync_service_zone_nat.rs Outdated Show resolved Hide resolved
nexus/src/app/background/sync_service_zone_nat.rs Outdated Show resolved Hide resolved
nexus/src/app/background/sync_service_zone_nat.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Levon! Let's just get this rebased on main and run through another round of testing before we merge.

}

// reconcile service zone nat entries
let result = match self.datastore.ipv4_nat_sync_service_zones(opctx, &ipv4_nat_values).await {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in chat, but this whole flow has some implications for our calls to ensure_nat_entry / ensure_ipv4_nat_entry:

  • If you're writing code in Nexus to make an instance with a NAT entry, you must call these functions explicitly. No one else will add your NAT entry to the DB!
  • If you're writing code in Nexus to make a service with a NAT entry, you should not call these functions, and should rely on the "Nexus -> Sled Agent -> Inventory -> Service Zone NAT RPW" pathway" to ensure that these entries get populated. If you tried to add the entry to the DB, you'd risk a race condition between "collection via inventory / RPW" and "explicitly inserting the record".

I think that's okay, but we should add documentation around the ensure_nat_entry function to make that distinction extremely clear! I think it's kinda subtle that the same table is treated quite differently depending on the purpose of the NAT entry.

Copy link
Collaborator

@smklein smklein Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for posterity, I created this drawing to map out the data flow here.

My "TLDR" of the above is that I wanted to ensure we avoided having loops in this data flow graph:

image

(Source: https://docs.google.com/drawings/d/19MkoKsgZ8vuPng6uKaCF1hG2hiHI9735jv9ThLgajVM/edit?usp=sharing )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great feedback, adding some documentation to it now. I think one day we may get to a place where we can lift the NAT logic out of the service zone creation functionality in sled-agent, which could help us move towards a more consistent pattern of interacting with this table.

@internet-diglett internet-diglett enabled auto-merge (squash) January 22, 2024 21:39
@internet-diglett
Copy link
Contributor Author

I've added a stopgap that was discussed in chat, we are now checking to ensure that we have a minimum number of each service zone type that we're interested in before reconciling the entries.

Comment on lines 28 to 30
// Minumum number of nexus zones that should be present in a valid
// set of service zone nat configurations.
const MIN_NEXUS_COUNT: usize = 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should do this -- it's possible that one of the three Nexus instances have gone away, and we'd still want our NAT entries to be up-to-date.

This is a reasonable goal to try to achieve in the "graceful shutdown" case -- if we want three Nexus instances to run, we should provision a 4th one before removing one of the original 3 -- but if a sled is yanked from the rack, we do have two Nexus instances. That's just a truth! We should aspire to have the blueprint creator create a new Nexus service, provision it, and get the NAT entries populated, but it's very possible to run under our redundancy expectations. That's why we use redundancy!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would setting all of the minimums to 1 be a reasonable compromise? If we don't have at least 1 Nexus, NTP, and ExternalDns zone that can be found in inventory, I would think we have more serious problems, no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatting with some folks in the update sync today, it sounds like the inventory system -- which gathers the inventory in a "collection", which contains a lot of objects -- may, at any time, simply "not report a sled" within that collection. Theoretically, that means we could see an "inventory collection" that doesn't contain any sleds which contain Nexus, NTP, external DNS, etc.

This has some weird implications for depending on the inventory system as the source-of-truth here, but without a full implementation of blueprints, I acknowledge that there isn't a great alternative yet.

So: "Could we set all the minimums to 1?" That would stop us from propagating the state of the inventory system if we saw a blip that eliminated all of these critical zones. So in that sense, it's arguably better than not doing the check! I also think it avoids "breaking" NAT when we're under-provisioned, as I mentioned in the case below.

If we're on the same page that, eventually, the right source-of-truth is "info from the blueprint, somehow", I think this is a reasonable intermediate step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's my understanding that blueprints are the future, inventory is what we're using for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, I appreciate you tolerating all this churn. Getting RPWs + NAT propagation right is tricky, and having this portion of the system be "not-totally-ready" does make this extra hairy. Thank you for pushing through regardless)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! I know this is critical so I appreciate the extra eyes and feedback!

});
}

if nexus_count < MIN_NEXUS_COUNT {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example where this could go awry:

  • Sled pulled (or crashes, or catches fire) we mark it as "Removed"
  • We identify that the services on the sled are not present
  • Inventory reports that two Nexus instances are running
  • 2 < 3
  • So we fail the RPW here, and never update any NAT entries until redundancy is fully restored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. Will we not attempt to move the service zone to a new sled in this situation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as we implement service re-provisioning, this would happen, but I think the scope of "new service provisions" is going to start with only Crucible and NTP.

Until that is fully implemented, this check would just stop NAT propagation

@internet-diglett internet-diglett enabled auto-merge (squash) January 24, 2024 03:52
@rcgoodfellow
Copy link
Contributor

I took this for another spin in a4x2 after the rebase. Setting aside for the moment the general startup issues I had on cold boot I did notice some strange behavior that seems specific to NAT logic.

I started my test by letting a4x2 come all the way up to where I could reach the Omicron external API. I then rebooted scrimlet0. This worked out well. When the scrimlet came back up all the way, all the expected NAT entries were there and I could hit the API just fine.

Next, I rebooted scrimlet1. When this scrimlet came back up, it appeared to get stuck in time synchronization. I checked chronyc tracking in the NTP zone, and it was not tracking time.cloudflare.com like it should be. I then realized it had no connectivity, e.g

root@oxz_ntp_0634836d:~# host time.cloudflare.com
;; communications error to 1.1.1.1#53: timed out
;; communications error to 1.1.1.1#53: timed out
;; communications error to 9.9.9.9#53: timed out
;; no servers could be reached

Looking at the sled agent logs, things were indeed stuck on time sync.

08:13:42.465Z WARN SledAgent (ServiceManager): Time not yet synchronized (retrying in 1.440876112s)
    error = "No sync TimeSync { sync: false, ref_id: 2139029761, ip_addr: ::, stratum: 10, ref_time: 1706083959.2016006, correction: 0.0 }"

Looking at the NAT entries on the switches. Scrimlet 0 went from 7 to 2 (due to the reboot of scrimlet 1), and scrimlet 1 only had 3 entires. Looking at the OPTE entries on scrimlet 1 where the NTP zone with no connectivity is, we see

 /opt/oxide/opte/bin/opteadm list-ports
LINK                             MAC ADDRESS              IPv4 ADDRESS     EPHEMERAL IPv4   FLOATING IPv4    IPv6 ADDRESS                             EXTERNAL IPv6                            FLOATING IPv6                            STATE
opte0                            A8:40:25:FF:98:EC        172.30.3.5       None                              None                                     None                                     None                                     running

This MAC address has a NAT entry on scrimlet 1, but not scrimlet 0, which I suspect is the connectivity problem. But the question is why is this NAT entry missing from scrimlet 0.

We weren't filtering the soft-deleted entries when calculating the
diff between entries to add and entries to delete. This caused us
to skip re-adding entries when an exact match was previously
soft-deleted.
@rcgoodfellow
Copy link
Contributor

I tested this after d3501dc, and things look much better. I restarted both scrimlets one after the other, and after each restart, all the NAT entries came back on the scrimlet that had been restarted.

@internet-diglett internet-diglett enabled auto-merge (squash) January 26, 2024 21:36
@internet-diglett internet-diglett merged commit 5215d85 into main Jan 26, 2024
20 checks passed
@internet-diglett internet-diglett deleted the rpw-for-service-zone-nat branch January 26, 2024 23:48
internet-diglett added a commit that referenced this pull request Mar 16, 2024
Overview
---
This PR ensures the rest of our `dpd` configuration is covered by a RPW
to help recover state in the event of `dendrite` crashing, the switch
zone restarting / being replaced, or the sled restarting. This is
accomplished via a background task in Nexus that periodically ensures
`dpd` is up to date with the tables in Nexus. The tradeoffs of this
design is that we don't track versioning and reconcile the entire state
every time, but since the actual number of ports will never be that high
(relative to something like NAT entries) the tradeoff of less efficiency
for much greater simplicity seems to make sense today, and it requires
much less rework in Nexus and Dendrite should we choose to replace this
strategy down the road.

Tasks
---
- [x] Ensure that Service Zones configured during rss, cold boot, and
nexus have their NAT entries added to the NAT RPW table (extracted into
#4857)
- [x] Create background task that periodically reconciles switch port
configuration for dendrite instances
- [x] Move switch zone uplink SMF property updates to RPW
- [x] Move routing updates (via mg) to RPW
    - [x] Static Routing
    - [x] BGP
- [x] Move bootstore updates to RPW
- [x] Move loopback address management to RPW
- [x] Move Nexus-side switch zone service on-demand lookups as outlined
in #5092

Verifications Performed
---
- [x] Basic instance deployment
- [x] Loopback Address Creation
- [x] BGP configuration (a4x2)
- [ ] BGP configuration modification (a4x2)
- [x] Static routing
- [x] Static routing configuration modification

Related
---
Closes #4715
Closes #4650

Depends on oxidecomputer/dendrite#838

---------

Co-authored-by: Levon Tarver <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Service NAT entries missing after dendrite restart
5 participants