-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
background task for service zone nat #4857
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Levon!
The approach of using inventory as the source of truth for reconciling the service entries seems solid.
I took this for a spin in a4x2
. After the system came up, I rebooted scrimlet0
. When it came back up most of the NAT entries came back with it, and I was able to hit the Nexus API successfully without any manual prodding 🎉 . The NAT entry for external DNS was missing however. I noted where I believe the issue is in the code review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Levon! Let's just get this rebased on main and run through another round of testing before we merge.
} | ||
|
||
// reconcile service zone nat entries | ||
let result = match self.datastore.ipv4_nat_sync_service_zones(opctx, &ipv4_nat_values).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this in chat, but this whole flow has some implications for our calls to ensure_nat_entry
/ ensure_ipv4_nat_entry
:
- If you're writing code in Nexus to make an instance with a NAT entry, you must call these functions explicitly. No one else will add your NAT entry to the DB!
- If you're writing code in Nexus to make a service with a NAT entry, you should not call these functions, and should rely on the "Nexus -> Sled Agent -> Inventory -> Service Zone NAT RPW" pathway" to ensure that these entries get populated. If you tried to add the entry to the DB, you'd risk a race condition between "collection via inventory / RPW" and "explicitly inserting the record".
I think that's okay, but we should add documentation around the ensure_nat_entry
function to make that distinction extremely clear! I think it's kinda subtle that the same table is treated quite differently depending on the purpose of the NAT entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for posterity, I created this drawing to map out the data flow here.
My "TLDR" of the above is that I wanted to ensure we avoided having loops in this data flow graph:
(Source: https://docs.google.com/drawings/d/19MkoKsgZ8vuPng6uKaCF1hG2hiHI9735jv9ThLgajVM/edit?usp=sharing )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is great feedback, adding some documentation to it now. I think one day we may get to a place where we can lift the NAT logic out of the service zone creation functionality in sled-agent, which could help us move towards a more consistent pattern of interacting with this table.
I've added a stopgap that was discussed in chat, we are now checking to ensure that we have a minimum number of each service zone type that we're interested in before reconciling the entries. |
// Minumum number of nexus zones that should be present in a valid | ||
// set of service zone nat configurations. | ||
const MIN_NEXUS_COUNT: usize = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should do this -- it's possible that one of the three Nexus instances have gone away, and we'd still want our NAT entries to be up-to-date.
This is a reasonable goal to try to achieve in the "graceful shutdown" case -- if we want three Nexus instances to run, we should provision a 4th one before removing one of the original 3 -- but if a sled is yanked from the rack, we do have two Nexus instances. That's just a truth! We should aspire to have the blueprint creator create a new Nexus service, provision it, and get the NAT entries populated, but it's very possible to run under our redundancy expectations. That's why we use redundancy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would setting all of the minimums to 1
be a reasonable compromise? If we don't have at least 1 Nexus
, NTP
, and ExternalDns
zone that can be found in inventory, I would think we have more serious problems, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatting with some folks in the update sync today, it sounds like the inventory system -- which gathers the inventory in a "collection", which contains a lot of objects -- may, at any time, simply "not report a sled" within that collection. Theoretically, that means we could see an "inventory collection" that doesn't contain any sleds which contain Nexus, NTP, external DNS, etc.
This has some weird implications for depending on the inventory system as the source-of-truth here, but without a full implementation of blueprints, I acknowledge that there isn't a great alternative yet.
So: "Could we set all the minimums to 1?" That would stop us from propagating the state of the inventory system if we saw a blip that eliminated all of these critical zones. So in that sense, it's arguably better than not doing the check! I also think it avoids "breaking" NAT when we're under-provisioned, as I mentioned in the case below.
If we're on the same page that, eventually, the right source-of-truth is "info from the blueprint, somehow", I think this is a reasonable intermediate step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's my understanding that blueprints are the future, inventory is what we're using for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, I appreciate you tolerating all this churn. Getting RPWs + NAT propagation right is tricky, and having this portion of the system be "not-totally-ready" does make this extra hairy. Thank you for pushing through regardless)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! I know this is critical so I appreciate the extra eyes and feedback!
}); | ||
} | ||
|
||
if nexus_count < MIN_NEXUS_COUNT { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example where this could go awry:
- Sled pulled (or crashes, or catches fire) we mark it as "Removed"
- We identify that the services on the sled are not present
- Inventory reports that two Nexus instances are running
- 2 < 3
- So we fail the RPW here, and never update any NAT entries until redundancy is fully restored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. Will we not attempt to move the service zone to a new sled in this situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as we implement service re-provisioning, this would happen, but I think the scope of "new service provisions" is going to start with only Crucible and NTP.
Until that is fully implemented, this check would just stop NAT propagation
I took this for another spin in I started my test by letting Next, I rebooted scrimlet1. When this scrimlet came back up, it appeared to get stuck in time synchronization. I checked
Looking at the sled agent logs, things were indeed stuck on time sync.
Looking at the NAT entries on the switches. Scrimlet 0 went from 7 to 2 (due to the reboot of scrimlet 1), and scrimlet 1 only had 3 entires. Looking at the OPTE entries on scrimlet 1 where the NTP zone with no connectivity is, we see
This MAC address has a NAT entry on scrimlet 1, but not scrimlet 0, which I suspect is the connectivity problem. But the question is why is this NAT entry missing from scrimlet 0. |
We weren't filtering the soft-deleted entries when calculating the diff between entries to add and entries to delete. This caused us to skip re-adding entries when an exact match was previously soft-deleted.
I tested this after d3501dc, and things look much better. I restarted both scrimlets one after the other, and after each restart, all the NAT entries came back on the scrimlet that had been restarted. |
Overview --- This PR ensures the rest of our `dpd` configuration is covered by a RPW to help recover state in the event of `dendrite` crashing, the switch zone restarting / being replaced, or the sled restarting. This is accomplished via a background task in Nexus that periodically ensures `dpd` is up to date with the tables in Nexus. The tradeoffs of this design is that we don't track versioning and reconcile the entire state every time, but since the actual number of ports will never be that high (relative to something like NAT entries) the tradeoff of less efficiency for much greater simplicity seems to make sense today, and it requires much less rework in Nexus and Dendrite should we choose to replace this strategy down the road. Tasks --- - [x] Ensure that Service Zones configured during rss, cold boot, and nexus have their NAT entries added to the NAT RPW table (extracted into #4857) - [x] Create background task that periodically reconciles switch port configuration for dendrite instances - [x] Move switch zone uplink SMF property updates to RPW - [x] Move routing updates (via mg) to RPW - [x] Static Routing - [x] BGP - [x] Move bootstore updates to RPW - [x] Move loopback address management to RPW - [x] Move Nexus-side switch zone service on-demand lookups as outlined in #5092 Verifications Performed --- - [x] Basic instance deployment - [x] Loopback Address Creation - [x] BGP configuration (a4x2) - [ ] BGP configuration modification (a4x2) - [x] Static routing - [x] Static routing configuration modification Related --- Closes #4715 Closes #4650 Depends on oxidecomputer/dendrite#838 --------- Co-authored-by: Levon Tarver <[email protected]>
Currently the logic for configuring NAT for service zones is deeply nested and crosses sled-agent http API boundaries. The cleanest way to deliver eventual consistency for service zone nat entries was to pull the zone information from inventory and use that to generate nat entries to reconcile against the
ipv4_nat_entry
table. This covers us in the following scenarios:RSS:
ipv4_nat_entry
table, then triggers dendrite sync.Cold boot:
ipv4_nat_entry
table and then trigger dendrite sync.Dendrite crash
Migration / Relocation of service zone
Considerations
Because this relies on data from the inventory task which runs on a periodic timer (600s), and because this task also runs on a periodic timer (30s), there may be some latency for picking up changes. A few potential avenues for improvement:
Plumb additional logic into service zone nat configuration that enables direct updates to the
ipv4_nat_entry
table once nexus is online. Of note, this would further bifurcate the logic of pre-nexus and post-nexus state management. At this moment, it seems that this is the most painful approach. An argument can be made that we ultimately should be lifting the nat configuration logic out of the service zone creation instead.Decrease the timer for the inventory task. This is the simplest change, however this would result in more frequent collection, increasing overhead. I do not know how much this would increase overhead. Maybe it is negligible.
Plumb in the ability to trigger the inventory collection task for interesting control plane events. This would allow us to keep the relatively infrequent timing intervals but allow us to refresh on-demand when needed.
Related
Closes #4650
Extracted from #4822