-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rpws for all networking #4822
Rpws for all networking #4822
Conversation
@@ -153,64 +151,14 @@ async fn inventory_activate( | |||
Ok(collection) | |||
} | |||
|
|||
/// Determine which sleds to inventory based on what's in the database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably going to undo the commit containing this change (moving the sled querying logic to common). Initially I thought that I should potentially query the sled agents directly, but it seems that using the records stored by this background job is sufficient even though it adds some latency.
Currently the logic for configuring NAT for service zones is deeply nested and crosses sled-agent http API boundaries. The cleanest way to deliver eventual consistency for service zone nat entries was to pull the zone information from inventory and use that to generate nat entries to reconcile against the `ipv4_nat_entry` table. This covers us in the following scenarios: ### RSS: * User provides configuration to RSS * RSS process ultimately creates a sled plan and service plan * Application of service plan by sled-agents creates zones * zone create makes direct calls to dendrite to configure NAT (it is the only way it can be done at this time) * eventually the Nexus zones are launched and handoff to Nexus is complete * inventory task is run, recording zone locations to db * service zone nat background task reads inventory from db and uses the data to generate records for `ipv4_nat_entry` table, then triggers dendrite sync. * sync is ultimately a noop because nat entries already exist in dendrite (dendrite operations are idempotent) ### Cold boot: * sled-agents create switch zones if they are managing a scrimlet, and subsequently create zones written to their ledgers. This may result in direct calls to dendrite. * Once nexus is back up, inventory will resume being collected * service zone nat background task will read inventory from db to reconcile entries in `ipv4_nat_entry` table and then trigger dendrite sync. * If nat is out of date on dendrite, it will be updated on trigger. ### Dendrite crash * If dendrite crashes and restarts, it will immediately contact Nexus for re-sync (pre-existing logic from earlier NAT RPW work) * service zone and instance nat entries are now present in rpw table, so all nat entries will be restored ### Migration / Relocation of service zone * New zone gets created on a sled in the rack. Direct call to dendrite will be made (it uses the same logic as pre-nexus to create zone). * Inventory task will record new location of service zone * Service zone nat background task will use inventory to update table, adding and removing the necessary nat entries and triggering a dendrite update Considerations --- Because this relies on data from the inventory task which runs on a periodic timer (600s), and because this task also runs on a periodic timer (30s), there may be some latency for picking up changes. A few potential avenues for improvement: * Plumb additional logic into service zone nat configuration that enables direct updates to the `ipv4_nat_entry` table once nexus is online. Of note, this would further bifurcate the logic of pre-nexus and post-nexus state management. At this moment, it seems that this is the most painful approach. An argument can be made that we ultimately should be lifting the nat configuration logic _out_ of the service zone creation instead. * Decrease the timer for the inventory task. This is the simplest change, however this would result in more frequent collection, increasing overhead. I do not know _how much_ this would increase overhead. Maybe it is negligible. * Plumb in the ability to trigger the inventory collection task for interesting control plane events. This would allow us to keep the _relatively_ infrequent timing intervals but allow us to refresh on-demand when needed. Related --- Closes #4650 Extracted from #4822
c794b1a
to
b88e966
Compare
This reverts commit a3d0f56. Switching to "push" based approach instead of "pulling" from dendrite. This will be a simpler implementation and can more easily be replaced with a pull approach later than if we implemented the pull approach and determined that we'd prefer the push approach. Another big motivator, possibly the most important motivation, is that omicron has access to important information that we'd have to plumb into dendrite, like switch location and in the near future *rack* location, so it makes more sense for this to be in omicron today.
A few initial high-level observations.
|
@rcgoodfellow I actually had a similar thought, was going to talk to you about it but it seems you are thinking similarly. I will pull out the saga [nodes] and update with RPW triggers and give it a test. |
@davepacheco I think this actually ended up being the golden nugget I needed, because we already were iterating over entries in the rack table, but there is an additional It seems that after adding a db-query (and associated index to avoid a full table scan) I am now able to iterate over initialized racks instead of all entries in the rack table, so this eliminates the entire need for any additional records / toggles / etc. |
* Do not use prebuilt clients for dpd and mgd throughout nexus * Get infra address lot from db
Going to rename the RPW since it's not just a "switch port" rpw anymore, but I think we're finally ready to land this thing. |
// I hate this. I know how to replace this transaction with | ||
// CTEs but for the life of me I can't get it to work in | ||
// diesel. I gave up and just extended the logic inside | ||
// of the transaction instead chasing diesel trait bound errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's fresh: I don't want to block your PR, but if you have the equivalent SQL for a CTE, could you post it in an issue? I can collaborate with you on making that work separately from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.await | ||
.ok(); | ||
|
||
let db_lot = match found_lot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of a transaction, this "SELECT or else INSERT" behavior seems fine to me. "INSERT ... ON CONFLICT ... RETURNING" can really only tell you the value you just inserted anyway, so it'll omit any value which exists in the DB already.
(As you're already well-aware, CTEs are the right way to reduce this load with a smaller number of queries)
use db::schema::address_lot::dsl as lot_dsl; | ||
use db::schema::address_lot_block::dsl as block_dsl; | ||
|
||
let address_lot_id = lot_dsl::address_lot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: This could be a single query with a join, but I think this is also fine as-is
let blocks = blocks | ||
.into_iter() | ||
.filter(|b| { | ||
!db_blocks.iter().any(|db_b| { | ||
db_b.first_address == b.first_address | ||
|| db_b.last_address == b.last_address | ||
}) | ||
}) | ||
.collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, this is looking for cases where:
- We have
blocks
that come from params - In the database, we looked up
db_blocks
, which come from the same address lot and happen to have the same "first address" or the same "last address" as the parameter-suppliedblocks
(lines 99-114)
And this particular filter identifies:
- If there are any
blocks
wheredb_blocks
has the same "first address" or "last address", we filter them out ofblocks
?
Am I misunderstanding this? Aren't we explicitly looking for that overlap - why then filter it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're supposed to be filtering for blocks to insert, but I may have gotten my wires crossed here. I'll take a second look here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this actually doesn't work quite right for what we're planning to return below. I'll need to tweak it. It should be inserting the correct stuff though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smklein this should be fixed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things look good in a4x2. Onward to on-rack testing.
Overview
This PR ensures the rest of our
dpd
configuration is covered by a RPW to help recover state in the event ofdendrite
crashing, the switch zone restarting / being replaced, or the sled restarting. This is accomplished via a background task in Nexus that periodically ensuresdpd
is up to date with the tables in Nexus. The tradeoffs of this design is that we don't track versioning and reconcile the entire state every time, but since the actual number of ports will never be that high (relative to something like NAT entries) the tradeoff of less efficiency for much greater simplicity seems to make sense today, and it requires much less rework in Nexus and Dendrite should we choose to replace this strategy down the road.Tasks
HashMap<SwitchLocation, Client>
with on-demand lookups, and eventually DNS #5092Verifications Performed
Related
Closes #4715
Closes #4650
Depends on https://github.com/oxidecomputer/dendrite/pull/838