Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rpws for all networking #4822

Merged
merged 48 commits into from
Mar 16, 2024
Merged

Rpws for all networking #4822

merged 48 commits into from
Mar 16, 2024

Conversation

internet-diglett
Copy link
Contributor

@internet-diglett internet-diglett commented Jan 16, 2024

Overview

This PR ensures the rest of our dpd configuration is covered by a RPW to help recover state in the event of dendrite crashing, the switch zone restarting / being replaced, or the sled restarting. This is accomplished via a background task in Nexus that periodically ensures dpd is up to date with the tables in Nexus. The tradeoffs of this design is that we don't track versioning and reconcile the entire state every time, but since the actual number of ports will never be that high (relative to something like NAT entries) the tradeoff of less efficiency for much greater simplicity seems to make sense today, and it requires much less rework in Nexus and Dendrite should we choose to replace this strategy down the road.

Tasks

Verifications Performed

  • Basic instance deployment
  • Loopback Address Creation
  • BGP configuration (a4x2)
  • BGP configuration modification (a4x2)
  • Static routing
  • Static routing configuration modification

Related

Closes #4715
Closes #4650

Depends on https://github.com/oxidecomputer/dendrite/pull/838

@rcgoodfellow rcgoodfellow linked an issue Jan 18, 2024 that may be closed by this pull request
@rcgoodfellow rcgoodfellow linked an issue Jan 18, 2024 that may be closed by this pull request
@rcgoodfellow rcgoodfellow mentioned this pull request Jan 19, 2024
@@ -153,64 +151,14 @@ async fn inventory_activate(
Ok(collection)
}

/// Determine which sleds to inventory based on what's in the database
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably going to undo the commit containing this change (moving the sled querying logic to common). Initially I thought that I should potentially query the sled agents directly, but it seems that using the records stored by this background job is sufficient even though it adds some latency.

internet-diglett added a commit that referenced this pull request Jan 26, 2024
Currently the logic for configuring NAT for service zones is deeply
nested and crosses sled-agent http API boundaries. The cleanest way to
deliver eventual consistency for service zone nat entries was to pull
the zone information from inventory and use that to generate nat entries
to reconcile against the `ipv4_nat_entry` table. This covers us in the
following scenarios:

### RSS:
* User provides configuration to RSS
* RSS process ultimately creates a sled plan and service plan
* Application of service plan by sled-agents creates zones
* zone create makes direct calls to dendrite to configure NAT (it is the
only way it can be done at this time)
* eventually the Nexus zones are launched and handoff to Nexus is
complete
* inventory task is run, recording zone locations to db
* service zone nat background task reads inventory from db and uses the
data to generate records for `ipv4_nat_entry` table, then triggers
dendrite sync.
* sync is ultimately a noop because nat entries already exist in
dendrite (dendrite operations are idempotent)
 
### Cold boot:
* sled-agents create switch zones if they are managing a scrimlet, and
subsequently create zones written to their ledgers. This may result in
direct calls to dendrite.
* Once nexus is back up, inventory will resume being collected
* service zone nat background task will read inventory from db to
reconcile entries in `ipv4_nat_entry` table and then trigger dendrite
sync.
* If nat is out of date on dendrite, it will be updated on trigger.

### Dendrite crash
* If dendrite crashes and restarts, it will immediately contact Nexus
for re-sync (pre-existing logic from earlier NAT RPW work)
* service zone and instance nat entries are now present in rpw table, so
all nat entries will be restored

### Migration / Relocation of service zone
* New zone gets created on a sled in the rack. Direct call to dendrite
will be made (it uses the same logic as pre-nexus to create zone).
* Inventory task will record new location of service zone
* Service zone nat background task will use inventory to update table,
adding and removing the necessary nat entries and triggering a dendrite
update


Considerations
---
Because this relies on data from the inventory task which runs on a
periodic timer (600s), and because this task also runs on a periodic
timer (30s), there may be some latency for picking up changes. A few
potential avenues for improvement:

* Plumb additional logic into service zone nat configuration that
enables direct updates to the `ipv4_nat_entry` table once nexus is
online. Of note, this would further bifurcate the logic of pre-nexus and
post-nexus state management. At this moment, it seems that this is the
most painful approach. An argument can be made that we ultimately should
be lifting the nat configuration logic _out_ of the service zone
creation instead.

* Decrease the timer for the inventory task. This is the simplest
change, however this would result in more frequent collection,
increasing overhead. I do not know _how much_ this would increase
overhead. Maybe it is negligible.

* Plumb in the ability to trigger the inventory collection task for
interesting control plane events. This would allow us to keep the
_relatively_ infrequent timing intervals but allow us to refresh
on-demand when needed.

Related
---
Closes #4650 
Extracted from #4822
Levon Tarver and others added 5 commits February 5, 2024 10:44
This reverts commit a3d0f56.

Switching to "push" based approach instead of "pulling" from dendrite.
This will be a simpler implementation and can more easily be replaced with
a pull approach later than if we implemented the pull approach and
determined that we'd prefer the push approach. Another big motivator,
possibly the most important motivation, is that omicron has access to
important information that we'd have to plumb into dendrite, like switch
location and in the near future *rack* location, so it makes more sense
for this to be in omicron today.
@internet-diglett internet-diglett marked this pull request as ready for review February 7, 2024 20:10
@rcgoodfellow
Copy link
Contributor

A few initial high-level observations.

@internet-diglett
Copy link
Contributor Author

internet-diglett commented Feb 8, 2024

@rcgoodfellow I actually had a similar thought, was going to talk to you about it but it seems you are thinking similarly. I will pull out the saga [nodes] and update with RPW triggers and give it a test.

@internet-diglett internet-diglett marked this pull request as draft February 20, 2024 23:57
@internet-diglett
Copy link
Contributor Author

Also, elsewhere we assume "handoff to Nexus completed" == "there is a record in the Rack table". For example see here. This is also specified in RFD 278 ("The transfer-of-control operation should synchronously write some state to the database that allows both this Nexus instance and others to tell that transfer-of-control has happened.

@davepacheco I think this actually ended up being the golden nugget I needed, because we already were iterating over entries in the rack table, but there is an additional initialized column that appears to get set during the handoff that we can filter on.

It seems that after adding a db-query (and associated index to avoid a full table scan) I am now able to iterate over initialized racks instead of all entries in the rack table, so this eliminates the entire need for any additional records / toggles / etc.

@internet-diglett internet-diglett marked this pull request as ready for review March 11, 2024 23:07
@internet-diglett
Copy link
Contributor Author

Going to rename the RPW since it's not just a "switch port" rpw anymore, but I think we're finally ready to land this thing.

Comment on lines +57 to +60
// I hate this. I know how to replace this transaction with
// CTEs but for the life of me I can't get it to work in
// diesel. I gave up and just extended the logic inside
// of the transaction instead chasing diesel trait bound errors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's fresh: I don't want to block your PR, but if you have the equivalent SQL for a CTE, could you post it in an issue? I can collaborate with you on making that work separately from this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.await
.ok();

let db_lot = match found_lot {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of a transaction, this "SELECT or else INSERT" behavior seems fine to me. "INSERT ... ON CONFLICT ... RETURNING" can really only tell you the value you just inserted anyway, so it'll omit any value which exists in the DB already.

(As you're already well-aware, CTEs are the right way to reduce this load with a smaller number of queries)

use db::schema::address_lot::dsl as lot_dsl;
use db::schema::address_lot_block::dsl as block_dsl;

let address_lot_id = lot_dsl::address_lot
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: This could be a single query with a join, but I think this is also fine as-is

Comment on lines 118 to 126
let blocks = blocks
.into_iter()
.filter(|b| {
!db_blocks.iter().any(|db_b| {
db_b.first_address == b.first_address
|| db_b.last_address == b.last_address
})
})
.collect::<Vec<_>>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this is looking for cases where:

  • We have blocks that come from params
  • In the database, we looked up db_blocks, which come from the same address lot and happen to have the same "first address" or the same "last address" as the parameter-supplied blocks (lines 99-114)

And this particular filter identifies:

  • If there are any blocks where db_blocks has the same "first address" or "last address", we filter them out of blocks?

Am I misunderstanding this? Aren't we explicitly looking for that overlap - why then filter it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're supposed to be filtering for blocks to insert, but I may have gotten my wires crossed here. I'll take a second look here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this actually doesn't work quite right for what we're planning to return below. I'll need to tweak it. It should be inserting the correct stuff though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smklein this should be fixed now

Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things look good in a4x2. Onward to on-rack testing.

@internet-diglett internet-diglett merged commit 05ed790 into main Mar 16, 2024
20 checks passed
@internet-diglett internet-diglett deleted the rpws-for-all-networking branch March 16, 2024 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Network RPWs for All Service NAT entries missing after dendrite restart
6 participants