Rpws for all networking #4822

internet-diglett · 2024-01-16T23:43:51Z

Overview

This PR ensures the rest of our dpd configuration is covered by a RPW to help recover state in the event of dendrite crashing, the switch zone restarting / being replaced, or the sled restarting. This is accomplished via a background task in Nexus that periodically ensures dpd is up to date with the tables in Nexus. The tradeoffs of this design is that we don't track versioning and reconcile the entire state every time, but since the actual number of ports will never be that high (relative to something like NAT entries) the tradeoff of less efficiency for much greater simplicity seems to make sense today, and it requires much less rework in Nexus and Dendrite should we choose to replace this strategy down the road.

Tasks

Ensure that Service Zones configured during rss, cold boot, and nexus have their NAT entries added to the NAT RPW table (extracted into background task for service zone nat #4857)
Create background task that periodically reconciles switch port configuration for dendrite instances
Move switch zone uplink SMF property updates to RPW
Move routing updates (via mg) to RPW
- Static Routing
- BGP
Move bootstore updates to RPW
Move loopback address management to RPW
Move Nexus-side switch zone service on-demand lookups as outlined in Replace HashMap<SwitchLocation, Client> with on-demand lookups, and eventually DNS #5092

Verifications Performed

Currently the logic for configuring NAT for service zones is deeply nested and crosses sled-agent http API boundaries. The cleanest way to deliver eventual consistency for service zone nat entries was to pull the zone information from inventory and use that to generate nat entries to reconcile against the `ipv4_nat_entry` table. This covers us in the following scenarios: ### RSS: * User provides configuration to RSS * RSS process ultimately creates a sled plan and service plan * Application of service plan by sled-agents creates zones * zone create makes direct calls to dendrite to configure NAT (it is the only way it can be done at this time) * eventually the Nexus zones are launched and handoff to Nexus is complete * inventory task is run, recording zone locations to db * service zone nat background task reads inventory from db and uses the data to generate records for `ipv4_nat_entry` table, then triggers dendrite sync. * sync is ultimately a noop because nat entries already exist in dendrite (dendrite operations are idempotent) ### Cold boot: * sled-agents create switch zones if they are managing a scrimlet, and subsequently create zones written to their ledgers. This may result in direct calls to dendrite. * Once nexus is back up, inventory will resume being collected * service zone nat background task will read inventory from db to reconcile entries in `ipv4_nat_entry` table and then trigger dendrite sync. * If nat is out of date on dendrite, it will be updated on trigger. ### Dendrite crash * If dendrite crashes and restarts, it will immediately contact Nexus for re-sync (pre-existing logic from earlier NAT RPW work) * service zone and instance nat entries are now present in rpw table, so all nat entries will be restored ### Migration / Relocation of service zone * New zone gets created on a sled in the rack. Direct call to dendrite will be made (it uses the same logic as pre-nexus to create zone). * Inventory task will record new location of service zone * Service zone nat background task will use inventory to update table, adding and removing the necessary nat entries and triggering a dendrite update Considerations --- Because this relies on data from the inventory task which runs on a periodic timer (600s), and because this task also runs on a periodic timer (30s), there may be some latency for picking up changes. A few potential avenues for improvement: * Plumb additional logic into service zone nat configuration that enables direct updates to the `ipv4_nat_entry` table once nexus is online. Of note, this would further bifurcate the logic of pre-nexus and post-nexus state management. At this moment, it seems that this is the most painful approach. An argument can be made that we ultimately should be lifting the nat configuration logic _out_ of the service zone creation instead. * Decrease the timer for the inventory task. This is the simplest change, however this would result in more frequent collection, increasing overhead. I do not know _how much_ this would increase overhead. Maybe it is negligible. * Plumb in the ability to trigger the inventory collection task for interesting control plane events. This would allow us to keep the _relatively_ infrequent timing intervals but allow us to refresh on-demand when needed. Related --- Closes #4650 Extracted from #4822

This reverts commit a3d0f56. Switching to "push" based approach instead of "pulling" from dendrite. This will be a simpler implementation and can more easily be replaced with a pull approach later than if we implemented the pull approach and determined that we'd prefer the push approach. Another big motivator, possibly the most important motivation, is that omicron has access to important information that we'd have to plumb into dendrite, like switch location and in the near future *rack* location, so it makes more sense for this to be in omicron today.

rcgoodfellow · 2024-02-08T03:04:52Z

A few initial high-level observations.

I think all the port settings saga code can be thrown out now?
- switch_port_settings_apply.rs
- switch_port_settings_clear.rs
- switch_port_settings_common.rs
- Along with all the saga plumbing in the API handlers like this.
When certain API actions are taken, I think we'll want to trigger the RPW instead of waiting for the next period. I did this for BFD like this.

internet-diglett · 2024-02-08T17:14:07Z

@rcgoodfellow I actually had a similar thought, was going to talk to you about it but it seems you are thinking similarly. I will pull out the saga [nodes] and update with RPW triggers and give it a test.

internet-diglett · 2024-03-07T00:23:06Z

Also, elsewhere we assume "handoff to Nexus completed" == "there is a record in the Rack table". For example see here. This is also specified in RFD 278 ("The transfer-of-control operation should synchronously write some state to the database that allows both this Nexus instance and others to tell that transfer-of-control has happened.

@davepacheco I think this actually ended up being the golden nugget I needed, because we already were iterating over entries in the rack table, but there is an additional initialized column that appears to get set during the handoff that we can filter on.

It seems that after adding a db-query (and associated index to avoid a full table scan) I am now able to iterate over initialized racks instead of all entries in the rack table, so this eliminates the entire need for any additional records / toggles / etc.

* Do not use prebuilt clients for dpd and mgd throughout nexus * Get infra address lot from db

internet-diglett · 2024-03-11T23:08:48Z

Going to rename the RPW since it's not just a "switch port" rpw anymore, but I think we're finally ready to land this thing.

smklein · 2024-03-15T19:28:10Z

nexus/db-queries/src/db/datastore/address_lot.rs

+                // I hate this. I know how to replace this transaction with
+                // CTEs but for the life of me I can't get it to work in
+                // diesel. I gave up and just extended the logic inside
+                // of the transaction instead chasing diesel trait bound errors.


While it's fresh: I don't want to block your PR, but if you have the equivalent SQL for a CTE, could you post it in an issue? I can collaborate with you on making that work separately from this PR.

smklein · 2024-03-15T19:34:02Z

nexus/db-queries/src/db/datastore/address_lot.rs

+                    .await
+                    .ok();
+
+                let db_lot = match found_lot {


In the context of a transaction, this "SELECT or else INSERT" behavior seems fine to me. "INSERT ... ON CONFLICT ... RETURNING" can really only tell you the value you just inserted anyway, so it'll omit any value which exists in the DB already.

(As you're already well-aware, CTEs are the right way to reduce this load with a smaller number of queries)

smklein · 2024-03-15T19:46:38Z

nexus/db-queries/src/db/datastore/address_lot.rs

+        use db::schema::address_lot::dsl as lot_dsl;
+        use db::schema::address_lot_block::dsl as block_dsl;
+
+        let address_lot_id = lot_dsl::address_lot


nitpick: This could be a single query with a join, but I think this is also fine as-is

smklein · 2024-03-15T19:56:40Z

nexus/db-queries/src/db/datastore/address_lot.rs

+                let blocks = blocks
+                    .into_iter()
+                    .filter(|b| {
+                        !db_blocks.iter().any(|db_b| {
+                            db_b.first_address == b.first_address
+                                || db_b.last_address == b.last_address
+                        })
+                    })
+                    .collect::<Vec<_>>();


To be clear, this is looking for cases where:

We have blocks that come from params

In the database, we looked up db_blocks, which come from the same address lot and happen to have the same "first address" or the same "last address" as the parameter-supplied blocks (lines 99-114)

And this particular filter identifies:

If there are any blocks where db_blocks has the same "first address" or "last address", we filter them out of blocks?

Am I misunderstanding this? Aren't we explicitly looking for that overlap - why then filter it out?

We're supposed to be filtering for blocks to insert, but I may have gotten my wires crossed here. I'll take a second look here.

Yeah, this actually doesn't work quite right for what we're planning to return below. I'll need to tweak it. It should be inserting the correct stuff though.

@smklein this should be fixed now

rcgoodfellow

Things look good in a4x2. Onward to on-rack testing.

rcgoodfellow linked an issue Jan 18, 2024 that may be closed by this pull request

Service NAT entries missing after dendrite restart #4650

Closed

rcgoodfellow removed a link to an issue Jan 18, 2024

Service NAT entries missing after dendrite restart #4650

Closed

rcgoodfellow linked an issue Jan 18, 2024 that may be closed by this pull request

Service NAT entries missing after dendrite restart #4650

Closed

rcgoodfellow mentioned this pull request Jan 19, 2024

BFD support #4852

Merged

internet-diglett commented Jan 19, 2024

View reviewed changes

sled-agent/src/params.rs Outdated Show resolved Hide resolved

internet-diglett commented Jan 19, 2024

View reviewed changes

sled-agent/src/params.rs Outdated Show resolved Hide resolved

internet-diglett commented Jan 19, 2024

View reviewed changes

internet-diglett mentioned this pull request Jan 20, 2024

background task for service zone nat #4857

Merged

internet-diglett closed this Feb 1, 2024

internet-diglett force-pushed the rpws-for-all-networking branch from c794b1a to b88e966 Compare February 1, 2024 19:56

WIP: placeholder to keep PR open

a3d0f56

internet-diglett reopened this Feb 1, 2024

Levon Tarver and others added 5 commits February 5, 2024 10:44

Implement BackgroundTask for switch port settings

ac7a086

bump fixtures

5152b0b

Merge branch 'main' into rpws-for-all-networking

f69c3ec

fix log attribute

97db963

internet-diglett marked this pull request as ready for review February 7, 2024 20:10

internet-diglett added 6 commits February 8, 2024 22:46

comments for aiding refactor

185ae2f

Merge branch 'main' into rpws-for-all-networking

2939cd9

fixup! Merge branch 'main' into rpws-for-all-networking

a00f21b

scaffold new background tasks

09b241c

WIP: port switch port settings saga to RPW

31db009

more WIP: continue port of saga to RPW

ab0ed14

internet-diglett marked this pull request as draft February 20, 2024 23:57

internet-diglett added 2 commits February 21, 2024 00:19

clean up debris from refactor

3d9fc13

More WIP: port bootstore updates

6e3e390

add the fully qualified name to the schema

90e3ed6

Levon Tarver and others added 9 commits March 7, 2024 14:56

cache bootstore configs for subsequent comparisons and auditing

a32d6fa

track bootstore history

e01fb5b

EXPECTORATE

121c8b5

timeout on nexus zone startup

61fc7ae

set the timeout to a more reasonable value

29dbb07

fix broken sed command

0126ae6

Fix Correctness Issues

90cd854

* Do not use prebuilt clients for dpd and mgd throughout nexus * Get infra address lot from db

move loopback address management to rpw

9aa14e8

EXPECTORATE

dbcdc5b

internet-diglett marked this pull request as ready for review March 11, 2024 23:07

internet-diglett and others added 5 commits March 12, 2024 01:19

Merge branch 'main' into rpws-for-all-networking

9c58bf1

remove duplicates

cc2255e

Do not fail if address-lot already exists

2f0fc52

WIP: make bgp creation idempotent

8aadca9

WIP: make bgp call idempotent

8db6137

smklein reviewed Mar 15, 2024

View reviewed changes

Levon Tarver added 7 commits March 15, 2024 18:36

add description to bgp config

af4fc83

add timestamp fields to bgp config insertion

1376116

adjust vdev creation parameters for virt disks

4f6067c

Make Address Lot Creation Idempotent

95024d9

a bit more cleanup

3efe2de

bump dendrite

7d41d73

Merge branch 'main' into rpws-for-all-networking

e28135f

rcgoodfellow approved these changes Mar 16, 2024

View reviewed changes

internet-diglett merged commit 05ed790 into main Mar 16, 2024
20 checks passed

internet-diglett deleted the rpws-for-all-networking branch March 16, 2024 21:29

smklein mentioned this pull request Mar 18, 2024

Expose Sled Agent API for "control plane disk management", use it #5172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rpws for all networking #4822

Rpws for all networking #4822

internet-diglett commented Jan 16, 2024 •

edited

Loading

internet-diglett Jan 19, 2024

rcgoodfellow commented Feb 8, 2024

internet-diglett commented Feb 8, 2024 •

edited

Loading

internet-diglett commented Mar 7, 2024

internet-diglett commented Mar 11, 2024

smklein Mar 15, 2024

internet-diglett Mar 16, 2024

smklein Mar 15, 2024

smklein Mar 15, 2024

smklein Mar 15, 2024

internet-diglett Mar 15, 2024

internet-diglett Mar 15, 2024

internet-diglett Mar 16, 2024

rcgoodfellow left a comment

@@ @@ -153,64 +151,14 @@ async fn inventory_activate( @@
                   Ok(collection)
               }
-              /// Determine which sleds to inventory based on what's in the database

Rpws for all networking #4822

Rpws for all networking #4822

Conversation

internet-diglett commented Jan 16, 2024 • edited Loading

Overview

Tasks

Verifications Performed

Related

Choose a reason for hiding this comment

rcgoodfellow commented Feb 8, 2024

internet-diglett commented Feb 8, 2024 • edited Loading

internet-diglett commented Mar 7, 2024

internet-diglett commented Mar 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcgoodfellow left a comment

Choose a reason for hiding this comment

internet-diglett commented Jan 16, 2024 •

edited

Loading

internet-diglett commented Feb 8, 2024 •

edited

Loading