[nexus][sled-agent] Try more to load firewall rules during service bringup #5054

smklein · 2024-02-13T20:48:44Z

Fixes #5053 , with this solution implemented:

Nexus Internal Endpoint to Ask for FW Rules, Triggered After Individual Zones Launch. Rather than waiting for "all zones to launch", the Sled Agent could just re-apply firewall rules after any zones that need FW rules come up.

I used a bit of a brute-force solution here -- "just do it each time we try to launch a set of zones". This could certainly be more fine-grained, but it gets the job done.

…bringup

smklein · 2024-02-13T20:51:58Z

sled-agent/src/sled_agent.rs

+                // Load as many services as we can, and don't exit immediately
+                // upon failure...
+                let load_services_result =
+                    self.inner.services.load_services().await.map_err(|err| {
+                        BackoffError::transient(Error::from(err))
+                    });
+
+                // ... and request firewall rule updates for as many services as
+                // we can. Note that we still make this request even if we only
+                // partially load some services.
+                let firewall_result = self
+                    .request_firewall_update()
                    .await
-                    .map_err(|err| BackoffError::transient(err))
+                    .map_err(|err| BackoffError::transient(err));
+
+                // Only complete if we have loaded all services and firewall
+                // rules successfully.
+                load_services_result.and(firewall_result)


By doing this operations back-to-back, we basically "load as many firewall rules as we can" for a set of services.

I could make this trigger more individually for "services which need firewall rules" (e.g., the request is only made after adding a Nexus / External DNS / boundary NTP zone), but felt that this was more "obviously not missing anything".

I could push down the triggers into the ServiceManager if folks felt strongly about this.

Hmm, like this we'll continue to hit the endpoint if we get stuck in the disk error loop right? Wonder if cause any issues constantly trying to update fw rules. Oh but there's the backoff in the retry loop. So, should be fine

luqmana

Thanks for putting this together so quick Sean!

luqmana · 2024-02-13T21:41:48Z

sled-agent/src/sled_agent.rs

+                // Load as many services as we can, and don't exit immediately
+                // upon failure...
+                let load_services_result =
+                    self.inner.services.load_services().await.map_err(|err| {
+                        BackoffError::transient(Error::from(err))
+                    });
+
+                // ... and request firewall rule updates for as many services as
+                // we can. Note that we still make this request even if we only
+                // partially load some services.
+                let firewall_result = self
+                    .request_firewall_update()
                    .await
-                    .map_err(|err| BackoffError::transient(err))
+                    .map_err(|err| BackoffError::transient(err));
+
+                // Only complete if we have loaded all services and firewall
+                // rules successfully.
+                load_services_result.and(firewall_result)


Hmm, like this we'll continue to hit the endpoint if we get stuck in the disk error loop right? Wonder if cause any issues constantly trying to update fw rules. Oh but there's the backoff in the retry loop. So, should be fine

FelixMcFelix

Thanks for the very quick turnaround! This looks about right to me from what we talked over.

[nexus][sled-agent] Try harder to load firewall rules during service …

9d5d238

…bringup

smklein commented Feb 13, 2024

View reviewed changes

Comments

3b45391

smklein requested review from luqmana and FelixMcFelix February 13, 2024 20:54

smklein marked this pull request as ready for review February 13, 2024 20:54

expectorate

23bb273

luqmana approved these changes Feb 13, 2024

View reviewed changes

FelixMcFelix approved these changes Feb 13, 2024

View reviewed changes

smklein merged commit 8b4d2e9 into main Feb 13, 2024
21 checks passed

smklein deleted the firewall-shenanigans branch February 13, 2024 22:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus][sled-agent] Try more to load firewall rules during service bringup #5054

[nexus][sled-agent] Try more to load firewall rules during service bringup #5054

smklein commented Feb 13, 2024

smklein Feb 13, 2024

luqmana Feb 13, 2024

luqmana left a comment

luqmana Feb 13, 2024

FelixMcFelix left a comment

[nexus][sled-agent] Try more to load firewall rules during service bringup #5054

[nexus][sled-agent] Try more to load firewall rules during service bringup #5054

Conversation

smklein commented Feb 13, 2024

smklein Feb 13, 2024

Choose a reason for hiding this comment

luqmana Feb 13, 2024

Choose a reason for hiding this comment

luqmana left a comment

Choose a reason for hiding this comment

luqmana Feb 13, 2024

Choose a reason for hiding this comment

FelixMcFelix left a comment

Choose a reason for hiding this comment