Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconfigurator: Record external networking allocations when realizing a blueprint #5045

Merged
merged 10 commits into from
Feb 21, 2024

Conversation

jgallagher
Copy link
Contributor

@jgallagher jgallagher commented Feb 12, 2024

This PR expands blueprint execution to record external IPs and created NICs for Nexus, Boundary NTP, and External DNS in crdb before sending new zone requests to sled-agents.

The implementation has a very obvious TOCTOU race, but I think it's okay (as explained in the comments inline). If two Nexuses try to realize the same blueprint simultaneously and both see no records present, only one will succeed to insert, and the other will spuriously fail. Assuming that failure causes a retry, subsequent attempts to realize the same blueprint will succeed, as the required records will be present. If this seems wrong, please holler!

I'd like to give this a spin on either madrid or one of the software testbeds before merging, but I think this is ready for review.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

// one. Bail out here if this zone already has one or more allocated
// NICs but not the one we think it needs.
//
// This doesn't check the allocated NIC's subnet against our NICs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable.

// Only attempt to allocate `external_ip` if it isn't already assigned
// to this zone.
//
// Checking for the existing of the external IP and then creating it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

// if somehow this same service got a _different_ NIC allocated
// to it in the TOCTOU race window above. That should be
// impossible with the way we generate blueprints, so we'll just
// return a scary error here and expect to never see it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

.await
.expect("failed to expand service IP pool");

// Generate the values we care about. (Other required zone config params
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strategy of champion testers everywhere!

allocated_ip.first_port == SqlU16(first)
&& allocated_ip.last_port == SqlU16(last)
})
.unwrap_or(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does port_range being None mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two kinds of external IPs that have separate methods (ensure_external_service_ip and ensure_external_service_snat_ip) that both want to call this method. In the former case, all we have is an IP address, no port range, so when it calls this function it passes None. In the latter case, we have a SourceNatConfig value, which includes an IP and a port range, so when it calls this function it passes Some(_).

In practice, the non-SNAT IPs are given a port range of (0, 65535), but it seemed weird to hardcode that here when it's a decision being made elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting, should we use a type that's isomorphic to Option but not actually that? Definitely was surprised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(In particular, Ord doesn't make sense I think)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, where is Ord coming into play?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what I mean is that if we define our own type, we can choose to not implement Ord on it, since it doesn't make sense in this case. We could also define other operations like subset_of/superset_of, which is what I think this code is trying to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But this is followup material, doesn't have to be done here.)

Copy link
Contributor Author

@jgallagher jgallagher Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want subset/superset here; this is just a straight equality check (except for having to convert to SqlU16 to actually do the comparison). Is there a clearer way to express that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, no I think it was just that I misunderstood what this was doing :)

Comment on lines +264 to +267
format!(
"failed to allocate IP to {zone_type} {service_id}: \
{external_ip}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a note here saying that this is expected in case of a race? (Or maybe this is already done at a higher level.)

I'm thinking about future debugging sessions where a simple error in the logs can be confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I hope the error we get from the datastore would make it clear whether a failure is due to a duplication issue or some other transient db issue. I'm hesitant to label this as "expected if we're racing" when the underlying error might be some other kind of db problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, just trying to think it through.

Comment on lines +317 to +322
.with_context(|| {
format!(
"failed to allocate snat IP to {zone_type} {service_id}: \
{snat:?}"
)
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment applies here I guess, and maybe it does suggest that there is (or should be) a higher-level note.

service_id: Uuid,
external_ip: IpAddr,
ip_name: &Name,
) -> anyhow::Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on use anyhow::Result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection, I just have a light preference for this because when I see Result<()> I have to either remember or look up which typealias it is.

@andrewjstone
Copy link
Contributor

On testbed, I'm seeing this warning as soon as I generate a blueprint for g2, which certainly doesn't have a nexus zone yet:

00:25:24.307Z WARN c33a7560-3972-4cc0-8418-aa5db7cd5026 (ServerContext): zone has unexpected NICs allocated
    allocated_nics = [ServiceNetworkInterface { identity: ServiceNetworkInterfaceIdentity { id: 3fac0bfd-5bba-402c-940c-99b7f1bda592, name: Name(Name("nexus-5a581037-15b9-4196-9d0b-cf20a77e86b5")), description: "nexus service vNIC", time_created: 2024-02-13T23:28:19.951897Z, time_modified: 2024-02-13T23:28:19.951897Z, time_deleted: None }, service_id: 5a581037-15b9-4196-9d0b-cf20a77e86b5, vpc_id: 001de000-074c-4000-8000-000000000000, subnet_id: 001de000-c470-4000-8000-000000000002, mac: MacAddr(MacAddr(MacAddr6([168, 64, 37, 255, 194, 97]))), ip: V4(Ipv4Network { addr: 172.30.2.5, prefix: 32 }), slot: 1, primary: true }]
    background_task = blueprint_executor
    comment =
    file = nexus/blueprint-execution/src/resource_allocation.rs:195
    want_nic = NetworkInterface { id: 3fac0bfd-5bba-402c-940c-99b7f1bda592, ip: 172.30.2.5, kind: Service(5a581037-15b9-4196-9d0b-cf20a

@andrewjstone
Copy link
Contributor

I think the previous error is preventing blueprint execution. No omicron-zones.json is showing up and the ntp zone isn't starting. I'm going to test on main to see if I can get nexus deploying without this code.

@bnaecker
Copy link
Collaborator

@andrewjstone is this possibly being caused by #5056?

@andrewjstone
Copy link
Contributor

@andrewjstone is this possibly being caused by #5056?

Ah very possibly.

@jgallagher
Copy link
Contributor Author

Yeah, it definitely is, and I strongly suspect the NIC it's warning about is for one of the existing Nexus zones. This PR attempts to reconcile all of the external IP allocations, which should be a no-op for anything set up during RSS, but is failing because of the mismatched slot number. Fixing #5056 is at the top of my list.

jgallagher added a commit that referenced this pull request Feb 15, 2024
…5065)

This makes a several minor changes to plumb slots through:

* `IncompleteNetworkInterface` now stores an optional slot, just like it
stores an optional IP/MAC address
* In `network_interfaces::InsertQuery`, if the incoming slot is set, we
use it directly instead of running the `NextItem`-based subquery
* Adds a partial index to ensure uniqueness of a slot within a single
`parent_id` (I believe this is correct, but would love confirmation from
someone more familiar!)
* `IncompleteNetworkInterface::new_service()` now takes a _non-optional_
IP, MAC address, and slot. This matches how it was called in all
non-test code.
* Tweaked the Nexus internal API used for RSS handoff to include the
slot in the description of NICs.

This is a partial fix for #5056, and should produce correct behavior on
new systems that run through RSS, even without a fix for #5055 (because
we bypass `NextItem` altogether with this change). In particular, I
think this should unblock testing of #5045 on madrid / testbeds. It does
not address the already-recorded-NICs-with-incorrect-slots on systems
like dogfood; I'll take care of that in a subsequent PR.
jgallagher added a commit that referenced this pull request Feb 15, 2024
…ist-eips` (#5064)

This PR has some omdb commands I wanted during dev/debug of #5045. It
expands the output of `list-vnics` to include the parent's ID
(particularly useful when trying to determine the external IP of a
specific Nexus instance, for example):

```
 IP                PORTS        KIND       STATE     OWNER_KIND  OWNER_ID                              OWNER_DESCRIPTION
 10.1.1.3/32       0/65535      floating   Attached  instance    4e6fb33a-7ba2-4a5e-abc5-dc9b047c01e0  v6/some-vm2
 10.1.1.4/32       0/16383      SNAT       Attached  instance    4e6fb33a-7ba2-4a5e-abc5-dc9b047c01e0  v6/some-vm2
 10.1.1.5/32       0/65535      ephemeral  Attached  instance    4e6fb33a-7ba2-4a5e-abc5-dc9b047c01e0  v6/some-vm2
 172.20.26.1/32    0/65535      floating   Attached  service     edd99650-5df1-4241-815d-253e4ef2399c  ExternalDns
 172.20.26.2/32    0/65535      floating   Attached  service     f500d564-c40a-4eca-ac8a-a26b435f2037  ExternalDns
 172.20.26.3/32    0/65535      floating   Attached  service     65a11c18-7f59-41ac-b9e7-680627f996e7  Nexus
 172.20.26.4/32    0/65535      floating   Attached  service     20b100d0-84c3-4119-aa9b-0c632b0b6a3a  Nexus
 172.20.26.5/32    0/65535      floating   Attached  service     2898657e-4141-4c05-851b-147bffc6bbbd  Nexus
 172.20.26.6/32    0/16383      SNAT       Attached  service     c3ec3d1a-3172-4d36-bfd3-f54a04d5ba55  Ntp
```

and adds a `list-vnics` command to show allocated vnics:

```
 IP                 MAC                SLOT  PRIMARY  KIND      SUBNET           PARENT_ID                             DESCRIPTION
 172.30.0.5/32      A8:40:25:F8:A5:8C  1     true     instance  172.30.0.0/22    2a4afdda-e269-48bc-913f-01ad57c50543  default primary interface for p4
 172.30.0.5/32      A8:40:25:F5:AF:F0  1     true     instance  172.30.0.0/22    be705808-d507-4693-9a97-186c92970e7b  default primary interface for updateinst
 172.30.0.5/32      A8:40:25:F7:3B:00  1     true     instance  172.30.0.0/22    0ab1939f-af6e-4ea2-a155-71f210e937fc  a sample nic
 172.30.1.5/32      A8:40:25:FF:B0:9C  1     true     service   172.30.1.0/24    edd99650-5df1-4241-815d-253e4ef2399c  external_dns service vNIC
 172.30.1.6/32      A8:40:25:FF:D0:B4  1     true     service   172.30.1.0/24    f500d564-c40a-4eca-ac8a-a26b435f2037  external_dns service vNIC
 172.30.2.5/32      A8:40:25:FF:A6:83  1     true     service   172.30.2.0/24    65a11c18-7f59-41ac-b9e7-680627f996e7  nexus service vNIC
 172.30.2.6/32      A8:40:25:FF:B4:C1  1     true     service   172.30.2.0/24    20b100d0-84c3-4119-aa9b-0c632b0b6a3a  nexus service vNIC
 172.30.2.7/32      A8:40:25:FF:C6:59  0     true     service   172.30.2.0/24    2898657e-4141-4c05-851b-147bffc6bbbd  nexus service vNIC
 172.30.3.5/32      A8:40:25:FF:B2:52  1     true     service   172.30.3.0/24    c3ec3d1a-3172-4d36-bfd3-f54a04d5ba55  ntp service vNIC
 172.30.3.6/32      A8:40:25:FF:A0:F9  1     true     service   172.30.3.0/24    6ea2684c-115e-48a6-8453-ab52d1cecd73  ntp service vNIC
```

(This command immediately revealed issues with the slot number recording
on dogfood, which led to opening #5056.)
@jgallagher
Copy link
Contributor Author

@andrewjstone I've caught up with main which has fixed #5056, if you want to give this another try.

@andrewjstone
Copy link
Contributor

@andrewjstone I've caught up with main which has fixed #5056, if you want to give this another try.

Will do!

@andrewjstone
Copy link
Contributor

@andrewjstone I've caught up with main which has fixed #5056, if you want to give this another try.

Will do!

I re-ran this and blueprints are executing successfully. Unfortunately I didn't remember to add an extra nexus node to the code, so I'm stuck at 3 nexuses for the original RSS nodes. However, the bug appears to be fixed regarding #5056, since the executor is successfully deploying zones.

@jgallagher
Copy link
Contributor Author

I ran this through a test on madrid, and it was almost successful. I was able to add a 4th Nexus, it did go to the one sled that didn't have one from RSS, the allocations for external IP and vNIC looked valid and were recorded in cockroach, and the background NAT RPW set up the NAT rules for it after the inventory collection picked it up. I ran into two problems (one extremely minor that may not even be worth fixing, and one serious):

  1. omdb db network list-eips failed to show the new external IP, because it tries to map the service name from the services table. We're not creating an entry in services due to its planned removal (retire "services" table #4947). I'll see if I can fix up omdb here.
  2. sled-agent didn't get OPTE firewall rules for the new service, so even though Nexus existed and had NAT plumbing, it wasn't accepting inbound connections.

I'm going to poke around 2 a bit and will open an issue soon.

@jgallagher
Copy link
Contributor Author

Haha 🤦‍♂️ the test failure on the latest commit is legit. I think "always request new firewall rules when getting zones" is probably okay once Nexus is up, but can't gate RSS. I need to think about this more.

@jgallagher
Copy link
Contributor Author

Haha 🤦‍♂️ the test failure on the latest commit is legit. I think "always request new firewall rules when getting zones" is probably okay once Nexus is up, but can't gate RSS. I need to think about this more.

We chatted about this in the update sync. I reverted the last commit that attempted to quickly fix the firewall rule problem, and will land this as-is. Everything here is correct and necessary to the best of our knowledge; it just isn't fully complete and functional. I'll open a followup PR with the firewall rule changes later.

@jgallagher jgallagher merged commit 55ef8da into main Feb 21, 2024
20 checks passed
@jgallagher jgallagher deleted the john/blueprint-nexus-allocate-networking branch February 21, 2024 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants