Add on-demand communication probes. #4585

rcgoodfellow · 2023-11-30T08:50:14Z

This PR adds the following.

Probes

A new first-class Omicron element called a probe is introduced. A probe is similar to an instance but is underpinned by a zone instead of an HVM. They are managed much like instances in terms of lifecycle. They have network interfaces on a VPC that are externally reachable via ephemeral IP addresses. They also have network interfaces on the underlay network. The primary function of probes in this initial PR is for network testing. They come with daemons like thundermuffin pre-installed as SMF services to facilitate communications testing both through boundary services and within the rack. Probes may become more general over time to facilitate more kinds of testing beyond networking.

The idea for probes came from the desire to test the Oxide stack in an environment where the hardware is constructed by interconnected virtual machines. Nested virtualization is undesirable for a number of reasons, so probes plus virtual rack topologies give us the ability to test a significant surface area of the control plane and the underlying systems it manages, such as networking and storage. While the motivation for probes was to test in virtual environments, they're just as applicable to hardware environments. Because they're a first-class Omicron element, the same tests written using probes in virtual environments can run on real racks.

Probes currently require fleet admin privileges. This is enforced in the API handlers.

A 4-gimlet 2-sidecar CI test

A primary goal of building the probes mechanism is to have automated multi-switch multi-sled tests that run in CI. This PR also takes the first step along that path. There are two new CI jobs added. The a4x2-prepare job builds and packages omicron for each of the 4 gimlets in the topology. This includes RSS configuration, individual sled configuration, and faux-mgs configuration for each scrimlet. The omicron packages and their configuration bits are then tarred up and provided as artifacts for the a4x2-deploy job.

The a4x2-deploy job extracts the artifacts from a4x2-prepare into a set of folders with omicron configuration and deployment archives for each gimlet in the topology. This is in a folder called cargo-bay. The cargo bay contains a top-level folder for each virtual gimlet. The a4x2 falcon topology knows to look for these folders and mounts each into the corresponding gimlet via P9fs mounts. In the cargo bay for each gimlet, there is also an initialization script that installs omicron in the VM using omicron-install. This initialization script is run automatically by falcon (this is in the code for the topology itself, not something that is hard coded). Once the topology has been launched the control plane is on it's way toward coming up.

The way that we test the virtual rack is functional is through probes. There is a new testing program in this PR called commtest in the end-to-end-tests directory. This program takes an address to reach the Oxide API at as a parameter and waits for the /ping endpoint to become responsive. Once that happens, a probe is launched on each sled in the topology and a basic connectivity test is run against each probe using ICMP. The test checks for packet loss within a configurable threshold for each probe. If the test passes, then the CI job passes.

In addition to the virtual rack, the test topology also contains a pair of routers running the Arista network stack that the sidecars are connected to. The connection between the rack and these routers is facilitated by BGP. There is also a third router called the customer edge (ce) that connects to both Arista routers. This router is running Linux+FRR. It advertises a default route to both Arista routers that propagates via BGP to the Oxide rack routers. Similarly, the IP pool address block the Oxide rack advertises propagates to the customer edge router via the Arista routers through BGP. This customer edge VM is also running an iptables configuration to decouple the testing network from the host network on the lab machine the entire topology is running on. All that needs to be done on the lab machine to connect to the rack is create a route to the IP pool block the rack is using that uses the IP address of the external interface on the customer edge machine as a nexthop. This is done in the a4x2-deploy job.

Depends on

image: bump gimlet ramdisk zfs size helios#143

rcgoodfellow · 2024-01-06T07:06:20Z

Threre are still a few TODOs in here to take care of, but I think this is generally ready for review.

nexus/tests/integration_tests/saml.rs

clients/sled-agent-client/src/lib.rs

ahl · 2024-02-27T17:22:17Z

Probes could be useful as a tool to narrow down problems for operators. If an end user reports a connectivity problem with a particular instance, an operator could launch a probe on the same sled, drawing an external address from the same IP pool, to help determine if the connectivity issue is at the instance level, the sled level or they could launch several probes to see if there is a broader connectivity issue.

That being said, the primary use for probes right now is to ensure working end-to-end networking for PRs in CI.

In that case, my suggestion would be to keep it out of the documented API / CLI / etc. What do you think?

nexus/tests/integration_tests/oximeter.rs

rcgoodfellow · 2024-02-27T22:36:30Z

In that case, my suggestion would be to keep it out of the documented API / CLI / etc. What do you think?

I've moved the probes API endpoints to an "experimental" API in 4f58a4a

sunshowers

(Just a comment on experimental API endpoints.)

Thank you for taking this on! I've been wanting something like this for a while, and I really like this solution (my own half-baked solution was to use sets of tags, but this is easier to read I think.)

nexus/src/external_api/http_entrypoints.rs

FelixMcFelix

Thanks Ry, from what I've looked over this is a very nice addition. I haven't yet had the chance to stand this up and test it locally, I'm hoping to do so shortly. I've left some questions throughout.

On 'what' we can test with this, I guess this gives us really good mileage for any traffic carried by a VPC (Instance<->Probe and Probe<->Rack-external). If we wanted to ask more targeted questions about the underlay itself, would we create probes and zlogin ?

dev-tools/omdb/src/bin/omdb/db.rs

end-to-end-tests/Cargo.toml

end-to-end-tests/src/helpers/icmp.rs

nexus/blueprint-execution/src/resource_allocation.rs

nexus/db-queries/src/db/datastore/probe.rs

nexus/db-queries/src/db/queries/network_interface.rs

nexus/src/app/sagas/switch_port_settings_apply.rs

nexus/types/src/external_api/params.rs

FelixMcFelix

Thanks for handling the nits/papercuts etc., looks good!

rcgoodfellow · 2024-03-09T16:35:40Z

Unfortunately, I need to disable the a4x2 CI jobs in this initial commit. The zones and services within VMs come up too unreliably and would be a major source of friction for day-to-day development. We'll continue to push to make things more reliable within VMs.

The primary issue appears to be SMF service startup failures within zones. Sometimes it's CRDB, sometimes it's more basic things like ndp. These service start failures seem to correspond to periods of heavy I/O. I did try running zpool trim as a precursor to a4x2 CI runs, and that sped things up significantly - but unfortunately, executing several runs with the benefit of trim shows reliability issues remain. I see these issues much less often on my local workstation, so it does appear to be system performance related in some way.

The CI tests and machinery remain in place within this PR, but the buildomat enable property is set to false.

rcgoodfellow force-pushed the commprobe branch 2 times, most recently from 3726743 to 774e1ed Compare December 1, 2023 02:51

rcgoodfellow force-pushed the commprobe branch 3 times, most recently from 563a152 to 23b96d9 Compare December 13, 2023 09:17

rcgoodfellow added the networking Related to the networking. label Dec 14, 2023

rcgoodfellow self-assigned this Dec 22, 2023

rcgoodfellow force-pushed the commprobe branch 2 times, most recently from 0fee0f5 to aecff37 Compare December 22, 2023 05:35

jordanhendricks self-requested a review December 22, 2023 19:24

rcgoodfellow force-pushed the commprobe branch 17 times, most recently from fe3fe2c to 0e047a6 Compare January 4, 2024 03:02

rcgoodfellow force-pushed the commprobe branch from 0e047a6 to 584ae73 Compare January 5, 2024 07:47

rcgoodfellow marked this pull request as ready for review January 6, 2024 07:06

rcgoodfellow force-pushed the commprobe branch 2 times, most recently from 134e768 to 884bf18 Compare February 27, 2024 04:44

rcgoodfellow commented Feb 27, 2024

View reviewed changes

nexus/tests/integration_tests/saml.rs Outdated Show resolved Hide resolved

rcgoodfellow commented Feb 27, 2024

View reviewed changes

clients/sled-agent-client/src/lib.rs Show resolved Hide resolved

rcgoodfellow commented Feb 27, 2024

View reviewed changes

nexus/tests/integration_tests/oximeter.rs Show resolved Hide resolved

rcgoodfellow force-pushed the commprobe branch from 884bf18 to 9205444 Compare February 27, 2024 20:09

sunshowers reviewed Feb 27, 2024

View reviewed changes

nexus/src/external_api/http_entrypoints.rs Show resolved Hide resolved

nexus/src/external_api/http_entrypoints.rs Outdated Show resolved Hide resolved

rcgoodfellow force-pushed the commprobe branch from e7bbd34 to 54d6f76 Compare February 28, 2024 02:52

FelixMcFelix reviewed Feb 28, 2024

View reviewed changes

rcgoodfellow force-pushed the commprobe branch 4 times, most recently from 18aee3c to 82fcaa2 Compare March 8, 2024 07:56

FelixMcFelix mentioned this pull request Mar 8, 2024

External IPs should store the type of their parent as an enum #5228

Open

rcgoodfellow force-pushed the commprobe branch from 82fcaa2 to b9a83cd Compare March 8, 2024 17:07

FelixMcFelix approved these changes Mar 8, 2024

View reviewed changes

rcgoodfellow force-pushed the commprobe branch 4 times, most recently from e7d3a2c to fbf34dd Compare March 9, 2024 04:24

add communication probes

25115dd

rcgoodfellow force-pushed the commprobe branch from fbf34dd to 25115dd Compare March 9, 2024 07:46

disable a4x2 ci jobs

aaab5f6

back to main helios branch

b8dd072

rcgoodfellow merged commit 65cbb82 into main Mar 9, 2024
23 checks passed

rcgoodfellow deleted the commprobe branch March 9, 2024 19:00

rcgoodfellow mentioned this pull request Oct 2, 2024

Automated e2e testing in a4x2 #6749

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add on-demand communication probes. #4585

Add on-demand communication probes. #4585

rcgoodfellow commented Nov 30, 2023 •

edited

Loading

rcgoodfellow commented Jan 6, 2024

ahl commented Feb 27, 2024

rcgoodfellow commented Feb 27, 2024

sunshowers left a comment

FelixMcFelix left a comment

FelixMcFelix left a comment

rcgoodfellow commented Mar 9, 2024

Add on-demand communication probes. #4585

Add on-demand communication probes. #4585

Conversation

rcgoodfellow commented Nov 30, 2023 • edited Loading

Probes

A 4-gimlet 2-sidecar CI test

rcgoodfellow commented Jan 6, 2024

ahl commented Feb 27, 2024

rcgoodfellow commented Feb 27, 2024

sunshowers left a comment

Choose a reason for hiding this comment

FelixMcFelix left a comment

Choose a reason for hiding this comment

FelixMcFelix left a comment

Choose a reason for hiding this comment

rcgoodfellow commented Mar 9, 2024

rcgoodfellow commented Nov 30, 2023 •

edited

Loading