Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add on-demand communication probes. #4585

Merged
merged 3 commits into from
Mar 9, 2024
Merged

Add on-demand communication probes. #4585

merged 3 commits into from
Mar 9, 2024

Conversation

rcgoodfellow
Copy link
Contributor

@rcgoodfellow rcgoodfellow commented Nov 30, 2023

This PR adds the following.

Probes

A new first-class Omicron element called a probe is introduced. A probe is similar to an instance but is underpinned by a zone instead of an HVM. They are managed much like instances in terms of lifecycle. They have network interfaces on a VPC that are externally reachable via ephemeral IP addresses. They also have network interfaces on the underlay network. The primary function of probes in this initial PR is for network testing. They come with daemons like thundermuffin pre-installed as SMF services to facilitate communications testing both through boundary services and within the rack. Probes may become more general over time to facilitate more kinds of testing beyond networking.

The idea for probes came from the desire to test the Oxide stack in an environment where the hardware is constructed by interconnected virtual machines. Nested virtualization is undesirable for a number of reasons, so probes plus virtual rack topologies give us the ability to test a significant surface area of the control plane and the underlying systems it manages, such as networking and storage. While the motivation for probes was to test in virtual environments, they're just as applicable to hardware environments. Because they're a first-class Omicron element, the same tests written using probes in virtual environments can run on real racks.

Probes currently require fleet admin privileges. This is enforced in the API handlers.

A 4-gimlet 2-sidecar CI test

A primary goal of building the probes mechanism is to have automated multi-switch multi-sled tests that run in CI. This PR also takes the first step along that path. There are two new CI jobs added. The a4x2-prepare job builds and packages omicron for each of the 4 gimlets in the topology. This includes RSS configuration, individual sled configuration, and faux-mgs configuration for each scrimlet. The omicron packages and their configuration bits are then tarred up and provided as artifacts for the a4x2-deploy job.

The a4x2-deploy job extracts the artifacts from a4x2-prepare into a set of folders with omicron configuration and deployment archives for each gimlet in the topology. This is in a folder called cargo-bay. The cargo bay contains a top-level folder for each virtual gimlet. The a4x2 falcon topology knows to look for these folders and mounts each into the corresponding gimlet via P9fs mounts. In the cargo bay for each gimlet, there is also an initialization script that installs omicron in the VM using omicron-install. This initialization script is run automatically by falcon (this is in the code for the topology itself, not something that is hard coded). Once the topology has been launched the control plane is on it's way toward coming up.

The way that we test the virtual rack is functional is through probes. There is a new testing program in this PR called commtest in the end-to-end-tests directory. This program takes an address to reach the Oxide API at as a parameter and waits for the /ping endpoint to become responsive. Once that happens, a probe is launched on each sled in the topology and a basic connectivity test is run against each probe using ICMP. The test checks for packet loss within a configurable threshold for each probe. If the test passes, then the CI job passes.

In addition to the virtual rack, the test topology also contains a pair of routers running the Arista network stack that the sidecars are connected to. The connection between the rack and these routers is facilitated by BGP. There is also a third router called the customer edge (ce) that connects to both Arista routers. This router is running Linux+FRR. It advertises a default route to both Arista routers that propagates via BGP to the Oxide rack routers. Similarly, the IP pool address block the Oxide rack advertises propagates to the customer edge router via the Arista routers through BGP. This customer edge VM is also running an iptables configuration to decouple the testing network from the host network on the lab machine the entire topology is running on. All that needs to be done on the lab machine to connect to the rack is create a route to the IP pool block the rack is using that uses the IP address of the external interface on the customer edge machine as a nexthop. This is done in the a4x2-deploy job.

Depends on

@rcgoodfellow rcgoodfellow force-pushed the commprobe branch 2 times, most recently from 3726743 to 774e1ed Compare December 1, 2023 02:51
@rcgoodfellow rcgoodfellow force-pushed the commprobe branch 3 times, most recently from 563a152 to 23b96d9 Compare December 13, 2023 09:17
@rcgoodfellow rcgoodfellow added the networking Related to the networking. label Dec 14, 2023
@rcgoodfellow rcgoodfellow self-assigned this Dec 22, 2023
@rcgoodfellow rcgoodfellow force-pushed the commprobe branch 2 times, most recently from 0fee0f5 to aecff37 Compare December 22, 2023 05:35
@jordanhendricks jordanhendricks self-requested a review December 22, 2023 19:24
@rcgoodfellow rcgoodfellow force-pushed the commprobe branch 17 times, most recently from fe3fe2c to 0e047a6 Compare January 4, 2024 03:02
@rcgoodfellow
Copy link
Contributor Author

Threre are still a few TODOs in here to take care of, but I think this is generally ready for review.

@rcgoodfellow rcgoodfellow marked this pull request as ready for review January 6, 2024 07:06
@rcgoodfellow rcgoodfellow force-pushed the commprobe branch 2 times, most recently from 134e768 to 884bf18 Compare February 27, 2024 04:44
@ahl
Copy link
Contributor

ahl commented Feb 27, 2024

Probes could be useful as a tool to narrow down problems for operators. If an end user reports a connectivity problem with a particular instance, an operator could launch a probe on the same sled, drawing an external address from the same IP pool, to help determine if the connectivity issue is at the instance level, the sled level or they could launch several probes to see if there is a broader connectivity issue.

That being said, the primary use for probes right now is to ensure working end-to-end networking for PRs in CI.

In that case, my suggestion would be to keep it out of the documented API / CLI / etc. What do you think?

@rcgoodfellow
Copy link
Contributor Author

In that case, my suggestion would be to keep it out of the documented API / CLI / etc. What do you think?

I've moved the probes API endpoints to an "experimental" API in 4f58a4a

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just a comment on experimental API endpoints.)

Thank you for taking this on! I've been wanting something like this for a while, and I really like this solution (my own half-baked solution was to use sets of tags, but this is easier to read I think.)

nexus/src/external_api/http_entrypoints.rs Show resolved Hide resolved
nexus/src/external_api/http_entrypoints.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ry, from what I've looked over this is a very nice addition. I haven't yet had the chance to stand this up and test it locally, I'm hoping to do so shortly. I've left some questions throughout.

On 'what' we can test with this, I guess this gives us really good mileage for any traffic carried by a VPC (Instance<->Probe and Probe<->Rack-external). If we wanted to ask more targeted questions about the underlay itself, would we create probes and zlogin ?

dev-tools/omdb/src/bin/omdb/db.rs Outdated Show resolved Hide resolved
end-to-end-tests/Cargo.toml Outdated Show resolved Hide resolved
end-to-end-tests/src/helpers/icmp.rs Outdated Show resolved Hide resolved
end-to-end-tests/src/helpers/icmp.rs Outdated Show resolved Hide resolved
nexus/blueprint-execution/src/resource_allocation.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/probe.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/probe.rs Show resolved Hide resolved
nexus/db-queries/src/db/queries/network_interface.rs Outdated Show resolved Hide resolved
nexus/src/app/sagas/switch_port_settings_apply.rs Outdated Show resolved Hide resolved
nexus/types/src/external_api/params.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for handling the nits/papercuts etc., looks good!

@rcgoodfellow rcgoodfellow force-pushed the commprobe branch 4 times, most recently from e7d3a2c to fbf34dd Compare March 9, 2024 04:24
@rcgoodfellow
Copy link
Contributor Author

Unfortunately, I need to disable the a4x2 CI jobs in this initial commit. The zones and services within VMs come up too unreliably and would be a major source of friction for day-to-day development. We'll continue to push to make things more reliable within VMs.

The primary issue appears to be SMF service startup failures within zones. Sometimes it's CRDB, sometimes it's more basic things like ndp. These service start failures seem to correspond to periods of heavy I/O. I did try running zpool trim as a precursor to a4x2 CI runs, and that sped things up significantly - but unfortunately, executing several runs with the benefit of trim shows reliability issues remain. I see these issues much less often on my local workstation, so it does appear to be system performance related in some way.

The CI tests and machinery remain in place within this PR, but the buildomat enable property is set to false.

@rcgoodfellow rcgoodfellow merged commit 65cbb82 into main Mar 9, 2024
23 checks passed
@rcgoodfellow rcgoodfellow deleted the commprobe branch March 9, 2024 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
networking Related to the networking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants