-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a source IP allowlist for user-facing services #5686
Conversation
bnaecker
commented
May 2, 2024
- Add database table and updates with an allowlist of IPs. This is currently structured as a table with one row, and an array of INETs as the allowlist.
- Add Nexus API for viewing / updating allowlist. Also does basic sanity checks on the list, most importantly that the source IP the request came from is on the list.
- Add the allowlist to wicket example config file
- Add the allowlist into the rack initialization request, and insert it into the database during Nexus's internal server handling of that request.
- Read allowlist and convert to firewall rules when plumbing any service firewall rules to sled agents.
- Add method for waiting on this plumbing before starting Nexus's external server, to ensure the IP allowlist is set before anything can reach Nexus.
This still needs a bit of work mostly around testing, but I'd like to get some early eyes from folks familiar with the rack init process. |
Alright, I'm working on the wicket updates as well, following #5685 as a guide. Apart from having to use I'm sure I'm not testing things quite correctly though. I followed the very helpful README in |
I'm still going to try to sneak in a small RPW that syncs this specific set of firewall rules only, along with some more tests and work to check that we return a 503 when we fail to plumb the rules down when someone calls the actual Nexus udpate API. |
The screenshot looks great. |
Not on a computer for the rest of the day, but is this something that can wait until the oxpop work lands? There's probably going to be some collisions |
@sunshowers Yeah, I expect some collisions. The changes I've made are quite small, so I should be able to resolve them fairly easily. But 🤞! |
Ok, I'm pretty happy with this at this point. There's still a failing authz test here that I haven't been able to track down. The other failures look like expectorate output comparisons for the most part, which I'll fix up now. I'd also like to run things locally a few times to see how the rules get plumbed down with the new logic and background task. I'll do that now, and hopefully we'll be ready for Dogfood tomorrow. |
I think most of the test failures are pretty harmless changes to expectorated file contents and the like. There's one exception though, which is This test loads some previously-serialized plan that the sled-agent has saved, with the results of RSS and the resulting plan for deploying resources around the rack. A similar file exists on existing customer racks. If the sled-agent reboots, it looks like it will attempt to load an existing plan via the |
Ok, I've verified the background task is running correctly on deployed Nexus's. I spun the whole control plane up on my local (non-Gimlet) developer machine, and we can see the initial plumbing of the rules first:
A bit later, the background task is kicked and runs how I'd expect:
Next, I put together a quick version of the Oxide CLI built from this PR, and used it to update the rules. This shows that partially working:
So a couple things here. First, we can see that the rules are correctly being received and inserted / retrieved in the DB. That's not a surprise, we have tests for that already in the PR. Second, we can see the 503 being returned in the case where propagating the rules to the sled-agents fails. So that's good. But why did it fail? I found the Nexus that handled the request, and we see this at the tail end:
We're failing the check to look up the VPC there. I think that makes sense, since we're using the op-context from the public API request. That actor ( In any case, I activated the RPW manually with
And found the Nexus that ran it. It indeed fetched the right rules and modified those passed to the sled-agent appropriately:
The
That's one of the Nexus zone's OPTE ports, where we can see we've left in place the right ports and protocols. There are a few more sanity checks. First, we correctly prevent someone from cutting off their own access:
I did also find one more bug -- you can't actually set a single IP address in the allowlist! This is because we're using
But if I hack around that by using a
|
Ok 8386fb4 fixed the opcontext problem, updating the allowlist no longer returns a 503. I haven't figured out how to get the allowlist to accept either a IP address or CIDR yet, without also requiring that we use that enum internally everywhere, which sucks. So there are three things left, I think
|
I think the only evolution of this we've done to date were either inside I think I argued that --- a/sled-agent/src/bootstrap/params.rs
+++ b/sled-agent/src/bootstrap/params.rs
@@ -42,9 +42,16 @@ struct UnvalidatedRackInitializeRequest {
external_certificates: Vec<Certificate>,
recovery_silo: RecoverySiloConfig,
rack_network_config: RackNetworkConfig,
+ #[serde(default = "default_allowed_source_ips")]
allowed_source_ips: AllowedSourceIps,
}
+// This field was added after several existing racks went through RSS. RSS plans
+// for such racks should be treated as though they allowed any source IP.
+fn default_allowed_source_ips() -> AllowedSourceIps {
+ AllowedSourceIps::Any
+}
+
/// Configuration for the "rack setup service".
///
/// The Rack Setup Service should be responsible for one-time setup actions This will require an EXPECTORATE overwrite to pass the test (we save the new output), but I think that's fine/correct. |
This fixes the test, and I think is a correct change, but please confirm: --- a/nexus/tests/integration_tests/endpoints.rs
+++ b/nexus/tests/integration_tests/endpoints.rs
@@ -2391,7 +2391,7 @@ pub static VERIFY_ENDPOINTS: Lazy<Vec<VerifyEndpoint>> = Lazy::new(|| {
// User-facing services IP allowlist
VerifyEndpoint {
url: &ALLOWED_SOURCE_IP_URL,
- visibility: Visibility::Protected,
+ visibility: Visibility::Public,
unprivileged_access: UnprivilegedAccess::None,
allowed_methods: vec![
AllowedMethod::Get, The visibility here is about what happens if authenticated but unauthorized user hits the endpoint. The docs on visibility say /// If it's [`Visibility::Public`] (like "/v1/organizations"), unauthorized
/// users can expect to get back a 401 or 403 when they attempt to access
/// it. If it's [`Visibility::Protected`] (like a specific Organization),
/// unauthorized users will get a 404.
I think the intent here is that we shouldn't leak information about the existence of resources (e.g., can I discover project names I shouldn't be able to see by hitting endpoints and getting 403 vs 404), but in this case the endpoint is a system level thing that everyone knows exists, so sending back a 403 seems correct. |
Is this a bug in the CLI and/or OpenAPI spec? You already have a test that omicron/common/src/api/internal/shared.rs Lines 519 to 521 in 8386fb4
This error does look like it's coming from the CLI before it even tries to send the request:
|
@jgallagher Thanks for the help! Deserialization: Ok, that all matches my understanding. I agree we should make a function for returning a default on deserialization, rather than deriving Default for the enum itself. Authz: That's really helpful thanks. It's been ages since I wrote one of these tests, and I was trying to make the endpoint itself hidden. That's not really a thing we support in this case, since it has no child resources of any kind. I was also worried that your fix was a "fix", just changing the tests to pass, but it does make sense with your explanation. I'll incorporate it! Single IP: This is an issue with the generated schema, not the internal type we use. The schema for |
It also looks like the |
Hit some codegen bugs with the new |
Ok, b714c32 should have fixed the codegen issues and pass the deserialization / authz checks. There is still one thing I need to resolve, which is the IP / CIDR handling. The fastest option is to only allow CIDRs for now, so that customers need to write I will either do that, and update the docs accordingly, or figure out how to get the generated schema to allow both. In either case, we also need to handle the fact that the regex appears wrong.... We can't specify IPv6s that aren't in the ULA range, which kinda defeats the purpose of an external allowlist. |
@@ -1356,18 +1409,7 @@ impl JsonSchema for Ipv6Net { | |||
})), | |||
instance_type: Some(schemars::schema::InstanceType::String.into()), | |||
string: Some(Box::new(schemars::schema::StringValidation { | |||
pattern: Some( | |||
// Conforming to unique local addressing scheme, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we confident that no other locations are depending on these being local addresses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. The place we care is when creating VPC prefixes and VPC Subnets, both of which use Ipv6Net::is_unique_local()
to verify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just hit this in a Nexus integration test that was depending on a local address. (Me hitting it was because of a different bug, but I think Nexus is expecting these to be local in at least some cases.) The specific error I saw was
thread 'integration_tests::rack::test_sled_add' panicked at clients/sled-agent-client/src/lib.rs:451:58:
0:0:0:21::/64: doesn't match pattern "^([fF][dD])[0-9a-fA-F]{2}:(([0-9a-fA-F]{1,4}:){6}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,6}:)([0-9a-fA-F]{1,4})?\/([0-9]|[1-9][0-9]|1[0-1][0-9]|12[0-8])$"
when I (incorrectly) tried to create a sled with that subnet (which is a legal IPv6 addr but not local).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man, we've just been so lucky
There is some remote possibility that a customer could lock themselves out of the rack if they have somehow misconfigured the allow list. It may not be something for this PR but may worth having a follow-up issue on this (e.g. an omdb command for Oxide technician rescue). |
@askfongjojo I won't say it's impossible, but there are two safeguards to locking themselves out completely. First you can't set the allowlist in such a way that would prevent access from your own IP address. Second, as @jgallagher, Nexus will still be available to Oxide folks through the proxy over the technician port. Wicketd makes this available to us, and since it uses the techport and the underlay network, it is not subject to the OPTE firewall rules we use to implement the allowlist on the external network. |
The universe is trying to tell me something |
Ok, forget it. The IP or CIDR block thing is going to be very annoying with schemars. I'm going to simply require that folks specify only a network -- if you want a single address, that's a |
@ahl @jgallagher would you mind taking one more quick pass through this? Adam, maybe double-check the CLI generation comes out the way you'd like it? |
There will be conflicts with #5685. I need to squash this all anyway, so @sunshowers you're welcome to merge first (if it's ready), and I can address those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hit one minor hiccup; all looks good.
There's a pile of conflicts here, so I'm going to start squashing and rebasing. Apologies, but we're going to lose the context for some of the PR comments. I've addressed everything so far, so that's not a huge loss. |
9b7be39
to
9c65544
Compare
Rebased on @ahl Could you just make one last version of the new CLI on this, to make sure I didn't mess up the OpenAPI definitions? @sunshowers and / or @rcgoodfellow, I think I addressed the conflicts with your PR correctly, but since types moved around a bit, would you mind double-checking that I've not adversely impacted your work? And @jgallagher, I'd still love one more look on the details that you've already reviewed, if you get a chance in the next day or two. I'm going to take this for one more spin locally, and hopefully we can get this wrapped up soon. Thanks to everyone for their help and input! |
9c65544
to
abfeb98
Compare
Just spun up everything on my PC, and it all still looks good. I can successfully view / update the rules via the modified CLI; the DB repr looks right; Nexus modifies the VPC firewall rules on update and during the background task. Modulo any comments from the reviewers I've tagged, I'm happy with this one finally. |
- Add database table and updates with an allowlist of IPs. This is currently structured as a table with one row, and an array of INETs as the allowlist. Includes a CHECK constraint that the list is not empty -- NULL is used to indicate the lack of an allowlist. - Add Nexus API for viewing / updating allowlist. Also does basic sanity checks on the list, most importantly that the source IP the request came from is _on_ the list. VPC firewall rules are updated after the database has been updated successfully. - Add the allowlist to wicket example config file, and plumb through wicket UI, client, server, and into bootstrap agent. - Add the allowlist into the rack initialization request, and insert it into the database during Nexus's internal server handling of that request. - Read allowlist and convert to firewall rules when plumbing any service firewall rules to sled agents. This works by modifying existing firewall rules for the internal service VPC. The host filters should always be empty here, so this is simple and well-defined. It also lets us keep the right protocol and port filters on the rules. - Add method for waiting on this plumbing _before_ starting Nexus's external server, to ensure the IP allowlist is set before anything can reach Nexus. - Add background task in Nexus for ensuring service VPC rules only. This runs pretty infrequently now (5 minutes), but the allowlist should only be updated very rarely. - Include allowlist default on deserialization in the sled-agent, so that it applies to existing customer installations that've already been RSS'd. - Note: This also relaxes the regular expression we've been using for IPv6 networks. It was previously checking only for ULAs, while we now need it to represent any valid network. Adds tests for the regex too.
abfeb98
to
a25140e
Compare
FYI merged the updated OpenAPI doc into the CLI and it all looks as we discussed oxidecomputer/oxide.rs#658 |
I looked through the new BGP plumbing and things look good. Thanks for the typo fixes ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave the wicket bits a quick look, the merge looks good.
# Allowlist of source IPs that can make requests to user-facing services. | ||
[allowed_source_ips] | ||
# Any external IPs to make requests. This is the default. | ||
allow = "any" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an algebraic bounded lattice similar to the BGP prefix allowlist below (a list of some values, plus "any" as an option). Would it be possible to align the way this is defined with the BGP prefixes below?
https://en.m.wikipedia.org/wiki/Lattice_(order)
The wicket bits aren't blocking for r8 since it's user facing config and not part of the stable API. But it would be nice for the bootstore configs to be aligned. (But again not a huge deal, we can migrate the JSON configs fine. I can also put up a follow-up on Monday.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd prefer to do that in a follow-up, since this has already been a long road. I'd like some time to read through the use of the BGP allow-list type, too. It looks like the semantics might be slightly different, so I'd like to look it over in more detail.
In the spirit of urgency, I'm going to merge this as-is. It's had a lot of eyes and several rounds of testing. We'd like to get a spin on Dogfood, and I'll be around to make another PR if we hit any issues or @jgallagher has any lingering comments from another review. |