-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blueprint-execution: plumb service firewall rules #5157
Conversation
nexus/blueprint-execution/src/lib.rs
Outdated
@@ -46,6 +49,31 @@ impl From<nexus_db_model::Sled> for Sled { | |||
} | |||
} | |||
|
|||
// A vastly-restricted `Nexus` object that allows us access to some of Nexus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of constructing a not-Nexus-that-can-still-do-Nexus-things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this direction, just wanna make sure we're all on the same page before pulling the trigger. But thank you for pushing in this direction!
nexus/src/app/capabilities.rs
Outdated
// This is a bit of an historical artifact. We use the "instance | ||
// allocation" op context to create sled-agent clients. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a question for @davepacheco , but should we be doing something different here? Sorta feels like "which opctx we're using" is getting muddied a bit.
I suspect there are cases where the caller invoking sled_client_by_id
, which uses this method, has a different opctx
they could use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example of this is actually inside nexus_capabilities::FirewallRules
. send_sled_agents_firewall_rules()
creates sled agent clients, which is where we're implicitly using this op ctx. That function is given some other opctx
by its caller, but I wasn't sure that whatever that opctx
was would be able to look up sleds by ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seems like maybe the OpContext should be an argument to the function that gets the sled agent client? I haven't dug into the history here but I would guess that instance allocation was the first use case for wanting to use a sled agent client and we just kept using it. It doesn't necessarily seem appropriate for all uses of the sled agent client. (It's also possible that there's no better-fitting internal identity for some of the other use cases and that it's not worth creating separate ones. That's supposed to be pretty easy but it also hasn't bought us much thus far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example of this is actually inside
nexus_capabilities::FirewallRules
.send_sled_agents_firewall_rules()
creates sled agent clients, which is where we're implicitly using this op ctx. That function is given some otheropctx
by its caller, but I wasn't sure that whatever thatopctx
was would be able to look up sleds by ID.
(Sorry I didn't see this before writing my previous comment.) I think it's reasonable for a function to impose requirements on the privileges of the provided OpContext -- I mean that happens in every function that does an authz check. If it's possible that two different contexts might be needed (e.g., a user context and one for Nexus itself to do its thing), I know it's a little awkward to provide two OpContexts, but I think that's reasonable too -- it accurately reflects what's going on. "I'm doing two different things acting as different agents so you have to tell me which one to use for which parts."
I also think it's reasonable to say "the cost of using two different internal identities here isn't worth the benefit" and then just always using the same one (and not taking it as an argument).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd have to take two different OpContext
s here, if I understand the reason we're using opctx_alloc
in general. We have paths like this:
- The
PUT /v1/vpc-firewall-rules
external endpoint creates an opctx and passes it intonexus.vpc_update_firewall_rules
nexus.vpc_update_firewall_rules
does an authz check and passes the opctx intosend_sled_agents_firewall_rules
send_sled_agents_firewall_rules
passes the opctx intoresolve_firewall_rules_for_sled_agent
resolve_firewall_rules_for_sled_agent
uses the opctx to perform several auth checks
If send_sled_agents_firewall_rules
only took one opctx, I don't think there is a good one to use: the opctx
from the external endpoint wouldn't be able to look up sled IDs, and opctx_alloc
wouldn't be correct to use from within resolve_firewall_rules_for_sled_agent
, right?
I can take a stab at changing sled_client_by_id
to take an explicit opctx then updating all the call sites to either pass in opctx_alloc
(if they have access to it) or a new opctx argument they grow themselves (as in the case of send_sled_agents_firewall_rules
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a stab at changing
sled_client_by_id
to take an explicit opctx then updating all the call sites to either pass inopctx_alloc
(if they have access to it) or a new opctx argument they grow themselves (as in the case ofsend_sled_agents_firewall_rules
).
Done in 8f9ca8b. It's more verbose but I think a step in the right direction; it also makes it a little more obvious that we're kinda abusing opctx_alloc
.
@@ -6,12 +6,11 @@ | |||
|
|||
use crate::app::sagas; | |||
use crate::external_api::params; | |||
use nexus_capabilities::FirewallRules; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Comment location only partially related because I see stuff moving from nexus/src/app.rs
into capabilities
)
I know this is nitpicky, but for my own edification:
Suppose I'm adding "app-level logic" to Nexus. Should I add it to nexus/capabilities
? or to nexus/src/app
? What do you think should be the dividing line between the two?
I grok that the capabilities
crate allows "things without an Arc<Nexus>
to access the APIs, so is this crate basically treated like "another flavor of pub
"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose I'm adding "app-level logic" to Nexus. Should I add it to nexus/capabilities? or to nexus/src/app? What do you think should be the dividing line between the two?
Good question; I'm not sure! I could see arguments for at least:
- Always add to
nexus/capabilities
; this lets us keep it organized and makes it accessible to sagas (assuming we rework them to use this instead of taking anArc<Nexus>
) and RPWs - Just keep adding to
nexus/src/app
as we have been. If someone needs functionality in an RPW, they can move it tonexus/capabilities
.
I'm slightly partial to the first because it feels a little cleaner and encourages "only use the parts of Nexus
you really need" (since via Base
you only have access to the datastore, you'd have to add methods to get at other things). But the second seems perfectly reasonable and maybe an easier transition.
I grok that the capabilities crate allows "things without an Arc to access the APIs, so is this crate basically treated like "another flavor of pub"?
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this reflects some of my unease with the approach. If we think of this package as defined by "stuff that's shared between otherwise unrelated subsystems in Nexus", it feels weirdly arbitrary what functionality lives here vs. outside this package. But maybe the way we think of it is instead: essentially all Nexus methods get partitioned into traits defined here so that everybody's always looking at some narrow slice of the app-level logic. I still don't love it for the reasons I mention elsewhere but that makes a little more sense to me.
If we did this, in the limit, would it still solve the problem? Or would we wind up with stuff inside this package that depends on other stuff inside this package that it shouldn't know about? I guess we could always create traits for only the limited functionality they need and this should be fine. To that point, could the contents of nexus-capabilities live in the main omicron-nexus
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did this, in the limit, would it still solve the problem? Or would we wind up with stuff inside this package that depends on other stuff inside this package that it shouldn't know about? I guess we could always create traits for only the limited functionality they need and this should be fine.
The goal is that any functionality that moves here only has access to data it needs, and that those dependencies are obvious (such as FirewallRules
requiring SledAgent
, since it has to construct sled-agent clients). We could certainly end up with weird cross-dependencies, but I think trying to split it up like this would make that obvious, and possibly painful enough that we'd take steps to avoid it.
To that point, could the contents of nexus-capabilities live in the main omicron-nexus package?
I don't think so; that would force blueprint-execution
(and any other smaller crates that want to use Nexus-lite functionality) to depend on omicron-nexus
, creating a cycle.
nexus/blueprint-execution/src/lib.rs
Outdated
datastore: &'a DataStore, | ||
} | ||
|
||
impl Base for NexusContext<'_> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is interesting - at first read, I kinda expected that this would be the Nexus object.
This really isn't the Nexus object, nor is it truly a subset, right? It's:
- Taking a chunk of code that used to be part of
struct Nexus
- Moving it out, and copying all fields from
struct Nexus
it used - ... But now it's a totally new object.
I bring this up because in this use-case, you happen to use a single opctx + datastore, but there's a lot more stuff that could be used here by other capabilities, if we keep splitting this stuff up.
For example: app/silo.rs
uses self.external_resolver
, so presumably that field would need to be copied (or fully moved) into the NexusContext
into a "capability to access silos".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really isn't the Nexus object, nor is it truly a subset, right? It's:
- Taking a chunk of code that used to be part of
struct Nexus
- Moving it out, and copying all fields from
struct Nexus
it used- ... But now it's a totally new object.
This is all correct. And in particular, in this case the logger we present is different from the one stored in Nexus
, which means any logs emitted by plumb_service_firewall_rules
(or other nexus-capabilities
methods it calls in turn) will use our RPW logger instead of the Nexus logger.
I bring this up because in this use-case, you happen to use a single opctx + datastore, but there's a lot more stuff that could be used here by other capabilities, if we keep splitting this stuff up.
For example:
app/silo.rs
usesself.external_resolver
, so presumably that field would need to be copied (or fully moved) into theNexusContext
into a "capability to access silos".
All correct, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the extension trait style, and think it makes sense for what we are trying to do here: namely split up the nexus object functionality into usable subsets.
My primary worry is actually related to one of Sean's comments. It would be really bad to use the wrong opctx, and so maybe we should require it to always be passed in rather than making it a mandatory trait method to implement, backed by a copy of some op context.
@@ -65,29 +67,10 @@ pub(crate) async fn deploy_zones( | |||
} | |||
} | |||
|
|||
// This is a modified copy of the functionality from `nexus/src/app/sled.rs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay, grossness expunged!
Thanks for taking a swing at this! I really like the high order bit of exposing narrower bits of app-level functionality to various subsystems rather than all of That said, I find this approach kind of hard to grok. (It helped when I realized we're essentially using the extension trait pattern. Maybe if we go this route we could make that more explicit somehow?) The naming is part of it and we can obviously talk through that. (I think I do think this approach can address the problem of reduced visibility of functionality between different parts of Nexus. But increasingly, I've been running into the different (but related) problem of wanting bits of Nexus app-level functionality outside of Nexus altogether. I don't think this approach can be used to solve that problem, right? For example, RSS and Nexus could really share more code. They do share stuff like So I wonder about doing the same thing here. More concretely, suppose:
What I prefer about this pattern over the trait-based approach:
There's one other consideration I haven't fully thought through but: what about common functionality that wants to store some data? Take the @jgallagher raised the risk that these smaller packages wind up depending on each other. It's a real risk but I think we've solved that with the existing subpackages by pushing shared functionality into lower-level, more widely-shared crates (like I'm really thinking about this from the assumption that we want a pattern here that we can repeat. I'm curious what others think about these considerations and about these two options. I don't think there's an objective answer. I think both of these approaches could probably work. But I think I like the "independent packages" approach a fair bit better. Thoughts? |
fn sled_client(&self, id: Uuid, address: SocketAddrV6) -> Client { | ||
let log = self.log().new(o!("SledAgent" => id.to_string())); | ||
let dur = std::time::Duration::from_secs(60); | ||
let client = reqwest::ClientBuilder::new() | ||
.connect_timeout(dur) | ||
.timeout(dur) | ||
.build() | ||
.unwrap(); | ||
Client::new_with_client(&format!("http://{address}"), client, log) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever really want to expose this? If not, where would we put this functionality? Free functions in this module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever really want to expose this?
I think so; this is the function that blueprint-execution
calls, for example, because it already knows the sled-agent address and doesn't need to look it up again.
If not, where would we put this functionality? Free functions in this module?
Yep; private functions in this module would be perfectly reasonable.
opctx: &'a OpContext, | ||
sled_id: Uuid, | ||
) -> lookup::Sled<'a> { | ||
LookupPath::new(opctx, self.datastore()).sled_id(sled_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we don't use opctx_sled_client()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons, a bad one and a good one:
- I was trying to move functions over from Nexus with minimal changes, and this is how we had it in Nexus.
- This function is called by some external endpoints with an
opctx
that is notopctx_alloc
.
nexus/src/app/capabilities.rs
Outdated
// This is a bit of an historical artifact. We use the "instance | ||
// allocation" op context to create sled-agent clients. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seems like maybe the OpContext should be an argument to the function that gets the sled agent client? I haven't dug into the history here but I would guess that instance allocation was the first use case for wanting to use a sled agent client and we just kept using it. It doesn't necessarily seem appropriate for all uses of the sled agent client. (It's also possible that there's no better-fitting internal identity for some of the other use cases and that it's not worth creating separate ones. That's supposed to be pretty easy but it also hasn't bought us much thus far.
@@ -6,12 +6,11 @@ | |||
|
|||
use crate::app::sagas; | |||
use crate::external_api::params; | |||
use nexus_capabilities::FirewallRules; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this reflects some of my unease with the approach. If we think of this package as defined by "stuff that's shared between otherwise unrelated subsystems in Nexus", it feels weirdly arbitrary what functionality lives here vs. outside this package. But maybe the way we think of it is instead: essentially all Nexus methods get partitioned into traits defined here so that everybody's always looking at some narrow slice of the app-level logic. I still don't love it for the reasons I mention elsewhere but that makes a little more sense to me.
If we did this, in the limit, would it still solve the problem? Or would we wind up with stuff inside this package that depends on other stuff inside this package that it shouldn't know about? I guess we could always create traits for only the limited functionality they need and this should be fine. To that point, could the contents of nexus-capabilities live in the main omicron-nexus
package?
(To be crystal clear, I am not 100% sold on the approach I took here, and am taking the position of defending it in this comment without firmly believing it's actually the better option.)
I think we would avoid the most troublesome issues around traits if we don't expect implementers to override the default implementations provided, which I think we don't as they're the entire point of these traits. (Maybe test implementations could override them, but I think that's a lot less of a concern?) rust-analyzer will jump to the trait in both the
I think it can? In the
I think the second and third items here are orthogonal to traits vs concrete types. We could create a
Yes, agreed; if we have methods we want to move to traits that want their own internal storage, every implementer has to provide that storage. I think we could make this pretty painless, but it would always be awkward.
I mostly addressed this above, but if the main concern is how much unrelated stuff lands in |
Porting a comment over from chat for posterity: I tried changing the traits to free functions, and I think I like it marginally more? it's pretty similar but is more obvious, although maybe a bit more verbose too. diff from this branch I think "split it up into distinct crates" is a definite winner. Whether those crates expose extension traits, free fns, or concrete types, I don't have particularly strong feelings on, and honestly it's not all that hard to switch between them (as in it's very mechanical and easy to follow the compiler around; the amount of work will grow as the crates grow over time, of course). |
Unrelated to the discussion about the nexus split design, I tried this PR on madrid, and it doesn't work. The hurdle it's bumping into is that omicron/nexus/db-queries/src/db/datastore/vpc.rs Lines 674 to 682 in 55a0760
This joins the |
…nt in addition to the `service` table
Closing in favor of #5233 |
This is the reworked and caught-up-with-main #5157 (which I'll close momentarily). I believe I addressed all the comments on the PR except those related to its extension trait style and the `nexus-capabilities` crate. This introduces a more focused `nexus-networking` crate that exposes free functions instead of extension traits. I'll run this through madrid before merging, but I believe it should work (absent any mistakes I made merging changes across branches) now that #5202 has landed. Closes #4886
The PR title is a little misleading; only the last commit actually makes blueprint-execution do the right thing here. The bulk of this PR is an experiment / idea on how we can expose some of
Nexus
's methods to background tasks without having to pass down an actualArc<Nexus>
. See thenexus-capabilities
crate for details; the gist is:nexus_capabilities::Foo
trait that provides some subset of Nexus functionality. The majority of this should be moving methods fromNexus
into the trait.Nexus
properties that theFoo
trait needs. I added aBase
trait that provides a logger and the datastore; theSledAgent
trait has another example of a required method.Nexus
, allowing it to also pull in whatever::Foo
traits it needs. In this PR the background-execution task has aNexusContext
that pulls innexus_capabilities::{Base, SledAgent, FirewallRules}
, allowing it to reuse Nexus's methods for creating sled-agent clients and plumbing service firewall rules.I need to actually test that this does what it's supposed to do, but I wanted to go ahead and open it for review since the bulk of the work is really a refactoring that I'd like to get feedback on.