Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blueprint-execution: plumb service firewall rules #5157

Closed
wants to merge 10 commits into from

Conversation

jgallagher
Copy link
Contributor

The PR title is a little misleading; only the last commit actually makes blueprint-execution do the right thing here. The bulk of this PR is an experiment / idea on how we can expose some of Nexus's methods to background tasks without having to pass down an actual Arc<Nexus>. See the nexus-capabilities crate for details; the gist is:

  1. We define a nexus_capabilities::Foo trait that provides some subset of Nexus functionality. The majority of this should be moving methods from Nexus into the trait.
  2. Provide getter methods that implementors must provide for any Nexus properties that the Foo trait needs. I added a Base trait that provides a logger and the datastore; the SledAgent trait has another example of a required method.
  3. Bikeshed ad nauseam about what the traits should be named, how they should be organized, etc.
  4. In background tasks (and potentially sagas, if we want?), construct some new type that is able to implement the required methods without having all of Nexus, allowing it to also pull in whatever ::Foo traits it needs. In this PR the background-execution task has a NexusContext that pulls in nexus_capabilities::{Base, SledAgent, FirewallRules}, allowing it to reuse Nexus's methods for creating sled-agent clients and plumbing service firewall rules.

I need to actually test that this does what it's supposed to do, but I wanted to go ahead and open it for review since the bulk of the work is really a refactoring that I'd like to get feedback on.

@@ -46,6 +49,31 @@ impl From<nexus_db_model::Sled> for Sled {
}
}

// A vastly-restricted `Nexus` object that allows us access to some of Nexus
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of constructing a not-Nexus-that-can-still-do-Nexus-things.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this direction, just wanna make sure we're all on the same page before pulling the trigger. But thank you for pushing in this direction!

Comment on lines 23 to 24
// This is a bit of an historical artifact. We use the "instance
// allocation" op context to create sled-agent clients.
Copy link
Collaborator

@smklein smklein Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a question for @davepacheco , but should we be doing something different here? Sorta feels like "which opctx we're using" is getting muddied a bit.

I suspect there are cases where the caller invoking sled_client_by_id, which uses this method, has a different opctx they could use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of this is actually inside nexus_capabilities::FirewallRules. send_sled_agents_firewall_rules() creates sled agent clients, which is where we're implicitly using this op ctx. That function is given some other opctx by its caller, but I wasn't sure that whatever that opctx was would be able to look up sleds by ID.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems like maybe the OpContext should be an argument to the function that gets the sled agent client? I haven't dug into the history here but I would guess that instance allocation was the first use case for wanting to use a sled agent client and we just kept using it. It doesn't necessarily seem appropriate for all uses of the sled agent client. (It's also possible that there's no better-fitting internal identity for some of the other use cases and that it's not worth creating separate ones. That's supposed to be pretty easy but it also hasn't bought us much thus far.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of this is actually inside nexus_capabilities::FirewallRules. send_sled_agents_firewall_rules() creates sled agent clients, which is where we're implicitly using this op ctx. That function is given some other opctx by its caller, but I wasn't sure that whatever that opctx was would be able to look up sleds by ID.

(Sorry I didn't see this before writing my previous comment.) I think it's reasonable for a function to impose requirements on the privileges of the provided OpContext -- I mean that happens in every function that does an authz check. If it's possible that two different contexts might be needed (e.g., a user context and one for Nexus itself to do its thing), I know it's a little awkward to provide two OpContexts, but I think that's reasonable too -- it accurately reflects what's going on. "I'm doing two different things acting as different agents so you have to tell me which one to use for which parts."

I also think it's reasonable to say "the cost of using two different internal identities here isn't worth the benefit" and then just always using the same one (and not taking it as an argument).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd have to take two different OpContexts here, if I understand the reason we're using opctx_alloc in general. We have paths like this:

  • The PUT /v1/vpc-firewall-rules external endpoint creates an opctx and passes it into nexus.vpc_update_firewall_rules
  • nexus.vpc_update_firewall_rules does an authz check and passes the opctx into send_sled_agents_firewall_rules
  • send_sled_agents_firewall_rules passes the opctx into resolve_firewall_rules_for_sled_agent
  • resolve_firewall_rules_for_sled_agent uses the opctx to perform several auth checks

If send_sled_agents_firewall_rules only took one opctx, I don't think there is a good one to use: the opctx from the external endpoint wouldn't be able to look up sled IDs, and opctx_alloc wouldn't be correct to use from within resolve_firewall_rules_for_sled_agent, right?

I can take a stab at changing sled_client_by_id to take an explicit opctx then updating all the call sites to either pass in opctx_alloc (if they have access to it) or a new opctx argument they grow themselves (as in the case of send_sled_agents_firewall_rules).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a stab at changing sled_client_by_id to take an explicit opctx then updating all the call sites to either pass in opctx_alloc (if they have access to it) or a new opctx argument they grow themselves (as in the case of send_sled_agents_firewall_rules).

Done in 8f9ca8b. It's more verbose but I think a step in the right direction; it also makes it a little more obvious that we're kinda abusing opctx_alloc.

@@ -6,12 +6,11 @@

use crate::app::sagas;
use crate::external_api::params;
use nexus_capabilities::FirewallRules;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Comment location only partially related because I see stuff moving from nexus/src/app.rs into capabilities)

I know this is nitpicky, but for my own edification:

Suppose I'm adding "app-level logic" to Nexus. Should I add it to nexus/capabilities? or to nexus/src/app? What do you think should be the dividing line between the two?

I grok that the capabilities crate allows "things without an Arc<Nexus> to access the APIs, so is this crate basically treated like "another flavor of pub"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose I'm adding "app-level logic" to Nexus. Should I add it to nexus/capabilities? or to nexus/src/app? What do you think should be the dividing line between the two?

Good question; I'm not sure! I could see arguments for at least:

  • Always add to nexus/capabilities; this lets us keep it organized and makes it accessible to sagas (assuming we rework them to use this instead of taking an Arc<Nexus>) and RPWs
  • Just keep adding to nexus/src/app as we have been. If someone needs functionality in an RPW, they can move it to nexus/capabilities.

I'm slightly partial to the first because it feels a little cleaner and encourages "only use the parts of Nexus you really need" (since via Base you only have access to the datastore, you'd have to add methods to get at other things). But the second seems perfectly reasonable and maybe an easier transition.

I grok that the capabilities crate allows "things without an Arc to access the APIs, so is this crate basically treated like "another flavor of pub"?

Yes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this reflects some of my unease with the approach. If we think of this package as defined by "stuff that's shared between otherwise unrelated subsystems in Nexus", it feels weirdly arbitrary what functionality lives here vs. outside this package. But maybe the way we think of it is instead: essentially all Nexus methods get partitioned into traits defined here so that everybody's always looking at some narrow slice of the app-level logic. I still don't love it for the reasons I mention elsewhere but that makes a little more sense to me.

If we did this, in the limit, would it still solve the problem? Or would we wind up with stuff inside this package that depends on other stuff inside this package that it shouldn't know about? I guess we could always create traits for only the limited functionality they need and this should be fine. To that point, could the contents of nexus-capabilities live in the main omicron-nexus package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did this, in the limit, would it still solve the problem? Or would we wind up with stuff inside this package that depends on other stuff inside this package that it shouldn't know about? I guess we could always create traits for only the limited functionality they need and this should be fine.

The goal is that any functionality that moves here only has access to data it needs, and that those dependencies are obvious (such as FirewallRules requiring SledAgent, since it has to construct sled-agent clients). We could certainly end up with weird cross-dependencies, but I think trying to split it up like this would make that obvious, and possibly painful enough that we'd take steps to avoid it.

To that point, could the contents of nexus-capabilities live in the main omicron-nexus package?

I don't think so; that would force blueprint-execution (and any other smaller crates that want to use Nexus-lite functionality) to depend on omicron-nexus, creating a cycle.

datastore: &'a DataStore,
}

impl Base for NexusContext<'_> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is interesting - at first read, I kinda expected that this would be the Nexus object.

This really isn't the Nexus object, nor is it truly a subset, right? It's:

  1. Taking a chunk of code that used to be part of struct Nexus
  2. Moving it out, and copying all fields from struct Nexus it used
  3. ... But now it's a totally new object.

I bring this up because in this use-case, you happen to use a single opctx + datastore, but there's a lot more stuff that could be used here by other capabilities, if we keep splitting this stuff up.

For example: app/silo.rs uses self.external_resolver, so presumably that field would need to be copied (or fully moved) into the NexusContext into a "capability to access silos".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really isn't the Nexus object, nor is it truly a subset, right? It's:

  1. Taking a chunk of code that used to be part of struct Nexus
  2. Moving it out, and copying all fields from struct Nexus it used
  3. ... But now it's a totally new object.

This is all correct. And in particular, in this case the logger we present is different from the one stored in Nexus, which means any logs emitted by plumb_service_firewall_rules (or other nexus-capabilities methods it calls in turn) will use our RPW logger instead of the Nexus logger.

I bring this up because in this use-case, you happen to use a single opctx + datastore, but there's a lot more stuff that could be used here by other capabilities, if we keep splitting this stuff up.

For example: app/silo.rs uses self.external_resolver, so presumably that field would need to be copied (or fully moved) into the NexusContext into a "capability to access silos".

All correct, yes.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the extension trait style, and think it makes sense for what we are trying to do here: namely split up the nexus object functionality into usable subsets.

My primary worry is actually related to one of Sean's comments. It would be really bad to use the wrong opctx, and so maybe we should require it to always be passed in rather than making it a mandatory trait method to implement, backed by a copy of some op context.

@@ -65,29 +67,10 @@ pub(crate) async fn deploy_zones(
}
}

// This is a modified copy of the functionality from `nexus/src/app/sled.rs`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, grossness expunged!

@davepacheco
Copy link
Collaborator

Thanks for taking a swing at this!

I really like the high order bit of exposing narrower bits of app-level functionality to various subsystems rather than all of Nexus. I find doing this for the functionality in the SledAgent trait surprisingly compelling for whatever reason, even though it's pretty simple.

That said, I find this approach kind of hard to grok. (It helped when I realized we're essentially using the extension trait pattern. Maybe if we go this route we could make that more explicit somehow?) The naming is part of it and we can obviously talk through that. (I think nexus_shared or nexus_guts would be a better name than nexus_capabilities.) But I also find there's a real cost in legibility when putting functionality behind traits, and I hate doing that when there's one "real" implementation. An example problem I hate is when rust-analyzer can't jump straight to the implementation of a thing. That might be less of an issue in practice here since the interesting functionality is all on the trait itself rather than the impls (although I think nothing enforces that). Right now it's also simpler because at-large code never has a dyn Base or a T: Base type parameter -- it always has a nexus or its own more specific thing and we're just using the trait to namespace stuff. Will that always be true? As we go down this path and start layering this functionality, might we start needing these? (I feel like this sounds like I hate traits, which isn't exactly it. It's just that I find these costs real, and they're one extra bit of complexity in large or new-to-me codebases.)

I do think this approach can address the problem of reduced visibility of functionality between different parts of Nexus. But increasingly, I've been running into the different (but related) problem of wanting bits of Nexus app-level functionality outside of Nexus altogether. I don't think this approach can be used to solve that problem, right? For example, RSS and Nexus could really share more code. They do share stuff like omicron-certificates and omicron-passwords and it'd be nice if RSS eventually used blueprints and the planner to lay out services and allocate IPs, etc. I've also wanted to share app-level logic in omdb. In the past we've cheated a bit and put some app-level stuff into the datastore because omdb could access that. But an alternative would be to build Nexus in terms of smaller, more independent packages. To some extent we've already done this with nexus-db-model, nexus-db-queries, nexus-types, etc. I've really loved this approach as we've used it for pieces of reconfigurator like nexus-inventory, nexus-deployment, nexus-blueprint-execution, etc. (Even though nexus/src/app/background is not a separate package, I think it's similar in some of these ways. Maybe it should be its own package?) The smaller packages are easier to understand, faster to iterate on (because they build a lot faster), and I think it's encouraged more unit-level testing. (I should add that I've still struggled a little bit with exactly where functionality should go. For example, in some work I'm doing I've wanted omdb to essentially construct a PlanningContext, which is decidedly app-level logic and depends on the datastore. Most of the reconfigurator packages deliberately don't know about the datastore so it can't go there. But this seems solvable.)

So I wonder about doing the same thing here. More concretely, suppose:

  • create a new package, call it "nexus-networking" ("app-level logic related to networking")
  • it could have a struct NexusNetworking that's probably just an Arc<DataStore> and maybe a logger
  • for the methods you moved into a trait, hang them off that struct instead
  • during Nexus startup, create a NexusNetworking and store it as self.networking
  • in all the places we call these methods today, the only change would be tos/self.$method/self.networking.$method.
  • for the background task, you give it a separate NexusNetworking. (It could as well be the same one via a reference or Arc but I don't think it's worth it, especially if we want them to use separate loggers.)

What I prefer about this pattern over the trait-based approach:

  • It feels more explicit / less indirect.
  • It keeps code close to related code. (With the trait approach, a lot of unrelated code is next to each other in the same package only because it's all "shared code". If you view this as an extension of the existing "monolithic Nexus" package, it's not worse, but I think there's an opportunity to improve this.)
  • It makes it easy for other components in the system (like RSS and omdb) to share bits of this logic. Is this possible with the trait-based approach? I think it would technically be possible for omdb to pull in nexus-capabilities, but that in doing so, it would wind up pulling in essentially all of the dependencies that Nexus itself has (because it's needed to implement some of those traits, many of which omdb presumably doesn't need).

There's one other consideration I haven't fully thought through but: what about common functionality that wants to store some data? Take the SledAgent trait. There's a TODO in there about wanting to cache sled agent clients. (This is not very important, but I think it makes a lot of sense. Sled Agents are never really going to change IP addresses. Why do a database query each time?) With a little struct encapsulating this functionality, this is easy. With the trait, it seems much more awkward: I think the impl would have to store the data and make it available via a method?

@jgallagher raised the risk that these smaller packages wind up depending on each other. It's a real risk but I think we've solved that with the existing subpackages by pushing shared functionality into lower-level, more widely-shared crates (like nexus-types). There would presumably be some churn as we figure this out, and a little ongoing as functionality becomes shared, but it doesn't seem too bad.

I'm really thinking about this from the assumption that we want a pattern here that we can repeat. I'm curious what others think about these considerations and about these two options. I don't think there's an objective answer. I think both of these approaches could probably work. But I think I like the "independent packages" approach a fair bit better. Thoughts?

Comment on lines +30 to +39
fn sled_client(&self, id: Uuid, address: SocketAddrV6) -> Client {
let log = self.log().new(o!("SledAgent" => id.to_string()));
let dur = std::time::Duration::from_secs(60);
let client = reqwest::ClientBuilder::new()
.connect_timeout(dur)
.timeout(dur)
.build()
.unwrap();
Client::new_with_client(&format!("http://{address}"), client, log)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever really want to expose this? If not, where would we put this functionality? Free functions in this module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever really want to expose this?

I think so; this is the function that blueprint-execution calls, for example, because it already knows the sled-agent address and doesn't need to look it up again.

If not, where would we put this functionality? Free functions in this module?

Yep; private functions in this module would be perfectly reasonable.

opctx: &'a OpContext,
sled_id: Uuid,
) -> lookup::Sled<'a> {
LookupPath::new(opctx, self.datastore()).sled_id(sled_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we don't use opctx_sled_client() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons, a bad one and a good one:

  • I was trying to move functions over from Nexus with minimal changes, and this is how we had it in Nexus.
  • This function is called by some external endpoints with an opctx that is not opctx_alloc.

Comment on lines 23 to 24
// This is a bit of an historical artifact. We use the "instance
// allocation" op context to create sled-agent clients.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems like maybe the OpContext should be an argument to the function that gets the sled agent client? I haven't dug into the history here but I would guess that instance allocation was the first use case for wanting to use a sled agent client and we just kept using it. It doesn't necessarily seem appropriate for all uses of the sled agent client. (It's also possible that there's no better-fitting internal identity for some of the other use cases and that it's not worth creating separate ones. That's supposed to be pretty easy but it also hasn't bought us much thus far.

@@ -6,12 +6,11 @@

use crate::app::sagas;
use crate::external_api::params;
use nexus_capabilities::FirewallRules;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this reflects some of my unease with the approach. If we think of this package as defined by "stuff that's shared between otherwise unrelated subsystems in Nexus", it feels weirdly arbitrary what functionality lives here vs. outside this package. But maybe the way we think of it is instead: essentially all Nexus methods get partitioned into traits defined here so that everybody's always looking at some narrow slice of the app-level logic. I still don't love it for the reasons I mention elsewhere but that makes a little more sense to me.

If we did this, in the limit, would it still solve the problem? Or would we wind up with stuff inside this package that depends on other stuff inside this package that it shouldn't know about? I guess we could always create traits for only the limited functionality they need and this should be fine. To that point, could the contents of nexus-capabilities live in the main omicron-nexus package?

@jgallagher
Copy link
Contributor Author

(To be crystal clear, I am not 100% sold on the approach I took here, and am taking the position of defending it in this comment without firmly believing it's actually the better option.)

That said, I find this approach kind of hard to grok. (It helped when I realized we're essentially using the extension trait pattern. Maybe if we go this route we could make that more explicit somehow?) The naming is part of it and we can obviously talk through that. (I think nexus_shared or nexus_guts would be a better name than nexus_capabilities.) But I also find there's a real cost in legibility when putting functionality behind traits, and I hate doing that when there's one "real" implementation. An example problem I hate is when rust-analyzer can't jump straight to the implementation of a thing. That might be less of an issue in practice here since the interesting functionality is all on the trait itself rather than the impls (although I think nothing enforces that). Right now it's also simpler because at-large code never has a dyn Base or a T: Base type parameter -- it always has a nexus or its own more specific thing and we're just using the trait to namespace stuff. Will that always be true? As we go down this path and start layering this functionality, might we start needing these? (I feel like this sounds like I hate traits, which isn't exactly it. It's just that I find these costs real, and they're one extra bit of complexity in large or new-to-me codebases.)

I think we would avoid the most troublesome issues around traits if we don't expect implementers to override the default implementations provided, which I think we don't as they're the entire point of these traits. (Maybe test implementations could override them, but I think that's a lot less of a concern?) rust-analyzer will jump to the trait in both the dyn Base and T: Base cases, and if you want to get to the one true implementation, that's exactly the right thing here. It only doesn't work for required methods like datastore(), which I agree is annoying, although FWIW at least in neovim from a trait I can "go to implementions" and get a list of all impls to jump to.

I do think this approach can address the problem of reduced visibility of functionality between different parts of Nexus. But increasingly, I've been running into the different (but related) problem of wanting bits of Nexus app-level functionality outside of Nexus altogether. I don't think this approach can be used to solve that problem, right?

I think it can? In the nexus_capabilities::SledAgent example, anyone who has a Logger and a DataStore can stick those in a type, implement Base and SledAgent, and be able to call all the methods with default impls provided by SledAgent.

So I wonder about doing the same thing here. More concretely, suppose:

  • create a new package, call it "nexus-networking" ("app-level logic related to networking")
  • it could have a struct NexusNetworking that's probably just an Arc<DataStore> and maybe a logger
  • for the methods you moved into a trait, hang them off that struct instead
  • during Nexus startup, create a NexusNetworking and store it as self.networking
  • in all the places we call these methods today, the only change would be tos/self.$method/self.networking.$method.
  • for the background task, you give it a separate NexusNetworking. (It could as well be the same one via a reference or Arc but I don't think it's worth it, especially if we want them to use separate loggers.)

What I prefer about this pattern over the trait-based approach:

  • It feels more explicit / less indirect.
  • It keeps code close to related code. (With the trait approach, a lot of unrelated code is next to each other in the same package only because it's all "shared code". If you view this as an extension of the existing "monolithic Nexus" package, it's not worse, but I think there's an opportunity to improve this.)
  • It makes it easy for other components in the system (like RSS and omdb) to share bits of this logic. Is this possible with the trait-based approach? I think it would technically be possible for omdb to pull in nexus-capabilities, but that in doing so, it would wind up pulling in essentially all of the dependencies that Nexus itself has (because it's needed to implement some of those traits, many of which omdb presumably doesn't need).

I think the second and third items here are orthogonal to traits vs concrete types. We could create a nexus-networking, use the trait based approach here, and get both of those items, right? I'd have the same concerns about smaller packages depending on each other, but again I think that's orthogonal to whether these packages expose extension traits or structs (or even just free functions).

There's one other consideration I haven't fully thought through but: what about common functionality that wants to store some data? Take the SledAgent trait. There's a TODO in there about wanting to cache sled agent clients. (This is not very important, but I think it makes a lot of sense. Sled Agents are never really going to change IP addresses. Why do a database query each time?) With a little struct encapsulating this functionality, this is easy. With the trait, it seems much more awkward: I think the impl would have to store the data and make it available via a method?

Yes, agreed; if we have methods we want to move to traits that want their own internal storage, every implementer has to provide that storage. I think we could make this pretty painless, but it would always be awkward.

I'm really thinking about this from the assumption that we want a pattern here that we can repeat. I'm curious what others think about these considerations and about these two options. I don't think there's an objective answer. I think both of these approaches could probably work. But I think I like the "independent packages" approach a fair bit better. Thoughts?

I mostly addressed this above, but if the main concern is how much unrelated stuff lands in nexus-capabilities, we could split it up into independent packages whether those packages are exposing traits or concrete types.

@jgallagher
Copy link
Contributor Author

Porting a comment over from chat for posterity:

I tried changing the traits to free functions, and I think I like it marginally more? it's pretty similar but is more obvious, although maybe a bit more verbose too. diff from this branch

I think "split it up into distinct crates" is a definite winner. Whether those crates expose extension traits, free fns, or concrete types, I don't have particularly strong feelings on, and honestly it's not all that hard to switch between them (as in it's very mechanical and easy to follow the compiler around; the amount of work will grow as the crates grow over time, of course).

@jgallagher
Copy link
Contributor Author

Unrelated to the discussion about the nexus split design, I tried this PR on madrid, and it doesn't work. The hurdle it's bumping into is that vpc_resolve_to_sleds has this query to map service NICs to sleds:

let service_query = service_network_interface::table
.inner_join(
service::table
.on(service::id.eq(service_network_interface::service_id)),
)
.inner_join(sled::table.on(sled::id.eq(service::sled_id)))
.filter(service_network_interface::vpc_id.eq(vpc_id))
.filter(service_network_interface::time_deleted.is_null())
.select(Sled::as_select());

This joins the service table, which I have not added to the reconfigurator executor because of #4947. To get this working we either need to push on #4947 or add rows in service for new Nexuses.

@jgallagher
Copy link
Contributor Author

Closing in favor of #5233

@jgallagher jgallagher closed this Mar 8, 2024
jgallagher added a commit that referenced this pull request Mar 12, 2024
This is the reworked and caught-up-with-main #5157 (which I'll close
momentarily). I believe I addressed all the comments on the PR except
those related to its extension trait style and the `nexus-capabilities`
crate. This introduces a more focused `nexus-networking` crate that
exposes free functions instead of extension traits.

I'll run this through madrid before merging, but I believe it should
work (absent any mistakes I made merging changes across branches) now
that #5202 has landed.

Closes #4886
@jgallagher jgallagher deleted the john/nexus-capabilities branch April 25, 2024 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants