Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for propolis-based softnpu device, fix multi-switch uplink updates. #4390

Merged
merged 6 commits into from
Nov 4, 2023

Conversation

rcgoodfellow
Copy link
Contributor

@rcgoodfellow rcgoodfellow commented Oct 29, 2023

@rcgoodfellow rcgoodfellow marked this pull request as ready for review November 2, 2023 07:12
@rcgoodfellow rcgoodfellow changed the title Plumbing for running control plane on propolis-based softnpu device Support for propolis-based softnpu device, fix multi-switch uplink updates. Nov 2, 2023
} else {
if let Some(zname) = zone {
if let Err(out) =
tokio::process::Command::new(PFEXEC)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put this here because services do not come up reliably when Omicron is running in a VM. I'm not sure if this is something we want in the mainline code to deal with early zone service failure or not. AFAICT this is only being called when waiting on zones to come to milestone/single-user. @smklein @andrewjstone @davepacheco.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut is this is definitely not something we want to be doing under normal operation. If things are going into maintenance, I'd expect we need to figure out why and fix the underlying issues. Without knowing what those are, it's hard to know we're not papering over a more significant issue or one that could come up again later.

Copy link
Contributor Author

@rcgoodfellow rcgoodfellow Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put this behind a compilation flag. In order for it to get compiled into the code

RUSTFLAGS="--cfg svcadm_autoclear"

is required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use that flag anywhere or is that just for development environments where people happen to be seeing this behavior? Are you seeing this problem (services going into maintenance during startup) frequently?

I'm a little worried because to my knowledge we have not seen this before. Might this change cause us to see this problem elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag is a one-off that one has to explicitly set for this code to be present. The goal is to make it hard to accidentally enable this.

I'm working in a multi-switch falcon topology that hosts the control plane. This topology is composed of propolis VMs interconnected by simnet links. In this environment I get many services going into maintenance mode every time I launch the control plane. The number and frequency of services going into maintenance on startup is high enough to make it very time-consuming to find and clear them all by hand. With this little workaround, things mostly come up automatically as the issues all appear to be transient in nature.

I'm a little worried because to my knowledge we have not seen this before. Might this change cause us to see this problem elsewhere?

We started seeing this when I started running the control plane in topologies composed of VMs. Why this happens in VMs, I'm not sure - but I don't think anything here would cause this issue to manifest elsewhere. I think it's a bad interaction at the intersection of the control plane, SMF and running on virtual hardware.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that context! That makes sense. I think it's almost always a bug for any SMF service to go into maintenance on our systems. If you haven't already, would you mind filing an issue for that (even if it's not high priority)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #4399 on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +27 to +30
// TODO: unfortunately this test can no longer be run in the integration test
// suite because it depends on communicating with MGS which is not part
// of the infrastructure available in the integration test context.
#[ignore]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we unable to use the faux MGS here (the same one we use to simulate MGS for the switch zone with SoftNPU)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure that can be made to work, but it's not a part of the current integration test infrastructure and not something I have the bandwidth to take on at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally understandable. It may be something that fits into the scope of work for the next set of RPW PRs, I was curious if there were any blockers I wasn't aware of.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, as of (I think it was) #4291, the test suite starts a copy of MGS backed by simulated SPs. (I haven't looked at what this test requires so I'm not sure that will work but it may be an option.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I'll give this a shot when I merge main back into this branch.

@rcgoodfellow rcgoodfellow merged commit ea4da47 into main Nov 4, 2023
21 checks passed
@rcgoodfellow rcgoodfellow deleted the softnpu-propolis branch November 4, 2023 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A port settings update can result in the ASIC and switch-zone updates going to different sidecars.
3 participants