Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled agent] fail requests for non-NTP zones if time is not synchronized #4778

Merged
merged 4 commits into from
Jan 10, 2024

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jan 8, 2024

Fixes #4776

@smklein
Copy link
Collaborator Author

smklein commented Jan 8, 2024

From #4776 :

Note that it'd be easy to implement this in a way where you can't even PUT the current generation config to sled agent when time is not synchronized -- I think it'd be better (and easy) to allow that instead.

... admittedly, this current implementation would return a 503 when called with the current generation, because I'm performing this check before performing any state-modifying changes (like creating datasets for zones that need them).

@davepacheco , if you feel strongly about this, I could move this check into the body of ensure_all_omicron_zones_persistent, and compare the request with the ledger's generation number before checking this timesync stuff.

@davepacheco
Copy link
Collaborator

From #4776 :

Note that it'd be easy to implement this in a way where you can't even PUT the current generation config to sled agent when time is not synchronized -- I think it'd be better (and easy) to allow that instead.

... admittedly, this current implementation would return a 503 when called with the current generation, because I'm performing this check before performing any state-modifying changes (like creating datasets for zones that need them).

@davepacheco , if you feel strongly about this, I could move this check into the body of ensure_all_omicron_zones_persistent, and compare the request with the ledger's generation number before checking this timesync stuff.

Yeah, is there any reason not to do that? This feels like a policy to be enforced at the service manager, and there's other policy in ensure_all_omicron_zones_persistent() that feels similar (e.g., not downgrading the generation, making sure if it's the same generation that the contents are the same). Admittedly right now the only caller outside the test suite that I see of ensure_all_omicron_zones_persistent is omicron_zones_ensure(), so it doesn't matter much, but if we had other callers of ensure_all_omicron_zones_persistent(), I'd expect we'd want to enforce this on them, too. Plus, this would make it easier to add some tests for this in this PR and we wouldn't have to expose zones_require_timesync().

@smklein
Copy link
Collaborator Author

smklein commented Jan 9, 2024

My main hesitation was that "if we return a 503 error to the caller, does that mean we have not altered the sled state?"

  • In this PR: We check early, so we avoid changing anything.
  • If I move it into the service manager: datasets will get created, and it's slightly harder to guarantee that nothing else will have changed

I don't think this is a deal-breaker, but it's a "state-mutation-even-on-error" quirk of this API -- admittedly, a quirk that we'll need to deal with later anyway, since it's always possible that "launching 10 zones, 9/10 work, and the 10th one fails".

@smklein
Copy link
Collaborator Author

smklein commented Jan 9, 2024

Related to that: I do think we should separate dataset creation / destruction into a separate API; it seems possible we'd want to remove a deployed service without destroying the underlying dataset. We aren't really doing dataset cleanup on zone removal right now.

@davepacheco
Copy link
Collaborator

My main hesitation was that "if we return a 503 error to the caller, does that mean we have not altered the sled state?"

* In this PR: We check early, so we avoid changing anything.

* If I move it into the service manager: datasets will get created, and it's slightly harder to guarantee that nothing else will have changed

I don't think this is a deal-breaker, but it's a "state-mutation-even-on-error" quirk of this API -- admittedly, a quirk that we'll need to deal with later anyway, since it's always possible that "launching 10 zones, 9/10 work, and the 10th one fails".

This is a good point. Unfortunately I think we already have this problem because of the policy checks that are already in ensure_all_omicron_zones_persistent().

@smklein
Copy link
Collaborator Author

smklein commented Jan 9, 2024

This is a good point. Unfortunately I think we already have this problem because of the policy checks that are already in ensure_all_omicron_zones_persistent().

The more I think about it, the more I realize this is just something we'll need to deal with in an async world anyway. There isn't really a great way to atomically swap from one set of services to another, so I don't think this is a viable goal regardless.

I'll get an updated version of this PR up shortly.

@smklein
Copy link
Collaborator Author

smklein commented Jan 10, 2024

Okay, this PR has been updated:

  • The check now happens within the ServiceManager itself
  • A test has been added

@@ -540,6 +571,16 @@ struct SledAgentInfo {
rack_network_config: Option<RackNetworkConfig>,
}

pub(crate) enum TimeSyncConfig {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly more complicated than a bool to assist with testing. I need an option that fails unconditionally to avoid mocking out all the NTP calls.

// NOTE: This imposes a constraint, during initial setup, cold boot,
// etc, that NTP and the internal DNS system it depends on MUST be
// initialized prior to other zones.
if zone_requires_timesync(&zone.zone_type) && !time_is_synchronized
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good but I'm not sure about the check here. (I don't know that it's wrong; I just ran out of time right now to convince myself it's right.) My two concerns are:

  1. This has a lot more callers than ensure_all_omicron_zones_persistent(). But I just looked and the only ones outside of ensure_all_omicron_zones_persistent() are all in load_services(), where it seems fairly clear that this will be okay because we've already waited for timesync at the right spot. So I think this is okay.
  2. If we do fail at this point, we may have already destroyed zones that aren't supposed to be running in the new generation, but we're also rejecting the new generation. I know this falls in the category of "side effects" that we talked about earlier and said we might ignore for now, but this seems a lot worse than the side effects we were talking about before, where we just created an extra dataset. In the other case, the system is still running what it's supposed to be running. In this case, it wouldn't be. Right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This has a lot more callers than ensure_all_omicron_zones_persistent(). But I just looked and the only ones outside of ensure_all_omicron_zones_persistent() are all in load_services(), where it seems fairly clear that this will be okay because we've already waited for timesync at the right spot. So I think this is okay.

Yeah, if anything, if we do want this check to prevent zones from truly loading before timesync, putting it on the "common" path seemed better.

  1. If we do fail at this point, we may have already destroyed zones that aren't supposed to be running in the new generation, but we're also rejecting the new generation. I know this falls in the category of "side effects" that we talked about earlier and said we might ignore for now, but this seems a lot worse than the side effects we were talking about before, where we just created an extra dataset. In the other case, the system is still running what it's supposed to be running. In this case, it wouldn't be. Right?

I somewhat agree, that's why I initially put the check in a much higher-level location to bail out early. However, I rationalized this to myself with the following: the state of "we destroyed the zones that should not be here, and we are getting ready to make the new ones" is absolutely a valid intermediate state while performing this operation. If we fail, the endpoint can be called again (or load_services can be re-invoked) and we'll proceed as we should.

So in that sense: is there really value in leaving the old zones up? A new request arrived asking for them to be destroyed, and that would be a valid intermediate state. I'd argue that this situation is actually not that distinct from "You had zones {A, B, C}, you now ask for zones {D, E, F}, and after destroying {A, B, C}, we fail to create {D} for any reason". That's how the code is currently implemented, regardless of this PR, and I think this change is consistent with those guarantees.

Note that if a zone was in both the "old" and "new" config, it'll be left alone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I hear what you're saying. I agree these are all problematic and we probably want to move toward a model like we do with RPWs where we validate that the request is logically sound (i.e., that we should be able to apply it), accept it, and then retry indefinitely to make it happen. And that's outside the scope of what we're doing here.

This new case feels a little different to me than the existing ones. For the existing cases where sled agent could fail to create a zone, they seem unlikely, unknowable ahead of time (so not easily avoidable), and either transient (e.g., out of memory) or unrecoverable (e.g., some step went haywire and needs human attention). This seems like a problem but not an urgent one. (Have we seen cases like that outside of development?) On the other hand, the condition we're adding here is likely (we're intending to use it as part of normal operation) so it seems pragmatically worse -- but it's also knowable up front so easy to avoid.

Oh, how about this alternative: keep the check inside this function, but do it first thing, before having made any changes? I think that preserves the upside (keeping it in common code) but without the risk?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I hear what you're saying. I agree these are all problematic and we probably want to move toward a model like we do with RPWs where we validate that the request is logically sound (i.e., that we should be able to apply it), accept it, and then retry indefinitely to make it happen. And that's outside the scope of what we're doing here.

I agree with this direction -- IMO this is basically the approach we'd take to "async-ify" the endpoint -- but I'm less certain we could ever fully validate that all the zones will successfully launch without actively trying to launch them. We can certainly try to catch obviously broken configs, but I think it's a difficult task (and not a necessary one!) to ensure that "once we accept a request to launch a certain set of services, we must guarantee that they'll be launch-able".

For example: Someone could ask for a zone using a dataset on a U.2 disk that gets unplugged! We'll always need to cope with "some set of zones may not be runnable".

This new case feels a little different to me than the existing ones. For the existing cases where sled agent could fail to create a zone, they seem unlikely, unknowable ahead of time (so not easily avoidable), and either transient (e.g., out of memory) or unrecoverable (e.g., some step went haywire and needs human attention). This seems like a problem but not an urgent one. (Have we seen cases like that outside of development?) On the other hand, the condition we're adding here is likely (we're intending to use it as part of normal operation) so it seems pragmatically worse -- but it's also knowable up front so easy to avoid.

I disagree with the assertion that this is pragmatically worse! I think this is fundamentally a transient condition, like out-of-memory. Yes, it has slightly different behavior, in the sense that "we don't expect this transient failure to re-appear after the initial timesync has completed", but it's basically a transient case for someone making this request at an HTTP level.

Also, I'm not sure it being "knowable ahead-of-time" necessarily forces our hand into making the operation avoid all side-effects. We don't even try to validate most of the zone configuration (in the form of both addresses, free-form strings as "domains", etc), and I think that's okay because whether or not the information can be validated ahead-of-time is only for convenience and error message clarity, and it's not necessary for correctness.

Oh, how about this alternative: keep the check inside this function, but do it first thing, before having made any changes? I think that preserves the upside (keeping it in common code) but without the risk?

If we did this, we'd still be causing side-effects, due to the dataset generation I described in #4778 (comment) .

I've moved it regardless, so now the logic is basically:

  • Find zones to {add, remove}
  • Do some checks on zones to be added <-- Timesync check now here
  • Remove zones to be removed
  • Do more checks on zones to be added
  • Add zones to be added

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@smklein smklein merged commit 42c0c8a into main Jan 10, 2024
20 checks passed
@smklein smklein deleted the ntp branch January 10, 2024 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sled Agent should fail requests to configure non-NTP zones if time is not synchronized
2 participants