-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled agent] fail requests for non-NTP zones if time is not synchronized #4778
Conversation
From #4776 :
... admittedly, this current implementation would return a 503 when called with the current generation, because I'm performing this check before performing any state-modifying changes (like creating datasets for zones that need them). @davepacheco , if you feel strongly about this, I could move this check into the body of |
Yeah, is there any reason not to do that? This feels like a policy to be enforced at the service manager, and there's other policy in |
My main hesitation was that "if we return a 503 error to the caller, does that mean we have not altered the sled state?"
I don't think this is a deal-breaker, but it's a "state-mutation-even-on-error" quirk of this API -- admittedly, a quirk that we'll need to deal with later anyway, since it's always possible that "launching 10 zones, 9/10 work, and the 10th one fails". |
Related to that: I do think we should separate dataset creation / destruction into a separate API; it seems possible we'd want to remove a deployed service without destroying the underlying dataset. We aren't really doing dataset cleanup on zone removal right now. |
This is a good point. Unfortunately I think we already have this problem because of the policy checks that are already in |
The more I think about it, the more I realize this is just something we'll need to deal with in an async world anyway. There isn't really a great way to atomically swap from one set of services to another, so I don't think this is a viable goal regardless. I'll get an updated version of this PR up shortly. |
Okay, this PR has been updated:
|
@@ -540,6 +571,16 @@ struct SledAgentInfo { | |||
rack_network_config: Option<RackNetworkConfig>, | |||
} | |||
|
|||
pub(crate) enum TimeSyncConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly more complicated than a bool to assist with testing. I need an option that fails unconditionally to avoid mocking out all the NTP calls.
sled-agent/src/services.rs
Outdated
// NOTE: This imposes a constraint, during initial setup, cold boot, | ||
// etc, that NTP and the internal DNS system it depends on MUST be | ||
// initialized prior to other zones. | ||
if zone_requires_timesync(&zone.zone_type) && !time_is_synchronized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good but I'm not sure about the check here. (I don't know that it's wrong; I just ran out of time right now to convince myself it's right.) My two concerns are:
- This has a lot more callers than
ensure_all_omicron_zones_persistent()
. But I just looked and the only ones outside ofensure_all_omicron_zones_persistent()
are all inload_services()
, where it seems fairly clear that this will be okay because we've already waited for timesync at the right spot. So I think this is okay. - If we do fail at this point, we may have already destroyed zones that aren't supposed to be running in the new generation, but we're also rejecting the new generation. I know this falls in the category of "side effects" that we talked about earlier and said we might ignore for now, but this seems a lot worse than the side effects we were talking about before, where we just created an extra dataset. In the other case, the system is still running what it's supposed to be running. In this case, it wouldn't be. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This has a lot more callers than
ensure_all_omicron_zones_persistent()
. But I just looked and the only ones outside ofensure_all_omicron_zones_persistent()
are all inload_services()
, where it seems fairly clear that this will be okay because we've already waited for timesync at the right spot. So I think this is okay.
Yeah, if anything, if we do want this check to prevent zones from truly loading before timesync, putting it on the "common" path seemed better.
- If we do fail at this point, we may have already destroyed zones that aren't supposed to be running in the new generation, but we're also rejecting the new generation. I know this falls in the category of "side effects" that we talked about earlier and said we might ignore for now, but this seems a lot worse than the side effects we were talking about before, where we just created an extra dataset. In the other case, the system is still running what it's supposed to be running. In this case, it wouldn't be. Right?
I somewhat agree, that's why I initially put the check in a much higher-level location to bail out early. However, I rationalized this to myself with the following: the state of "we destroyed the zones that should not be here, and we are getting ready to make the new ones" is absolutely a valid intermediate state while performing this operation. If we fail, the endpoint can be called again (or load_services
can be re-invoked) and we'll proceed as we should.
So in that sense: is there really value in leaving the old zones up? A new request arrived asking for them to be destroyed, and that would be a valid intermediate state. I'd argue that this situation is actually not that distinct from "You had zones {A, B, C}, you now ask for zones {D, E, F}, and after destroying {A, B, C}, we fail to create {D} for any reason". That's how the code is currently implemented, regardless of this PR, and I think this change is consistent with those guarantees.
Note that if a zone was in both the "old" and "new" config, it'll be left alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I hear what you're saying. I agree these are all problematic and we probably want to move toward a model like we do with RPWs where we validate that the request is logically sound (i.e., that we should be able to apply it), accept it, and then retry indefinitely to make it happen. And that's outside the scope of what we're doing here.
This new case feels a little different to me than the existing ones. For the existing cases where sled agent could fail to create a zone, they seem unlikely, unknowable ahead of time (so not easily avoidable), and either transient (e.g., out of memory) or unrecoverable (e.g., some step went haywire and needs human attention). This seems like a problem but not an urgent one. (Have we seen cases like that outside of development?) On the other hand, the condition we're adding here is likely (we're intending to use it as part of normal operation) so it seems pragmatically worse -- but it's also knowable up front so easy to avoid.
Oh, how about this alternative: keep the check inside this function, but do it first thing, before having made any changes? I think that preserves the upside (keeping it in common code) but without the risk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I hear what you're saying. I agree these are all problematic and we probably want to move toward a model like we do with RPWs where we validate that the request is logically sound (i.e., that we should be able to apply it), accept it, and then retry indefinitely to make it happen. And that's outside the scope of what we're doing here.
I agree with this direction -- IMO this is basically the approach we'd take to "async-ify" the endpoint -- but I'm less certain we could ever fully validate that all the zones will successfully launch without actively trying to launch them. We can certainly try to catch obviously broken configs, but I think it's a difficult task (and not a necessary one!) to ensure that "once we accept a request to launch a certain set of services, we must guarantee that they'll be launch-able".
For example: Someone could ask for a zone using a dataset on a U.2 disk that gets unplugged! We'll always need to cope with "some set of zones may not be runnable".
This new case feels a little different to me than the existing ones. For the existing cases where sled agent could fail to create a zone, they seem unlikely, unknowable ahead of time (so not easily avoidable), and either transient (e.g., out of memory) or unrecoverable (e.g., some step went haywire and needs human attention). This seems like a problem but not an urgent one. (Have we seen cases like that outside of development?) On the other hand, the condition we're adding here is likely (we're intending to use it as part of normal operation) so it seems pragmatically worse -- but it's also knowable up front so easy to avoid.
I disagree with the assertion that this is pragmatically worse! I think this is fundamentally a transient condition, like out-of-memory. Yes, it has slightly different behavior, in the sense that "we don't expect this transient failure to re-appear after the initial timesync has completed", but it's basically a transient case for someone making this request at an HTTP level.
Also, I'm not sure it being "knowable ahead-of-time" necessarily forces our hand into making the operation avoid all side-effects. We don't even try to validate most of the zone configuration (in the form of both addresses, free-form strings as "domains", etc), and I think that's okay because whether or not the information can be validated ahead-of-time is only for convenience and error message clarity, and it's not necessary for correctness.
Oh, how about this alternative: keep the check inside this function, but do it first thing, before having made any changes? I think that preserves the upside (keeping it in common code) but without the risk?
If we did this, we'd still be causing side-effects, due to the dataset generation I described in #4778 (comment) .
I've moved it regardless, so now the logic is basically:
- Find zones to {add, remove}
- Do some checks on zones to be added <-- Timesync check now here
- Remove zones to be removed
- Do more checks on zones to be added
- Add zones to be added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
Fixes #4776