[sled agent] fail requests for non-NTP zones if time is not synchronized #4778

smklein · 2024-01-08T19:48:36Z

…synchronized

smklein · 2024-01-08T20:05:28Z

Note that it'd be easy to implement this in a way where you can't even PUT the current generation config to sled agent when time is not synchronized -- I think it'd be better (and easy) to allow that instead.

... admittedly, this current implementation would return a 503 when called with the current generation, because I'm performing this check before performing any state-modifying changes (like creating datasets for zones that need them).

@davepacheco , if you feel strongly about this, I could move this check into the body of ensure_all_omicron_zones_persistent, and compare the request with the ledger's generation number before checking this timesync stuff.

davepacheco · 2024-01-09T18:51:46Z

From #4776 :

Note that it'd be easy to implement this in a way where you can't even PUT the current generation config to sled agent when time is not synchronized -- I think it'd be better (and easy) to allow that instead.

... admittedly, this current implementation would return a 503 when called with the current generation, because I'm performing this check before performing any state-modifying changes (like creating datasets for zones that need them).

@davepacheco , if you feel strongly about this, I could move this check into the body of ensure_all_omicron_zones_persistent, and compare the request with the ledger's generation number before checking this timesync stuff.

Yeah, is there any reason not to do that? This feels like a policy to be enforced at the service manager, and there's other policy in ensure_all_omicron_zones_persistent() that feels similar (e.g., not downgrading the generation, making sure if it's the same generation that the contents are the same). Admittedly right now the only caller outside the test suite that I see of ensure_all_omicron_zones_persistent is omicron_zones_ensure(), so it doesn't matter much, but if we had other callers of ensure_all_omicron_zones_persistent(), I'd expect we'd want to enforce this on them, too. Plus, this would make it easier to add some tests for this in this PR and we wouldn't have to expose zones_require_timesync().

smklein · 2024-01-09T18:57:08Z

My main hesitation was that "if we return a 503 error to the caller, does that mean we have not altered the sled state?"

In this PR: We check early, so we avoid changing anything.
If I move it into the service manager: datasets will get created, and it's slightly harder to guarantee that nothing else will have changed

I don't think this is a deal-breaker, but it's a "state-mutation-even-on-error" quirk of this API -- admittedly, a quirk that we'll need to deal with later anyway, since it's always possible that "launching 10 zones, 9/10 work, and the 10th one fails".

smklein · 2024-01-09T18:58:40Z

Related to that: I do think we should separate dataset creation / destruction into a separate API; it seems possible we'd want to remove a deployed service without destroying the underlying dataset. We aren't really doing dataset cleanup on zone removal right now.

davepacheco · 2024-01-09T19:23:48Z

My main hesitation was that "if we return a 503 error to the caller, does that mean we have not altered the sled state?"
* In this PR: We check early, so we avoid changing anything.

* If I move it into the service manager: datasets will get created, and it's slightly harder to guarantee that nothing else will have changed
I don't think this is a deal-breaker, but it's a "state-mutation-even-on-error" quirk of this API -- admittedly, a quirk that we'll need to deal with later anyway, since it's always possible that "launching 10 zones, 9/10 work, and the 10th one fails".

This is a good point. Unfortunately I think we already have this problem because of the policy checks that are already in ensure_all_omicron_zones_persistent().

smklein · 2024-01-09T20:50:58Z

This is a good point. Unfortunately I think we already have this problem because of the policy checks that are already in ensure_all_omicron_zones_persistent().

The more I think about it, the more I realize this is just something we'll need to deal with in an async world anyway. There isn't really a great way to atomically swap from one set of services to another, so I don't think this is a viable goal regardless.

I'll get an updated version of this PR up shortly.

smklein · 2024-01-10T00:49:40Z

Okay, this PR has been updated:

The check now happens within the ServiceManager itself
A test has been added

smklein · 2024-01-10T00:50:48Z

sled-agent/src/services.rs

@@ -540,6 +571,16 @@ struct SledAgentInfo {
    rack_network_config: Option<RackNetworkConfig>,
 }

+pub(crate) enum TimeSyncConfig {


This is slightly more complicated than a bool to assist with testing. I need an option that fails unconditionally to avoid mocking out all the NTP calls.

davepacheco · 2024-01-10T01:03:10Z

sled-agent/src/services.rs

+            // NOTE: This imposes a constraint, during initial setup, cold boot,
+            // etc, that NTP and the internal DNS system it depends on MUST be
+            // initialized prior to other zones.
+            if zone_requires_timesync(&zone.zone_type) && !time_is_synchronized


The PR looks good but I'm not sure about the check here. (I don't know that it's wrong; I just ran out of time right now to convince myself it's right.) My two concerns are:

This has a lot more callers than ensure_all_omicron_zones_persistent(). But I just looked and the only ones outside of ensure_all_omicron_zones_persistent() are all in load_services(), where it seems fairly clear that this will be okay because we've already waited for timesync at the right spot. So I think this is okay.

If we do fail at this point, we may have already destroyed zones that aren't supposed to be running in the new generation, but we're also rejecting the new generation. I know this falls in the category of "side effects" that we talked about earlier and said we might ignore for now, but this seems a lot worse than the side effects we were talking about before, where we just created an extra dataset. In the other case, the system is still running what it's supposed to be running. In this case, it wouldn't be. Right?

This has a lot more callers than ensure_all_omicron_zones_persistent(). But I just looked and the only ones outside of ensure_all_omicron_zones_persistent() are all in load_services(), where it seems fairly clear that this will be okay because we've already waited for timesync at the right spot. So I think this is okay.

Yeah, if anything, if we do want this check to prevent zones from truly loading before timesync, putting it on the "common" path seemed better.

If we do fail at this point, we may have already destroyed zones that aren't supposed to be running in the new generation, but we're also rejecting the new generation. I know this falls in the category of "side effects" that we talked about earlier and said we might ignore for now, but this seems a lot worse than the side effects we were talking about before, where we just created an extra dataset. In the other case, the system is still running what it's supposed to be running. In this case, it wouldn't be. Right?

I somewhat agree, that's why I initially put the check in a much higher-level location to bail out early. However, I rationalized this to myself with the following: the state of "we destroyed the zones that should not be here, and we are getting ready to make the new ones" is absolutely a valid intermediate state while performing this operation. If we fail, the endpoint can be called again (or load_services can be re-invoked) and we'll proceed as we should.

So in that sense: is there really value in leaving the old zones up? A new request arrived asking for them to be destroyed, and that would be a valid intermediate state. I'd argue that this situation is actually not that distinct from "You had zones {A, B, C}, you now ask for zones {D, E, F}, and after destroying {A, B, C}, we fail to create {D} for any reason". That's how the code is currently implemented, regardless of this PR, and I think this change is consistent with those guarantees.

Note that if a zone was in both the "old" and "new" config, it'll be left alone.

Hmm. I hear what you're saying. I agree these are all problematic and we probably want to move toward a model like we do with RPWs where we validate that the request is logically sound (i.e., that we should be able to apply it), accept it, and then retry indefinitely to make it happen. And that's outside the scope of what we're doing here.

This new case feels a little different to me than the existing ones. For the existing cases where sled agent could fail to create a zone, they seem unlikely, unknowable ahead of time (so not easily avoidable), and either transient (e.g., out of memory) or unrecoverable (e.g., some step went haywire and needs human attention). This seems like a problem but not an urgent one. (Have we seen cases like that outside of development?) On the other hand, the condition we're adding here is likely (we're intending to use it as part of normal operation) so it seems pragmatically worse -- but it's also knowable up front so easy to avoid.

Oh, how about this alternative: keep the check inside this function, but do it first thing, before having made any changes? I think that preserves the upside (keeping it in common code) but without the risk?

Hmm. I hear what you're saying. I agree these are all problematic and we probably want to move toward a model like we do with RPWs where we validate that the request is logically sound (i.e., that we should be able to apply it), accept it, and then retry indefinitely to make it happen. And that's outside the scope of what we're doing here.

I agree with this direction -- IMO this is basically the approach we'd take to "async-ify" the endpoint -- but I'm less certain we could ever fully validate that all the zones will successfully launch without actively trying to launch them. We can certainly try to catch obviously broken configs, but I think it's a difficult task (and not a necessary one!) to ensure that "once we accept a request to launch a certain set of services, we must guarantee that they'll be launch-able".

For example: Someone could ask for a zone using a dataset on a U.2 disk that gets unplugged! We'll always need to cope with "some set of zones may not be runnable".

This new case feels a little different to me than the existing ones. For the existing cases where sled agent could fail to create a zone, they seem unlikely, unknowable ahead of time (so not easily avoidable), and either transient (e.g., out of memory) or unrecoverable (e.g., some step went haywire and needs human attention). This seems like a problem but not an urgent one. (Have we seen cases like that outside of development?) On the other hand, the condition we're adding here is likely (we're intending to use it as part of normal operation) so it seems pragmatically worse -- but it's also knowable up front so easy to avoid.

I disagree with the assertion that this is pragmatically worse! I think this is fundamentally a transient condition, like out-of-memory. Yes, it has slightly different behavior, in the sense that "we don't expect this transient failure to re-appear after the initial timesync has completed", but it's basically a transient case for someone making this request at an HTTP level.

Also, I'm not sure it being "knowable ahead-of-time" necessarily forces our hand into making the operation avoid all side-effects. We don't even try to validate most of the zone configuration (in the form of both addresses, free-form strings as "domains", etc), and I think that's okay because whether or not the information can be validated ahead-of-time is only for convenience and error message clarity, and it's not necessary for correctness.

Oh, how about this alternative: keep the check inside this function, but do it first thing, before having made any changes? I think that preserves the upside (keeping it in common code) but without the risk?

If we did this, we'd still be causing side-effects, due to the dataset generation I described in #4778 (comment) .

I've moved it regardless, so now the logic is basically:

Find zones to {add, remove}

Do some checks on zones to be added <-- Timesync check now here

Remove zones to be removed

Do more checks on zones to be added

Add zones to be added

davepacheco

Sounds good!

[sled agent] fail requests to configure non-NTP zones if time is not …

90eee35

…synchronized

Merge branch 'main' into ntp

395cf26

Respond to review feedback: Check timesync in ServiceManager, add test

1742b98

smklein commented Jan 10, 2024

View reviewed changes

davepacheco reviewed Jan 10, 2024

View reviewed changes

Move timesync check earlier

9173fea

davepacheco approved these changes Jan 10, 2024

View reviewed changes

smklein merged commit 42c0c8a into main Jan 10, 2024
20 checks passed

smklein deleted the ntp branch January 10, 2024 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sled agent] fail requests for non-NTP zones if time is not synchronized #4778

[sled agent] fail requests for non-NTP zones if time is not synchronized #4778

smklein commented Jan 8, 2024

smklein commented Jan 8, 2024

davepacheco commented Jan 9, 2024

smklein commented Jan 9, 2024

smklein commented Jan 9, 2024

davepacheco commented Jan 9, 2024

smklein commented Jan 9, 2024

smklein commented Jan 10, 2024

smklein Jan 10, 2024

davepacheco Jan 10, 2024

smklein Jan 10, 2024

davepacheco Jan 10, 2024

smklein Jan 10, 2024

davepacheco left a comment

[sled agent] fail requests for non-NTP zones if time is not synchronized #4778

[sled agent] fail requests for non-NTP zones if time is not synchronized #4778

Conversation

smklein commented Jan 8, 2024

smklein commented Jan 8, 2024

davepacheco commented Jan 9, 2024

smklein commented Jan 9, 2024

smklein commented Jan 9, 2024

davepacheco commented Jan 9, 2024

smklein commented Jan 9, 2024

smklein commented Jan 10, 2024

smklein Jan 10, 2024

Choose a reason for hiding this comment

davepacheco Jan 10, 2024

Choose a reason for hiding this comment

smklein Jan 10, 2024

Choose a reason for hiding this comment

davepacheco Jan 10, 2024

Choose a reason for hiding this comment

smklein Jan 10, 2024

Choose a reason for hiding this comment

davepacheco left a comment

Choose a reason for hiding this comment