Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] add RPW for proactively pulling instance state from sled-agents #5611

Merged
merged 73 commits into from
May 10, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Apr 24, 2024

Currently, instance states are communicated by the sled-agents of the
sleds on which an instance is running calling into a Nexus HTTP API.
This is a "push"-shaped communication pattern, where the monitored
resources (the sled-agents) act as clients of the monitor (Nexus) to
publish their state to it. While push-shaped communications are
generally more efficient than pull-shaped communications (as they only
require communication when a monitored resource actively changes state),
they have a substantial disadvantage: if the sled-agent is gone, stuck,
or otherwise failed, Nexus will never become aware of this, and will
continue happily chugging along assuming the instances are in the last
state reported by their sled-agents.

In order to allow Nexus to determine when a sled-agent or VMM has
failed, this branch introduces a background task ("instance_watcher")
that periodically attempts to pull the states of a sled-agent's managed
instances. This way, if calls to sled agents fail, the Nexus process is
made aware that a sled-agent it expected to publish the state of its
monitored instances may no longer be doing so, and we can (eventually)
take corrective action.

Presently, the instance_watcher task does not perform corrective
action, such as restarting failed instances or advancing instances to
the "failed" state in CRDB, as affirmatively detecting failures will
require other work. For example, if a sled-agent is unreachable when
checking on a particular instance, the instance itself might be fine,
and we shouldn't decide that it's gone unless we also can't reach the
propolis-server process directly. Furthermore, there's the split-brain
problem to consider: the Nexus process performing the check might be on
one side of a network partition...and it might not be on the same side
of that partition as the majority of the rack, so we may not want to
consider an unreachable sled-agent "failed" unless all the Nexus
replicas can't talk to it. And, there are probably even more
considerations I haven't thought of yet. I'm planning on starting an RFD
to propose a comprehensive design for instance health checking.

In the meantime, though, the instance_watcher task just emits a bunch
of Oximeter metrics describing the results of instance checks. These
metrics capture:

  • Cases where the instance check failed in a way that might indicate
    that something is wrong with the instance or its sled-agent or sled
  • Cases where the instance check was unsuccessful due to an internal
    error, such as a client-side request error or an error updating the
    instance's database records
  • Cases where the instance check was successful, and what state updates
    occurred as a result of the check:
    • Whether the instance state in the instances table was updated as a
      result of the state returned by sled-agent
    • Whether the VMM state in the vmms table was updated as a result of
      the state returned by sled-agent

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start -- this is a first batch of comments, and I can come back and do a deeper review in a moment

Comment on lines 532 to 533
/// maximum number of retries to attempt before considering a sled-agent
/// dead.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm being nitpicky, but I think this distinction matters - I don't think we're considering that the sled-agent is necessarily dead here, but rather, it sounds like we're making some decision that the instance should e marked failed? I think that logical jump is slightly smaller.

Seems plausible to me that we could bump into this case with e.g. a network partition as well, in which case the sled may be alive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to be clear: I like the idea of drawing inferences from client requests to the sleds being unresponsive, but I think we may want to consider this pattern more holistically. There are a lot of ways to "check if the sled is alive or not" (e.g., sending queries, checking-in with MGS, checking-in with the PSC), but I'm not sure this is sufficient to make the call about the whole sled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. Right now, we're not doing anything with this information besides logging a warning, but I think that the eventual end goal would be to use this as a signal that we should look in other places to figure out what's going on. E.g. I was also thinking that if the sled-agent isn't responding, Nexus may also want to try talking to the VMMs directly to see if they're interested in talking to it --- it might just be the sled-agent that's wedged. And, before we make determinations about the whole sled, we should absolutely look elsewhere as well.

nexus/src/app/background/instance_watcher.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled_instance.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled_instance.rs Outdated Show resolved Hide resolved
nexus/src/app/background/instance_watcher.rs Outdated Show resolved Hide resolved
Comment on lines 65 to 67
let rsp = backoff::retry_notify(
backoff,
|| async {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda agree with @andrewjstone 's comments here -- I'm not sure that we should be retrying in this background task.

I don't think that this task -- instance monitoring -- is ultimately responsible for deciding whether or not a sled is alive. It is, however, responsible for updating the state of an instance if the sled rebooted. In this case, if we see:

  • Sled policy: not expunged
  • Sled agent: not responding

... and we wait... later, we should see:

  • Sled policy: not expunged
  • Sled agent: responding! and I don't know about your instance.

I don't think this requires an inner retry loop, nor a "max retries" check.

As an alternative, if we expunge the sled, we'll see:

  • Sled policy: expunged
  • Sled agent: <we shouldn't talk to it>

We'll need to sort through #4872 to decide what to do here, but at least, we'll know that the sled being out-to-lunch is an explicitly marked operation, not something we need to infer

Copy link
Member Author

@hawkw hawkw Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, regarding this and your comments from #5611 (comment) and #5611 (comment), here's what I'm currently thinking I'm going to do as part of this PR:

  • remove all the retry logic
  • add a new table of "sled instance check failures" to the database. for now, i'm imagining that this would be shaped sort of like (sled_id, instance_id, id_of_the_nexus_performing_the_health_check, time_updated, failure_count). we might also want to add a notion of failure kinds to this as well.
  • if we ask a sled-agent about an instance we believe it has, and the sled-agent responds with an HTTP error that proactively indicates that it does not know about that instance, mark that instance as Failed (as I believe this would indicate that the sled has been rebooted and the instance is gone1)
  • for all other errors (the ones that currently get retried), such as TCP connections refused/timing out, etc, increment the failure count table for that (sled_id, instance_id, our_nexus_id) tuple
  • for now, all we do is increment that error count. eventually, the error count might be exposed to an operator and/or used to drive automated decisions (e.g. if the failure count for a particular instance is > n, but the sled-agent has responded successfully for other instances, restart the instance...if the failure count for all instances on a sled is > n, power cycle the sled, etc2)

What do you think?

Footnotes

  1. Potentially, if we wanted to be maximally distrustful of the sled-agent's understanding of the state of the world, we could also try pinging the actual propolis-server process for that instance, to see if it still exists. If the sled-agent doesn't believe it exists, it shouldn't, but unfortunately, as you probably already know, distributed systems...

  2. This is more or less what I was using the max-retries limit for in the current implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, I think my main feedback regards that "failure table". This doesn't actually even need to block us -- if you want to do a simpler version that "just logs an error", I think that is an okay place to start. I like your suggestion that an explicit response from a Sled Agent HTTP server saying "the instance is gone" should be enough info for us to mark the instance as failed - if we get a response, we have confirmation of state.

I bring all this up because "how we track+present no-response failures" is thorny. It's cool, we should be working on it, but it's big, and probably needs some design considerations, and I don't want any of that to stop this PR from moving forward.

That being said...

  • Almost all of the sled agent endpoints are hit by Nexus at one point or another -- why would we treat the instance monitoring call as the only one where we check for HTTP errors? If we went down the approach of "HTTP errors are worthy of recording", shouldn't we be doing this for all our endpoints accessed by Nexus here?
  • Is a counter sufficient information to diagnose what has happened? The moment we update failure_count / time_updated, we have no idea how big the window of failures was. Did we see one transient failure a second ago? or has this sled been dead for hours?
  • Alternatively, if we had a "row-per-error" type of structure, or something like a window, how would we do garbage collection to prevent the table from growing forever?
  • ... and, given that these are events, would it make sense to store this info in a column-based database -- like clickhouse -- instead of cockroachdb? (I'm not sure)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've a couple of random thoughts here:

if we ask a sled-agent about an instance we believe it has, and the sled-agent responds with an HTTP error that proactively indicates that it does not know about that instance, mark that instance as Failed

What happens if we fetch the sled and instance IDs, the instance migrates, and then we make the request to the original sled about it?

we might also want to add a notion of failure kinds to this as well.

I think this is really good to add early. It would help debugging a lot if we could distinguish connection timeouts from refusals, for example.

for all other errors (the ones that currently get retried), such as TCP connections refused/timing out, etc, increment the failure count table for that (sled_id, instance_id, our_nexus_id) tuple

Is the history of the failure behavior useful to track? If I understand this proposal, there is a single record with the cumulative failure count, rather than one record for each failure. The latter may be very useful in the long-run for making more complex diagnoses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like emitting both these failures and failures from clients to sled-agent endpoints elsewhere in Nexus as Oximeter metrics is potentially a good idea. The table of failure counts was just sort of a strawman proposal so that we're doing something with the information here...I'm also happy to just log them and punt on doing anything else in this PR!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you would like to spit these out as a timeseries, it might be helpful to look at the oximeter self-stats module, which reports statistics about its own collections. A very similar kind of reporting could be done here, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick update on this: I'm currently working on emitting these failure counts as Oximeter metrics. I'll probably want @bnaecker's feedback on how I'm doing that once I push it up for review --- I can ping you then!

hawkw added a commit that referenced this pull request Apr 25, 2024
hawkw added a commit that referenced this pull request Apr 25, 2024
@hawkw hawkw force-pushed the eliza/instance-watch-rpw branch from a924d8d to a679561 Compare April 25, 2024 19:47
hawkw added a commit that referenced this pull request Apr 25, 2024
@hawkw hawkw force-pushed the eliza/instance-watch-rpw branch from a679561 to 9ea71c5 Compare April 25, 2024 22:14
hawkw added a commit that referenced this pull request Apr 26, 2024
Currently, the instance state management code in nexus is implemented as
methods on the `Nexus` type. This is problematic, as these methods can
only be called in HTTP endpoints, and not from background tasks or
sagas. For example, as part of #5611, I needed to call the
`Nexus::notify_instance_updated` method from a background task, which
isn't possible with the current factoring. Therefore, this PR moves that
method and its dependencies, as well as all the methods in the
`instance_network` module into free functions that can be called without
a `Nexus` reference.

The `Nexus` methods are left in place in order to avoid having to update
all the callers as part of this PR, but they now just call into the free
functions. The free functions take a few more arguments that are
provided by the `&self` reference in the `Nexus` methods, such as the
internal DNS `Resolver` and `opctx_alloc`, so the methods pass these in
from `&self`, making them a bit more convenient to call when a `Nexus`
reference _is_ available. For both of those reasons, I thought keeping
them around was the right thing to do.

I had briefly considered moving this code to a new `nexus-instances`
crate, but it seemed a bit iffy to me as `notify_instance_updated` also
unregisters an Oximeter producer, and I didn't love the idea of moving
Oximeter-related code to a `nexus-instances` crate --- perhaps it also
ought to go in a `nexus-oximeter` crate, or something, but that felt
like a rabbit hole that I was afraid I might just end up going down
forever, so I'll save it for later.

If there are any functional changes as part of this PR, I've made a very
bad mistake; this _should_ be a pure refactor...
@hawkw
Copy link
Member Author

hawkw commented Apr 26, 2024

Hmm, upon further consideration, I think this change actually does depend on the instance-update saga that Greg proposed being implemented first. The reason for this is that i realized that in the current code, where instance states are only pushed from sled-agents to Nexus, the sled agent only calls cpapi_instances_put on one Nexus replica, essentially giving it a lock on the instance row for that state update. On the other hand, in a design where states are pulled by Nexuses, we might end up in a situation where multiple Nexus instances race to update the instance in CRDB, possibly with different states, so I think we need to introduce Greg's "instance updater lock" before we update instance states from data pulled by Nexus.

As an aside, that does also highlight another reason pulling instance states is, in the long run, the right thing. In a push-based model, it's possible that a sled-agent might push an instance state update to a Nexus replica that crashes before it gets to write the state update. If that happens, that state update is basically lost forever, because we only ever send it to one Nexus. That could also be solved by adding retries to the Nexus client in sled-agent,1 provided that we would retry with a different Nexus replica, but pulling instance states has other advantages, so it's probably the better long term solution. Edit: oh, there is a retry loop, it's just in sled-agent's InstanceRunner::publish_state_to_nexus, and not in the client implementation. So, that's kind of a moot point, at least.

Footnotes

  1. and it's possible that they're already there, since the client implementation is hidden in Progenitor-generated code? I'm still trying to figure that out...

@hawkw
Copy link
Member Author

hawkw commented Apr 26, 2024

Hmm, upon further consideration, I think this change actually does depend on the instance-update saga that Greg proposed being implemented first. The reason for this is that i realized that in the current code, where instance states are only pushed from sled-agents to Nexus, the sled agent only calls cpapi_instances_put on one Nexus replica, essentially giving it a lock on the instance row for that state update. On the other hand, in a design where states are pulled by Nexuses, we might end up in a situation where multiple Nexus instances race to update the instance in CRDB, possibly with different states, so I think we need to introduce Greg's "instance updater lock" before we update instance states from data pulled by Nexus.

Or, perhaps the instance's generation number already guards against this? I'm not totally sure, I'll have to think a bit more about it...

@hawkw hawkw force-pushed the eliza/instance-watch-rpw branch from 384a5f3 to c3694fc Compare May 8, 2024 22:28
Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still responding to your main question, but GH decided I can't post single comments anymore, only reviews!


// All requests fired off! While we wait for them to come back,
// let's prune old instances.
let pruned = self.metrics.lock().unwrap().prune();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, here we prune instances based on whether they've been marked touched by the tasks we spawn above. Does that mean we're racing them? For example, suppose we have the following:

  • We spawn the checking task, it marks the instance touched.
  • The prune call swaps that with false so it's marked "untouched"
  • On the next pass, we spawn the checking task, but before it runs...
  • We get here, pruning that still-untouched instance.

I think the actual data we report to ClickHouse will be handled correctly, in either case, but it does mean there will appear to be many restarts of that cumulative timeseries. Logically, it seems like we should be using the database to tell us which instances to report data for: if we have it in our list from the database, keep it, otherwise prune it. That might not play well with the pagination you have here, but I think it could be made to work and avoids the race above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this for a minute, I don't actually believe this is racy.
The checking task itself does not actually mutate the metrics, it just returns
a value to this task, which updates the metric and sets the touched flag. This is the only task that
mutates the metrics, and there can only be one of these tasks executing at any
given time, as activating the background task borrows it mutably. And, we won't
activate the background task again until all in-flight checks have completed,
because this task waits for all tasks spawned on the JoinSet to complete
before this future completes. There would be
a potential race if the checking tasks were responsible for actually updating
the metrics, but they're not.

I think there is an issue here, though, but it's much simpler: the issue is
just that we should call prune after all tasks spawned on the JoinSet have
completed, rather than before.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the checking task itself does not actually mutate the metrics

Ah, I didn't grok that, thanks.

we should call prune after all tasks spawned on the JoinSet have completed

I think that makes sense. It seems logically equivalent to pruning any instances that are not returned from the database, right? If so, then that seems good to me.

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm very confused what's going on with GH's UI here. This was supposed to be a one-off comment answering your question. It appears I can only add it as part of a review now. Let me know if anything is unclear!

Comment on lines 562 to 592
/// The number of successful checks for a single instance, VMM, and sled agent.
#[derive(Clone, Debug, Metric)]
struct Check {
/// The string representation of the instance's state as understood by
/// the VMM. If the check failed, this will generally be "failed".
state: String,
/// `Why the instance was marked as being in this state.
///
/// If an instance was marked as "failed" due to a check failure, this
/// will be a string representation of the failure reason. Otherwise, if
/// the check was successful, this will be "success". Note that this may
/// be "success" even if the instance's state is "failed", which
/// indicates that we successfully queried the instance's state from the
/// sled-agent, and the *sled-agent* reported that the instance has
/// failed --- which is distinct from the instance watcher marking an
/// instance as failed due to a failed check.
reason: String,
/// this will be a string representation of the failure reason.
datum: Cumulative<u64>,
}

/// The number of unsuccessful checks for an instance and sled agent pair.
#[derive(Clone, Debug, Metric)]
struct IncompleteCheck {
/// The reason why the check was unsuccessful.
///
/// This is generated from the [`Incomplete`] enum's `Display` implementation.
reason: String,
/// The number of failed checks for this instance and sled agent.
datum: Cumulative<u64>,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question, and though I've been thinking for a while, I'm very sorry to say I don't have a good answer for you :/

In general, I'd probably make this decision based on how one wants to query the data. If one wants to easily break out or group by a field, they should probably be part of the same timeseries. If that's rare, then I'd probably put these mostly-unrelated metrics (instance-state vs instance-state-fetch-failures) into two different timeseries.

My only hesitation with grouping them is that it sort of complicates the meaning of the data. IIUC, then the reason field is non-empty iff the state is "unknown", i.e., we failed the check. So even the successful samples carry this now-empty field. To be clear, I'm not worried about resource usage here, just confusion about what the field means, and when it's valid. Maybe changing that field name to failure_reason is enough to mitigate that confusion, though. And this is a pretty non-specific worry -- I think it should mechanically work fine to do that, even if one wants to separately query for things like known states or failed checks. In either case, we can answer the same kinds of questions:

  • what fraction of time was the instance in some state (say, stopped or running), over some period of time?
  • how often did the instance change states?
  • was the instance rebooted in this period?
  • why did Nexus fail to determine the instance's state?

So I'd just say, choose what feels most natural to you at this point! I wish I had more concrete advice for you, but I guess we'll have to develop that together as we use it more :)

@bnaecker
Copy link
Collaborator

Thanks for all your patience on the back-and-forth on this PR! I've one other question, which I feel does not block integrating this, just something I'm curious about. How does the fact that each Nexus instance runs these checks impact the resulting metrics?

@hawkw
Copy link
Member Author

hawkw commented May 10, 2024

@bnaecker

Thanks for all your patience on the back-and-forth on this PR! I've one other question, which I feel does not block integrating this, just something I'm curious about. How does the fact that each Nexus instance runs these checks impact the resulting metrics?

Yeah, that's why I asked you whether Oximeter automatically included producer metadata or not --- I was, once again, suffering from Prometheus brain damage (prometheus always adds the IP and port of the scrape target, as well as metadata about how it was discovered). Commit 08e01ad adds the UUID of the Nexus process to the metric target. We could also add the rack UUID if you think that's worth including now as well?

@bnaecker
Copy link
Collaborator

@hawkw My current stance is that more fields are usually better :) That might change in the future, but they can always be ignored in the query and storing them is inexpensive. So if you think the rack / cluster ID is something you might want to know, then go for it!

@hawkw
Copy link
Member Author

hawkw commented May 10, 2024

@hawkw My current stance is that more fields are usually better :) That might change in the future, but they can always be ignored in the query and storing them is inexpensive. So if you think the rack / cluster ID is something you might want to know, then go for it!

Given that it sounds like metric schema changes are painful, I'll throw it in just in case we need it! :)

@bnaecker
Copy link
Collaborator

I wouldn't call them painful, so much as impossible :) I hope to change that in the coming months.

@hawkw hawkw merged commit cbb8dbf into main May 10, 2024
22 checks passed
@hawkw hawkw deleted the eliza/instance-watch-rpw branch May 10, 2024 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants