-
Notifications
You must be signed in to change notification settings - Fork 40
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[nexus] add RPW for proactively pulling instance state from sled-agen…
…ts (#5611) Currently, instance states are communicated by the sled-agents of the sleds on which an instance is running calling into a Nexus HTTP API. This is a "push"-shaped communication pattern, where the monitored resources (the sled-agents) act as clients of the monitor (Nexus) to publish their state to it. While push-shaped communications are generally more efficient than pull-shaped communications (as they only require communication when a monitored resource actively changes state), they have a substantial disadvantage: if the sled-agent is gone, stuck, or otherwise failed, Nexus will never become aware of this, and will continue happily chugging along assuming the instances are in the last state reported by their sled-agents. In order to allow Nexus to determine when a sled-agent or VMM has failed, this branch introduces a background task ("instance_watcher") that periodically attempts to pull the states of a sled-agent's managed instances. This way, if calls to sled agents fail, the Nexus process is made aware that a sled-agent it expected to publish the state of its monitored instances may no longer be doing so, and we can (eventually) take corrective action. Presently, the `instance_watcher` task does not perform corrective action, such as restarting failed instances or advancing instances to the "failed" state in CRDB, as affirmatively detecting failures will require other work. For example, if a sled-agent is unreachable when checking on a particular instance, the instance itself might be fine, and we shouldn't decide that it's gone unless we also can't reach the `propolis-server` process directly. Furthermore, there's the split-brain problem to consider: the Nexus process performing the check might be on one side of a network partition...and it might not be on the same side of that partition as the majority of the rack, so we may not want to consider an unreachable sled-agent "failed" unless *all* the Nexus replicas can't talk to it. And, there are probably even more considerations I haven't thought of yet. I'm planning on starting an RFD to propose a comprehensive design for instance health checking. In the meantime, though, the `instance_watcher` task just emits a bunch of Oximeter metrics describing the results of instance checks. These metrics capture: - Cases where the instance check *failed* in a way that might indicate that something is wrong with the instance or its sled-agent or sled - Cases where the instance check was *unsuccessful* due to an internal error, such as a client-side request error or an error updating the instance's database records - Cases where the instance check was successful, and what state updates occurred as a result of the check: - Whether the instance state in the `instances` table was updated as a result of the state returned by sled-agent - Whether the VMM state in the `vmms` table was updated as a result of the state returned by sled-agent
- Loading branch information
Showing
31 changed files
with
1,313 additions
and
34 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.