Kubernetes-friendly health checking #6023
Labels
Focus:Supervisor
Related to the Habitat Supervisor (core/hab-sup) component
Type: Feature
Issues that describe a new desired feature
Milestone
Background
In the Kubernetes realm, liveness and readiness probes are used to determine when containers should be restarted due to issues and when a container is available to serve traffic, respectively. These checks can be performed in three ways:
Current situation
Habitat’s built-in health-checking mechanism currently only reports status through the supervisor REST API at
/services/<name>/<group>/health
in a format that the probe requests can’t use. The API currently always returns a 200 response with the health status encoded in the JSON response.Proposed Solutions
Either of these could solve the problem on their own, but I'd very much like both to be implemented.
Compatible API endpoints
Add
healthz
endpoints to the REST API that mirror the standard health endpoints, but return 200 for healthy services and 500 otherwise. It would be beneficial to provide a health endpoint for the for the supervisor itself at/healthz
, and a per-service endpoints at/services/<name>/<group>/healthz
. The downside to this technique is that both the supervisor and Kubernetes perform checks periodically. If the supervisor checks health every 30 seconds, then it doesn't make sense for Kubernetes to check the REST API more often; but if the periods are offset the wrong way, there could potentially be up to a minute from a service problem -> supervisor health check -> Kubernetes readiness probe.API endpoint parameters
Add GET parameter to the health endpoints that enable control over the return code for the different statuses. Prior art: HashiCorp Vault's health endpoint https://www.vaultproject.io/api/system/health.html. This may be the simplest solution and have the best effort-to-payoff ratio.
Direct check execution
The second path is a little more drastic: add the ability to disable the supervisor’s periodic health checks and provide a CLI interface to directly run a service’s
health-check
hook, à lahab pkg exec
. This method allows Kubernetes to take over all of the scheduling responsibility and avoid the issue above.It might even be possible for the Habitat Operator to configure either of these probes automatically, provided that services with no health check hook return successful statuses.
The text was updated successfully, but these errors were encountered: