-
-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(health): add check endpoint and loop control #2575
base: master
Are you sure you want to change the base?
Conversation
On Kubernetes, it makes more sense to use a liveness probe than the health server loop (i.e. only have one loop). This patch introduces a flag to disable the health server loop, and a new /check/ endpoint for such probes. When the connection is saturated, health checks can take a long time and therefore fail if the timeout is too short. Gradually increasing the timeout, as done in the health server loop, is not all that useful because the upper bound on the timeout is ultimately what you are willing to tolerate before declaring the connection unhealthy. So a static probe with a long timeout and a failure count, as implemented in Kubernetes, will be more stable (i.e. less flopping), especially if more than one sequential failure is allowed. The above argument aside, the two health/probe loops also do not work well together because they can get out of phase. Kubernetes probes usually must be used to sequence containers in a pod. Signed-off-by: Jean-Francois Roy <[email protected]>
|
My intuition agrees, but I've seen this fail when 1000s of TCP connections are going through the VPN and the link is fairly loaded (80%+ of expected throughput).
The liveness endpoint uses the same code as the health loop on purpose, to avoid code duplication and maintain as much of the behavior of the loop as possible. As I wrote in the commit message, when using a K8S liveness probe, the gluetun health loop and the K8S probe loop sort of interfere with each other -- they can easily get out of phase. So before this patch, you can end up in a situation where the gluetun loop fails and sets the status error, the K8S liveness probe comes in and samples the error (thus failing the probe), the gluetun loop then succeeds and then fails again, the K8S probe comes in and samples an error again, etc. This can lead to K8S considering the container failed, even though the health server is flopping because the connection is loaded. Because the gluetun loop has an adaptive timeout, it's not possible to prevent the 2 loops from going out of phase. Even if you were to change the gluetun loop to have a fixed period and matched it to the K8S probe, they would still eventually get out of phase. Disabling the gluetun health loop only makes sense when using a K8S liveness probe. When using gluetun from just Docker, it is a bad idea to disable it, indeed. Perhaps the environment variable could be named something more specific ("disable for kubernetes", "disable I know what I am doing", "disable footguns are fun", etc). Alternatively, perhaps if the liveness endpoint is used it could disable the gluetun loop.
Ah yes, very good point. I will fix the code. I don't think cancellation is likely, but it's just better to handle it.
That's an interesting alternative, but it feels better, overall, not to have 2 loops.
Yeah, and I did try that before writing this patch, but I always got into the unstable regime I described above. |
On Kubernetes, it makes more sense to use a liveness probe than the health server loop (i.e. only have one loop). This patch introduces a flag to disable the health server loop, and a new /check/ endpoint for such probes.
When the connection is saturated, health checks can take a long time and therefore fail if the timeout is too short. Gradually increasing the timeout, as done in the health server loop, is not all that useful because the upper bound on the timeout is ultimately what you are willing to tolerate before declaring the connection unhealthy. So a static probe with a long timeout and a failure count, as implemented in Kubernetes, will be more stable (i.e. less flopping), especially if more than one sequential failure is allowed.
The above argument aside, the two health/probe loops also do not work well together because they can get out of phase. Kubernetes probes usually must be used to sequence containers in a pod.