Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Copy of #3422. @neufeldtech changed jobs and doesn't have time to shepherd his original PR. We are still experiencing this issue in production and would like to get this change merged if possible.
The Problem
Intermittent 503's from the /ambassador/v0/check_alive endpoint even though gunicorn/diagd is healthy.
This is to help avoid 503's on the /ambassador/v0/check_alive and similar endpoints. Currently, use-cases that probe this health check are subject to race-conditions where gunicorn closes idle TCP connections after 2s and does not send a FIN or RST to envoy. Envoy, believing the tcp connection is still active, attempts to route the check_alive request to gunicorn, but receives a RST as the TCP connection is no longer valid according to gunicorn. This results in a failing health check even though gunicorn is 'up'. This false-positive failure mode is problematic especially in situations where one is leveraging this health check to detect ambassador health from a CDN or other downstream load balancer.
See Wireshark image of the default 2s tcp idle timeout from gunicorn, and the implications of this low default. Our CDN is probing the backend between once every 500ms and 2.5s. When the HTTP probes do this pod are more than 2s apart, the above race condition is triggered and Envoy reports receiving a 503 from gunicorn, indicating a 'failed' health check.
Related Issues
N/A
Testing
I've manually tested & captured this in Production some time ago, but since have upgraded back to mainline ambassador and am no longer running this fork. Would love to get this change merged into Ambassador to fix my 503 problem 😬 .
Checklist
I made sure to update
CHANGELOG.md
.Remember, the CHANGELOG needs to mention:
This is unlikely to impact how Ambassador performs at scale.
Remember, things that might have an impact at scale include:
My change is adequately tested.
Remember when considering testing:
I updated
DEVELOPING.md
with any any special dev tricks I had to use to work on this code efficiently.