Keepalive gunicorn #4008

jasons42 · 2022-01-12T19:40:09Z

Description

Copy of #3422. @neufeldtech changed jobs and doesn't have time to shepherd his original PR. We are still experiencing this issue in production and would like to get this change merged if possible.

The Problem

Intermittent 503's from the /ambassador/v0/check_alive endpoint even though gunicorn/diagd is healthy.

This is to help avoid 503's on the /ambassador/v0/check_alive and similar endpoints. Currently, use-cases that probe this health check are subject to race-conditions where gunicorn closes idle TCP connections after 2s and does not send a FIN or RST to envoy. Envoy, believing the tcp connection is still active, attempts to route the check_alive request to gunicorn, but receives a RST as the TCP connection is no longer valid according to gunicorn. This results in a failing health check even though gunicorn is 'up'. This false-positive failure mode is problematic especially in situations where one is leveraging this health check to detect ambassador health from a CDN or other downstream load balancer.

See Wireshark image of the default 2s tcp idle timeout from gunicorn, and the implications of this low default. Our CDN is probing the backend between once every 500ms and 2.5s. When the HTTP probes do this pod are more than 2s apart, the above race condition is triggered and Envoy reports receiving a 503 from gunicorn, indicating a 'failed' health check.

Related Issues

N/A

Testing

I've manually tested & captured this in Production some time ago, but since have upgraded back to mainline ambassador and am no longer running this fork. Would love to get this change merged into Ambassador to fix my 503 problem 😬 .

Checklist

I made sure to update CHANGELOG.md.

Remember, the CHANGELOG needs to mention:
- Any new features
- Any changes to our included version of Envoy
- Any non-backward-compatible changes
- Any deprecations
This is unlikely to impact how Ambassador performs at scale.

Remember, things that might have an impact at scale include:
- Any significant changes in memory use that might require adjusting the memory limits
- Any significant changes in CPU use that might require adjusting the CPU limits
- Anything that might change how many replicas users should use
- Changes that impact data-plane latency/scalability
My change is adequately tested.

Remember when considering testing:
- Your change needs to be specifically covered by tests.
  - Tests need to cover all the states where your change is relevant: for example, if you add a behavior that can be enabled or disabled, you'll need tests that cover the enabled case and tests that cover the disabled case. It's not sufficient just to test with the behavior enabled.
- You also need to make sure that the entire area being changed has adequate test coverage.
  - If existing tests don't actually cover the entire area being changed, add tests.
  - This applies even for aspects of the area that you're not changing – check the test coverage, and improve it if needed!
- We should lean on the bulk of code being covered by unit tests, but...
- ... an end-to-end test should cover the integration points
I updated DEVELOPING.md with any any special dev tricks I had to use to work on this code efficiently.

Signed-off-by: Jason Smith <[email protected]>

neufeldtech · 2023-11-28T20:19:21Z

@jasons42 any update here?

jasons42 added 2 commits January 14, 2022 12:51

fix: set diagd gunicorn keepalive to 300s

cc905d1

Signed-off-by: Jason Smith <[email protected]>

docs: update relase notes

0b190fd

Signed-off-by: Jason Smith <[email protected]>

jasons42 force-pushed the keepalive_gunicorn branch from f875bf2 to 0b190fd Compare January 14, 2022 18:51

neufeldtech approved these changes Jan 17, 2022

View reviewed changes

kflynn mentioned this pull request Jan 19, 2022

fix: set diagd gunicorn keepalive to 300s #3422

Closed

4 tasks

This was referenced Feb 24, 2022

CI for #4008 #4142

Open

CI for #3422 #3832

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keepalive gunicorn #4008

Keepalive gunicorn #4008

jasons42 commented Jan 12, 2022

neufeldtech commented Nov 28, 2023

Keepalive gunicorn #4008

Are you sure you want to change the base?

Keepalive gunicorn #4008

Conversation

jasons42 commented Jan 12, 2022

Description

The Problem

Related Issues

Testing

Checklist

neufeldtech commented Nov 28, 2023