Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keepalive gunicorn #4008

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jasons42
Copy link

Description

Copy of #3422. @neufeldtech changed jobs and doesn't have time to shepherd his original PR. We are still experiencing this issue in production and would like to get this change merged if possible.

The Problem

Intermittent 503's from the /ambassador/v0/check_alive endpoint even though gunicorn/diagd is healthy.

This is to help avoid 503's on the /ambassador/v0/check_alive and similar endpoints. Currently, use-cases that probe this health check are subject to race-conditions where gunicorn closes idle TCP connections after 2s and does not send a FIN or RST to envoy. Envoy, believing the tcp connection is still active, attempts to route the check_alive request to gunicorn, but receives a RST as the TCP connection is no longer valid according to gunicorn. This results in a failing health check even though gunicorn is 'up'. This false-positive failure mode is problematic especially in situations where one is leveraging this health check to detect ambassador health from a CDN or other downstream load balancer.

See Wireshark image of the default 2s tcp idle timeout from gunicorn, and the implications of this low default. Our CDN is probing the backend between once every 500ms and 2.5s. When the HTTP probes do this pod are more than 2s apart, the above race condition is triggered and Envoy reports receiving a 503 from gunicorn, indicating a 'failed' health check.
image

Related Issues

N/A

Testing

I've manually tested & captured this in Production some time ago, but since have upgraded back to mainline ambassador and am no longer running this fork. Would love to get this change merged into Ambassador to fix my 503 problem 😬 .

Checklist

  • I made sure to update CHANGELOG.md.

    Remember, the CHANGELOG needs to mention:

    • Any new features
    • Any changes to our included version of Envoy
    • Any non-backward-compatible changes
    • Any deprecations
  • This is unlikely to impact how Ambassador performs at scale.

    Remember, things that might have an impact at scale include:

    • Any significant changes in memory use that might require adjusting the memory limits
    • Any significant changes in CPU use that might require adjusting the CPU limits
    • Anything that might change how many replicas users should use
    • Changes that impact data-plane latency/scalability
  • My change is adequately tested.

    Remember when considering testing:

    • Your change needs to be specifically covered by tests.
      • Tests need to cover all the states where your change is relevant: for example, if you add a behavior that can be enabled or disabled, you'll need tests that cover the enabled case and tests that cover the disabled case. It's not sufficient just to test with the behavior enabled.
    • You also need to make sure that the entire area being changed has adequate test coverage.
      • If existing tests don't actually cover the entire area being changed, add tests.
      • This applies even for aspects of the area that you're not changing – check the test coverage, and improve it if needed!
    • We should lean on the bulk of code being covered by unit tests, but...
    • ... an end-to-end test should cover the integration points
  • I updated DEVELOPING.md with any any special dev tricks I had to use to work on this code efficiently.

This was referenced Feb 24, 2022
@neufeldtech
Copy link
Contributor

@jasons42 any update here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants