Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests sent to terminating pods #15211

Open
dspeck1 opened this issue May 15, 2024 · 12 comments
Open

Requests sent to terminating pods #15211

dspeck1 opened this issue May 15, 2024 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-user-input Issues which are waiting on a response from the reporter

Comments

@dspeck1
Copy link

dspeck1 commented May 15, 2024

What version of Knative?

1.14.0

Expected Behavior

Being able to send groups of 200 requests and knative service those requests.

The requests not to be scheduled to terminating pods.

Actual Behavior

Sending in groups of 200 requests to knative. The processing takes 5 minutes on knative to run. All the pods will finish with 200 return code. When a second groups of 200 requests is sent in and pods are terminating many of the requests will return 502 bad gateway errors. The requests are getting scheduled to pods that are terminating.

Steps to Reproduce the Problem

Watch for pods to terminate and send in requests. kourier is the ingress and using the knative autoscaler.

@dspeck1 dspeck1 added the kind/bug Categorizes issue or PR as related to a bug. label May 15, 2024
@skonto
Copy link
Contributor

skonto commented May 16, 2024

Hi @dspeck1 could you pls provide more info on how to reproduce this eg. ksvc definition, env setup. There was a similar issue in the past but it was not reproducible, status was unclear.

/triage needs-user-input

@knative-prow knative-prow bot added the triage/needs-user-input Issues which are waiting on a response from the reporter label May 16, 2024
@dspeck1
Copy link
Author

dspeck1 commented May 22, 2024

Hi @skonto. I posted testing code here The app folder has the knative service, the tester folder sends simultaneous requests, and the knative operator config is here To replicate the issue. Send a job with 200 requests, watch for the pods to start to terminate then send job-2 and observe 502 bad gateway errors in the response. It does not happen every time. I have also noticed it does not happen if the pod runs for a short time (10 seconds, 30 seconds). It occurs on long requests like 5 minutes.

The below error is from the queue proxy when this happens. We see the same behavior on Google Cloud GKE and on an on-premise Kubernetes Cluster.

logger: "queueproxy"
message: "error reverse proxying request; sockstat: sockets: used 8
TCP: inuse 3 orphan 11 tw 12 alloc 183 mem 63
UDP: inuse 0 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
"
stacktrace: "knative.dev/pkg/network.ErrorHandler.func1
	knative.dev/[email protected]/network/error_handler.go:33
net/http/httputil.(*ReverseProxy).ServeHTTP
	net/http/httputil/reverseproxy.go:472
knative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP
	knative.dev/serving/pkg/queue/request_metric.go:199
knative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3.2
	knative.dev/serving/pkg/queue/handler.go:65
knative.dev/serving/pkg/queue.(*Breaker).Maybe
	knative.dev/serving/pkg/queue/breaker.go:155
knative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3
	knative.dev/serving/pkg/queue/handler.go:63
net/http.HandlerFunc.ServeHTTP
	net/http/server.go:2166
knative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4
	knative.dev/serving/pkg/queue/forwarded_shim.go:54
net/http.HandlerFunc.ServeHTTP
	net/http/server.go:2166
knative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4
	knative.dev/serving/pkg/http/handler/timeout.go:118"
timestamp: "2024-05-22T19:42:56.16014057Z"

Below are similar issues I have found:
This one mentions a lack of a graceful shutdown for the user-container for in flight requests.
Timeout issue on long requests

Thanks for your help! Please let me know anything else you need.

@dspeck1
Copy link
Author

dspeck1 commented May 28, 2024

Here is another related issue. #9355

@dspeck1
Copy link
Author

dspeck1 commented Jun 4, 2024

@skonto - checking in.

@skonto
Copy link
Contributor

skonto commented Jun 5, 2024

Hi @dspeck1 would you mind checking if this is related to the graceful shutdown in Python #12865 (comment)?

@dspeck1
Copy link
Author

dspeck1 commented Jun 17, 2024

Thanks! I was away on vacation and am back now. Testing python container with dumb-init for SIGTERM handling and will let you know.

@dspeck1
Copy link
Author

dspeck1 commented Jun 17, 2024

We are still receiving 502 bad gateway messages with dumb-init.

@ashrafguitoni
Copy link

ashrafguitoni commented Jul 24, 2024

Hi @dspeck1 would you mind checking if this is related to the graceful shutdown in Python #12865 (comment)?

I think that's unrelated... The dumb-init usage allows knative pods to terminate gracefully (otherwise they stay stuck Terminating for the duration of timeoutSeconds as specified in the service specs).

Our team has been having the same issue as @dspeck1 for a long time. We tried so many things including using nginx unit (which seemed to reduce the frequency of the errors, for some reason), but it still occurs. We get the error sometimes when services are scaled to zero and we send a few requests.

Edit: Actually, it seems like more similar issues have been popping up, and one of them mentions a simple potential fix (don't know if it would work for our cases), so I'll try it, and maybe you could as well: #15352 (comment)

Edit 2: setting responseStartTimeoutSeconds did not work, I'm still getting the same error logs in queue proxy...

@skonto
Copy link
Contributor

skonto commented Oct 8, 2024

@dspeck1 hi, could you verify that no activator pods are being scaled up and down, or failed, preempted, during your test (there are hpa resources installed by default by the operator), so requests might be affected by activator killing the request due to termination grace period (you set TBC=-1 so activator is always on the path). The setting there needs to be increased if max-revision-timeout-seconds is increased.

@skonto
Copy link
Contributor

skonto commented Nov 26, 2024

@dspeck1 gentle ping.

@ashrafguitoni
Copy link

@skonto I set terminationGracePeriodSeconds: 36000 on the activator deployment, and timeoutSeconds: 7200 on the knative service, yet I'm still getting the 502 errors (with the same error message in the queue proxy that OP shared) within a few minutes of triggering requests. What would you advise I look into next?

@skonto
Copy link
Contributor

skonto commented Dec 25, 2024

Hi @ashrafguitoni I am off. I did a quick check, I will take a look again when back. Assuming you face the same issue as @dspeck1.

Could you verify that you don't suffer from no proper signal handling from the app (user container) as I describe here? If you don't do a graceful shutdown of your server in python or any other language, handling SIGTERM, draining the connections, and then you just run the app as PID=1, you will get an EOF (502 bad gateway) error as QP will not wait. @dspeck1's code seems to handle signals (due to CMD exec gunicorn) but does not wait for the response to finish (no draining):

terminal a:
docker run -p 8080:8080 -e PORT=8080 -e SLEEP_TIME=90 skonto/ktesting
[2024-12-25 21:31:36 +0000] [1] [INFO] Starting gunicorn 23.0.0
[2024-12-25 21:31:36 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)
[2024-12-25 21:31:36 +0000] [1] [INFO] Using worker: gthread
[2024-12-25 21:31:36 +0000] [9] [INFO] Booting worker with pid: 9
[2024-12-25 21:33:21 +0000] [1] [INFO] Handling signal: term
[2024-12-25 21:33:51 +0000] [1] [INFO] Shutting down: Master

terminal b:
curl -v http://127.0.0.1:8080
*   Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
> GET / HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/8.6.0
> Accept: */*
> 
* Empty reply from server
* Closing connection
curl: (52) Empty reply from server

terminal c:
kill 11140

When I run that Python app on K8s with SLEEP_TIME=90 and kill the pod fast enough I see (empty error maps to 502):

< HTTP/1.1 502 Bad Gateway
< content-length: 4
< content-type: text/plain; charset=utf-8
< date: Wed, 25 Dec 2024 22:20:33 GMT
< x-content-type-options: nosniff
< x-envoy-upstream-service-time: 70738
< server: envoy
< 
EOF
* Connection #0 to host 192.168.39.174 left intact


log:

{"severity":"ERROR","timestamp":"2024-12-25T22:20:33.705190831Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 7\nTCP: inuse 0 orphan 0 tw 2 alloc 339 mem 28\nUDP: inuse 0 mem 3\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"6a27004","knative.dev/key":"default/autoscale-go-00001","knative.dev/pod":"autoscale-go-00001-deployment-5cc77db79-vlqs6","error":"EOF","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:472\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2171\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2171\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}

This is not the case if I use the example in here.

In order to fix the code you just need to add a graceful timeout (note that setting to zero/infinite as for --timeout will not work), default is only 30s:

CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 --graceful-timeout 120 app:app

Note here that the above is not proper, docker complains

 - JSONArgsRecommended: JSON arguments recommended for CMD to prevent unintended behavior related to OS signals (line 20)

Now requests are handled as expected.

< HTTP/1.1 200 OK
< content-length: 28
< content-type: text/html; charset=utf-8
< date: Wed, 25 Dec 2024 22:43:36 GMT
< server: envoy
< x-envoy-upstream-service-time: 90104
< 
Hello World! Sleep Time 90 
* Connection #0 to host 192.168.39.174 left intact

In any case you have a different problem could you provide logs from the QP, activator side, kourier, envoy gateway etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-user-input Issues which are waiting on a response from the reporter
Projects
None yet
Development

No branches or pull requests

3 participants