-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected probe failures due to transient denied connections #305
Comments
@steveteahan - On one of the node with impacted pod, can you please check the SDK logs and see if you notice this line around the time of the event - SDK logs location -
|
@jayanthvn I'll have to find some time to reproduce the issue again. It may not be for a few days. I didn't get a chance to capture those logs originally, but I'll make sure to run the capture script on the next one. |
@jayanthvn I haven't had as much time to reproduce in our development environment as I had hoped. Is this issue something that you also had a chance to reproduce at all? I'm concerned that this bug could prevent the usage of |
@steveteahan |
I have not tried the latest version. Would that bug only present itself in scenarios where there is >1 |
What happened:
We have an application that is failing readiness and liveness probes because the traffic is being denied by NetPol agent. We've seen this across multiple versions including
v1.1.0-eksbuild.1
andv1.1.2-eksbuild.1
.I was able to see that the network traffic was being denied in
/var/log/aws-routed-eni/network-policy-agent.log
. After some period of time, the traffic is accepted again and the application recovers.What stuck out to me is that there are multiple
PolicyEndpoint
s created. Our NP looks something like:Think of use cases where all pods need to reach a core service. This results in multiple PEs:
I tested that by changing
namespaceSelector: {}
to a rule likeipBlock.cidr: 0.0.0.0/0
, the multiple PEs are removed and a single PE is created since everyPod
in the cluster doesn't need to be enumerated in.spec.ingress
. We haven't seen a single probe failure in a week after changing the configuration to remove the multiple PEs. This is compared to literally hundreds of failures over a couple of weeks.It's also worth noting that this is an intermittent issue. The pattern we see is that the probes fail, the container is restarted, and then the service recovers. We'll see this anywhere from 1-5 times a day. Interestingly, we see this issue on a few of our clusters with ~2000 pods, but a relatively low pod churn rate. We never see container restarts on our cluster with ~3000 pods, but a higher churn rate due to heavy usage of
CronJob
s. I can see theReceived a new reconcile request
log line happening far more frequently in/var/log/aws-routed-eni/network-policy-agent.log
on the cluster that's not experiencing this issue. This may still mean that any potential bug could still be occurring on that cluster, but the next reconciliation happens faster than the time it takes for the liveness probes to fail (~30s).Attach logs
Logs were sent.
What you expected to happen:
Liveness / readiness probe traffic is not denied.
How to reproduce it (as minimally and precisely as possible):
http-get http://some-endpoint delay=0s timeout=3s period=5s #success=1 #failure=6
NetworkPolicy
usingnamespaceSelector: {}
for ingress rulesAnything else we need to know?:
Environment:
kubectl version
):v1.18.2-eksbuild.1
v1.1.2-eksbuild.1
cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: