Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected probe failures due to transient denied connections #305

Open
steveteahan opened this issue Sep 4, 2024 · 5 comments
Open

Unexpected probe failures due to transient denied connections #305

steveteahan opened this issue Sep 4, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@steveteahan
Copy link

steveteahan commented Sep 4, 2024

What happened:

We have an application that is failing readiness and liveness probes because the traffic is being denied by NetPol agent. We've seen this across multiple versions including v1.1.0-eksbuild.1 and v1.1.2-eksbuild.1.

I was able to see that the network traffic was being denied in /var/log/aws-routed-eni/network-policy-agent.log. After some period of time, the traffic is accepted again and the application recovers.

What stuck out to me is that there are multiple PolicyEndpoints created. Our NP looks something like:

piVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  labels:
    app.kubernetes.io/name: <app>
  name: <app>
spec:
  ingress:
  - from:
    - namespaceSelector: {}
...

Think of use cases where all pods need to reach a core service. This results in multiple PEs:

% kubectl -n <namespace> get policyendpoint <pe-name-0> -o json | jq '.spec.ingress | length'
879

% kubectl -n <namespace> get policyendpoint <pe-name-1> -o json | jq '.spec.ingress | length'
684

% kubectl -n <namespace> get policyendpoint <pe-name-2> -o json | jq '.spec.ingress | length'
293

I tested that by changing namespaceSelector: {} to a rule like ipBlock.cidr: 0.0.0.0/0, the multiple PEs are removed and a single PE is created since every Pod in the cluster doesn't need to be enumerated in .spec.ingress. We haven't seen a single probe failure in a week after changing the configuration to remove the multiple PEs. This is compared to literally hundreds of failures over a couple of weeks.

It's also worth noting that this is an intermittent issue. The pattern we see is that the probes fail, the container is restarted, and then the service recovers. We'll see this anywhere from 1-5 times a day. Interestingly, we see this issue on a few of our clusters with ~2000 pods, but a relatively low pod churn rate. We never see container restarts on our cluster with ~3000 pods, but a higher churn rate due to heavy usage of CronJobs. I can see the Received a new reconcile request log line happening far more frequently in /var/log/aws-routed-eni/network-policy-agent.log on the cluster that's not experiencing this issue. This may still mean that any potential bug could still be occurring on that cluster, but the next reconciliation happens faster than the time it takes for the liveness probes to fail (~30s).

Attach logs

Logs were sent.

What you expected to happen:

Liveness / readiness probe traffic is not denied.

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster that has >1000-2000 pods to simulate multiple PE entries
  2. Configure an application with liveness probes, something like http-get http://some-endpoint delay=0s timeout=3s period=5s #success=1 #failure=6
  3. Configure a NetworkPolicy using namespaceSelector: {} for ingress rules
  4. Allow the application to run for some number of hours (again, we see this 0-3 times per day)

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
% kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.11-eks-db838b0
  • CNI Version: v1.18.2-eksbuild.1
  • Network Policy Agent Version: v1.1.2-eksbuild.1
  • OS (e.g: cat /etc/os-release):
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
  • Kernel (e.g. uname -a):
$ uname -a
Linux eks-prod-dove-c-0fa30f4f6ba5af8be 5.10.219-208.866.amzn2.x86_64 #1 SMP Tue Jun 18 14:00:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
@steveteahan steveteahan added the bug Something isn't working label Sep 4, 2024
@jayanthvn
Copy link
Contributor

jayanthvn commented Sep 12, 2024

@steveteahan - On one of the node with impacted pod, can you please check the SDK logs and see if you notice this line around the time of the event -

SDK logs location - /var/log/aws-routed-eni/ebpf-sdk.log

error: ":"unable to update map: invalid argument"}

@steveteahan
Copy link
Author

@jayanthvn I'll have to find some time to reproduce the issue again. It may not be for a few days. I didn't get a chance to capture those logs originally, but I'll make sure to run the capture script on the next one.

@steveteahan
Copy link
Author

@jayanthvn I haven't had as much time to reproduce in our development environment as I had hoped. Is this issue something that you also had a chance to reproduce at all? I'm concerned that this bug could prevent the usage of NetworkPolicy on foundational services that have many pods connecting to them.

@jaydeokar
Copy link
Contributor

jaydeokar commented Oct 2, 2024

@steveteahan
There is one bug which we fixed in the latest v1.1.3 release where the IPs can get garbage collected when the SDK tries to make an update resulting in a traffic getting blocked. Have you tried with the latest version and see if you run into the same issue ?

@steveteahan
Copy link
Author

@steveteahan There is one bug which we fixed in the latest v1.1.3 release where the IPs can get garbage collected when the SDK tries to make an update resulting in a traffic getting blocked. Have you tried with the latest version and see if you run into the same issue ?

I have not tried the latest version. Would that bug only present itself in scenarios where there is >1 PolicyEndpoint? We still have not experienced this issue since I changed the NetworkPolicy rule such that there is only a single PolicyEndpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants