Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network denies despite allow-all policy (strict mode) #288

Open
creinheimer opened this issue Jul 19, 2024 · 7 comments
Open

Network denies despite allow-all policy (strict mode) #288

creinheimer opened this issue Jul 19, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@creinheimer
Copy link

Hello,

For several weeks we've been working on implementing network policies using the AWS solution. However, we've encountered various challenges along the way. Initially, we discovered that using the standard enforcement mode could lead to network instability. As a result, we decided to use the so called strict mode.

In this thread #271 (comment), @achevuru suggested that we could create an allow all policy for each namespace and that the only side effect would be the deny mode during the first seconds of a newly launched pod. We then created an allow all policy on all namespaces and enabled the annotate Pod IP flag to allow faster network policy evaluations.

Now we have a new issue: pods in namespaces with an allow-all network policy are still experiencing network denies. This isn't limited to the initial startup period. It's happening long after pods have been running, sometimes hours later.

This behaviour is causing different problems, including pod crashes. In some cases, even the pod's internal health checks are being denied, triggering unnecessary restarts.

Can you provide any insight into why this might be happening? Am I missing something?

More info:

Deny logs from pods to control-plane

image

Deny logs from pods to pods on same namespace

image

These are just a few of them. We had approx. 200 denies over last 15 minutes.

NetworkPolicy allow-all
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  finalizers:
  - networking.k8s.aws/resources
  name: allow-all
  namespace: kube-prometheus-stack
spec:
  egress:
  - {}
  ingress:
  - {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Environment:

  • Kubernetes: v1.29.4-eks-036c24b
  • CNI Version: v1.18.1
  • Network Policy Agent Version: v1.1.2
  • OS: Amazon Linux 2
  • Kernel: 5.10.215-203.850.amzn2.aarch64

Our AWS-CNI uses the default helm-chart with the following variables:

AWS-CNI configuration
env:
  ENABLE_PREFIX_DELEGATION: "true"
  AWS_VPC_K8S_PLUGIN_LOG_FILE: stderr
  AWS_VPC_K8S_PLUGIN_LOG_LEVEL: DEBUG
  AWS_VPC_K8S_CNI_LOG_FILE: stdout
  AWS_VPC_K8S_CNI_LOGLEVEL: DEBUG
  NETWORK_POLICY_ENFORCING_MODE: strict
  ANNOTATE_POD_IP: "true"

Note:

@jayanthvn that's follow-up of #73 (comment).

@creinheimer creinheimer added the bug Something isn't working label Jul 19, 2024
@achevuru
Copy link
Contributor

@creinheimer If I understood the issue accurately, you've pods that are only configured with an allow all policy but they're still denying all traffic? If yes, is this specific to few pods or you observe this behavior across all pods in your cluster?

So, the issue with standard mode you referenced above is tied to the flows that start during the first few seconds of a new pod launch I assume? ANNOTATE_POD_IP should help bring down the NP reconciliation latency to under 1s in standard mode.

@creinheimer
Copy link
Author

Hi @achevuru,

I mentioned the other issues to provide you some context. ANNOTATE_POD_IP is already configured.

If I understood the issue accurately, you've pods that are only configured with an allow all policy but they're still denying all traffic? If yes, is this specific to few pods or you observe this behaviour across all pods in your cluster?

Yes. That happens sporadically on different pods even though we have an allow-all rule on all namespaces.

  • Please note that we are using enforcement mode strict.
  • You can check the last 24hrs logs here (filtered by Verdict=DENY).

I would suggest we focus on understanding why denials occur sporadically (sometimes hours after pods have been running) despite having an allow-all rule applied to all namespaces.

@pelzerim
Copy link

pelzerim commented Aug 9, 2024

Hi, we are experiencing a similar issue with STRICT mode + ANNOTATE_POD_IP. We also have a allow all policy.

Pods can start and are unable to connect to any host. They end up in a crash loop (due to timeouts in the app) and never recover. Only removing the pods manually does resolve this issue.

We moved to strict mode as we were experiencing dropped connections with workloads shortly after pod start.

These are the network-policy-agent.log logs network-policy-agent.log. The pod name is workload-dxl6g. I've also attached
aws-eks-na-cli outputs.

We can easily reproduce this.

[edit]
Some more information. We have extreme pod churn (pod lifetime 5-10 seconds) and this issue affects roughly 25% of pods. We had do move away from strict mode and now are going for the ANNOTATE_POD_IP + an init container that literally watches for "Successfully attached.*$${POD_NAME}" in the agent's logs.

I am happy to supply any debugging information go help resolve this.

@anshulpatel25
Copy link

Hello @pelzerim,

We are also getting the same behaviour as our use case also involves a short pod lifecycle of 10 - 15 seconds.

The init container workaround that you have currently, is it 100% effective? or you still observing issues after that workaround?

Thanks !

@pelzerim
Copy link

The init container workaround that you have currently, is it 100% effective? or you still observing issues after that workaround?

Hey @anshulpatel25, the init container workaround does only work for standard mode. We've determined that its not actually the log line that does the magic but the minimum wait time of 1 second.

Unrelated to that issue, strict mode seems to be currently incompatible with high pod churn (see my previous comment)

@617m4rc
Copy link

617m4rc commented Sep 24, 2024

Hi @pelzerim, can you shed some light on what the implementation of this init container looks like? Does it just wait? So far we have tried static wait time in the regular container, but this does not seem to affect the problem.

@Pavani-Panakanti
Copy link
Contributor

@creinheimer We have a couple of fixes that went in with the latest release v1.1.3. Can you try with the latest image and let us know if you are still seeing the issue https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants