Network denies despite allow-all policy (strict mode) #288

creinheimer · 2024-07-19T14:00:54Z

Hello,

For several weeks we've been working on implementing network policies using the AWS solution. However, we've encountered various challenges along the way. Initially, we discovered that using the standard enforcement mode could lead to network instability. As a result, we decided to use the so called strict mode.

In this thread #271 (comment), @achevuru suggested that we could create an allow all policy for each namespace and that the only side effect would be the deny mode during the first seconds of a newly launched pod. We then created an allow all policy on all namespaces and enabled the annotate Pod IP flag to allow faster network policy evaluations.

Now we have a new issue: pods in namespaces with an allow-all network policy are still experiencing network denies. This isn't limited to the initial startup period. It's happening long after pods have been running, sometimes hours later.

This behaviour is causing different problems, including pod crashes. In some cases, even the pod's internal health checks are being denied, triggering unnecessary restarts.

Can you provide any insight into why this might be happening? Am I missing something?

More info:

Deny logs from pods to control-plane

Deny logs from pods to pods on same namespace

These are just a few of them. We had approx. 200 denies over last 15 minutes.

NetworkPolicy allow-all

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  finalizers:
  - networking.k8s.aws/resources
  name: allow-all
  namespace: kube-prometheus-stack
spec:
  egress:
  - {}
  ingress:
  - {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Environment:

Kubernetes: v1.29.4-eks-036c24b
CNI Version: v1.18.1
Network Policy Agent Version: v1.1.2
OS: Amazon Linux 2
Kernel: 5.10.215-203.850.amzn2.aarch64

Our AWS-CNI uses the default helm-chart with the following variables:

AWS-CNI configuration

env:
  ENABLE_PREFIX_DELEGATION: "true"
  AWS_VPC_K8S_PLUGIN_LOG_FILE: stderr
  AWS_VPC_K8S_PLUGIN_LOG_LEVEL: DEBUG
  AWS_VPC_K8S_CNI_LOG_FILE: stdout
  AWS_VPC_K8S_CNI_LOGLEVEL: DEBUG
  NETWORK_POLICY_ENFORCING_MODE: strict
  ANNOTATE_POD_IP: "true"

Note:

@jayanthvn that's follow-up of #73 (comment).

The text was updated successfully, but these errors were encountered:

achevuru · 2024-07-26T00:02:11Z

@creinheimer If I understood the issue accurately, you've pods that are only configured with an allow all policy but they're still denying all traffic? If yes, is this specific to few pods or you observe this behavior across all pods in your cluster?

So, the issue with standard mode you referenced above is tied to the flows that start during the first few seconds of a new pod launch I assume? ANNOTATE_POD_IP should help bring down the NP reconciliation latency to under 1s in standard mode.

creinheimer · 2024-07-26T15:42:37Z

Hi @achevuru,

I mentioned the other issues to provide you some context. ANNOTATE_POD_IP is already configured.

If I understood the issue accurately, you've pods that are only configured with an allow all policy but they're still denying all traffic? If yes, is this specific to few pods or you observe this behaviour across all pods in your cluster?

Yes. That happens sporadically on different pods even though we have an allow-all rule on all namespaces.

Please note that we are using enforcement mode strict.
You can check the last 24hrs logs here (filtered by Verdict=DENY).

I would suggest we focus on understanding why denials occur sporadically (sometimes hours after pods have been running) despite having an allow-all rule applied to all namespaces.

pelzerim · 2024-08-09T14:28:20Z

Hi, we are experiencing a similar issue with STRICT mode + ANNOTATE_POD_IP. We also have a allow all policy.

Pods can start and are unable to connect to any host. They end up in a crash loop (due to timeouts in the app) and never recover. Only removing the pods manually does resolve this issue.

We moved to strict mode as we were experiencing dropped connections with workloads shortly after pod start.

These are the network-policy-agent.log logs network-policy-agent.log. The pod name is workload-dxl6g. I've also attached
aws-eks-na-cli outputs.

We can easily reproduce this.

[edit]
Some more information. We have extreme pod churn (pod lifetime 5-10 seconds) and this issue affects roughly 25% of pods. We had do move away from strict mode and now are going for the ANNOTATE_POD_IP + an init container that literally watches for "Successfully attached.*$${POD_NAME}" in the agent's logs.

I am happy to supply any debugging information go help resolve this.

anshulpatel25 · 2024-08-15T04:34:09Z

Hello @pelzerim,

We are also getting the same behaviour as our use case also involves a short pod lifecycle of 10 - 15 seconds.

The init container workaround that you have currently, is it 100% effective? or you still observing issues after that workaround?

Thanks !

pelzerim · 2024-08-15T11:22:02Z

The init container workaround that you have currently, is it 100% effective? or you still observing issues after that workaround?

Hey @anshulpatel25, the init container workaround does only work for standard mode. We've determined that its not actually the log line that does the magic but the minimum wait time of 1 second.

Unrelated to that issue, strict mode seems to be currently incompatible with high pod churn (see my previous comment)

617m4rc · 2024-09-24T16:39:58Z

Hi @pelzerim, can you shed some light on what the implementation of this init container looks like? Does it just wait? So far we have tried static wait time in the regular container, but this does not seem to affect the problem.

Pavani-Panakanti · 2024-10-02T21:12:06Z

@creinheimer We have a couple of fixes that went in with the latest release v1.1.3. Can you try with the latest image and let us know if you are still seeing the issue https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.5

creinheimer added the bug Something isn't working label Jul 19, 2024

pelzerim mentioned this issue Aug 14, 2024

Determine if network policies are applied #298

Open

albertschwarzkopf mentioned this issue Oct 2, 2024

Network traffic sporadically denied despite valid network policies #307

Open

wiseelf mentioned this issue Oct 3, 2024

Network policy blocks established connections to STS. #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network denies despite allow-all policy (strict mode) #288

Network denies despite allow-all policy (strict mode) #288

creinheimer commented Jul 19, 2024

achevuru commented Jul 26, 2024

creinheimer commented Jul 26, 2024

pelzerim commented Aug 9, 2024 •

edited

Loading

anshulpatel25 commented Aug 15, 2024

pelzerim commented Aug 15, 2024

617m4rc commented Sep 24, 2024

Pavani-Panakanti commented Oct 2, 2024

Network denies despite allow-all policy (strict mode) #288

Network denies despite allow-all policy (strict mode) #288

Comments

creinheimer commented Jul 19, 2024

achevuru commented Jul 26, 2024

creinheimer commented Jul 26, 2024

pelzerim commented Aug 9, 2024 • edited Loading

anshulpatel25 commented Aug 15, 2024

pelzerim commented Aug 15, 2024

617m4rc commented Sep 24, 2024

Pavani-Panakanti commented Oct 2, 2024

pelzerim commented Aug 9, 2024 •

edited

Loading