Race condition causes quickly opened connections to fail #186

dave-powell-kensho · 2024-01-22T21:18:27Z

What happened:

After enabling network policy support we observed applications which opened connections early in their lifecycle would become hung. It appears that the process had established a connection successfully, and were stuck in a read syscall indefinitely.

# ss -nitp
State                         Recv-Q                         Send-Q                                                 Local Address:Port                                                      Peer Address:Port                         Process                         
ESTAB                         0                              0                                                       xx.xx.xx.xx:44670                                                   xx.xx.xx.xx:443                           xx
         cubic wscale:13,7 rto:204 rtt:2.217/0.927 ato:40 mss:1388 pmtu:9001 rcvmss:1448 advmss:8949 cwnd:10 bytes_sent:783 bytes_acked:784 bytes_received:4735 segs_out:6 segs_in:7 data_segs_out:3 data_segs_in:4 send 50085701bps lastsnd:904904 lastrcv:904904 lastack:904900 pacing_rate 100160104bps delivery_rate 11131824bps delivered:4 app_limited busy:8ms rcv_space:56587 rcv_ssthresh:56587 minrtt:1.339 snd_wnd:65536

The process becomes stuck while in a read syscall

# cat /proc/8/syscall 
0 0x3 0x56505f69dcc3 0x5 0x0 0x0 0x0 0x7ffe65dd4f38 0x7f94ae42b07d

This occurred across multiple disparate deployments with the common feature being early outbound connections. When debugging the affected pods, we found that we were able to open outbound connections without issue. Our theory is that the application is opening connections early in the pod lifecycle before the agent gets going, and once the network policy agent does its work, the connection is affected. In these cases we had no egress filtering network policies applied to the pods, but did have ingress filters.

Attach logs

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
Create a pod that immediately attempts to download a large enough file to last several seconds. The request ends up hanging, but executing the same request on the same pod after some period of initialization succeeds.

Anything else we need to know?:
Possibly related to #144 ?

Environment:

Kubernetes version (use kubectl version): 1.26
CNI Version: 1.15.3
Network Policy Agent Version: 1.0.5 and 1.0.8rc1
OS (e.g: cat /etc/os-release): AL2
Kernel (e.g. uname -a): 5.10.192-183.736.amzn2.x86_64 Network Policy Agent - Initial commit #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

ndrafahl · 2024-02-05T15:29:35Z

Hey @dave-powell-kensho - did you end up finding a resolution to your problem?

We are seeing somewhat similar results in our environment after enabling the network policy enforcement via the AWS CNI addon, that we were not seeing prior to enforcing the network policies. I don't think though that's it's because they're occurring too quickly after the pod comes up.

We are seeing that connections appear to be successful initially, but are timing out due to a "Read Timeout" on the client's side. We know the initial connection is successful because there are actions that are being taken on the server side, and a retry of the same action basically gives us a response of "you already did this".

In other cases, we're seeing that connections that have no timeout enforced basically stay open indefinitely, and we have to forcefully reboot the pods (namely, connections to an RDS instance).

dave-powell-kensho · 2024-02-05T15:36:07Z

@ndrafahl We were not able to find a resolution that left the netpol agent running, and rolled back/unwound the netpol agent changes. We have not experienced these issues since the rollback ~2 weeks ago.

ndrafahl · 2024-02-05T15:39:08Z

Did you basically go through these steps, for your rollback?:

Deleted all of your ingress network policies in the cluster
Set enable-network-policy-controller to false in ConfigMap amazon-vpc-cni (kube-system NS).
Set the 'enableNetworkPolicy' parameter to false. This will disable the agents on the nodes.

Out of curiosity, did you also try updating the addon to 1.16.x? That's been one suggestion that has been made to us, but we haven't yet taken that path. Right now we're trying to figure out which direction to take.

Sorry - to add one additional question, did you guys do the same thing in other environments without any sort of seen issue there?

dave-powell-kensho · 2024-02-05T19:09:31Z

Yes, those are the steps we took, though we also removed the netpol agent container from the daemonset at the end.

I'm not aware of that version - I had seen similar issues with requests from the developers to try 1.0.8rc1, which we did upgrading to (from 1.0.5) with the same results.

We were able to replicate this issue in multiple environments. We have left this addon enabled in our lowest environment so that we're able to test any potential fixes quickly.

dave-powell-kensho · 2024-02-05T19:12:01Z

cc @jayanthvn We've been sitting on this issue for a couple weeks now and would really appreciate some eyes from the maintainers.

ndrafahl · 2024-02-05T20:26:45Z

Did you find that you also needed to remove the node agent from the daemonset as well, after those steps, to get your issue to be resolved?

I tested the steps in a lower environment, and sure enough that container is still running on the pod even though the addon is set to not enforce network policies any longer.

jayanthvn · 2024-02-05T22:02:14Z

@dave-powell-kensho - Sorry somehow lost track of this. This is expected behavior if the connection is established prior to policy reconciliation against the new pod. Please see this - #189 (comment)

dave-powell-kensho · 2024-02-06T20:43:03Z

@ndrafahl We removed the node agent from the pod's container list, yes, though we self-manage the aws-node deployment config, so I cannot advise on helm charts and the like.

@jayanthvn Thank you for the update, we'll be looking forward to the release of the strict mode feature. Is there any issue or other location we can track to know when it is released?

ndrafahl · 2024-02-06T21:04:37Z

@dave-powell-kensho Thanks for the info, appreciate you responding. 👍

jdn5126 · 2024-02-16T17:03:41Z

@dave-powell-kensho you can track the progress of #209 and its release

ariary · 2024-04-04T08:51:24Z

@jdn5126 @jayanthvn I'm not sure the Strict Mode solved the issue.

In fact, the issue is describing especially a bug in the standard potion of the Strict Mode which is (still) blocking some traffic.

Last tests with v1.17.1-eksbuild.1 and standard: short-lived connections are still blocked (while explicitly allowed by network policies + after some times pod is able to perform same connection without any issue)

dave-powell-kensho added the bug Something isn't working label Jan 22, 2024

ndrafahl mentioned this issue Feb 5, 2024

Random Client Read Timeouts to RDS After Enabling Network Policies via AWS CNI #204

Closed

jdn5126 added the strict mode Issues blocked on strict mode implementation label Feb 16, 2024

Mohilpalav mentioned this issue Mar 19, 2024

Network policy blocks established connections to RDS #236

Open

younsl mentioned this issue Apr 11, 2024

Intermittent connection reset and delay running time #245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition causes quickly opened connections to fail #186

Race condition causes quickly opened connections to fail #186

dave-powell-kensho commented Jan 22, 2024

ndrafahl commented Feb 5, 2024

dave-powell-kensho commented Feb 5, 2024

ndrafahl commented Feb 5, 2024 •

edited

Loading

dave-powell-kensho commented Feb 5, 2024

dave-powell-kensho commented Feb 5, 2024

ndrafahl commented Feb 5, 2024

jayanthvn commented Feb 5, 2024

dave-powell-kensho commented Feb 6, 2024

ndrafahl commented Feb 6, 2024

jdn5126 commented Feb 16, 2024

ariary commented Apr 4, 2024

Race condition causes quickly opened connections to fail #186

Race condition causes quickly opened connections to fail #186

Comments

dave-powell-kensho commented Jan 22, 2024

ndrafahl commented Feb 5, 2024

dave-powell-kensho commented Feb 5, 2024

ndrafahl commented Feb 5, 2024 • edited Loading

dave-powell-kensho commented Feb 5, 2024

dave-powell-kensho commented Feb 5, 2024

ndrafahl commented Feb 5, 2024

jayanthvn commented Feb 5, 2024

dave-powell-kensho commented Feb 6, 2024

ndrafahl commented Feb 6, 2024

jdn5126 commented Feb 16, 2024

ariary commented Apr 4, 2024

ndrafahl commented Feb 5, 2024 •

edited

Loading