-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Support]: General issue of missing packets #332
Comments
Thanks Xavier for detailed reportWe’ll start looking into itCouple of quick questions:1. Can you share the instance sizes you are using ?2. When you say communication is with “external host” . Is that host also an EC2 ? Same AZ ?3. Usually when something like this happens with no indication of error from Ena stack could be cause by packet drop in the network outside the host (routers, gateways) and that something ENA would not know about. To confirm or eliminate this theory , we suggest run the test between two EC2 instances (ideally in same Cluster Placement Group) to see if the issue is the host or the network sideSent from my iPhoneOn Dec 13, 2024, at 4:46 AM, Xavier R. Guérin ***@***.***> wrote:
Preliminary Actions
I have searched the existing issues and didn't find a duplicate.
I have followed the AWS official troubleshoot documentation.
I have followed the driver readme and best practices.
Driver Type
Linux kernel driver for Elastic Network Adapter (ENA)
Driver Tag/Commit
DPDK 23.11
Custom Code
No
OS Platform and Distribution
Ubuntu Noble
Support request
This is a follow-up on issues #235 and #286 as I cannot re-open any of these. I believe there is a more general issue with missing packets when using the DPDK driver for ENA.
Background
I am using a custom TCP stack on top of the ENA DPDK driver;
The correctness of that stack is irrelevant as the problem concerns L1/l2;
When establishing connections to external hosts, two erroneous scenarios may happen:
a. Entire connection may disappear (ref #286)
b. Streams of packets are lost (more or less long) (ref #235)
Observations
The ENA port never reports RX overruns, misses, or any sort of errors
Or queues utilization never goes beyond 20% (we use the max RX queue depth of 8192)
We notice missing packets even very small loads (20 TCP connections, 2MB/s bandwidth)
Beyond the odd missing packets, we experience waves or large dropped packet streams
Those issues appears on all instance types we tested: c5*, c6i*, m6i*, and c7i
We have not yet tested on metal instances
Throwing more queues at the problem does not help (tested on c6in.8xlarge)
It is interesting to note that, for identical configurations, the kernel driver never loses a single packet (as per tcpdump, assuming it captures packets pre-reassembly). Once upon a time I was able to use traffic mirroring to verify the streams, but no more as no nitro instance is supported.
Case analysis
In one instance, we ran over 8 hours 250 external connections to multiple external hosts on a single port, with 7 queues associated to TCP traffic (c6in instance). The results of the run are below:
20241212.-.BinanceUS.packet.losses.jpg (view on web)
In that pictures, we show:
on top, the per-minute bandwidth of the instance
on the bottom, the per-5-minutes throughput in packets/s
the vertical lines are the instances of large streams of packet lost
What you can see immediately is that, except for the large number of logical connections, the used bandwidth and PPS throughput are very reasonable. You can also see that the lost streams do not happen at any peak of anything (bytes/s or packets/s). Also, those lost streams are very large: in the one at 01:50, we lost packets on 105 connections for a total of 3MB.
I'm running out of ideas as I can't use traffic mirroring to check whatever comes on the wire. I have yet to test on metal instances and to benchmark the driver between two internal hosts to see if I can reproduce locally. Any help/suggestion would be appreciated.
Contact Details
No response
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Thanks @nafeabshara.
I'm happy to let the program exposing the issue running for a while if that helps. I can let you know the instance ID and the ENI ID.
Currently
"External" as we need to go through the internet gateway and we don't connect over our own private subnets. AFAIK, all of the remote peers are hosted on AWS. One on Please not that I'm seeing those issues regardless of where the instance is located. I have a test instance in
I will try that next. Remember that no such behavior is seen when running with kernel sockets with identical configurations, at least AFAICT using a local TCP dump (which may or may not reflect the actual wire traffic). EDIT 1 running on a EDIT 2 I can reproduce, on demand, both the dropped TCP stream issue and the dropped packet issue, between 2 @nafeabshara I'd be happy to share the protocol with you guys if you want to reproduce the issues. It uses the userspace TCP stack mentioned above (OSS), with a server on one side (using a single queue) and a client on the other side (using multiple queues, bonded together, to accommodate the TX buffer requirements of the client). |
xguerin Hello,
All the best, |
Preliminary Actions
Driver Type
Linux kernel driver for Elastic Network Adapter (ENA)
Driver Tag/Commit
DPDK 23.11
Custom Code
No
OS Platform and Distribution
Ubuntu Noble
Support request
This is a follow-up on issues #235 and #286 as I cannot re-open any of these. I believe there is a more general issue with missing packets when using the DPDK driver for ENA.
Background
a. Entire connection may disappear (ref [Support]: sudden disappearance of user-space TCP streams #286)
b. Streams of packets are lost (more or less long) (ref DPDK ENA PMD Silently Drops Packets on Rx #235)
Observations
a. More precisely, we ask to read 8192 buffers and never get more than 2000 back
b. Multiple consecutive reads may return as much, but never more
c5*
,c6i*
,m6i*
, andc7i
c6in.8xlarge
)It is interesting to note that, for identical configurations, the kernel driver never loses a single packet (as per
tcpdump
, assuming it captures packets pre-reassembly). Once upon a time I was able to use traffic mirroring to verify the streams, but no more as no nitro instance is supported.Case analysis
In one instance, we ran over 8 hours 250 external connections to multiple external hosts on a single port, with 7 queues associated to TCP traffic (
c6in
instance). The results of the run are below:In that pictures, we show:
What you can see immediately is that, except for the large number of logical connections, the used bandwidth and PPS throughput are very reasonable. You can also see that the lost streams do not happen at any peak of anything (bytes/s or packets/s). Also, those lost streams are very large: in the one at 01:50, we lost packets on 105 connections for a total of 3MB.
I'm running out of ideas as I can't use traffic mirroring to check whatever comes on the wire. I have yet to test on metal instances and to benchmark the driver between two internal hosts to see if I can reproduce locally. Any help/suggestion would be appreciated.
Contact Details
No response
The text was updated successfully, but these errors were encountered: