[Support]: General issue of missing packets #332

xguerin · 2024-12-13T12:45:48Z

Preliminary Actions

I have searched the existing issues and didn't find a duplicate.
I have followed the AWS official troubleshoot documentation.
I have followed the driver readme and best practices.

Driver Type

Linux kernel driver for Elastic Network Adapter (ENA)

Driver Tag/Commit

DPDK 23.11

Custom Code

No

OS Platform and Distribution

Ubuntu Noble

Support request

This is a follow-up on issues #235 and #286 as I cannot re-open any of these. I believe there is a more general issue with missing packets when using the DPDK driver for ENA.

Background

I am using a custom TCP stack on top of the ENA DPDK driver;
The correctness of that stack is irrelevant as the problem concerns L1/l2;
When establishing connections to external hosts, two erroneous scenarios may happen:
a. Entire connection may disappear (ref [Support]: sudden disappearance of user-space TCP streams #286)
b. Streams of packets are lost (more or less long) (ref DPDK ENA PMD Silently Drops Packets on Rx #235)

Observations

The ENA port never reports RX overruns, misses, or any sort of errors
Our queues utilization never goes beyond 20% (we use the max RX queue depth of 8192)
a. More precisely, we ask to read 8192 buffers and never get more than 2000 back
b. Multiple consecutive reads may return as much, but never more
We notice missing packets even very small loads (20 TCP connections, 2MB/s bandwidth)
Beyond the odd missing packets, we experience waves or large dropped packet streams
Those issues appears on all instance types we tested: c5*, c6i*, m6i*, and c7i
We have not yet tested on metal instances
Throwing more queues at the problem does not help (tested on c6in.8xlarge)

It is interesting to note that, for identical configurations, the kernel driver never loses a single packet (as per tcpdump, assuming it captures packets pre-reassembly). Once upon a time I was able to use traffic mirroring to verify the streams, but no more as no nitro instance is supported.

Case analysis

In one instance, we ran over 8 hours 250 external connections to multiple external hosts on a single port, with 7 queues associated to TCP traffic (c6in instance). The results of the run are below:

In that pictures, we show:

on top, the per-minute bandwidth of the instance
on the bottom, the per-5-minutes throughput in packets/s
the vertical lines are the instances of large streams of packet lost

What you can see immediately is that, except for the large number of logical connections, the used bandwidth and PPS throughput are very reasonable. You can also see that the lost streams do not happen at any peak of anything (bytes/s or packets/s). Also, those lost streams are very large: in the one at 01:50, we lost packets on 105 connections for a total of 3MB.

I'm running out of ideas as I can't use traffic mirroring to check whatever comes on the wire. I have yet to test on metal instances and to benchmark the driver between two internal hosts to see if I can reproduce locally. Any help/suggestion would be appreciated.

Contact Details

No response

The text was updated successfully, but these errors were encountered:

nafeabshara · 2024-12-13T14:43:25Z

Thanks Xavier for detailed reportWe’ll start looking into itCouple of quick questions:1. Can you share the instance sizes you are using ?2. When you say communication is with “external host” . Is that host also an EC2 ? Same AZ ?3. Usually when something like this happens with no indication of error from Ena stack could be cause by packet drop in the network outside the host (routers, gateways) and that something ENA would not know about. To confirm or eliminate this theory , we suggest run the test between two EC2 instances (ideally in same Cluster Placement Group) to see if the issue is the host or the network sideSent from my iPhoneOn Dec 13, 2024, at 4:46 AM, Xavier R. Guérin ***@***.***> wrote: Preliminary Actions I have searched the existing issues and didn't find a duplicate. I have followed the AWS official troubleshoot documentation. I have followed the driver readme and best practices. Driver Type Linux kernel driver for Elastic Network Adapter (ENA) Driver Tag/Commit DPDK 23.11 Custom Code No OS Platform and Distribution Ubuntu Noble Support request This is a follow-up on issues #235 and #286 as I cannot re-open any of these. I believe there is a more general issue with missing packets when using the DPDK driver for ENA. Background I am using a custom TCP stack on top of the ENA DPDK driver; The correctness of that stack is irrelevant as the problem concerns L1/l2; When establishing connections to external hosts, two erroneous scenarios may happen: a. Entire connection may disappear (ref #286) b. Streams of packets are lost (more or less long) (ref #235) Observations The ENA port never reports RX overruns, misses, or any sort of errors Or queues utilization never goes beyond 20% (we use the max RX queue depth of 8192) We notice missing packets even very small loads (20 TCP connections, 2MB/s bandwidth) Beyond the odd missing packets, we experience waves or large dropped packet streams Those issues appears on all instance types we tested: c5*, c6i*, m6i*, and c7i We have not yet tested on metal instances Throwing more queues at the problem does not help (tested on c6in.8xlarge) It is interesting to note that, for identical configurations, the kernel driver never loses a single packet (as per tcpdump, assuming it captures packets pre-reassembly). Once upon a time I was able to use traffic mirroring to verify the streams, but no more as no nitro instance is supported. Case analysis In one instance, we ran over 8 hours 250 external connections to multiple external hosts on a single port, with 7 queues associated to TCP traffic (c6in instance). The results of the run are below: 20241212.-.BinanceUS.packet.losses.jpg (view on web) In that pictures, we show: on top, the per-minute bandwidth of the instance on the bottom, the per-5-minutes throughput in packets/s the vertical lines are the instances of large streams of packet lost What you can see immediately is that, except for the large number of logical connections, the used bandwidth and PPS throughput are very reasonable. You can also see that the lost streams do not happen at any peak of anything (bytes/s or packets/s). Also, those lost streams are very large: in the one at 01:50, we lost packets on 105 connections for a total of 3MB. I'm running out of ideas as I can't use traffic mirroring to check whatever comes on the wire. I have yet to test on metal instances and to benchmark the driver between two internal hosts to see if I can reproduce locally. Any help/suggestion would be appreciated. Contact Details No response —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

xguerin · 2024-12-13T16:09:21Z

Thanks @nafeabshara.

We’ll start looking into it

I'm happy to let the program exposing the issue running for a while if that helps. I can let you know the instance ID and the ENI ID.

Can you share the instance sizes you are using ?

Currently c6in.2xlarge. I'm going to try with a c6in.metal shortly.

When you say communication is with “external host”. Is that host also an EC2 ? Same AZ ?

"External" as we need to go through the internet gateway and we don't connect over our own private subnets. AFAIK, all of the remote peers are hosted on AWS. One on us-east-1a where one of our instance is, others in apne.

Please not that I'm seeing those issues regardless of where the instance is located. I have a test instance in eu-west-3 that shows the same problems as the one in us-east-1.

Usually when something like this happens with no indication of error from Ena stack could be cause by packet drop in the network outside the host (routers, gateways) and that something ENA would not know about. To confirm or eliminate this theory , we suggest run the test between two EC2 instances (ideally in same Cluster Placement Group) to see if the issue is the host or the network side.

I will try that next. Remember that no such behavior is seen when running with kernel sockets with identical configurations, at least AFAICT using a local TCP dump (which may or may not reflect the actual wire traffic).

EDIT 1 running on a c6in.metal does not make a difference. We still observe spurious, large streams of lost packets for the same configurations we run on the c6in.2xlarge.

EDIT 2 I can reproduce, on demand, both the dropped TCP stream issue and the dropped packet issue, between 2 c6in.2xlarge instances connected on the same private network, in the same AZ, using their private IPs.

@nafeabshara I'd be happy to share the protocol with you guys if you want to reproduce the issues. It uses the userspace TCP stack mentioned above (OSS), with a server on one side (using a single queue) and a client on the other side (using multiple queues, bonded together, to accommodate the TX buffer requirements of the client).

shaibran · 2024-12-19T13:04:40Z

xguerin Hello,

Please contact me directly via email [email protected] to further investigate the issue you described.
Could you share the details for a specific incident in order to review the EC2 logs? We require the instance IDs, the region where they were launched, and the timeframe during which the test was conducted (timeframe of approximately 24 hours is sufficient)
Could you please share the DPDK version you are using and if you changed the ENA PMD in any way (e.g., applied backports)?

All the best,
Shai

xguerin added support Ask a question or request support triage Determine the priority and severity labels Dec 13, 2024

ShayAgros added the DPDK driver label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Support]: General issue of missing packets #332

[Support]: General issue of missing packets #332

xguerin commented Dec 13, 2024 •

edited

Loading

nafeabshara commented Dec 13, 2024 via email

xguerin commented Dec 13, 2024 •

edited

Loading

shaibran commented Dec 19, 2024

[Support]: General issue of missing packets #332

[Support]: General issue of missing packets #332

Comments

xguerin commented Dec 13, 2024 • edited Loading

Preliminary Actions

Driver Type

Driver Tag/Commit

Custom Code

OS Platform and Distribution

Support request

Background

Observations

Case analysis

Contact Details

nafeabshara commented Dec 13, 2024 via email

xguerin commented Dec 13, 2024 • edited Loading

shaibran commented Dec 19, 2024

xguerin commented Dec 13, 2024 •

edited

Loading

xguerin commented Dec 13, 2024 •

edited

Loading