Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Support]: General issue of missing packets #332

Open
3 tasks done
xguerin opened this issue Dec 13, 2024 · 3 comments
Open
3 tasks done

[Support]: General issue of missing packets #332

xguerin opened this issue Dec 13, 2024 · 3 comments
Labels
DPDK driver support Ask a question or request support triage Determine the priority and severity

Comments

@xguerin
Copy link

xguerin commented Dec 13, 2024

Preliminary Actions

Driver Type

Linux kernel driver for Elastic Network Adapter (ENA)

Driver Tag/Commit

DPDK 23.11

Custom Code

No

OS Platform and Distribution

Ubuntu Noble

Support request

This is a follow-up on issues #235 and #286 as I cannot re-open any of these. I believe there is a more general issue with missing packets when using the DPDK driver for ENA.

Background

  1. I am using a custom TCP stack on top of the ENA DPDK driver;
  2. The correctness of that stack is irrelevant as the problem concerns L1/l2;
  3. When establishing connections to external hosts, two erroneous scenarios may happen:
    a. Entire connection may disappear (ref [Support]: sudden disappearance of user-space TCP streams #286)
    b. Streams of packets are lost (more or less long) (ref DPDK ENA PMD Silently Drops Packets on Rx #235)

Observations

  1. The ENA port never reports RX overruns, misses, or any sort of errors
  2. Our queues utilization never goes beyond 20% (we use the max RX queue depth of 8192)
    a. More precisely, we ask to read 8192 buffers and never get more than 2000 back
    b. Multiple consecutive reads may return as much, but never more
  3. We notice missing packets even very small loads (20 TCP connections, 2MB/s bandwidth)
  4. Beyond the odd missing packets, we experience waves or large dropped packet streams
  5. Those issues appears on all instance types we tested: c5*, c6i*, m6i*, and c7i
  6. We have not yet tested on metal instances
  7. Throwing more queues at the problem does not help (tested on c6in.8xlarge)

It is interesting to note that, for identical configurations, the kernel driver never loses a single packet (as per tcpdump, assuming it captures packets pre-reassembly). Once upon a time I was able to use traffic mirroring to verify the streams, but no more as no nitro instance is supported.

Case analysis

In one instance, we ran over 8 hours 250 external connections to multiple external hosts on a single port, with 7 queues associated to TCP traffic (c6in instance). The results of the run are below:

20241212 - BinanceUS packet losses

In that pictures, we show:

  1. on top, the per-minute bandwidth of the instance
  2. on the bottom, the per-5-minutes throughput in packets/s
  3. the vertical lines are the instances of large streams of packet lost

What you can see immediately is that, except for the large number of logical connections, the used bandwidth and PPS throughput are very reasonable. You can also see that the lost streams do not happen at any peak of anything (bytes/s or packets/s). Also, those lost streams are very large: in the one at 01:50, we lost packets on 105 connections for a total of 3MB.

I'm running out of ideas as I can't use traffic mirroring to check whatever comes on the wire. I have yet to test on metal instances and to benchmark the driver between two internal hosts to see if I can reproduce locally. Any help/suggestion would be appreciated.

Contact Details

No response

@xguerin xguerin added support Ask a question or request support triage Determine the priority and severity labels Dec 13, 2024
@nafeabshara
Copy link

nafeabshara commented Dec 13, 2024 via email

@xguerin
Copy link
Author

xguerin commented Dec 13, 2024

Thanks @nafeabshara.

We’ll start looking into it

I'm happy to let the program exposing the issue running for a while if that helps. I can let you know the instance ID and the ENI ID.

  1. Can you share the instance sizes you are using ?

Currently c6in.2xlarge. I'm going to try with a c6in.metal shortly.

  1. When you say communication is with “external host”. Is that host also an EC2 ? Same AZ ?

"External" as we need to go through the internet gateway and we don't connect over our own private subnets. AFAIK, all of the remote peers are hosted on AWS. One on us-east-1a where one of our instance is, others in apne.

Please not that I'm seeing those issues regardless of where the instance is located. I have a test instance in eu-west-3 that shows the same problems as the one in us-east-1.

  1. Usually when something like this happens with no indication of error from Ena stack could be cause by packet drop in the network outside the host (routers, gateways) and that something ENA would not know about. To confirm or eliminate this theory , we suggest run the test between two EC2 instances (ideally in same Cluster Placement Group) to see if the issue is the host or the network side.

I will try that next. Remember that no such behavior is seen when running with kernel sockets with identical configurations, at least AFAICT using a local TCP dump (which may or may not reflect the actual wire traffic).

EDIT 1 running on a c6in.metal does not make a difference. We still observe spurious, large streams of lost packets for the same configurations we run on the c6in.2xlarge.

EDIT 2 I can reproduce, on demand, both the dropped TCP stream issue and the dropped packet issue, between 2 c6in.2xlarge instances connected on the same private network, in the same AZ, using their private IPs.

@nafeabshara I'd be happy to share the protocol with you guys if you want to reproduce the issues. It uses the userspace TCP stack mentioned above (OSS), with a server on one side (using a single queue) and a client on the other side (using multiple queues, bonded together, to accommodate the TX buffer requirements of the client).

@shaibran
Copy link
Contributor

xguerin Hello,

  1. Please contact me directly via email [email protected] to further investigate the issue you described.
  2. Could you share the details for a specific incident in order to review the EC2 logs? We require the instance IDs, the region where they were launched, and the timeframe during which the test was conducted (timeframe of approximately 24 hours is sufficient)
  3. Could you please share the DPDK version you are using and if you changed the ENA PMD in any way (e.g., applied backports)?

All the best,
Shai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DPDK driver support Ask a question or request support triage Determine the priority and severity
Projects
None yet
Development

No branches or pull requests

4 participants