-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to lwip checksum #3 / Discussion of general throughput improvements #6998
Comments
Thanks !
|
What's the code size footprint change with this? In many scenarios code size is more limiting than raw performance |
lwip_standard_chksum1: 96 Bytes, 36 Instructions So, a 56 Byte increase from checksum 2. You may be correct, in my particular use-case raw bandwidth is more important, and that isn't the case for everyone. Also, if we're low on space, another idea is to migrate bearSSL's AES crypto into the ROM, where there's already a copy present. |
Thanks for sharing. It seems Variant 2 is a clear and obvious improvement over variant 1, while variant 3 has a higher code footprint. I assume licensing wise this is compatible and free/libre open source? I agree that there may be many opportunities elsewhere to regain code size LWIP2 has a couple if compile time.flags (with/without ipv6, large/small mtu, large footprint+features over small code), so it might be an idea to select variant 2 vs 3 depending on that compile option. |
The three implementations are part of lwIP already, you just select which one to use in the config file. You could do much better if xtensa had a carry/overflow bit, but I couldn't find one. |
Yes, thanks @rsaxvc for noticing this !
This change is already merged in lwip2 repo but not yet imported to here (but soon, probably with #6887) but yes @dirkmueller it is a good idea to enable this with the "w/ features" options for lwip2 |
How would you feel instead about keying it off "Higher Bandwidth" vs "Lower Memory"? Speaking of things we might only want sometimes, memcpy performance on esp8266 is somewhat faster(about 30% less time per copy) for large aligned buffers rather than unaligned buffers. There may be some opportunity to adjust ETH_PAD_SIZE so that the IP packets are 32-bit aligned inside of their buffers to make them easier to access, at the cost of some overhead per packet. |
Sorry to get off topic, I noticed that each received packet from the ESP layers in ethernet_input gets buffered into memory from esp2glue_alloc_for_recv(). Is this strictly necessary? It seems like we should be able to do zero-copy-rx with LWIP_SUPPORT_CUSTOM_PBUF, then have lwIP free the ESP pbuf once the packet is no longer referenced. |
Back then (during lwip2 dev) the only fail-proof way was to copy the payload (the goal was to be able to make a reliable tcp echo tester). Moving pbuf and its data out from allocated buffers by FW as soon as possible was the only way I found, because on heavy load a nasty FW error message appears and everything falls into custard following that. This message can be seen when using lwip-1.4 and sustained network transfers. It could indeed be tested again with a custom pbuf (using a pointer). But maybe worthless if lwip-1.4 is still showing the same weakness (= saturated buffers: |
I believe I have LWIP_SUPPORT_CUSTOM_PBUF working, but my application only exercises UDP receive, perhaps it will fail as soon as I try TCP loopback. Or as soon as something drops a packet during TCP loopback. With UDP I assume lwIP 2 frees the custom pbuf as soon as the packet is handled. But with TCP we may need to wait to reassemble the stream. |
I think we need numbers about benefits. |
I agree, simple benchmarks like checksum time can be misleading, if you don't know how much overall time checksumming takes. When I started looking into this, I had assumed ESP8266 throughput was CPU bound. And I hadn't figured out how to disable UDP checksum checking from the transmitter. But now I think there's some other bottleneck in the system. With UDP RX and a short loop() function to print stats every second, I can get about 20mbps RX with or without optimizations(just receiving packets, no application usage of them). However, there's more CPU time available for my application code while doing so with optimizations applied. UDP RX Before: ~1.15 megabits per CPU% Edit: these numbers include a few other optimization attempts not yet discussed. I need to figure out which are responsible for the largest speedup. |
Here's what I'm using to benchmark UDP/RX: https://github.com/rsaxvc/esp_lwip_benchmarks |
Basic Infos
Platform
Settings in IDE
Problem Description
Benchmarking lwip_standard_chksum's implementations for a 1450 byte packet with CPU at 160MHz show:
MCVE Sketch
Debug Messages
The text was updated successfully, but these errors were encountered: