Switch to lwip checksum #3 / Discussion of general throughput improvements #6998

rsaxvc · 2020-01-08T02:04:45Z

Basic Infos

This issue complies with the issue POLICY doc.
I have read the documentation at readthedocs and the issue is not addressed there.
I have tested that the issue is present in current master branch (aka latest git).
I have searched the issue tracker for a similar issue.
If there is a stack dump, I have decoded it.
I have filled out all fields below.

Platform

Hardware: [ESP-12]
Core Version: [8242d72]
Development Env: [Arduino IDE]
Operating System: [Windows]

Settings in IDE

Module: [Nodemcu]
Flash Mode: [unknown]
Flash Size: [4MB]
lwip Variant: [v2 Higher Bandwidth]
Reset Method: [unknown]
Flash Frequency: [40Mhz]
CPU Frequency: [160MHz]
Upload Using: [SERIAL]
Upload Speed: [115200] (serial upload only)

Problem Description

Benchmarking lwip_standard_chksum's implementations for a 1450 byte packet with CPU at 160MHz show:

46us - implementation 1
32us - implementation 2
19us - implementation 3

MCVE Sketch

#include "lwip/opt.h"

#include "lwip/inet_chksum.h"
#include "lwip/def.h"
#include "lwip/ip_addr.h"

#include <string.h>

/**
 * lwip checksum
 *
 * @param dataptr points to start of data to be summed at any boundary
 * @param len length of data to be summed
 * @return host order (!) lwip checksum (non-inverted Internet sum)
 *
 * @note accumulator size limits summable length to 64k
 * @note host endianess is irrelevant (p3 RFC1071)
 */
u16_t
lwip_standard_chksum1(const void *dataptr, int len)
{
  u32_t acc;
  u16_t src;
  const u8_t *octetptr;

  acc = 0;
  /* dataptr may be at odd or even addresses */
  octetptr = (const u8_t *)dataptr;
  while (len > 1) {
    /* declare first octet as most significant
       thus assume network order, ignoring host order */
    src = (*octetptr) << 8;
    octetptr++;
    /* declare second octet as least significant */
    src |= (*octetptr);
    octetptr++;
    acc += src;
    len -= 2;
  }
  if (len > 0) {
    /* accumulate remaining octet */
    src = (*octetptr) << 8;
    acc += src;
  }
  /* add deferred carry bits */
  acc = (acc >> 16) + (acc & 0x0000ffffUL);
  if ((acc & 0xffff0000UL) != 0) {
    acc = (acc >> 16) + (acc & 0x0000ffffUL);
  }
  /* This maybe a little confusing: reorder sum using lwip_htons()
     instead of lwip_ntohs() since it has a little less call overhead.
     The caller must invert bits for Internet sum ! */
  return lwip_htons((u16_t)acc);
}

/*
 * Curt McDowell
 * Broadcom Corp.
 * [email protected]
 *
 * IP checksum two bytes at a time with support for
 * unaligned buffer.
 * Works for len up to and including 0x20000.
 * by Curt McDowell, Broadcom Corp. 12/08/2005
 *
 * @param dataptr points to start of data to be summed at any boundary
 * @param len length of data to be summed
 * @return host order (!) lwip checksum (non-inverted Internet sum)
 */
u16_t
lwip_standard_chksum2(const void *dataptr, int len)
{
  const u8_t *pb = (const u8_t *)dataptr;
  const u16_t *ps;
  u16_t t = 0;
  u32_t sum = 0;
  int odd = ((mem_ptr_t)pb & 1);

  /* Get aligned to u16_t */
  if (odd && len > 0) {
    ((u8_t *)&t)[1] = *pb++;
    len--;
  }

  /* Add the bulk of the data */
  ps = (const u16_t *)(const void *)pb;
  while (len > 1) {
    sum += *ps++;
    len -= 2;
  }

  /* Consume left-over byte, if any */
  if (len > 0) {
    ((u8_t *)&t)[0] = *(const u8_t *)ps;
  }

  /* Add end bytes */
  sum += t;

  /* Fold 32-bit sum to 16 bits
     calling this twice is probably faster than if statements... */
  sum = FOLD_U32T(sum);
  sum = FOLD_U32T(sum);

  /* Swap if alignment was odd */
  if (odd) {
    sum = SWAP_BYTES_IN_WORD(sum);
  }

  return (u16_t)sum;
}

/**
 * An optimized checksum routine. Basically, it uses loop-unrolling on
 * the checksum loop, treating the head and tail bytes specially, whereas
 * the inner loop acts on 8 bytes at a time.
 *
 * @arg start of buffer to be checksummed. May be an odd byte address.
 * @len number of bytes in the buffer to be checksummed.
 * @return host order (!) lwip checksum (non-inverted Internet sum)
 *
 * by Curt McDowell, Broadcom Corp. December 8th, 2005
 */
u16_t
lwip_standard_chksum3(const void *dataptr, int len)
{
  const u8_t *pb = (const u8_t *)dataptr;
  const u16_t *ps;
  u16_t t = 0;
  const u32_t *pl;
  u32_t sum = 0, tmp;
  /* starts at odd byte address? */
  int odd = ((mem_ptr_t)pb & 1);

  if (odd && len > 0) {
    ((u8_t *)&t)[1] = *pb++;
    len--;
  }

  ps = (const u16_t *)(const void *)pb;

  if (((mem_ptr_t)ps & 3) && len > 1) {
    sum += *ps++;
    len -= 2;
  }

  pl = (const u32_t *)(const void *)ps;

  while (len > 7)  {
    tmp = sum + *pl++;          /* ping */
    if (tmp < sum) {
      tmp++;                    /* add back carry */
    }

    sum = tmp + *pl++;          /* pong */
    if (sum < tmp) {
      sum++;                    /* add back carry */
    }

    len -= 8;
  }

  /* make room in upper bits */
  sum = FOLD_U32T(sum);

  ps = (const u16_t *)pl;

  /* 16-bit aligned word remaining? */
  while (len > 1) {
    sum += *ps++;
    len -= 2;
  }

  /* dangling tail byte remaining? */
  if (len > 0) {                /* include odd byte */
    ((u8_t *)&t)[0] = *(const u8_t *)ps;
  }

  sum += t;                     /* add end bytes */

  /* Fold 32-bit sum to 16 bits
     calling this twice is probably faster than if statements... */
  sum = FOLD_U32T(sum);
  sum = FOLD_U32T(sum);

  if (odd) {
    sum = SWAP_BYTES_IN_WORD(sum);
  }

  return (u16_t)sum;
}

static unsigned char packet[1450];

void setup() {
  // put your setup code here, to run once:
  Serial.begin(115200);
  for( unsigned i = 0; i < sizeof packet; ++i )
      packet[i] = rand(); 
}

void loop() {
u32_t time0 = micros();
u16_t res1 = lwip_standard_chksum1(packet, sizeof packet);
u32_t time1 = micros();
u16_t res2 = lwip_standard_chksum2(packet, sizeof packet);
u32_t time2 = micros();
u16_t res3 = lwip_standard_chksum3(packet, sizeof packet);
u32_t time3 = micros();
Serial.print( time1-time0 );Serial.print(",");Serial.print(time2-time1);Serial.print(",");Serial.println(time3-time2);
}

Debug Messages

46,33,19 #160MHz CPU
92,64,37 #80 MHz CPU

The text was updated successfully, but these errors were encountered:

d-a-v · 2020-01-08T10:18:22Z

Thanks !
Let me suggest either

make a PR in lwip2 repository
(as a patch in patches/ directory)
send a patch to upstream lwIP
both
I can make 1) with credits if you do 2)

dirkmueller · 2020-01-14T22:46:45Z

What's the code size footprint change with this? In many scenarios code size is more limiting than raw performance

rsaxvc · 2020-01-16T05:20:16Z

lwip_standard_chksum1: 96 Bytes, 36 Instructions
lwip_standard_chksum2: 104 Bytes, 38 Instructions
lwip_standard_chksum3: 160 Bytes, 57 Instructions

So, a 56 Byte increase from checksum 2. You may be correct, in my particular use-case raw bandwidth is more important, and that isn't the case for everyone.

Also, if we're low on space, another idea is to migrate bearSSL's AES crypto into the ROM, where there's already a copy present.

dirkmueller · 2020-01-16T06:56:59Z

Thanks for sharing. It seems Variant 2 is a clear and obvious improvement over variant 1, while variant 3 has a higher code footprint. I assume licensing wise this is compatible and free/libre open source?

I agree that there may be many opportunities elsewhere to regain code size LWIP2 has a couple if compile time.flags (with/without ipv6, large/small mtu, large footprint+features over small code), so it might be an idea to select variant 2 vs 3 depending on that compile option.

rsaxvc · 2020-01-16T13:04:43Z

I assume licensing wise this is compatible and free/libre open source?

The three implementations are part of lwIP already, you just select which one to use in the config file.

You could do much better if xtensa had a carry/overflow bit, but I couldn't find one.

d-a-v · 2020-01-16T14:26:54Z

The three implementations are part of lwIP already

Yes, thanks @rsaxvc for noticing this !

so it might be an idea to select variant 2 vs 3 depending on that compile option

This change is already merged in lwip2 repo but not yet imported to here (but soon, probably with #6887) but yes @dirkmueller it is a good idea to enable this with the "w/ features" options for lwip2
but not with the "w/o features" versions which is available to try and keep a lower flash footprint.

rsaxvc · 2020-01-17T03:14:13Z

it is a good idea to enable this with the "w/ features" options for lwip2

How would you feel instead about keying it off "Higher Bandwidth" vs "Lower Memory"?

Speaking of things we might only want sometimes, memcpy performance on esp8266 is somewhat faster(about 30% less time per copy) for large aligned buffers rather than unaligned buffers.

There may be some opportunity to adjust ETH_PAD_SIZE so that the IP packets are 32-bit aligned inside of their buffers to make them easier to access, at the cost of some overhead per packet.

rsaxvc · 2020-01-18T03:41:13Z

Sorry to get off topic, I noticed that each received packet from the ESP layers in ethernet_input gets buffered into memory from esp2glue_alloc_for_recv(). Is this strictly necessary? It seems like we should be able to do zero-copy-rx with LWIP_SUPPORT_CUSTOM_PBUF, then have lwIP free the ESP pbuf once the packet is no longer referenced.

d-a-v · 2020-01-18T22:26:26Z

Is this strictly necessary?

pbufs arrive to us in the lwip-1.4 format. They need to be transformed into lwip-v2 format (check the main readme in lwip2 repo). Currently pbuf structure and payload are copied but you seem to ask if the new pbuf could use a pointer to already allocated data in the primary receive buffer.

Back then (during lwip2 dev) the only fail-proof way was to copy the payload (the goal was to be able to make a reliable tcp echo tester). Moving pbuf and its data out from allocated buffers by FW as soon as possible was the only way I found, because on heavy load a nasty FW error message appears and everything falls into custard following that. This message can be seen when using lwip-1.4 and sustained network transfers.

It could indeed be tested again with a custom pbuf (using a pointer). But maybe worthless if lwip-1.4 is still showing the same weakness (= saturated buffers: LmacRxBlk:1 <= that one made me mad and go to the lwip2 process)

rsaxvc · 2020-01-18T23:57:08Z

I believe I have LWIP_SUPPORT_CUSTOM_PBUF working, but my application only exercises UDP receive, perhaps it will fail as soon as I try TCP loopback.

Or as soon as something drops a packet during TCP loopback. With UDP I assume lwIP 2 frees the custom pbuf as soon as the packet is handled. But with TCP we may need to wait to reassemble the stream.

d-a-v · 2020-01-19T00:38:44Z

How would you feel instead about keying it off "Higher Bandwidth" vs "Lower Memory"?

I think we need numbers about benefits.
Effective flash occupation for the 3 algorithms might not be the real concern as @devyte noted privately. On the other hand speed measures are to be done. Please go ahead with them if you can, I will try too (as soon as real life allows me to).
I know no way to achieve high bandwidth testing with UDP. With TCP I use echo testers like in #6979.

rsaxvc · 2020-01-19T08:11:51Z

I agree, simple benchmarks like checksum time can be misleading, if you don't know how much overall time checksumming takes.

When I started looking into this, I had assumed ESP8266 throughput was CPU bound. And I hadn't figured out how to disable UDP checksum checking from the transmitter.

But now I think there's some other bottleneck in the system. With UDP RX and a short loop() function to print stats every second, I can get about 20mbps RX with or without optimizations(just receiving packets, no application usage of them). However, there's more CPU time available for my application code while doing so with optimizations applied.

UDP RX Before: ~1.15 megabits per CPU%
UDP RX After Optimizations: ~1.25 megabits per CPU%
UDP RX After Optimizations With Checksum Disabled on TX: ~1.6 megabits per CPU%

Edit: these numbers include a few other optimization attempts not yet discussed. I need to figure out which are responsible for the largest speedup.

rsaxvc · 2020-01-19T19:36:00Z

Here's what I'm using to benchmark UDP/RX: https://github.com/rsaxvc/esp_lwip_benchmarks

d-a-v added this to the 3.0.0 milestone Jan 8, 2020

d-a-v self-assigned this Jan 8, 2020

rsaxvc changed the title ~~Switch to lwip checksum #3~~ Switch to lwip checksum #3 / Discussion of general throughput improvements Jan 18, 2020

d-a-v mentioned this issue Jan 23, 2020

lwip2 updates: no more git sub-sub-module deps, faster checksum, backlog limitation and other fixes #6887

Merged

d-a-v modified the milestones: 3.0.0, 2.7.0 Jan 23, 2020

d-a-v added the fixed-by-PR label Jan 24, 2020

d-a-v closed this as completed in #6887 Feb 18, 2020

ibigbug mentioned this issue Sep 29, 2023

choose chksum algo eycorsican/netstack-lwip#8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to lwip checksum #3 / Discussion of general throughput improvements #6998

Switch to lwip checksum #3 / Discussion of general throughput improvements #6998

rsaxvc commented Jan 8, 2020

d-a-v commented Jan 8, 2020

dirkmueller commented Jan 14, 2020

rsaxvc commented Jan 16, 2020

dirkmueller commented Jan 16, 2020 •

edited

Loading

rsaxvc commented Jan 16, 2020

d-a-v commented Jan 16, 2020

rsaxvc commented Jan 17, 2020

rsaxvc commented Jan 18, 2020

d-a-v commented Jan 18, 2020

rsaxvc commented Jan 18, 2020

d-a-v commented Jan 19, 2020

rsaxvc commented Jan 19, 2020 •

edited

Loading

rsaxvc commented Jan 19, 2020

Switch to lwip checksum #3 / Discussion of general throughput improvements #6998

Switch to lwip checksum #3 / Discussion of general throughput improvements #6998

Comments

rsaxvc commented Jan 8, 2020

Basic Infos

Platform

Settings in IDE

Problem Description

MCVE Sketch

Debug Messages

d-a-v commented Jan 8, 2020

dirkmueller commented Jan 14, 2020

rsaxvc commented Jan 16, 2020

dirkmueller commented Jan 16, 2020 • edited Loading

rsaxvc commented Jan 16, 2020

d-a-v commented Jan 16, 2020

rsaxvc commented Jan 17, 2020

rsaxvc commented Jan 18, 2020

d-a-v commented Jan 18, 2020

rsaxvc commented Jan 18, 2020

d-a-v commented Jan 19, 2020

rsaxvc commented Jan 19, 2020 • edited Loading

rsaxvc commented Jan 19, 2020

dirkmueller commented Jan 16, 2020 •

edited

Loading

rsaxvc commented Jan 19, 2020 •

edited

Loading