Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to lwip checksum #3 / Discussion of general throughput improvements #6998

Closed
6 tasks done
rsaxvc opened this issue Jan 8, 2020 · 13 comments · Fixed by #6887
Closed
6 tasks done

Switch to lwip checksum #3 / Discussion of general throughput improvements #6998

rsaxvc opened this issue Jan 8, 2020 · 13 comments · Fixed by #6887
Assignees
Milestone

Comments

@rsaxvc
Copy link
Contributor

rsaxvc commented Jan 8, 2020

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: [ESP-12]
  • Core Version: [8242d72]
  • Development Env: [Arduino IDE]
  • Operating System: [Windows]

Settings in IDE

  • Module: [Nodemcu]
  • Flash Mode: [unknown]
  • Flash Size: [4MB]
  • lwip Variant: [v2 Higher Bandwidth]
  • Reset Method: [unknown]
  • Flash Frequency: [40Mhz]
  • CPU Frequency: [160MHz]
  • Upload Using: [SERIAL]
  • Upload Speed: [115200] (serial upload only)

Problem Description

Benchmarking lwip_standard_chksum's implementations for a 1450 byte packet with CPU at 160MHz show:

  • 46us - implementation 1
  • 32us - implementation 2
  • 19us - implementation 3

MCVE Sketch

#include "lwip/opt.h"

#include "lwip/inet_chksum.h"
#include "lwip/def.h"
#include "lwip/ip_addr.h"

#include <string.h>

/**
 * lwip checksum
 *
 * @param dataptr points to start of data to be summed at any boundary
 * @param len length of data to be summed
 * @return host order (!) lwip checksum (non-inverted Internet sum)
 *
 * @note accumulator size limits summable length to 64k
 * @note host endianess is irrelevant (p3 RFC1071)
 */
u16_t
lwip_standard_chksum1(const void *dataptr, int len)
{
  u32_t acc;
  u16_t src;
  const u8_t *octetptr;

  acc = 0;
  /* dataptr may be at odd or even addresses */
  octetptr = (const u8_t *)dataptr;
  while (len > 1) {
    /* declare first octet as most significant
       thus assume network order, ignoring host order */
    src = (*octetptr) << 8;
    octetptr++;
    /* declare second octet as least significant */
    src |= (*octetptr);
    octetptr++;
    acc += src;
    len -= 2;
  }
  if (len > 0) {
    /* accumulate remaining octet */
    src = (*octetptr) << 8;
    acc += src;
  }
  /* add deferred carry bits */
  acc = (acc >> 16) + (acc & 0x0000ffffUL);
  if ((acc & 0xffff0000UL) != 0) {
    acc = (acc >> 16) + (acc & 0x0000ffffUL);
  }
  /* This maybe a little confusing: reorder sum using lwip_htons()
     instead of lwip_ntohs() since it has a little less call overhead.
     The caller must invert bits for Internet sum ! */
  return lwip_htons((u16_t)acc);
}

/*
 * Curt McDowell
 * Broadcom Corp.
 * [email protected]
 *
 * IP checksum two bytes at a time with support for
 * unaligned buffer.
 * Works for len up to and including 0x20000.
 * by Curt McDowell, Broadcom Corp. 12/08/2005
 *
 * @param dataptr points to start of data to be summed at any boundary
 * @param len length of data to be summed
 * @return host order (!) lwip checksum (non-inverted Internet sum)
 */
u16_t
lwip_standard_chksum2(const void *dataptr, int len)
{
  const u8_t *pb = (const u8_t *)dataptr;
  const u16_t *ps;
  u16_t t = 0;
  u32_t sum = 0;
  int odd = ((mem_ptr_t)pb & 1);

  /* Get aligned to u16_t */
  if (odd && len > 0) {
    ((u8_t *)&t)[1] = *pb++;
    len--;
  }

  /* Add the bulk of the data */
  ps = (const u16_t *)(const void *)pb;
  while (len > 1) {
    sum += *ps++;
    len -= 2;
  }

  /* Consume left-over byte, if any */
  if (len > 0) {
    ((u8_t *)&t)[0] = *(const u8_t *)ps;
  }

  /* Add end bytes */
  sum += t;

  /* Fold 32-bit sum to 16 bits
     calling this twice is probably faster than if statements... */
  sum = FOLD_U32T(sum);
  sum = FOLD_U32T(sum);

  /* Swap if alignment was odd */
  if (odd) {
    sum = SWAP_BYTES_IN_WORD(sum);
  }

  return (u16_t)sum;
}

/**
 * An optimized checksum routine. Basically, it uses loop-unrolling on
 * the checksum loop, treating the head and tail bytes specially, whereas
 * the inner loop acts on 8 bytes at a time.
 *
 * @arg start of buffer to be checksummed. May be an odd byte address.
 * @len number of bytes in the buffer to be checksummed.
 * @return host order (!) lwip checksum (non-inverted Internet sum)
 *
 * by Curt McDowell, Broadcom Corp. December 8th, 2005
 */
u16_t
lwip_standard_chksum3(const void *dataptr, int len)
{
  const u8_t *pb = (const u8_t *)dataptr;
  const u16_t *ps;
  u16_t t = 0;
  const u32_t *pl;
  u32_t sum = 0, tmp;
  /* starts at odd byte address? */
  int odd = ((mem_ptr_t)pb & 1);

  if (odd && len > 0) {
    ((u8_t *)&t)[1] = *pb++;
    len--;
  }

  ps = (const u16_t *)(const void *)pb;

  if (((mem_ptr_t)ps & 3) && len > 1) {
    sum += *ps++;
    len -= 2;
  }

  pl = (const u32_t *)(const void *)ps;

  while (len > 7)  {
    tmp = sum + *pl++;          /* ping */
    if (tmp < sum) {
      tmp++;                    /* add back carry */
    }

    sum = tmp + *pl++;          /* pong */
    if (sum < tmp) {
      sum++;                    /* add back carry */
    }

    len -= 8;
  }

  /* make room in upper bits */
  sum = FOLD_U32T(sum);

  ps = (const u16_t *)pl;

  /* 16-bit aligned word remaining? */
  while (len > 1) {
    sum += *ps++;
    len -= 2;
  }

  /* dangling tail byte remaining? */
  if (len > 0) {                /* include odd byte */
    ((u8_t *)&t)[0] = *(const u8_t *)ps;
  }

  sum += t;                     /* add end bytes */

  /* Fold 32-bit sum to 16 bits
     calling this twice is probably faster than if statements... */
  sum = FOLD_U32T(sum);
  sum = FOLD_U32T(sum);

  if (odd) {
    sum = SWAP_BYTES_IN_WORD(sum);
  }

  return (u16_t)sum;
}

static unsigned char packet[1450];

void setup() {
  // put your setup code here, to run once:
  Serial.begin(115200);
  for( unsigned i = 0; i < sizeof packet; ++i )
      packet[i] = rand(); 
}

void loop() {
u32_t time0 = micros();
u16_t res1 = lwip_standard_chksum1(packet, sizeof packet);
u32_t time1 = micros();
u16_t res2 = lwip_standard_chksum2(packet, sizeof packet);
u32_t time2 = micros();
u16_t res3 = lwip_standard_chksum3(packet, sizeof packet);
u32_t time3 = micros();
Serial.print( time1-time0 );Serial.print(",");Serial.print(time2-time1);Serial.print(",");Serial.println(time3-time2);
}

Debug Messages

46,33,19 #160MHz CPU
92,64,37 #80 MHz CPU
@d-a-v
Copy link
Collaborator

d-a-v commented Jan 8, 2020

Thanks !
Let me suggest either

  1. make a PR in lwip2 repository
    (as a patch in patches/ directory)
  2. send a patch to upstream lwIP
  3. both
  4. I can make 1) with credits if you do 2)

@d-a-v d-a-v added this to the 3.0.0 milestone Jan 8, 2020
@d-a-v d-a-v self-assigned this Jan 8, 2020
@dirkmueller
Copy link
Contributor

What's the code size footprint change with this? In many scenarios code size is more limiting than raw performance

@rsaxvc
Copy link
Contributor Author

rsaxvc commented Jan 16, 2020

lwip_standard_chksum1: 96 Bytes, 36 Instructions
lwip_standard_chksum2: 104 Bytes, 38 Instructions
lwip_standard_chksum3: 160 Bytes, 57 Instructions

So, a 56 Byte increase from checksum 2. You may be correct, in my particular use-case raw bandwidth is more important, and that isn't the case for everyone.

Also, if we're low on space, another idea is to migrate bearSSL's AES crypto into the ROM, where there's already a copy present.

@dirkmueller
Copy link
Contributor

dirkmueller commented Jan 16, 2020

Thanks for sharing. It seems Variant 2 is a clear and obvious improvement over variant 1, while variant 3 has a higher code footprint. I assume licensing wise this is compatible and free/libre open source?

I agree that there may be many opportunities elsewhere to regain code size LWIP2 has a couple if compile time.flags (with/without ipv6, large/small mtu, large footprint+features over small code), so it might be an idea to select variant 2 vs 3 depending on that compile option.

@rsaxvc
Copy link
Contributor Author

rsaxvc commented Jan 16, 2020

I assume licensing wise this is compatible and free/libre open source?

The three implementations are part of lwIP already, you just select which one to use in the config file.

You could do much better if xtensa had a carry/overflow bit, but I couldn't find one.

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 16, 2020

The three implementations are part of lwIP already

Yes, thanks @rsaxvc for noticing this !

so it might be an idea to select variant 2 vs 3 depending on that compile option

This change is already merged in lwip2 repo but not yet imported to here (but soon, probably with #6887) but yes @dirkmueller it is a good idea to enable this with the "w/ features" options for lwip2
but not with the "w/o features" versions which is available to try and keep a lower flash footprint.

@rsaxvc
Copy link
Contributor Author

rsaxvc commented Jan 17, 2020

it is a good idea to enable this with the "w/ features" options for lwip2

How would you feel instead about keying it off "Higher Bandwidth" vs "Lower Memory"?

Speaking of things we might only want sometimes, memcpy performance on esp8266 is somewhat faster(about 30% less time per copy) for large aligned buffers rather than unaligned buffers.

There may be some opportunity to adjust ETH_PAD_SIZE so that the IP packets are 32-bit aligned inside of their buffers to make them easier to access, at the cost of some overhead per packet.

@rsaxvc
Copy link
Contributor Author

rsaxvc commented Jan 18, 2020

Sorry to get off topic, I noticed that each received packet from the ESP layers in ethernet_input gets buffered into memory from esp2glue_alloc_for_recv(). Is this strictly necessary? It seems like we should be able to do zero-copy-rx with LWIP_SUPPORT_CUSTOM_PBUF, then have lwIP free the ESP pbuf once the packet is no longer referenced.

@rsaxvc rsaxvc changed the title Switch to lwip checksum #3 Switch to lwip checksum #3 / Discussion of general throughput improvements Jan 18, 2020
@d-a-v
Copy link
Collaborator

d-a-v commented Jan 18, 2020

Is this strictly necessary?

pbufs arrive to us in the lwip-1.4 format. They need to be transformed into lwip-v2 format (check the main readme in lwip2 repo). Currently pbuf structure and payload are copied but you seem to ask if the new pbuf could use a pointer to already allocated data in the primary receive buffer.

Back then (during lwip2 dev) the only fail-proof way was to copy the payload (the goal was to be able to make a reliable tcp echo tester). Moving pbuf and its data out from allocated buffers by FW as soon as possible was the only way I found, because on heavy load a nasty FW error message appears and everything falls into custard following that. This message can be seen when using lwip-1.4 and sustained network transfers.

It could indeed be tested again with a custom pbuf (using a pointer). But maybe worthless if lwip-1.4 is still showing the same weakness (= saturated buffers: LmacRxBlk:1 <= that one made me mad and go to the lwip2 process)

@rsaxvc
Copy link
Contributor Author

rsaxvc commented Jan 18, 2020

I believe I have LWIP_SUPPORT_CUSTOM_PBUF working, but my application only exercises UDP receive, perhaps it will fail as soon as I try TCP loopback.

Or as soon as something drops a packet during TCP loopback. With UDP I assume lwIP 2 frees the custom pbuf as soon as the packet is handled. But with TCP we may need to wait to reassemble the stream.

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 19, 2020

How would you feel instead about keying it off "Higher Bandwidth" vs "Lower Memory"?

I think we need numbers about benefits.
Effective flash occupation for the 3 algorithms might not be the real concern as @devyte noted privately. On the other hand speed measures are to be done. Please go ahead with them if you can, I will try too (as soon as real life allows me to).
I know no way to achieve high bandwidth testing with UDP. With TCP I use echo testers like in #6979.

@rsaxvc
Copy link
Contributor Author

rsaxvc commented Jan 19, 2020

I agree, simple benchmarks like checksum time can be misleading, if you don't know how much overall time checksumming takes.

When I started looking into this, I had assumed ESP8266 throughput was CPU bound. And I hadn't figured out how to disable UDP checksum checking from the transmitter.

But now I think there's some other bottleneck in the system. With UDP RX and a short loop() function to print stats every second, I can get about 20mbps RX with or without optimizations(just receiving packets, no application usage of them). However, there's more CPU time available for my application code while doing so with optimizations applied.

UDP RX Before: ~1.15 megabits per CPU%
UDP RX After Optimizations: ~1.25 megabits per CPU%
UDP RX After Optimizations With Checksum Disabled on TX: ~1.6 megabits per CPU%

Edit: these numbers include a few other optimization attempts not yet discussed. I need to figure out which are responsible for the largest speedup.

@rsaxvc
Copy link
Contributor Author

rsaxvc commented Jan 19, 2020

Here's what I'm using to benchmark UDP/RX: https://github.com/rsaxvc/esp_lwip_benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants