Fix: qCRC performance uplift #1708

ALTracer · 2024-01-02T16:55:14Z

Detailed description

On the scale between "fix" and "feature" this is closer to "fix".
The existing problem is suboptimal performance of (gdb) compare-sections because of extra ADIv5 MEM-AP setup overhead.
This PR solves that by increasing the stack buffer inside generic_crc32() from 128 to 1024 bytes.

There is an intentionally separate commit which adds DEBUG_INFO logs displaying "KiB/s" metric to BMDA, and copies that (ms and KiB/s) to the implementation for STM32 CRC IP enabled platforms. I avoid this calculation for small sections/regions (abundant in some firmware projects) for integer precision and overhead reasons.

On my quest to increasing runtime performance such that BMP-compatibles are more usable (Quality of Life for developers) against targets with Big Flash and with RTT, following (and testing) the SWD improvements, I decided to use some sort of benchmark internal to BMD (to not have to wire up and look at an LA or MSO). Uploads to reprogrammable flash bottleneck in said flash erase/write times, BMDA communication bottlenecks in either USB FS link, libusb or host syscalls, and there's no dedicated iperf3/speedtest between BMDA and BMPremote (using [un]hexify). But the GDB RSP qCRC packet it uses to check for section mismatch is pretty much ideal because a) the packets sent over RSP are pretty small, for just addresses and checksum values; b) STM32 CRC handles most of the calculations; c) the only remaining part is ADIv5 SWD (or JTAG) acceleration driver and gpio-bitbanging driver.

Dragonmux highlighted that ADIv5 MEM-AP has a feature of Target Address Register autoincrement (for up to 10 bit, so 1024-byte long and aligned regions) configured in CSW and checked in driver methods. I tried increasing said buffer to 4096 bytes first, on blackpill-f411ce as the development kit with bigger SRAM and CPU speed, and also increasing BMPremote v3 packet lengths from 1024 to 4096 bytes as well. This did not display additional runtime benefit, rather, it may not fly on smaller adapters.

The change 128->1024 is tested by me on both blackpill-f411ce and swlink adapters with ENABLE_DEBUG=1, and these logs indicate approximately +15% (gdb) compare-sections speed boost both on no_delay and 1 delay TAP frequency settings (compared to BMF built without the stack buffer change).

SWD no_delay: len = 163252: 160 KiB/s,        996 ms
SWD  1 delay: len = 163252: 105 KiB/s,        1505 ms
SWD  2 delay: len = 163252: 64 KiB/s, 2489 ms
before (128). After (1024):
SWD no_delay: len = 163252: 183 KiB/s,        867 ms
SWD  1 delay: len = 163252: 115 KiB/s,        1380 ms
SWD  2 delay: len = 163252: 72 KiB/s, 2210 ms

I reason it may be useful, and somewhat not expensive, because there are a few more places in BMF with stack buffers of similar size (memory-map.xml generator and samx5x nvm page code come to mind), which are not run simultaneously with this. Pinning the buffer to static storage duration in .bss, even if it guarantees allocation at link-time, is worse, because it is not used during >90% of the BMF runtime/lifetime. Repurposing the gdb packet buffer is another idea, but it may be more difficult in execution (or same for target flash buffers). I cannot in good conscience make it a dynamic buffer while BMF relies on newlib's -lnosys provided optimistic _sbrk -- it does not allow catching failed mallocs or stack/heap overflows.

Value of 128 looks somewhat arbitrary to me -- I couldn't find arguments supporting it in git history commit messages, or GH issues and PRs. It was introduced a long time ago in PR #108 and then copied in 7279089 in 2016. Value of 4096 (0x1000) was introduced in PR #922 in e58b7d6 for BMDA along with other changes catering to H7 target usecase.

PR should not have significant impact on Flash size.

Your checklist for this pull request

I've read the Code of Conduct
I've read the guidelines for contributing to this repository
It builds for hardware native (make PROBE_HOST=native)
It builds as BMDA (make PROBE_HOST=hosted)
I've tested it to the best of my ability
My commit messages provide a useful short description of what the commits do

Closing issues

dragonmux

This is looking very promising! We've some tweaks to request on the debug output to make it a little more understandable - to put context on it - but otherwise this all looks good to go.

src/crc32.c

* This log line allowed to indirectly measure `qCRC` performance on PC_HOSTED * Use `__func__` likewise

ALTracer · 2024-01-03T20:04:36Z

Updated log format code, DEBUG_* macro usage, retested BMDA and BMF+debug.
Backpropagated the suggested/refactoring changes to the rest of crc32.c translation unit as separate atomic commits.
Only the last commit is important (affects logic) for release firmware and can be cherry-picked mostly anywhere. The rest will stay on the development/main branch going forward.
Rebased to latest main (past PR1711) to see CI size-diff reports.

src/crc32.c

* qCRC is a way to benchmark SWD throughput from inside BMF * Update logs to additionally print each section address and length * For sufficiently long sections also compute KiB/s

* ADIv5 will run in autoincrement mode for up to 10-bit boundary (1024 octets)

dragonmux

LGTM, and this all seems straight-forward and reasonable. Merging - thank you for the contribution!

dragonmux added Enhancement General project improvement GDB Issue/PR related to GDB labels Jan 2, 2024

dragonmux added this to the v2.0 release milestone Jan 2, 2024

dragonmux requested changes Jan 3, 2024

View reviewed changes

src/crc32.c Outdated Show resolved Hide resolved

src/crc32.c Outdated Show resolved Hide resolved

src/crc32.c Show resolved Hide resolved

ALTracer force-pushed the qCRC-perf branch 3 times, most recently from 902d061 to 1cd869d Compare January 3, 2024 19:56

ALTracer added 3 commits January 3, 2024 23:03

crc32: Update logs to use __func__

ddebe0a

crc32: Rename implementation variants

a3ffc6c

crc32: Update loglevel of timing report, copy feature to stm32 impl

5554b9b

* This log line allowed to indirectly measure `qCRC` performance on PC_HOSTED * Use `__func__` likewise

ALTracer force-pushed the qCRC-perf branch from 1cd869d to 62d36b3 Compare January 3, 2024 20:03

dragonmux requested changes Jan 4, 2024

View reviewed changes

src/crc32.c Outdated Show resolved Hide resolved

ALTracer added 2 commits January 4, 2024 19:12

crc32: Enable debug logs of data rate in both impls

884ca4e

* qCRC is a way to benchmark SWD throughput from inside BMF * Update logs to additionally print each section address and length * For sufficiently long sections also compute KiB/s

crc32: Increase local buffer size in STM32 impl to 1 KiB

f1bf2f2

* ADIv5 will run in autoincrement mode for up to 10-bit boundary (1024 octets)

ALTracer force-pushed the qCRC-perf branch from 62d36b3 to f1bf2f2 Compare January 4, 2024 16:12

dragonmux approved these changes Jan 4, 2024

View reviewed changes

dragonmux merged commit e4d1c3f into blackmagic-debug:main Jan 4, 2024
6 checks passed

ALTracer deleted the qCRC-perf branch January 5, 2024 13:43

ALTracer mentioned this pull request Mar 28, 2024

Fix: Replace attribute alias in CRC32 #1713

Merged

6 tasks

ALTracer mentioned this pull request Nov 10, 2024

crc32: Buffer up log output to avoid prefix-chunking, fix func #1998

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: qCRC performance uplift #1708

Fix: qCRC performance uplift #1708

ALTracer commented Jan 2, 2024

dragonmux left a comment

ALTracer commented Jan 3, 2024

dragonmux left a comment

Fix: qCRC performance uplift #1708

Fix: qCRC performance uplift #1708

Conversation

ALTracer commented Jan 2, 2024

Detailed description

Your checklist for this pull request

Closing issues

dragonmux left a comment

Choose a reason for hiding this comment

ALTracer commented Jan 3, 2024

dragonmux left a comment

Choose a reason for hiding this comment