Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: qCRC performance uplift #1708

Merged
merged 5 commits into from
Jan 4, 2024
Merged

Conversation

ALTracer
Copy link
Contributor

@ALTracer ALTracer commented Jan 2, 2024

Detailed description

  • On the scale between "fix" and "feature" this is closer to "fix".
  • The existing problem is suboptimal performance of (gdb) compare-sections because of extra ADIv5 MEM-AP setup overhead.
  • This PR solves that by increasing the stack buffer inside generic_crc32() from 128 to 1024 bytes.

There is an intentionally separate commit which adds DEBUG_INFO logs displaying "KiB/s" metric to BMDA, and copies that (ms and KiB/s) to the implementation for STM32 CRC IP enabled platforms. I avoid this calculation for small sections/regions (abundant in some firmware projects) for integer precision and overhead reasons.

On my quest to increasing runtime performance such that BMP-compatibles are more usable (Quality of Life for developers) against targets with Big Flash and with RTT, following (and testing) the SWD improvements, I decided to use some sort of benchmark internal to BMD (to not have to wire up and look at an LA or MSO). Uploads to reprogrammable flash bottleneck in said flash erase/write times, BMDA communication bottlenecks in either USB FS link, libusb or host syscalls, and there's no dedicated iperf3/speedtest between BMDA and BMPremote (using [un]hexify). But the GDB RSP qCRC packet it uses to check for section mismatch is pretty much ideal because a) the packets sent over RSP are pretty small, for just addresses and checksum values; b) STM32 CRC handles most of the calculations; c) the only remaining part is ADIv5 SWD (or JTAG) acceleration driver and gpio-bitbanging driver.

Dragonmux highlighted that ADIv5 MEM-AP has a feature of Target Address Register autoincrement (for up to 10 bit, so 1024-byte long and aligned regions) configured in CSW and checked in driver methods. I tried increasing said buffer to 4096 bytes first, on blackpill-f411ce as the development kit with bigger SRAM and CPU speed, and also increasing BMPremote v3 packet lengths from 1024 to 4096 bytes as well. This did not display additional runtime benefit, rather, it may not fly on smaller adapters.

The change 128->1024 is tested by me on both blackpill-f411ce and swlink adapters with ENABLE_DEBUG=1, and these logs indicate approximately +15% (gdb) compare-sections speed boost both on no_delay and 1 delay TAP frequency settings (compared to BMF built without the stack buffer change).

SWD no_delay: len = 163252: 160 KiB/s,        996 ms
SWD  1 delay: len = 163252: 105 KiB/s,        1505 ms
SWD  2 delay: len = 163252: 64 KiB/s, 2489 ms
before (128). After (1024):
SWD no_delay: len = 163252: 183 KiB/s,        867 ms
SWD  1 delay: len = 163252: 115 KiB/s,        1380 ms
SWD  2 delay: len = 163252: 72 KiB/s, 2210 ms

I reason it may be useful, and somewhat not expensive, because there are a few more places in BMF with stack buffers of similar size (memory-map.xml generator and samx5x nvm page code come to mind), which are not run simultaneously with this. Pinning the buffer to static storage duration in .bss, even if it guarantees allocation at link-time, is worse, because it is not used during >90% of the BMF runtime/lifetime. Repurposing the gdb packet buffer is another idea, but it may be more difficult in execution (or same for target flash buffers). I cannot in good conscience make it a dynamic buffer while BMF relies on newlib's -lnosys provided optimistic _sbrk -- it does not allow catching failed mallocs or stack/heap overflows.

Value of 128 looks somewhat arbitrary to me -- I couldn't find arguments supporting it in git history commit messages, or GH issues and PRs. It was introduced a long time ago in PR #108 and then copied in 7279089 in 2016. Value of 4096 (0x1000) was introduced in PR #922 in e58b7d6 for BMDA along with other changes catering to H7 target usecase.

PR should not have significant impact on Flash size.

Your checklist for this pull request

  • I've read the Code of Conduct
  • I've read the guidelines for contributing to this repository
  • It builds for hardware native (make PROBE_HOST=native)
  • It builds as BMDA (make PROBE_HOST=hosted)
  • I've tested it to the best of my ability
  • My commit messages provide a useful short description of what the commits do

Closing issues

@dragonmux dragonmux added Enhancement General project improvement GDB Issue/PR related to GDB labels Jan 2, 2024
@dragonmux dragonmux added this to the v2.0 release milestone Jan 2, 2024
Copy link
Member

@dragonmux dragonmux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very promising! We've some tweaks to request on the debug output to make it a little more understandable - to put context on it - but otherwise this all looks good to go.

src/crc32.c Outdated Show resolved Hide resolved
src/crc32.c Outdated Show resolved Hide resolved
src/crc32.c Show resolved Hide resolved
@ALTracer ALTracer force-pushed the qCRC-perf branch 3 times, most recently from 902d061 to 1cd869d Compare January 3, 2024 19:56
* This log line allowed to indirectly measure `qCRC` performance on PC_HOSTED
* Use `__func__` likewise
@ALTracer
Copy link
Contributor Author

ALTracer commented Jan 3, 2024

Updated log format code, DEBUG_* macro usage, retested BMDA and BMF+debug.
Backpropagated the suggested/refactoring changes to the rest of crc32.c translation unit as separate atomic commits.
Only the last commit is important (affects logic) for release firmware and can be cherry-picked mostly anywhere. The rest will stay on the development/main branch going forward.
Rebased to latest main (past PR1711) to see CI size-diff reports.

src/crc32.c Outdated Show resolved Hide resolved
* qCRC is a way to benchmark SWD throughput from inside BMF
* Update logs to additionally print each section address and length
* For sufficiently long sections also compute KiB/s
* ADIv5 will run in autoincrement mode for up to 10-bit boundary (1024 octets)
Copy link
Member

@dragonmux dragonmux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and this all seems straight-forward and reasonable. Merging - thank you for the contribution!

@dragonmux dragonmux merged commit e4d1c3f into blackmagic-debug:main Jan 4, 2024
6 checks passed
@ALTracer ALTracer deleted the qCRC-perf branch January 5, 2024 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement General project improvement GDB Issue/PR related to GDB
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants