-
-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: qCRC performance uplift #1708
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very promising! We've some tweaks to request on the debug output to make it a little more understandable - to put context on it - but otherwise this all looks good to go.
902d061
to
1cd869d
Compare
* This log line allowed to indirectly measure `qCRC` performance on PC_HOSTED * Use `__func__` likewise
Updated log format code, DEBUG_* macro usage, retested BMDA and BMF+debug. |
* qCRC is a way to benchmark SWD throughput from inside BMF * Update logs to additionally print each section address and length * For sufficiently long sections also compute KiB/s
* ADIv5 will run in autoincrement mode for up to 10-bit boundary (1024 octets)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and this all seems straight-forward and reasonable. Merging - thank you for the contribution!
Detailed description
(gdb) compare-sections
because of extra ADIv5 MEM-AP setup overhead.generic_crc32()
from 128 to 1024 bytes.There is an intentionally separate commit which adds DEBUG_INFO logs displaying "KiB/s" metric to BMDA, and copies that (ms and KiB/s) to the implementation for STM32 CRC IP enabled platforms. I avoid this calculation for small sections/regions (abundant in some firmware projects) for integer precision and overhead reasons.
On my quest to increasing runtime performance such that BMP-compatibles are more usable (Quality of Life for developers) against targets with Big Flash and with RTT, following (and testing) the SWD improvements, I decided to use some sort of benchmark internal to BMD (to not have to wire up and look at an LA or MSO). Uploads to reprogrammable flash bottleneck in said flash erase/write times, BMDA communication bottlenecks in either USB FS link, libusb or host syscalls, and there's no dedicated iperf3/speedtest between BMDA and BMPremote (using [un]hexify). But the GDB RSP
qCRC
packet it uses to check for section mismatch is pretty much ideal because a) the packets sent over RSP are pretty small, for just addresses and checksum values; b) STM32 CRC handles most of the calculations; c) the only remaining part is ADIv5 SWD (or JTAG) acceleration driver and gpio-bitbanging driver.Dragonmux highlighted that ADIv5 MEM-AP has a feature of Target Address Register autoincrement (for up to 10 bit, so 1024-byte long and aligned regions) configured in CSW and checked in driver methods. I tried increasing said buffer to 4096 bytes first, on
blackpill-f411ce
as the development kit with bigger SRAM and CPU speed, and also increasing BMPremote v3 packet lengths from 1024 to 4096 bytes as well. This did not display additional runtime benefit, rather, it may not fly on smaller adapters.The change 128->1024 is tested by me on both
blackpill-f411ce
andswlink
adapters with ENABLE_DEBUG=1, and these logs indicate approximately +15%(gdb) compare-sections
speed boost both onno_delay
and1 delay
TAP frequency settings (compared to BMF built without the stack buffer change).I reason it may be useful, and somewhat not expensive, because there are a few more places in BMF with stack buffers of similar size (memory-map.xml generator and samx5x nvm page code come to mind), which are not run simultaneously with this. Pinning the buffer to
static
storage duration in .bss, even if it guarantees allocation at link-time, is worse, because it is not used during >90% of the BMF runtime/lifetime. Repurposing the gdb packet buffer is another idea, but it may be more difficult in execution (or same for target flash buffers). I cannot in good conscience make it a dynamic buffer while BMF relies on newlib's-lnosys
provided optimistic_sbrk
-- it does not allow catching failed mallocs or stack/heap overflows.Value of 128 looks somewhat arbitrary to me -- I couldn't find arguments supporting it in git history commit messages, or GH issues and PRs. It was introduced a long time ago in PR #108 and then copied in 7279089 in 2016. Value of 4096 (0x1000) was introduced in PR #922 in e58b7d6 for BMDA along with other changes catering to H7 target usecase.
PR should not have significant impact on Flash size.
Your checklist for this pull request
make PROBE_HOST=native
)make PROBE_HOST=hosted
)Closing issues