-
-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NE10 to enable more neon optimization for libopus #518
Conversation
Unlike on x86, where libopus provides inline assembly, most of the performance optimizations for ARM Neon are implemented in the NE10 library. We need to build it separately for optimal performance on arm64 targets.
This would need more work as the NE10 library is very outdated and cannot be compiled with current ct-ng toolchain. I also suspect the low performance of libopus is actually due to compiler/env bugs in libopus, as the current artifacts generated is still with slow performance. |
https://github.com/xiph/opus/blob/7db26934e4156597cb0586bb4d2e44dccdde1a59/src/opus_decoder.c#L37 #if defined(__GNUC__) && (__GNUC__ >= 2) && !defined(__OPTIMIZE__) && !defined(OPUS_WILL_BE_SLOW)
# pragma message "You appear to be compiling without optimization, if so opus will be very slow."
#endif // macos/arm64
// linux/arm64
// linux/amd64
|
libopus AVX2 auto-detection is also broken only in linux portable builds. // windows & macos amd64/x86
|
This is because we are setting CFLAGS env and the auto tool of libopus will not add optimization flags if the CFLAGS is not empty. We can workaround this by reset CFLAGS to empty during this stage. |
Makes sense. Maybe it's worth sending a PR upstream to xiph/opus. So does NE10 really offer any performance improvements? |
It does from preliminary tests, just not that huge (around 10% on M4 Max). |
But ARM doesn't seem to be actively maintaining it anymore, so we'll have to fork and fix it if we really want to use it. |
Concrete numbers with with NE10:
without NE10:
So about 15% performance gain by using NE10. Let me see if I can make it work with ct-ng toolchain. In the worst case we just put |
Performance of NE10 on RK3588 built with crosstool-ng: with NE10:
without NE10:
It is like 11.8% performance boost. |
Also I noticed that some of our CFLAGs are leaky which caused the optimization flags to be passed to ffmpeg incorrectly:
This happened even before this PR and I think it could be libopus, but the strange thing is it still occurs after this PR where the CFLAGS should be reset after the complication. The lucky part is that our portable arm64 builds still runs on earlier armv8 CPUs (like RK3399) and I have not seen it got crashed by calling not implemented instructions. Maybe we have to investigate in the future. |
Line 4068 in 2f3b874
-mcpu=generic is the default option for ffmpeg.
The warning happened during ffmpeg's own lto phase, and only |
builder/scripts.d/50-libopus.sh
Outdated
else | ||
export CFLAGS="-O3 -fPIC -DPIC" | ||
fi | ||
|
||
./configure "${myconf[@]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like optimizations being completely disabled only happens on macOS builds? I don't see the !DEFINED(__OPTIMIZED__)
warning on crosstool-ng builds.
Opus have this in their macOS CI:
- name: Configure
run: CFLAGS="-mavx -mfma -mavx2 -O2 -ffast-math" ./configure --enable-float-approx ${{ matrix.config.buildconfig }}
So maybe this will be enough for both platforms:
# Override previously set -O(n) option and the CC's default optimization options.
CFLAGS="$CFLAGS -O3" ./configure "${myconf[@]}"
https://github.com/xiph/opus/pull/362.patch This should fix the AVX2 intrinsics auto detection issue in GCC 14+:
|
Should we include that patch here or in a separate PR? |
For the sake of bisection, I can handle it in a separate PR after this is merged. ffbuild_dockerstage() {
to_df "RUN --mount=src=${SELF},dst=/stage.sh --mount=src=patches/libopus,dst=/patches run_stage /stage.sh"
}
ffbuild_dockerbuild() {
...
for patch in /patches/*.patch; do
echo "Applying $patch"
patch -p1 < "$patch"
done
...
} |
macOS builder won't run the |
Then we will have to use |
Co-authored-by: Nyanmisaka <[email protected]>
Unlike on x86, where libopus provides inline assembly, most of the performance optimizations for ARM Neon are implemented in the NE10 library. We need to build it separately for optimal performance on arm64 targets.
The performance improved significantly on M4 Max using test command:
Before:
After:
It is more than 5 times as fast. Would be more useful on CPU constrained devices like RK3588 based boards.
Changes
Issues