Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NE10 to enable more neon optimization for libopus #518

Merged
merged 8 commits into from
Dec 22, 2024
Merged

Conversation

gnattu
Copy link
Member

@gnattu gnattu commented Dec 20, 2024

Unlike on x86, where libopus provides inline assembly, most of the performance optimizations for ARM Neon are implemented in the NE10 library. We need to build it separately for optimal performance on arm64 targets.

The performance improved significantly on M4 Max using test command:

./ffmpeg -f lavfi -i "anoisesrc=d=1000" -c:a libopus -b:a 128k -benchmark -f null -

Before:

[out#0/null @ 0x6000004a8300] video:0KiB audio:14175KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed=61.7x
bench: utime=16.488s stime=0.590s rtime=16.202s
bench: maxrss=12926976KiB

After:

[out#0/null @ 0x6000035e8000] video:0KiB audio:14171KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed= 343x    
bench: utime=3.277s stime=0.492s rtime=2.916s
bench: maxrss=11796480KiB

It is more than 5 times as fast. Would be more useful on CPU constrained devices like RK3588 based boards.

Changes

Issues

Unlike on x86, where libopus provides inline assembly, most of the
performance optimizations for ARM Neon are implemented in the NE10
library. We need to build it separately for optimal performance on
arm64 targets.
@gnattu gnattu requested a review from a team December 20, 2024 07:22
@gnattu gnattu marked this pull request as draft December 20, 2024 08:35
@gnattu
Copy link
Member Author

gnattu commented Dec 20, 2024

This would need more work as the NE10 library is very outdated and cannot be compiled with current ct-ng toolchain.

I also suspect the low performance of libopus is actually due to compiler/env bugs in libopus, as the current artifacts generated is still with slow performance.

@nyanmisaka
Copy link
Member

This would need more work as the NE10 library is very outdated and cannot be compiled with current ct-ng toolchain.

I also suspect the low performance of libopus is actually due to compiler/env bugs in libopus, as the current artifacts generated is still with slow performance.

https://github.com/xiph/opus/blob/7db26934e4156597cb0586bb4d2e44dccdde1a59/src/opus_decoder.c#L37

#if defined(__GNUC__) && (__GNUC__ >= 2) && !defined(__OPTIMIZE__) && !defined(OPUS_WILL_BE_SLOW)
# pragma message "You appear to be compiling without optimization, if so opus will be very slow."
#endif

// macos/arm64

2024-12-20T07:37:15.7634410Z configure:
2024-12-20T07:37:15.7637110Z ------------------------------------------------------------------------
2024-12-20T07:37:15.7639140Z   opus unknown:  Automatic configuration OK.
2024-12-20T07:37:15.7639450Z 
2024-12-20T07:37:15.7639630Z     Compiler support:
2024-12-20T07:37:15.7640110Z 
2024-12-20T07:37:15.7640340Z       C99 var arrays: ................ yes
2024-12-20T07:37:15.7741050Z       C99 lrintf: .................... yes
2024-12-20T07:37:15.7841850Z       Use alloca: .................... no (using var arrays)
2024-12-20T07:37:15.7847330Z 
2024-12-20T07:37:15.7854220Z     General configuration:
2024-12-20T07:37:15.7854430Z 
2024-12-20T07:37:15.7854660Z       Floating point support: ........ yes
2024-12-20T07:37:15.7854990Z       Fast float approximations: ..... yes
2024-12-20T07:37:15.7855330Z       Fixed point debugging: ......... no
2024-12-20T07:37:15.7855790Z       Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
2024-12-20T07:37:15.7856240Z       External Assembly Optimizations: 
2024-12-20T07:37:15.7856650Z       Intrinsics Optimizations: ...... ARM (NEON) (NE10) (NEON Aarch64) (DOTPROD) (DOTPROD Aarch64)
2024-12-20T07:37:15.7857030Z       Run-time CPU detection: ........ no
2024-12-20T07:37:15.7857370Z       Custom modes: .................. no
2024-12-20T07:37:15.7857750Z       Assertion checking: ............ no
2024-12-20T07:37:15.7858290Z       Hardening: ..................... yes
2024-12-20T07:37:15.7858610Z       Fuzzing: ....................... no
2024-12-20T07:37:15.7858830Z       Check ASM: ..................... no
2024-12-20T07:37:15.7859470Z 
2024-12-20T07:37:15.7859820Z       API documentation: ............. yes
2024-12-20T07:37:15.7860070Z       Extra programs: ................ no
2024-12-20T07:37:15.7860350Z ------------------------------------------------------------------------

...

2024-12-20T07:37:20.9538410Z src/opus_decoder.c:37:10: warning: You appear to be compiling without optimization, if so opus will be very slow. [-W#pragma-messages]
2024-12-20T07:37:20.9558960Z # pragma message "You appear to be compiling without optimization, if so opus will be very slow."

// linux/arm64

2024-12-20T06:03:00.7064137Z #78 31.51 configure:
2024-12-20T06:03:00.7064615Z #78 31.51 ------------------------------------------------------------------------
2024-12-20T06:03:00.7065210Z #78 31.52   opus unknown:  Automatic configuration OK.
2024-12-20T06:03:00.7065648Z #78 31.52 
2024-12-20T06:03:00.7066173Z #78 31.52     Compiler support:
2024-12-20T06:03:00.7066506Z #78 31.52 
2024-12-20T06:03:00.7066803Z #78 31.52       C99 var arrays: ................ yes
2024-12-20T06:03:00.7067265Z #78 31.52       C99 lrintf: .................... yes
2024-12-20T06:03:00.7067932Z #78 31.52       Use alloca: .................... no (using var arrays)
2024-12-20T06:03:00.7068413Z #78 31.52 
2024-12-20T06:03:00.7068695Z #78 31.52     General configuration:
2024-12-20T06:03:00.7069041Z #78 31.52 
2024-12-20T06:03:00.7069355Z #78 31.52       Floating point support: ........ yes
2024-12-20T06:03:00.7069830Z #78 31.52       Fast float approximations: ..... yes
2024-12-20T06:03:00.7070309Z #78 31.52       Fixed point debugging: ......... no
2024-12-20T06:03:00.7071059Z #78 31.52       Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
2024-12-20T06:03:00.7071774Z #78 31.52       External Assembly Optimizations: 
2024-12-20T06:03:00.7072428Z #78 31.52       Intrinsics Optimizations: ...... ARM (NEON) (NEON Aarch64) (DOTPROD)
2024-12-20T06:03:00.7073160Z #78 31.52       Run-time CPU detection: ........ ARM (DOTPROD Intrinsics)
2024-12-20T06:03:00.7073727Z #78 31.52       Custom modes: .................. no
2024-12-20T06:03:00.7074190Z #78 31.52       Assertion checking: ............ no
2024-12-20T06:03:00.7074652Z #78 31.52       Hardening: ..................... yes
2024-12-20T06:03:00.7075098Z #78 31.52       Fuzzing: ....................... no
2024-12-20T06:03:00.7075521Z #78 31.52       Check ASM: ..................... no
2024-12-20T06:03:00.7075917Z #78 31.52 
2024-12-20T06:03:00.7076231Z #78 31.52       API documentation: ............. yes
2024-12-20T06:03:00.7076676Z #78 31.52       Extra programs: ................ no
2024-12-20T06:03:00.7077357Z #78 31.52 ------------------------------------------------------------------------

// linux/amd64

2024-12-20T06:09:43.2431650Z #73 32.88 configure:
2024-12-20T06:09:43.2432054Z #73 32.88 ------------------------------------------------------------------------
2024-12-20T06:09:43.2432609Z #73 32.88   opus unknown:  Automatic configuration OK.
2024-12-20T06:09:43.2433042Z #73 32.88 
2024-12-20T06:09:43.2433319Z #73 32.88     Compiler support:
2024-12-20T06:09:43.2433661Z #73 32.88 
2024-12-20T06:09:43.2433944Z #73 32.88       C99 var arrays: ................ yes
2024-12-20T06:09:43.2434376Z #73 32.88       C99 lrintf: .................... yes
2024-12-20T06:09:43.2434860Z #73 32.88       Use alloca: .................... no (using var arrays)
2024-12-20T06:09:43.2435324Z #73 32.88 
2024-12-20T06:09:43.2435605Z #73 32.88     General configuration:
2024-12-20T06:09:43.2435959Z #73 32.88 
2024-12-20T06:09:43.2436254Z #73 32.88       Floating point support: ........ yes
2024-12-20T06:09:43.2436727Z #73 32.88       Fast float approximations: ..... yes
2024-12-20T06:09:43.2437190Z #73 32.88       Fixed point debugging: ......... no
2024-12-20T06:09:43.2437884Z #73 32.88       Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
2024-12-20T06:09:43.2438606Z #73 32.88       External Assembly Optimizations: 
2024-12-20T06:09:43.2439160Z #73 32.88       Intrinsics Optimizations: ...... x86 SSE SSE2 SSE4.1
2024-12-20T06:09:43.2439706Z #73 32.88       Run-time CPU detection: ........ x86 SSE4.1
2024-12-20T06:09:43.2440178Z #73 32.88       Custom modes: .................. no
2024-12-20T06:09:43.2440619Z #73 32.88       Assertion checking: ............ no
2024-12-20T06:09:43.2441060Z #73 32.88       Hardening: ..................... yes
2024-12-20T06:09:43.2441494Z #73 32.88       Fuzzing: ....................... no
2024-12-20T06:09:43.2441921Z #73 32.88       Check ASM: ..................... no
2024-12-20T06:09:43.2442300Z #73 32.88 
2024-12-20T06:09:43.2442609Z #73 32.88       API documentation: ............. yes
2024-12-20T06:09:43.2443056Z #73 32.88       Extra programs: ................ no
2024-12-20T06:09:43.2443561Z #73 32.89 ------------------------------------------------------------------------

@nyanmisaka
Copy link
Member

nyanmisaka commented Dec 20, 2024

libopus AVX2 auto-detection is also broken only in linux portable builds.

// windows & macos amd64/x86

2024-12-20T05:06:15.3974650Z       Intrinsics Optimizations: ...... x86 SSE SSE2 SSE4.1 AVX2
2024-12-20T05:06:15.3975037Z       Run-time CPU detection: ........ x86 SSE4.1 AVX2

@gnattu
Copy link
Member Author

gnattu commented Dec 20, 2024

This is because we are setting CFLAGS env and the auto tool of libopus will not add optimization flags if the CFLAGS is not empty. We can workaround this by reset CFLAGS to empty during this stage.

@nyanmisaka
Copy link
Member

This is because we are setting CFLAGS env and the auto tool of libopus will not add optimization flags if the CFLAGS is not empty. We can workaround this by reset CFLAGS to empty during this stage.

Makes sense. Maybe it's worth sending a PR upstream to xiph/opus.

So does NE10 really offer any performance improvements?

@gnattu
Copy link
Member Author

gnattu commented Dec 20, 2024

It does from preliminary tests, just not that huge (around 10% on M4 Max).

@nyanmisaka
Copy link
Member

It does from preliminary tests, just not that huge (around 10% on M4 Max).

But ARM doesn't seem to be actively maintaining it anymore, so we'll have to fork and fix it if we really want to use it.

@gnattu
Copy link
Member Author

gnattu commented Dec 20, 2024

Concrete numbers with O3 (higher than default O2):

with NE10:

[out#0/null @ 0x600000bfc300] video:0KiB audio:14172KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed= 385x    
bench: utime=2.929s stime=0.450s rtime=2.600s
bench: maxrss=12664832KiB

without NE10:

[out#0/null @ 0x600003c10000] video:0KiB audio:14175KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed= 335x    
bench: utime=3.345s stime=0.482s rtime=2.985s
bench: maxrss=11452416KiB

So about 15% performance gain by using NE10. Let me see if I can make it work with ct-ng toolchain. In the worst case we just put O3 and call it a day.

@gnattu
Copy link
Member Author

gnattu commented Dec 20, 2024

Performance of NE10 on RK3588 built with crosstool-ng:

with NE10:

[out#0/null @ 0x559f9f9ed0] video:0KiB audio:14172KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed= 151x    
bench: utime=9.945s stime=0.862s rtime=6.639s
bench: maxrss=16964KiB

without NE10:

[out#0/null @ 0x55bd42ded0] video:0KiB audio:14175KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed= 135x    
bench: utime=11.357s stime=1.059s rtime=7.402s
bench: maxrss=16816KiB

It is like 11.8% performance boost.

@gnattu
Copy link
Member Author

gnattu commented Dec 20, 2024

Also I noticed that some of our CFLAGs are leaky which caused the optimization flags to be passed to ffmpeg incorrectly:

2024-12-20T16:10:19.7308030Z lto1: warning: switch '-mcpu=generic' conflicts with '-march=armv8.2-a+dotprod' switch and resulted in options '+lse+dotprod+rdma+crc' being added

This happened even before this PR and I think it could be libopus, but the strange thing is it still occurs after this PR where the CFLAGS should be reset after the complication.

The lucky part is that our portable arm64 builds still runs on earlier armv8 CPUs (like RK3399) and I have not seen it got crashed by calling not implemented instructions. Maybe we have to investigate in the future.

@nyanmisaka
Copy link
Member

nyanmisaka commented Dec 20, 2024

cpu="generic"

-mcpu=generic is the default option for ffmpeg.

The warning happened during ffmpeg's own lto phase, and only svt-av1 has checked -march=armv8.2-a+dotprod, so this should be related to it.

else
export CFLAGS="-O3 -fPIC -DPIC"
fi

./configure "${myconf[@]}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like optimizations being completely disabled only happens on macOS builds? I don't see the !DEFINED(__OPTIMIZED__) warning on crosstool-ng builds.

Opus have this in their macOS CI:

      - name: Configure
        run: CFLAGS="-mavx -mfma -mavx2 -O2 -ffast-math" ./configure --enable-float-approx ${{ matrix.config.buildconfig }}

So maybe this will be enough for both platforms:

    # Override previously set -O(n) option and the CC's default optimization options.
    CFLAGS="$CFLAGS -O3" ./configure "${myconf[@]}"

@nyanmisaka
Copy link
Member

https://github.com/xiph/opus/pull/362.patch

This should fix the AVX2 intrinsics auto detection issue in GCC 14+:

2024-12-20T17:53:17.2632984Z #63 24.47 checking if compiler supports AVX2 intrinsics with -mavx -mfma -mavx2... no
2024-12-20T17:53:17.2633946Z #63 25.45 checking How to get X86 CPU Info... configure: WARNING: Compiler does not support AVX2 intrinsics

@gnattu
Copy link
Member Author

gnattu commented Dec 21, 2024

https://github.com/xiph/opus/pull/362.patch

This should fix the AVX2 intrinsics auto detection issue in GCC 14+:

2024-12-20T17:53:17.2632984Z #63 24.47 checking if compiler supports AVX2 intrinsics with -mavx -mfma -mavx2... no
2024-12-20T17:53:17.2633946Z #63 25.45 checking How to get X86 CPU Info... configure: WARNING: Compiler does not support AVX2 intrinsics

Should we include that patch here or in a separate PR?

@nyanmisaka
Copy link
Member

https://github.com/xiph/opus/pull/362.patch
This should fix the AVX2 intrinsics auto detection issue in GCC 14+:

2024-12-20T17:53:17.2632984Z #63 24.47 checking if compiler supports AVX2 intrinsics with -mavx -mfma -mavx2... no
2024-12-20T17:53:17.2633946Z #63 25.45 checking How to get X86 CPU Info... configure: WARNING: Compiler does not support AVX2 intrinsics

Should we include that patch here or in a separate PR?

For the sake of bisection, I can handle it in a separate PR after this is merged.

ffbuild_dockerstage() {
    to_df "RUN --mount=src=${SELF},dst=/stage.sh --mount=src=patches/libopus,dst=/patches run_stage /stage.sh"
}

ffbuild_dockerbuild() {
...

    for patch in /patches/*.patch; do
        echo "Applying $patch"
        patch -p1 < "$patch"
    done

...
}

@gnattu
Copy link
Member Author

gnattu commented Dec 21, 2024

For the sake of bisection, I can handle it in a separate PR after this is merged.

ffbuild_dockerstage() {
    to_df "RUN --mount=src=${SELF},dst=/stage.sh --mount=src=patches/libopus,dst=/patches run_stage /stage.sh"
}

ffbuild_dockerbuild() {
...

    for patch in /patches/*.patch; do
        echo "Applying $patch"
        patch -p1 < "$patch"
    done

...
}

macOS builder won't run the ffbuild_dockerstage so we need to handle the case when there is no patches. This bug does not affect the clang toolchain under macOS anyway.

@nyanmisaka
Copy link
Member

Then we will have to use wget -q -O - https://github.com/xiph/opus/commit/9ec11c1.patch | git apply in ffbuild_dockerbuild

@gnattu gnattu marked this pull request as ready for review December 21, 2024 12:10
builder/scripts.d/50-libopus.sh Outdated Show resolved Hide resolved
builder/scripts.d/50-libopus.sh Outdated Show resolved Hide resolved
Co-authored-by: Nyanmisaka <[email protected]>
@gnattu gnattu merged commit a7d64d5 into jellyfin Dec 22, 2024
27 checks passed
@gnattu gnattu deleted the ne10-opus branch December 22, 2024 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants