Add NE10 to enable more neon optimization for libopus #518

gnattu · 2024-12-20T07:22:30Z

Unlike on x86, where libopus provides inline assembly, most of the performance optimizations for ARM Neon are implemented in the NE10 library. We need to build it separately for optimal performance on arm64 targets.

The performance improved significantly on M4 Max using test command:

./ffmpeg -f lavfi -i "anoisesrc=d=1000" -c:a libopus -b:a 128k -benchmark -f null -

Before:

[out#0/null @ 0x6000004a8300] video:0KiB audio:14175KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed=61.7x
bench: utime=16.488s stime=0.590s rtime=16.202s
bench: maxrss=12926976KiB

After:

[out#0/null @ 0x6000035e8000] video:0KiB audio:14171KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed= 343x    
bench: utime=3.277s stime=0.492s rtime=2.916s
bench: maxrss=11796480KiB

It is more than 5 times as fast. Would be more useful on CPU constrained devices like RK3588 based boards.

Changes

Issues

Unlike on x86, where libopus provides inline assembly, most of the performance optimizations for ARM Neon are implemented in the NE10 library. We need to build it separately for optimal performance on arm64 targets.

gnattu · 2024-12-20T08:37:07Z

This would need more work as the NE10 library is very outdated and cannot be compiled with current ct-ng toolchain.

I also suspect the low performance of libopus is actually due to compiler/env bugs in libopus, as the current artifacts generated is still with slow performance.

nyanmisaka · 2024-12-20T09:54:16Z

This would need more work as the NE10 library is very outdated and cannot be compiled with current ct-ng toolchain.

I also suspect the low performance of libopus is actually due to compiler/env bugs in libopus, as the current artifacts generated is still with slow performance.

https://github.com/xiph/opus/blob/7db26934e4156597cb0586bb4d2e44dccdde1a59/src/opus_decoder.c#L37

#if defined(__GNUC__) && (__GNUC__ >= 2) && !defined(__OPTIMIZE__) && !defined(OPUS_WILL_BE_SLOW)
# pragma message "You appear to be compiling without optimization, if so opus will be very slow."
#endif

// macos/arm64

2024-12-20T07:37:15.7634410Z configure:
2024-12-20T07:37:15.7637110Z ------------------------------------------------------------------------
2024-12-20T07:37:15.7639140Z   opus unknown:  Automatic configuration OK.
2024-12-20T07:37:15.7639450Z 
2024-12-20T07:37:15.7639630Z     Compiler support:
2024-12-20T07:37:15.7640110Z 
2024-12-20T07:37:15.7640340Z       C99 var arrays: ................ yes
2024-12-20T07:37:15.7741050Z       C99 lrintf: .................... yes
2024-12-20T07:37:15.7841850Z       Use alloca: .................... no (using var arrays)
2024-12-20T07:37:15.7847330Z 
2024-12-20T07:37:15.7854220Z     General configuration:
2024-12-20T07:37:15.7854430Z 
2024-12-20T07:37:15.7854660Z       Floating point support: ........ yes
2024-12-20T07:37:15.7854990Z       Fast float approximations: ..... yes
2024-12-20T07:37:15.7855330Z       Fixed point debugging: ......... no
2024-12-20T07:37:15.7855790Z       Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
2024-12-20T07:37:15.7856240Z       External Assembly Optimizations: 
2024-12-20T07:37:15.7856650Z       Intrinsics Optimizations: ...... ARM (NEON) (NE10) (NEON Aarch64) (DOTPROD) (DOTPROD Aarch64)
2024-12-20T07:37:15.7857030Z       Run-time CPU detection: ........ no
2024-12-20T07:37:15.7857370Z       Custom modes: .................. no
2024-12-20T07:37:15.7857750Z       Assertion checking: ............ no
2024-12-20T07:37:15.7858290Z       Hardening: ..................... yes
2024-12-20T07:37:15.7858610Z       Fuzzing: ....................... no
2024-12-20T07:37:15.7858830Z       Check ASM: ..................... no
2024-12-20T07:37:15.7859470Z 
2024-12-20T07:37:15.7859820Z       API documentation: ............. yes
2024-12-20T07:37:15.7860070Z       Extra programs: ................ no
2024-12-20T07:37:15.7860350Z ------------------------------------------------------------------------

...

2024-12-20T07:37:20.9538410Z src/opus_decoder.c:37:10: warning: You appear to be compiling without optimization, if so opus will be very slow. [-W#pragma-messages]
2024-12-20T07:37:20.9558960Z # pragma message "You appear to be compiling without optimization, if so opus will be very slow."

// linux/arm64

2024-12-20T06:03:00.7064137Z #78 31.51 configure:
2024-12-20T06:03:00.7064615Z #78 31.51 ------------------------------------------------------------------------
2024-12-20T06:03:00.7065210Z #78 31.52   opus unknown:  Automatic configuration OK.
2024-12-20T06:03:00.7065648Z #78 31.52 
2024-12-20T06:03:00.7066173Z #78 31.52     Compiler support:
2024-12-20T06:03:00.7066506Z #78 31.52 
2024-12-20T06:03:00.7066803Z #78 31.52       C99 var arrays: ................ yes
2024-12-20T06:03:00.7067265Z #78 31.52       C99 lrintf: .................... yes
2024-12-20T06:03:00.7067932Z #78 31.52       Use alloca: .................... no (using var arrays)
2024-12-20T06:03:00.7068413Z #78 31.52 
2024-12-20T06:03:00.7068695Z #78 31.52     General configuration:
2024-12-20T06:03:00.7069041Z #78 31.52 
2024-12-20T06:03:00.7069355Z #78 31.52       Floating point support: ........ yes
2024-12-20T06:03:00.7069830Z #78 31.52       Fast float approximations: ..... yes
2024-12-20T06:03:00.7070309Z #78 31.52       Fixed point debugging: ......... no
2024-12-20T06:03:00.7071059Z #78 31.52       Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
2024-12-20T06:03:00.7071774Z #78 31.52       External Assembly Optimizations: 
2024-12-20T06:03:00.7072428Z #78 31.52       Intrinsics Optimizations: ...... ARM (NEON) (NEON Aarch64) (DOTPROD)
2024-12-20T06:03:00.7073160Z #78 31.52       Run-time CPU detection: ........ ARM (DOTPROD Intrinsics)
2024-12-20T06:03:00.7073727Z #78 31.52       Custom modes: .................. no
2024-12-20T06:03:00.7074190Z #78 31.52       Assertion checking: ............ no
2024-12-20T06:03:00.7074652Z #78 31.52       Hardening: ..................... yes
2024-12-20T06:03:00.7075098Z #78 31.52       Fuzzing: ....................... no
2024-12-20T06:03:00.7075521Z #78 31.52       Check ASM: ..................... no
2024-12-20T06:03:00.7075917Z #78 31.52 
2024-12-20T06:03:00.7076231Z #78 31.52       API documentation: ............. yes
2024-12-20T06:03:00.7076676Z #78 31.52       Extra programs: ................ no
2024-12-20T06:03:00.7077357Z #78 31.52 ------------------------------------------------------------------------

// linux/amd64

2024-12-20T06:09:43.2431650Z #73 32.88 configure:
2024-12-20T06:09:43.2432054Z #73 32.88 ------------------------------------------------------------------------
2024-12-20T06:09:43.2432609Z #73 32.88   opus unknown:  Automatic configuration OK.
2024-12-20T06:09:43.2433042Z #73 32.88 
2024-12-20T06:09:43.2433319Z #73 32.88     Compiler support:
2024-12-20T06:09:43.2433661Z #73 32.88 
2024-12-20T06:09:43.2433944Z #73 32.88       C99 var arrays: ................ yes
2024-12-20T06:09:43.2434376Z #73 32.88       C99 lrintf: .................... yes
2024-12-20T06:09:43.2434860Z #73 32.88       Use alloca: .................... no (using var arrays)
2024-12-20T06:09:43.2435324Z #73 32.88 
2024-12-20T06:09:43.2435605Z #73 32.88     General configuration:
2024-12-20T06:09:43.2435959Z #73 32.88 
2024-12-20T06:09:43.2436254Z #73 32.88       Floating point support: ........ yes
2024-12-20T06:09:43.2436727Z #73 32.88       Fast float approximations: ..... yes
2024-12-20T06:09:43.2437190Z #73 32.88       Fixed point debugging: ......... no
2024-12-20T06:09:43.2437884Z #73 32.88       Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
2024-12-20T06:09:43.2438606Z #73 32.88       External Assembly Optimizations: 
2024-12-20T06:09:43.2439160Z #73 32.88       Intrinsics Optimizations: ...... x86 SSE SSE2 SSE4.1
2024-12-20T06:09:43.2439706Z #73 32.88       Run-time CPU detection: ........ x86 SSE4.1
2024-12-20T06:09:43.2440178Z #73 32.88       Custom modes: .................. no
2024-12-20T06:09:43.2440619Z #73 32.88       Assertion checking: ............ no
2024-12-20T06:09:43.2441060Z #73 32.88       Hardening: ..................... yes
2024-12-20T06:09:43.2441494Z #73 32.88       Fuzzing: ....................... no
2024-12-20T06:09:43.2441921Z #73 32.88       Check ASM: ..................... no
2024-12-20T06:09:43.2442300Z #73 32.88 
2024-12-20T06:09:43.2442609Z #73 32.88       API documentation: ............. yes
2024-12-20T06:09:43.2443056Z #73 32.88       Extra programs: ................ no
2024-12-20T06:09:43.2443561Z #73 32.89 ------------------------------------------------------------------------

nyanmisaka · 2024-12-20T10:06:47Z

libopus AVX2 auto-detection is also broken only in linux portable builds.

// windows & macos amd64/x86

2024-12-20T05:06:15.3974650Z       Intrinsics Optimizations: ...... x86 SSE SSE2 SSE4.1 AVX2
2024-12-20T05:06:15.3975037Z       Run-time CPU detection: ........ x86 SSE4.1 AVX2

gnattu · 2024-12-20T10:13:31Z

This is because we are setting CFLAGS env and the auto tool of libopus will not add optimization flags if the CFLAGS is not empty. We can workaround this by reset CFLAGS to empty during this stage.

nyanmisaka · 2024-12-20T10:18:28Z

This is because we are setting CFLAGS env and the auto tool of libopus will not add optimization flags if the CFLAGS is not empty. We can workaround this by reset CFLAGS to empty during this stage.

Makes sense. Maybe it's worth sending a PR upstream to xiph/opus.

So does NE10 really offer any performance improvements?

gnattu · 2024-12-20T10:22:29Z

It does from preliminary tests, just not that huge (around 10% on M4 Max).

nyanmisaka · 2024-12-20T10:34:06Z

It does from preliminary tests, just not that huge (around 10% on M4 Max).

But ARM doesn't seem to be actively maintaining it anymore, so we'll have to fork and fix it if we really want to use it.

gnattu · 2024-12-20T12:03:35Z

Concrete numbers with O3 (higher than default O2):

with NE10:

[out#0/null @ 0x600000bfc300] video:0KiB audio:14172KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed= 385x    
bench: utime=2.929s stime=0.450s rtime=2.600s
bench: maxrss=12664832KiB

without NE10:

[out#0/null @ 0x600003c10000] video:0KiB audio:14175KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed= 335x    
bench: utime=3.345s stime=0.482s rtime=2.985s
bench: maxrss=11452416KiB

So about 15% performance gain by using NE10. Let me see if I can make it work with ct-ng toolchain. In the worst case we just put O3 and call it a day.

gnattu · 2024-12-20T17:12:08Z

Performance of NE10 on RK3588 built with crosstool-ng:

with NE10:

[out#0/null @ 0x559f9f9ed0] video:0KiB audio:14172KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed= 151x    
bench: utime=9.945s stime=0.862s rtime=6.639s
bench: maxrss=16964KiB

without NE10:

[out#0/null @ 0x55bd42ded0] video:0KiB audio:14175KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:16:40.00 bitrate=N/A speed= 135x    
bench: utime=11.357s stime=1.059s rtime=7.402s
bench: maxrss=16816KiB

It is like 11.8% performance boost.

gnattu · 2024-12-20T17:21:17Z

Also I noticed that some of our CFLAGs are leaky which caused the optimization flags to be passed to ffmpeg incorrectly:

2024-12-20T16:10:19.7308030Z lto1: warning: switch '-mcpu=generic' conflicts with '-march=armv8.2-a+dotprod' switch and resulted in options '+lse+dotprod+rdma+crc' being added

This happened even before this PR and I think it could be libopus, but the strange thing is it still occurs after this PR where the CFLAGS should be reset after the complication.

The lucky part is that our portable arm64 builds still runs on earlier armv8 CPUs (like RK3399) and I have not seen it got crashed by calling not implemented instructions. Maybe we have to investigate in the future.

nyanmisaka · 2024-12-20T17:49:47Z

jellyfin-ffmpeg/configure

Line 4068 in 2f3b874

cpu="generic"

-mcpu=generic is the default option for ffmpeg.

The warning happened during ffmpeg's own lto phase, and only svt-av1 has checked -march=armv8.2-a+dotprod, so this should be related to it.

nyanmisaka · 2024-12-20T18:59:04Z

builder/scripts.d/50-libopus.sh

+    else
+        export CFLAGS="-O3 -fPIC -DPIC"
+    fi
+
    ./configure "${myconf[@]}"


It seems like optimizations being completely disabled only happens on macOS builds? I don't see the !DEFINED(__OPTIMIZED__) warning on crosstool-ng builds.

Opus have this in their macOS CI:

- name: Configure run: CFLAGS="-mavx -mfma -mavx2 -O2 -ffast-math" ./configure --enable-float-approx ${{ matrix.config.buildconfig }}

So maybe this will be enough for both platforms:

# Override previously set -O(n) option and the CC's default optimization options. CFLAGS="$CFLAGS -O3" ./configure "${myconf[@]}"

nyanmisaka · 2024-12-20T19:00:09Z

https://github.com/xiph/opus/pull/362.patch

This should fix the AVX2 intrinsics auto detection issue in GCC 14+:

2024-12-20T17:53:17.2632984Z #63 24.47 checking if compiler supports AVX2 intrinsics with -mavx -mfma -mavx2... no
2024-12-20T17:53:17.2633946Z #63 25.45 checking How to get X86 CPU Info... configure: WARNING: Compiler does not support AVX2 intrinsics

gnattu · 2024-12-21T09:00:01Z

https://github.com/xiph/opus/pull/362.patch

This should fix the AVX2 intrinsics auto detection issue in GCC 14+:
2024-12-20T17:53:17.2632984Z #63 24.47 checking if compiler supports AVX2 intrinsics with -mavx -mfma -mavx2... no
2024-12-20T17:53:17.2633946Z #63 25.45 checking How to get X86 CPU Info... configure: WARNING: Compiler does not support AVX2 intrinsics

Should we include that patch here or in a separate PR?

nyanmisaka · 2024-12-21T09:39:47Z

https://github.com/xiph/opus/pull/362.patch
This should fix the AVX2 intrinsics auto detection issue in GCC 14+:
2024-12-20T17:53:17.2632984Z #63 24.47 checking if compiler supports AVX2 intrinsics with -mavx -mfma -mavx2... no
2024-12-20T17:53:17.2633946Z #63 25.45 checking How to get X86 CPU Info... configure: WARNING: Compiler does not support AVX2 intrinsics
Should we include that patch here or in a separate PR?

For the sake of bisection, I can handle it in a separate PR after this is merged.

ffbuild_dockerstage() {
    to_df "RUN --mount=src=${SELF},dst=/stage.sh --mount=src=patches/libopus,dst=/patches run_stage /stage.sh"
}

ffbuild_dockerbuild() {
...

    for patch in /patches/*.patch; do
        echo "Applying $patch"
        patch -p1 < "$patch"
    done

...
}

gnattu · 2024-12-21T09:42:20Z

For the sake of bisection, I can handle it in a separate PR after this is merged.

ffbuild_dockerstage() {
    to_df "RUN --mount=src=${SELF},dst=/stage.sh --mount=src=patches/libopus,dst=/patches run_stage /stage.sh"
}

ffbuild_dockerbuild() {
...

    for patch in /patches/*.patch; do
        echo "Applying $patch"
        patch -p1 < "$patch"
    done

...
}

macOS builder won't run the ffbuild_dockerstage so we need to handle the case when there is no patches. This bug does not affect the clang toolchain under macOS anyway.

nyanmisaka · 2024-12-21T10:02:06Z

Then we will have to use wget -q -O - https://github.com/xiph/opus/commit/9ec11c1.patch | git apply in ffbuild_dockerbuild

builder/scripts.d/50-libopus.sh

Co-authored-by: Nyanmisaka <[email protected]>

Add NE10 to enable more neon optimization for libopus

25f1cc2

Unlike on x86, where libopus provides inline assembly, most of the performance optimizations for ARM Neon are implemented in the NE10 library. We need to build it separately for optimal performance on arm64 targets.

gnattu requested a review from a team December 20, 2024 07:22

builder: explict set traget arch for libne10

191e8f6

gnattu marked this pull request as draft December 20, 2024 08:35

Shadowghost approved these changes Dec 20, 2024

View reviewed changes

gnattu added 4 commits December 20, 2024 21:58

builder: set explicit optimization CFLAGS for opus

ef6fd85

builder: use custom fork of libNE10

80bf279

builder: use fPIC for Linux libopus

3756ca7

Fix indent

453273e

nyanmisaka reviewed Dec 20, 2024

View reviewed changes

builder: just append optimization flags

cde1084

gnattu marked this pull request as ready for review December 21, 2024 12:10

nyanmisaka reviewed Dec 21, 2024

View reviewed changes

builder/scripts.d/50-libopus.sh Outdated Show resolved Hide resolved

builder/scripts.d/50-libopus.sh Outdated Show resolved Hide resolved

Fix indent

3a595cd

Co-authored-by: Nyanmisaka <[email protected]>

nyanmisaka approved these changes Dec 22, 2024

View reviewed changes

gnattu merged commit a7d64d5 into jellyfin Dec 22, 2024
27 checks passed

gnattu deleted the ne10-opus branch December 22, 2024 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NE10 to enable more neon optimization for libopus #518

Add NE10 to enable more neon optimization for libopus #518

gnattu commented Dec 20, 2024

gnattu commented Dec 20, 2024

nyanmisaka commented Dec 20, 2024

nyanmisaka commented Dec 20, 2024 •

edited

Loading

gnattu commented Dec 20, 2024

nyanmisaka commented Dec 20, 2024

gnattu commented Dec 20, 2024

nyanmisaka commented Dec 20, 2024

gnattu commented Dec 20, 2024

gnattu commented Dec 20, 2024

gnattu commented Dec 20, 2024 •

edited

Loading

nyanmisaka commented Dec 20, 2024 •

edited

Loading

nyanmisaka Dec 20, 2024

nyanmisaka commented Dec 20, 2024

gnattu commented Dec 21, 2024

nyanmisaka commented Dec 21, 2024

gnattu commented Dec 21, 2024

nyanmisaka commented Dec 21, 2024

Add NE10 to enable more neon optimization for libopus #518

Add NE10 to enable more neon optimization for libopus #518

Conversation

gnattu commented Dec 20, 2024

Before:

After:

gnattu commented Dec 20, 2024

nyanmisaka commented Dec 20, 2024

nyanmisaka commented Dec 20, 2024 • edited Loading

gnattu commented Dec 20, 2024

nyanmisaka commented Dec 20, 2024

gnattu commented Dec 20, 2024

nyanmisaka commented Dec 20, 2024

gnattu commented Dec 20, 2024

with NE10:

without NE10:

gnattu commented Dec 20, 2024

with NE10:

without NE10:

gnattu commented Dec 20, 2024 • edited Loading

nyanmisaka commented Dec 20, 2024 • edited Loading

nyanmisaka Dec 20, 2024

Choose a reason for hiding this comment

nyanmisaka commented Dec 20, 2024

gnattu commented Dec 21, 2024

nyanmisaka commented Dec 21, 2024

gnattu commented Dec 21, 2024

nyanmisaka commented Dec 21, 2024

nyanmisaka commented Dec 20, 2024 •

edited

Loading

gnattu commented Dec 20, 2024 •

edited

Loading

nyanmisaka commented Dec 20, 2024 •

edited

Loading