Weird znver4 performance hit compared to x86-64-v4 #359

danog · 2024-09-28T19:02:44Z

From https://gitlab.archlinux.org/archlinux/packaging/packages/php/-/merge_requests/3: as can be seen by the benchmarks, the new znver4 repos actually have worse performance than the x86-64-v4 repos (both OOTB with packages from the repo, and when self-building php with or without LTO).

This seems quite strange to me, as I've looked through GCC's source code, specifically the flag selection logic for the various arches, and I've verified znver4 is a strict superset of x86-64-v4:

x86-64-v4:

PTA_64BIT | PTA_MMX | PTA_SSE
  | PTA_SSE2 | PTA_FXSR
  | PTA_CX16 | PTA_POPCNT | PTA_SSE3 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_SSSE3
  | PTA_AVX | PTA_AVX2 | PTA_BMI | PTA_BMI2 | PTA_F16C | PTA_FMA | PTA_LZCNT
  | PTA_MOVBE | PTA_XSAVE
  | PTA_AVX512F | PTA_AVX512BW | PTA_AVX512CD | PTA_AVX512DQ | PTA_AVX512VL

znver4:

PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2
  | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
  | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2 | PTA_BMI | PTA_BMI2
  | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT
  | PTA_FSGSBASE | PTA_RDRND | PTA_MOVBE | PTA_MWAITX | PTA_ADX | PTA_RDSEED
  | PTA_CLZERO | PTA_CLFLUSHOPT | PTA_XSAVEC | PTA_XSAVES | PTA_SHA | PTA_LZCNT
  | PTA_POPCNT| PTA_CLWB | PTA_RDPID
  | PTA_WBNOINVD | PTA_VAES | PTA_VPCLMULQDQ
  | PTA_PKU | PTA_ZNVER3 | PTA_AVX512F | PTA_AVX512DQ
  | PTA_AVX512IFMA | PTA_AVX512CD | PTA_AVX512BW | PTA_AVX512VL
  | PTA_AVX512BF16 | PTA_AVX512VBMI | PTA_AVX512VBMI2 | PTA_GFNI
  | PTA_AVX512VNNI | PTA_AVX512BITALG | PTA_AVX512VPOPCNTDQ | PTA_EVEX512

And same goes for the processor info flags:

{"x86-64-v4", PROCESSOR_K8, CPU_GENERIC, PTA_X86_64_V4 | PTA_NO_TUNE, 0, P_NONE}

{"znver4", PROCESSOR_ZNVER4, CPU_ZNVER4, PTA_ZNVER4, M_CPU_SUBTYPE (AMDFAM19H_ZNVER4), P_PROC_AVX512F}

So I can't explain the weird performance hit of znver4...

Note that all tests were fully automated using docker, actually the exact same dockerfile was used, switching out just the architecture in makepkg.conf and in the repos (appropriately re-installing all packages after doing that).

The text was updated successfully, but these errors were encountered:

ptr1337 · 2024-09-28T19:39:11Z

Hi,

Thanks for benchmarking this. I would also check this locally. Do you use the default provided config from Cachy?
Also, which CPU do you have?

I can only retest on a 9950X currently.

checked also with bin-cpuflags-x86 on the compiled binary:

znver4:

bin-cpuflags-x86 /usr/bin/php
Format: Elf
Architecture: X86_64
Features: INTEL8086 INTEL186 INTEL286 INTEL386 INTEL486 X64 AVX AVX2 AVX512_VBMI AVX512BW AVX512DQ AVX512F AVX512VL BMI1 BMI2 CET_IBT CMOV CPUID FPU FPU287 LZCNT MOVBE MULTIBYTENOP PCLMULQDQ POPCNT SSE XSAVE 
Warning: CPUID usage detected. The program can switch instruction sets in runtime.

v4:

Format: Elf
Architecture: X86_64
Features: INTEL8086 INTEL186 INTEL286 INTEL386 INTEL486 X64 AVX AVX2 AVX512BW AVX512DQ AVX512F AVX512VL BMI1 BMI2 CET_IBT CMOV CPUID FPU FPU287 LZCNT MOVBE MULTIBYTENOP PCLMULQDQ POPCNT SSE XSAVE 
Warning: CPUID usage detected. The program can switch instruction sets in runtime.

AVX512_VBMI appears to be aditonally applied according bin-cpuflags-x86. Im not sure tho, if it does show all applied flags.

vnepogodin · 2024-09-28T19:53:26Z

well for us matters if LTO really introduce regression with our php PKGBUILD.

znver4 vs v4 diff can be on the margin of error

danog · 2024-09-28T19:58:38Z

Sure, LTO is the real regression, and the margin between znver4 and v4 is small, but it still is significant (and reproducible).
I'll publish the scripts and config used for benchmarks in the coming days, in the meantime, I tested on a Ryzen 9 7950X.

ptr1337 · 2024-09-29T11:28:03Z

206cdf0

Got the LTO regression also verified, disabled LTO for now, as archlinux does.

danog · 2024-10-02T09:46:39Z

@ptr1337 I've published the set of scripts used to make the benchmarks: https://github.com/nicelocal/microarch-benchmarks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird znver4 performance hit compared to x86-64-v4 #359

Weird znver4 performance hit compared to x86-64-v4 #359

danog commented Sep 28, 2024

ptr1337 commented Sep 28, 2024

vnepogodin commented Sep 28, 2024 •

edited

Loading

danog commented Sep 28, 2024

ptr1337 commented Sep 29, 2024

danog commented Oct 2, 2024

Weird znver4 performance hit compared to x86-64-v4 #359

Weird znver4 performance hit compared to x86-64-v4 #359

Comments

danog commented Sep 28, 2024

ptr1337 commented Sep 28, 2024

vnepogodin commented Sep 28, 2024 • edited Loading

danog commented Sep 28, 2024

ptr1337 commented Sep 29, 2024

danog commented Oct 2, 2024

vnepogodin commented Sep 28, 2024 •

edited

Loading