Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype XNNPack gemm compiler. #7569

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

copybara-service[bot]
Copy link
Contributor

Prototype XNNPack gemm compiler.

Our existing GEMM templates are becoming unmaintainable and are preventing us from quickly adding support for new types and quantization schemes. They are also too restrictive in the shapes of the generated GEMMS. This new system generates assembly and the shape is only limited by the number of SIMD registers.

Arch: for example x64 & aarch64
Isa: neondot & avx512f

Each microkernel has an arch and an isa associated with it. All shared scalar code belongs in the arch and isa specific SIMD code belongs to the isa. Isas can inherit from each other. For example, stores are common between avx512f and avx512vnni and neonfma and neondot. This eliminates lots of code duplication.

Only the inner loops (and sometimes the outer loops) vary between GEMM microkernels on the same architecture. Most of the rest of the code is identical. Therefore, this system is modular, with each ISA inheriting from the proceeding one, and only small snippets of assembly are required to add a new ISA.

Architectures supported in initial prototype:
F32: neonfma and avx512f
QD8-F32-QC8W: neondot & avx512vnni

Support for aarch32 will be added in a future change. I do not plan on supporting x86 (32 bit) since it is irrelevant as an architecture and it has only 8 general purpose and SIMD registers. The lack of registers means that data will have to be repeatedly pushed and popped from the stack, adding lots of complexity to the templates for little gain.

The generated assembly only compiles on Linux. However, only the function headers, footers and calling conventions differ between Windows and Linux. The actual assembly is identical. I manually modified the generated assembly and tested it with MSVC for both aarch64 and x64. Support for Windows will be added in a future version.

Intel syntax is used since it is portably between Linux and Windows and it is less crazy than AT&T.

@copybara-service copybara-service bot force-pushed the test_702691549 branch 2 times, most recently from 57eacac to c96c59f Compare December 6, 2024 16:12
Our existing GEMM templates are becoming unmaintainable and are preventing us from quickly adding support for new types and quantization schemes. They are also too restrictive in the shapes of the generated GEMMS. This new system generates assembly and the shape is only limited by the number of SIMD registers.

Arch: for example x64 & aarch64
Isa: neondot & avx512f

Each microkernel has an arch and an isa associated with it. All shared scalar code belongs in the arch and isa specific SIMD code belongs to the isa. Isas can inherit from each other. For example, stores are common between avx512f and avx512vnni and neonfma and neondot. This eliminates lots of code duplication.

Only the inner loops (and sometimes the outer loops) vary between GEMM microkernels on the same architecture. Most of the rest of the code is identical. Therefore, this system is modular, with each ISA inheriting from the proceeding one, and only small snippets of assembly are required to add a new ISA.

Architectures supported in initial prototype:
F32: neonfma and avx512f
QD8-F32-QC8W: neondot & avx512vnni

Support for aarch32 will be added in a future change. I do not plan on supporting x86 (32 bit) since it is irrelevant as an architecture and it has only 8 general purpose and SIMD registers. The lack of registers means that data will have to be repeatedly pushed and popped from the stack, adding lots of complexity to the templates for little gain.

The generated assembly only compiles on Linux. However, only the function headers, footers and calling conventions differ between Windows and Linux. The actual assembly is identical. I manually modified the generated assembly and tested it with MSVC for both aarch64 and x64. Support for Windows will be added in a future version.

Intel syntax is used since it is portably between Linux and Windows and it is less crazy than AT&T.

PiperOrigin-RevId: 702691549
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant