Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Prototype XNNPack gemm compiler.
Our existing GEMM templates are becoming unmaintainable and are preventing us from quickly adding support for new types and quantization schemes. They are also too restrictive in the shapes of the generated GEMMS. This new system generates assembly and the shape is only limited by the number of SIMD registers.
Arch: for example x64 & aarch64
Isa: neondot & avx512f
Each microkernel has an arch and an isa associated with it. All shared scalar code belongs in the arch and isa specific SIMD code belongs to the isa. Isas can inherit from each other. For example, stores are common between avx512f and avx512vnni and neonfma and neondot. This eliminates lots of code duplication.
Only the inner loops (and sometimes the outer loops) vary between GEMM microkernels on the same architecture. Most of the rest of the code is identical. Therefore, this system is modular, with each ISA inheriting from the proceeding one, and only small snippets of assembly are required to add a new ISA.
Architectures supported in initial prototype:
F32: neonfma and avx512f
QD8-F32-QC8W: neondot & avx512vnni
Support for aarch32 will be added in a future change. I do not plan on supporting x86 (32 bit) since it is irrelevant as an architecture and it has only 8 general purpose and SIMD registers. The lack of registers means that data will have to be repeatedly pushed and popped from the stack, adding lots of complexity to the templates for little gain.
The generated assembly only compiles on Linux. However, only the function headers, footers and calling conventions differ between Windows and Linux. The actual assembly is identical. I manually modified the generated assembly and tested it with MSVC for both aarch64 and x64. Support for Windows will be added in a future version.
Intel syntax is used since it is portably between Linux and Windows and it is less crazy than AT&T.