Rebase/cherry-pick xsmm-main on top of triton-lang/triton-cpu/main #48

adam-smnk · 2024-12-16T12:17:06Z

No description provided.

#### Commits in this PR 1. [CI] Fix cache not saving Re-using the output of the cache restore step was recommended by the `actons/cache` docs, but it doesn't work here because we actually start from a clean cache when we run save so there is no output available to read. The annoyances of testing in the PR but main being a different environment. 2. Bump macOS timeout

We also exercise this in scale_dot, where we enable support for warps of arbitrary shape (before we just allowed `[num_warps, 1]`). With this infra in place, it should be rather easy to move from the legacy layouts to using LLs to represent all of our layouts. Something I'm concerned about is the amount of recomputation that happens when calling methods like `getSizePerThread` and the like, where we keep recomputing the result. There might be an optimisation opportunity here where we cache the result of all these functions. We choose the IR representation of an LL via its canonical form + a `repOrder` for several reasons: - It's generally more compact - It's easier to CSE, so it's easier to see when two layouts are in fact the same. - A technical reason: the `toLinearLayout` function returns a tensor with dimensions `dim0, ..., dim<rank-1>`, in other words, it "forgets" the repetition order. Without the repetition order, we cannot recover the tile size of the argument. In particular, we cannot recover `getSizePerThread`. There is an argument to be made about whether `getSizePerThread` is useful on its own, or whether it is `getElemsPerThread` the real useful abstraction here, but for now, we keep both for BC.

Currently you can manually call a workflow dispatch, but it won't actually run the tests because the variable enable_integration isn't set.

…ra-kernel perf tooling (triton-lang#5119) This PR introduces the `Proton Dialect` to enable intra kernel profiling and tooling for Triton. As a third-party dialect, it serves as the building blocks to create 3rd-party perf tools (e.g., profilers, analysis, modeling) for Triton compiler developers in a compiler-centric way, such as an intra-kernel latency profiler to understand software pipelining, warp specialization, and CTA fine-grained orchestration (e.g., cuda core, tensor core, TMA). Future developments would integrate this dialect with the existing Proton backend profiling infrastructure to make it a powerful and general perf tool utility. As a first step, this PR adds some basic boilerplate code and mechanics, and the `proton.record` op for the `Proton Dialect`. --------- Co-authored-by: Yuanwei Fang <[email protected]> Co-authored-by: Keren Zhou <[email protected]>

…ang#5208) After this PR, `MemDesc` will be a type only in the TritonGPU dialect, as will the `TensorOrMemDesc` interface.

The pass was reordering scf.if operations without checking the extra dependencies coming from the region. For now just prevent this case although this part of the code might still be fragile.

If you build using the `CMakeLists.txt` and not `setup.py` and you build in `Release` then you get ``` /__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp: In function ‘std::pair<mlir::Type, mlir::Type> mlir::TypesFromMfmaId(MLIRContext*, MfmaTypeId)’: Warning: /__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp:240:1: warning: control reaches end of non-void function [-Wreturn-type] ```

While working on some higher dimension tensor kernels, I noticed poor performance due to the fact that layouts wouldn't propagate to local loads. Since we do allow layout folding with local store and local alloc, this seems like a bit of an oversight. The change gives a 40% speed improvement on certain kernels for NVidia GPUs. This also removes asserts in lowering for higher dimensional kernels. As far as I can tell, those restrictions aren't required in practice. # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - [x] I have added tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices)

…ang#5226)

…riton-lang#5139) Adding a shortcut case for fp8 MFMA to dot operand layout conversion that avoids using shared memory, to speed up FP8 attention kernels.

…#5189 (triton-lang#5200) We simplify the implementation of `getElemsPerThread` and strengthen the preconditions of `getRepForOperand`. More generally, we should try to minimise the calls to `isAmpere` and `isHopper` throughout the codebase. I'll do a pass fixing many of these once we land LLs for `ldmatrix` and Hopper.

… issues (triton-lang#5225)

TMA fences require CUDA toolchain 12.3 or greater, but current gating does not check the CUDA toolchain version. This causes `test_experimental_tma.py` to fail when run with older CUDA toolchains. ## Before With cuda-12.0: ``` 55 failed, 9 passed in 18.11s ``` With cuda-12.4: ``` 64 passed in 11.99s ``` ## After With cuda-12.0: ``` 9 passed, 55 skipped in 4.26s ``` With cuda-12.4: ``` 64 passed in 11.96s ```

If you build with `-DTRITON_BUILD_UT=OFF` on Mac you will get something like ``` -- Looking for histedit.h CMake Error at /opt/homebrew/Cellar/cmake/3.30.5/share/cmake/Modules/CheckIncludeFile.cmake:90 (try_compile): Unknown extension ".c" for file -- Looking for histedit.h - not found /Users/runner/work/triton/triton/triton-build/CMakeFiles/CMakeScratch/TryCompile-QA06d6/CheckIncludeFile.c try_compile() works only for enabled languages. Currently these are: CXX See project() command to enable other languages. Call Stack (most recent call first): llvm-bd9145c8-macos-arm64/lib/cmake/llvm/FindLibEdit.cmake:28 (check_include_file) llvm-bd9145c8-macos-arm64/lib/cmake/llvm/LLVMConfig.cmake:177 (find_package) llvm-bd9145c8-macos-arm64/lib/cmake/mlir/MLIRConfig.cmake:10 (find_package) ``` because `C` isn't an enabled project language.

) This PR disables inline of print related functions, which speeds up compilation of test_scan_layouts dramatically. --------- Co-authored-by: Lei Zhang <[email protected]>

triton-lang#5153 fixed the issue; but we missed enabling one of the disabled case.

…-lang#5236)

This is causing some performance regression. I'll investigate and reland it. Reverts triton-lang#5219

…ion (FP8)" (triton-lang#5240) It is causing performance regression, revert until it can be investigated Reverts triton-lang#5139

…n-lang#5239)

If a kernel is launched on a thread which has not initialized a CUDA context (as can happen in the linked issue), it will throw an error. A simple fix is to call `cudaFree(0)` to establish a device context. Fixes triton-lang#3729

This includes llvm/llvm-project#115627

…#5234) Currently the llvm path changes every time the pin updates which makes it annoying to use the included tools. e.g. I use the tablegen language server, but currently need to update my editor config every time the llvm pin changes. This adds a stable symlink which for me is `~/.triton/llvm/llvm-macos-x64`. This will always point to the most recent version of llvm used to build triton. As a bonus this also refactors the symlink update code which was copy-pasted a few times.

Don't pipeline the dot accumulator in the default heuristic. In the finer grain control will allow user to decide.

…ang#5249) recommit of triton-lang#5219 While working on some higher dimension tensor kernels, I noticed poor performance due to the fact that layouts wouldn't propagate to local loads. Since we do allow layout folding with local store and local alloc, this seems like a bit of an oversight. The change gives a 40% speed improvement on certain kernels for NVidia GPUs. This also removes asserts in lowering for higher dimensional kernels. As far as I can tell, those restrictions aren't required in practice. # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - [x] I have added tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices)  # New contributor declaration - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Co-authored-by: Matthew Brookhart <[email protected]>

Upstreaming some of our Windows related changes assuming that there is interest in this triton-lang#5094 (comment) and hoping that it will not make it much more difficult to support this CMake file. --------- Signed-off-by: Anatoly Myachev <[email protected]>

This commit unified the names of header guards in third_party/amd.

Since StreamPipelineV2 has been the default for a while, this commit promoted StreamPipelineV2 to the general StreamPipeline by removing 'v2' suffix.

Also cleans up some includes clang thinks are unused.

…on-lang#5258) Fixes triton-lang#5256

Closes triton-lang#171.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix isSigned in PrintOp * Add float16 support for print * Support float16 printing for old compilers

Signed-off-by: Ilya Enkovich <[email protected]>

* Add pass to decompose matmul to FMA operations. Signed-off-by: Ilya Enkovich <[email protected]> * Use block pointers and padding in 03-matrix-multiplication-cpu.py. * Fix review comments. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>

* Improve AMX lowering to minimize loads and stores. Signed-off-by: Ilya Enkovich <[email protected]> * Support bfloat16 in CPU matmul tutorials. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>

…aidml#30) * Matrix multiplication tutorial block pointer variant (plaidml#1) Adds a `USE_BLOCK_POINTER` flag to the matmul_kernel so we can get IR for pointers-to-tensors instead of tensors-of-pointers. * Vector to XSMM (plaidml#3) Implements lowering pass from vector to XSMM microkernels. libxsmm is added as an external dependency together with general MLIR infrastructure for handling XSMM code generation and runtime execution. The XSMM lowering is optional and can be enabled at JIT step by environment variable TRITON_CPU_XSMM=1 libxsmm is built as a shared library and linked with XSMM-related libraries. These are also added to the Python infrastructure. Additionally, general MLIR utilities are imported to allow analysis, code generation and microkernel execution. Initially, a simple pattern mapping vector contraction to an XSMM kernel is added. * [XSMM] Place contraction accumulation buffer outside reduction loop (plaidml#5) Contraction lowering now moves accumulation buffer outside of a reduction loop when possible. This reduces data movement between memory and registers needed to accommodate mixed memref and vector abstractions. * Triton to XSMM (plaidml#4) Adds lowering pass from triton to XSMM microkernels. XSMM utility APIs are generalized to work on opaque operations representing contractions. A simple pattern mapping tt.dot to XSMM kernel is added. The runtime lowering to XSMM is now controlled by two separate flags: - TRITON_CPU_VECTOR_XSMM=1 to lower from vector as before - TRITON_CPU_TRITON_XSMM=1 to lower from triton ops * Lift -triton-raise-block-pointer pass from intel-xpu-backend-for-triton (plaidml#7) * Lift -triton-raise-block-pointer pass from intel-xpu-backend-for-triton Code was in turn taken from triton-shared (though does not use the tts dialect). * [triton][XSMM] Hoist accumulation buffer (plaidml#8) Ports hoisting from Vector to XSMM pass to Triton lowering. Dot lowering now moves accumulation buffer outside of a reduction loop when possible. * Bump libxsmm (plaidml#9) Updates libxsmm version. Brings support for vnni sw pipeline. * Enable bfloat16 support (plaidml#10) Extends XSMM code generation to allow for mixed precision computations to match triton requirements for <bf16 x bf16 -> f32> contraction. Data type selection is added as a global variable to the matmul tutorial. BF16 can suffer from some inaccuracies compared to PyTorch baseline. However, the difference appears to be the same between native triton-cpu and XSMM lowering - no mismatch on SPR. The matmul tutorial is aligned more with the main branch. V2 backend benchmarking is disable due to its instabilities. Default tile sizes are increased to improve general performance. * Dynamic shape/stride/offset support by way of memref.extract_strided_metadata (plaidml#11) * Matmul tutorial - K padding and dynamic K block size (plaidml#12) Adds two new optional flags to the matmul tutorial: - K dim padding - pads input matrices into multiple of chosen BLOCK_SIZE_K - dynamic K blocking - overrides set BLOCK_SIZE_K and adjusts it based on the input K dimension; input is padded if needed The main motivation is to allow testing with larger reduction dimension blocks without kernel lossing support for various sizes. Padding is required to meet triton's requirement for power-of-2 sizes. Dynamic blocking can be used to decrease reduction dimension range or completely eliminate it. Allowing the kernel to work on larger K blocks is also helpful for future rewriting of GEMM into BRGEMM to ensure larger batch dimension. * Matmul tutorial - cache padding (plaidml#14) Adds extra optional padding that can be use to ensure that input matrices' strides are non-power-of-two to improve cache behavior. Currently, it is most useful with DYNAMIC_K_BLOCK enabled. * Lower GEMM to BRGEMM kernel (plaidml#13) Extends contraction lowering to XSMM by rewriting plain GEMM into a BRGEMM kernel when possible. The rewrite improves performance of larger K block sizes thanks to extra reduction dim tiling. Use of BRGEMM kernel also enables online VNNI packing for BF16. * Matmul tutorial - external preprocessing (plaidml#15) Adds an optional flag to move matmul input preprocessing outside of the benchmarked kernel. This option allows to exclude preprocessing overhead from performance measurements. * Fix up useless num_threads arg * Matmul tutorial - external preprocessing (plaidml#15) Adds an optional flag to move matmul input preprocessing outside of the benchmarked kernel. This option allows to exclude preprocessing overhead from performance measurements. * Utility libxsmm Python extension (plaidml#17) Adds a python wrapper for a parallelized in-place copy function using libxsmm and OpenMP. It is intended to be used for efficient tensor padding implementation. The libxsmm path have to be specified through env variables: - XSMM_ROOT_DIR - path to libxsmm root dir with headers - XSMM_LIB_DIR - path to libxsmm.so location libxsmm .so also has to be available during runtime execution e.g., exposed through LD_LIBRARY_PATH. The XSMM python module can be built and installed using command: pip install -e ./third_party/cpu/python/ * Reduction loop GEMM to XSMM BRGEMM (plaidml#18) Adds experimental rewrite collapsing reduction loop over GEMM into a BRGEMM ukernel. The pattern matches the hand-written kernel using block pointers and is not compatible with IR generated by triton pointer raising. Direct lowering to XSMM allows to bypass triton load restriction when K dimension is not power-of-two. The pattern is quite brittle but functional for the matmul tutorial example. The rewriting is disable by default and can be enabled with environment variable: TRITON_CPU_LOOP_BRGEMM_XSMM=1 * Matmul tutorial - pad weights only (plaidml#19) Adds option to apply padding only to matrix B. This allows to explore potential speedups by limiting padding to weights which is reasonably common strategy in e.g., ML inference. Full padding still has to occur when K dimension is padded to avoid dimension mismatch and/or meet power-of-two size requirement. * Add script for running configs (plaidml#20) Also makes it so that xsmm_py is only imported when needed * added thread mgmt * Update configs to be as indicated on slides (plaidml#22) also changes --datatype-bf16 to --datatype bf16 * Fix typo in config parameter (plaidml#23) * added set of prepare and build script * added zen5 patch * BF8 support (plaidml#24) BF8 support * update XSMM * don't run torch/torch.compile for bf8 as it is at 1 GFLOP * revert because of plaidml#27 * Add run all benchmark script + use internal libxsmm (plaidml#27) * go up to 4096 --------- Co-authored-by: Adam Siemieniuk <[email protected]> Co-authored-by: Alexander Heinecke <[email protected]> Co-authored-by: Renato Golin <[email protected]>

adam-smnk · 2024-12-16T12:23:27Z

TBD in terms of our update strategy - most likely it will have to be messy due to rebase of the cpu/main branch.

However, overall rebase was clean apart from a few minor conflicts. The largest change was rejecting any changes to the python/tutorial/03-cpu.py tutorial in favor of keeping our version.
After initial quick look at the benchmark numbers, the general performance trends seem similar to earlier measurements. I'll still need have a closer.

rolfmorel · 2024-12-16T13:24:01Z

The is one we probably shouldn't need: dc8e445 - it was only to fix-up triton-lang/triton-cpu/main as their commit at the time broke torch. I expect that when they rebased, this was fixed. This was the only PR/commit where I did anything like this.

@rengolin

… a time Includes fixes for automatically deriving num_threads as well - thanks @rengolin

According to docs of triton.testing.do_bench: "rep (int): Repetition time (in ms)"

invoke by with raw data from run_all_benchmarks either on stdin on or with the filename as the sole argument. Output is to stdout. Note includes example raw data to test the script with.

…laidml#43)

… parsed

It's just too slow, especially when accidentally invoked in a batch job.

adam-smnk · 2024-12-16T15:43:41Z

The is one we probably shouldn't need: dc8e445 - it was only to fix-up triton-lang/triton-cpu/main as their commit at the time broke torch. I expect that when they rebased, this was fixed. This was the only PR/commit where I did anything like this.

Works fine without this fix and its fix. Removed the two PRs.

peterbell10 and others added 30 commits November 21, 2024 16:12

[CI] Run tests when CI is manually triggered (triton-lang#5216)

ad28e6c

Currently you can manually call a workflow dispatch, but it won't actually run the tests because the variable enable_integration isn't set.

[DRAFT] Completely remove MemDesc from the Triton dialect (triton-l…

133109d

…ang#5208) After this PR, `MemDesc` will be a type only in the TritonGPU dialect, as will the `TensorOrMemDesc` interface.

[AMD] Prevent wrong reordering of scf operations (triton-lang#5203)

c8a31a0

The pass was reordering scf.if operations without checking the extra dependencies coming from the region. For now just prevent this case although this part of the code might still be fragile.

[BACKEND] Fix transpose optimization missed during refactor (triton-l…

4ae95e7

…ang#5226)

[AMD] Use warp shuffle for fp8 MFMA to dot operand layout conversion (t…

af0649d

…riton-lang#5139) Adding a shortcut case for fp8 MFMA to dot operand layout conversion that avoids using shared memory, to speed up FP8 attention kernels.

[BACKEND] Use LL to simplify redundant elements check and fix related…

e558838

… issues (triton-lang#5225)

[AMD] Fix slow compilation due to inlining print calls (triton-lang#5153

422e5d3

) This PR disables inline of print related functions, which speeds up compilation of test_scan_layouts dramatically. --------- Co-authored-by: Lei Zhang <[email protected]>

[AMD] Re-enable overflow test in test_reduce_layouts (triton-lang#5233)

16ce143

triton-lang#5153 fixed the issue; but we missed enabling one of the disabled case.

[BACKEND] Fix a missed transpose optimization during refactor (triton…

340cbc6

…-lang#5236)

Revert "Allow Layouts to propogate to local_load" (triton-lang#5237)

84ced0e

This is causing some performance regression. I'll investigate and reland it. Reverts triton-lang#5219

Revert "[AMD] Use warp shuffle for MFMA to Dot operand layout convers…

c6da81a

…ion (FP8)" (triton-lang#5240) It is causing performance regression, revert until it can be investigated Reverts triton-lang#5139

Updated README.md to show the steps for overriding kernel's IR (trito…

f637ea7

…n-lang#5239)

[LLVM] Update to llvm-project@86b69c3 (triton-lang#5242)

6404fbb

This includes llvm/llvm-project#115627

[PIPELINER] tweak pipeline heuristic (triton-lang#5247)

4107453

Don't pipeline the dot accumulator in the default heuristic. In the finer grain control will allow user to decide.

[AMD] NFC: Unified header guard in third_party/amd (triton-lang#5244)

e1ebeed

This commit unified the names of header guards in third_party/amd.

[AMD] NFC: Drop v2 Suffix from Stream Pipeline (triton-lang#5251)

4210274

Since StreamPipelineV2 has been the default for a while, this commit promoted StreamPipelineV2 to the general StreamPipeline by removing 'v2' suffix.

[NFC] Cleanup references to unused index dialect (triton-lang#5257)

c780bf4

Also cleans up some includes clang thinks are unused.

[BUILD] Ensure parent directory exists before creating symlinks (trit…

22e212b

…on-lang#5258) Fixes triton-lang#5256

int3 and others added 14 commits December 6, 2024 17:34

Move libdevice to third_party (triton-lang#182)

0389f38

Closes triton-lang#171.

Introduce triton_cpu.DotOp.

02c9a81

Signed-off-by: Ilya Enkovich <[email protected]>

Fixes to use the latest LLVM.

67c75c5

Signed-off-by: Ilya Enkovich <[email protected]>

Fix pybind11 build issue for TritonCPU.

0116651

Signed-off-by: Ilya Enkovich <[email protected]>

Use mlir::amx::TileType.

755aed7

Signed-off-by: Ilya Enkovich <[email protected]>

Fix formatting

a608930

Fix test_tl_range.

5304375

Signed-off-by: Ilya Enkovich <[email protected]>

Fix test_conversions.

148ebf5

Signed-off-by: Ilya Enkovich <[email protected]>

Disable test_block_copy with lower bound check.

a74ee76

Signed-off-by: Ilya Enkovich <[email protected]>

Fix isSigned and add float16 in PrintOp (triton-lang#191)

958b889

* Fix isSigned in PrintOp * Add float16 support for print * Support float16 printing for old compilers

Add TritonCPU canonicalizer. (triton-lang#192)

5830c67

Signed-off-by: Ilya Enkovich <[email protected]>

rolfmorel and others added 12 commits December 16, 2024 07:39

Add --backend option and run experiments one backend*config option at…

785e54b

… a time Includes fixes for automatically deriving num_threads as well - thanks @rengolin

[Triton-benchmark] Set rep=1000 (vs default of 100)

119f072

According to docs of triton.testing.do_bench: "rep (int): Repetition time (in ms)"

Fix debug sizes still being set

e9399f1

Fix typo: benchmark -> config

63987e0

Add raw-to-csv.py converter script (plaidml#29)

7bbb769

invoke by with raw data from run_all_benchmarks either on stdin on or with the filename as the sole argument. Output is to stdout. Note includes example raw data to test the script with.

Make raw-to-csv.py compatible with BENCHMARK-outputting script

f1fcd00

Fix errant call to do-nothing "preproccess" function on some configs (p…

559e6f7

…laidml#43)

Improve raw-to-csv a la @rengoline and prepare for summary line to be…

f041615

… parsed

Change back max dim size to 3k for data collection

b3012f0

Error out when attempting run bf8 with torch

0caa00c

It's just too slow, especially when accidentally invoked in a batch job.

Store commit hash with benchmark run details

bee198a

Update xsmm_utils.cpp to avoid using libxsmm for now

ce0336b

adam-smnk force-pushed the rebase-on-main-24-Dec-16 branch from d5e8283 to ce0336b Compare December 16, 2024 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase/cherry-pick xsmm-main on top of triton-lang/triton-cpu/main #48

Rebase/cherry-pick xsmm-main on top of triton-lang/triton-cpu/main #48

adam-smnk commented Dec 16, 2024

adam-smnk commented Dec 16, 2024

rolfmorel commented Dec 16, 2024

adam-smnk commented Dec 16, 2024

Rebase/cherry-pick xsmm-main on top of triton-lang/triton-cpu/main #48

Are you sure you want to change the base?

Rebase/cherry-pick xsmm-main on top of triton-lang/triton-cpu/main #48

Conversation

adam-smnk commented Dec 16, 2024

adam-smnk commented Dec 16, 2024

rolfmorel commented Dec 16, 2024

adam-smnk commented Dec 16, 2024