-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Triton] To inline the VC intrinsic in the SIMT kernel. #658
Comments
Disassemble of the SPIRV IR from the DPCPP example:
|
SPIRV dialect we are working on:
|
Could you please explain what is the exact use case for the "might have to use" scenario? |
Also benchmark would be convincing. Like a SYCL example emulating the real use case, showing the benefit of using invoke_SIMD from SIMT code. I am not sure how much the invoke_SIMD overhead is, whether it will make these type of mixing not very appealing. |
The SPIRV JointMatrixMatmul is hard to achieve best performance. We may need to explicitly to use the DPAS in the IR for pre-op and post-op fusing in GEMM. |
I have concerns about the mix overhead. |
Yeah. Based on the SYCL example. The SIMT-SIMD calling convention is not as good as expected.
This is not optimized for now. But I think it could be optimized at link phase by replacing the register function call to inline function call. And do some link phase optimization. The SIMT-SIMD convention is a good mechanism for us to align our SIMT paradigm and SIMD paradigm. (Like: calling XeTLA micro kernel inside the Triton Kernel.) |
Yes. We will enable this one. We would like to hide this within XeTile dialect as first step. Then we may need additional pass in the integration code (say Triton side) to merge multiple invoke_SIMD call into one. In the future, when you say "performance not as expected", please report exact how much you expect, and how much it is currently. The benchmark should be close to real case as much as possible. For this time, we will build micro benchmark to track the XeTile level - like load/store/dpas of shapes with this numbers 8, 16, 24, 32, 64. and that will give us good understanding how the overhead is. |
I have tried the patches for supporting this. We can close this issue when it is upstreamed. |
Background
The Triton kernel is generated as SIMT major SPIRV kernel. It is because some component has to be used with SIMT paradigm. Like: Intel math library is only SIMT version.
But for some minor portion of the kernel, we might have to use the SIMD paradigm for performance or functionality. Like: we may want to use the VC intrinsic in the Triton kernel.
We are working on enabling SIMT->SIMD calling convention on Triton kernel.
https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_invoke_simd.asciidoc
By doing so, we can codegen SIMD paradigm code for parts of the kernel.
The requirements
We referred the SPIRV generated by the DPCPP which is SIMT+SIMD. We need to refer the SPIRV kernel function directly thru a function pointer.
https://github.com/intel/llvm/blob/sycl/sycl/doc/design/spirv-extensions/SPV_INTEL_function_pointers.asciidoc
We need the SPIRV dialect to support this.
The text was updated successfully, but these errors were encountered: