-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bridge matmul performance gap, peano vs ukernel #433
Comments
I've been looking into this for a while now, and I wanted to give an update on my findings. The Current StatusFirst, a comparison between the direct code generation kernel and the μkernel. The kernel being generated is the most naïve version possible: for (i = 0; i < 16; i++) { // loop 0
for (j = 0; j < 16; j++) { // loop 1
for (k = 0; k < 8; k++) { // loop 2
a = A[k, i];
b = B[j, k];
c = C[j, i];
r = mma(a, b, c);
C[j, i] = r;
}
}
} The μkernel has the following structure: for (i = 0; i < 16; i+=2) { // loop 0
for (j = 0; j < 16; j+=2) { // loop 1
c0 = C[i, j];
c1 = C[i, j+1];
c2 = C[i+1, j];
c3 = C[i+1, j+1];
a0 = A[i, 0];
a1 = A[i+1, 0];
b0 = B[0, j];
b1 = B[0, j+];
c0 = mma(a0, b0, c0);
c1 = mma(a0, b1, c1);
c2 = mma(a1, b0, c2);
c3 = mma(a1, b1, c3);
for (k = 1; k < 8; k++) { // loop 2
a0 = A[i, k];
a1 = A[i+1, k];
b0 = B[k, j];
b1 = B[k, j+1];
c0 = mma(a0, b0, c0);
c1 = mma(a0, b1, c1);
c2 = mma(a1, b0, c2);
c3 = mma(a1, b1, c3);
}
C[i, j] = c0;
C[i, j+1] = c1;
C[i+1, j] = c2;
C[i+1, j+1] = c3;
}
} That means 3 optimizations have been performed to the basic loops. Loop 0 & loop 1 have been unrolled & jammed by a factor of 2, this has the effect of increasing the number of independent MMA ops in the innermost loop, improving instruction packing. For loop 2, the innermost loop, the first iteration has been peeled off as a preamble, and my assumption is that this makes pipelining easier. Another subtle optimization is that the load/store of the accumulator has been hoisted out of the reduction loop. When Peano gets the directly generated kernel, it completely unrolls loop 2, and unrolls loop 1 by a factor of 2. That is, you end up with 16 MMA ops in the innermost loop, which are independent in pairs. Edit: From experiments with Peano, matching this kernel structure should get us up to 75% the performance of the μkernel, up from the current 20-33%. Moving ForwardAn obvious way to try to improve performance would be to replicate these optimizations:
There are two issues blocking the first optimization from two different fronts:
%0 = scf.for %i = %c0 to %c16 step %c1 iter_args(%a0 = %aa) -> (tensor<16x16x4x4xf32>) {
%1 = scf.for %j = %c0 to %c16 step %c1 iter_args(%a1 = %a0) -> (tensor<16x16x4x4xf32>) {
%2 = scf.for %k = %c0 to %c8 step %c1 iter_args(%a2 = %a1) -> (tensor<16x16x4x4xf32>) {
The two options are, therefore, either try to implement unroll & jam for Pending Issue?Optimizing redundant transfer read/writes is implemented and can be trivially used, but it has to happen after the load/stores have been bufferized, and we have transfer read/write ops. At the loop insertion level, this hasn't happened yet, so this optimization would have to happen either right after bufferization but before getting rid of the loops, or it needs to be implemented for tensor extract/insert slice. I'm going to preemptively start by extending unroll & jam on |
|
That would explain quite a lot... but then again, unroll & jam is not fully implemented in |
This is an issue to track progress, feel free to close if there is a more descriptive issue of what needs to be done. CC: @jsetoain, @jamestcl-amd
Currently there is a 4x performance gap. See #415 for the experiment to reproduce this 4x estimate. There has been discussion on what techniques chess uses in a private thread, and it sounds like by far-and-away the #1 performance optimization is the use of a "prepare_for_pipelining" pragma.
So we probably need software pipelining, probably in iree-amd-aie, but it could be anywhere in the stack. In iree-amd-aie (and in MLIR generally) any transformation with loops is easier than once it's been lowered to LLVM IR.
Maybe we need to unroll the loop(s) inserted in this pass, and then reschedule the operation in the resulting block (move operations which load data into registers up, so that they are before matmuls, so that matmuls can run concurrently).
Questions: how much unrolling do we want, is this the right approach, do we want to do prefetching without unrolling, etc.
At some point we will want to be able to benchmark this directly from mlir-aie level, starting with aievec matmuls.
The text was updated successfully, but these errors were encountered: