Bridge matmul performance gap, peano vs ukernel #433

newling · 2024-06-17T23:11:35Z

This is an issue to track progress, feel free to close if there is a more descriptive issue of what needs to be done. CC: @jsetoain, @jamestcl-amd

Currently there is a 4x performance gap. See #415 for the experiment to reproduce this 4x estimate. There has been discussion on what techniques chess uses in a private thread, and it sounds like by far-and-away the #1 performance optimization is the use of a "prepare_for_pipelining" pragma.

So we probably need software pipelining, probably in iree-amd-aie, but it could be anywhere in the stack. In iree-amd-aie (and in MLIR generally) any transformation with loops is easier than once it's been lowered to LLVM IR.

Maybe we need to unroll the loop(s) inserted in this pass, and then reschedule the operation in the resulting block (move operations which load data into registers up, so that they are before matmuls, so that matmuls can run concurrently).

Questions: how much unrolling do we want, is this the right approach, do we want to do prefetching without unrolling, etc.

At some point we will want to be able to benchmark this directly from mlir-aie level, starting with aievec matmuls.

jsetoain · 2024-07-10T11:42:34Z

I've been looking into this for a while now, and I wanted to give an update on my findings.

The Current Status

First, a comparison between the direct code generation kernel and the μkernel. The kernel being generated is the most naïve version possible:

        for (i = 0; i < 16; i++) {     // loop 0
          for (j = 0; j < 16; j++) {   // loop 1
            for (k = 0; k < 8; k++) {  // loop 2
                a = A[k, i];
                b = B[j, k];
                c = C[j, i];
                r = mma(a, b, c);
                C[j, i] = r;
            }
          }
        }

The μkernel has the following structure:

        for (i = 0; i < 16; i+=2) {     // loop 0
          for (j = 0; j < 16; j+=2) {   // loop 1
            c0 = C[i, j];
            c1 = C[i, j+1];
            c2 = C[i+1, j];
            c3 = C[i+1, j+1];
            a0 = A[i, 0];
            a1 = A[i+1, 0];
            b0 = B[0, j];
            b1 = B[0, j+];
            c0 = mma(a0, b0, c0);
            c1 = mma(a0, b1, c1);
            c2 = mma(a1, b0, c2);
            c3 = mma(a1, b1, c3);            
            for (k = 1; k < 8; k++) {  // loop 2
              a0 = A[i, k];
              a1 = A[i+1, k];
              b0 = B[k, j];
              b1 = B[k, j+1];
              c0 = mma(a0, b0, c0);
              c1 = mma(a0, b1, c1);
              c2 = mma(a1, b0, c2);
              c3 = mma(a1, b1, c3);
            }
            C[i, j] = c0;
            C[i, j+1] = c1;
            C[i+1, j] = c2;
            C[i+1, j+1] = c3;
          }
        }

That means 3 optimizations have been performed to the basic loops. Loop 0 & loop 1 have been unrolled & jammed by a factor of 2, this has the effect of increasing the number of independent MMA ops in the innermost loop, improving instruction packing. For loop 2, the innermost loop, the first iteration has been peeled off as a preamble, and my assumption is that this makes pipelining easier. Another subtle optimization is that the load/store of the accumulator has been hoisted out of the reduction loop.

When Peano gets the directly generated kernel, it completely unrolls loop 2, and unrolls loop 1 by a factor of 2. That is, you end up with 16 MMA ops in the innermost loop, which are independent in pairs.

Edit: From experiments with Peano, matching this kernel structure should get us up to 75% the performance of the μkernel, up from the current 20-33%.

Moving Forward

An obvious way to try to improve performance would be to replicate these optimizations:

Unroll & Jam the parallel loops (0 & 1) by a factor of 2
Peel a preamble out of the reduction loop (2)
Eliminate redundant load/stores on the accumulators

There are two issues blocking the first optimization from two different fronts:

The recently added functionality to do unroll & jam on scf::ForOp, unlike the equivalent available for affine::AffineForOp, does not support loops that return values, but the linalg::tileLinalgOp generates loops like:

  %0 = scf.for %i = %c0 to %c16 step %c1 iter_args(%a0 = %aa) -> (tensor<16x16x4x4xf32>) {
    %1 = scf.for %j = %c0 to %c16 step %c1 iter_args(%a1 = %a0) -> (tensor<16x16x4x4xf32>) {
      %2 = scf.for %k = %c0 to %c8 step %c1 iter_args(%a2 = %a1) -> (tensor<16x16x4x4xf32>) {

Although linalg::tileLinalgOp accepts a parameter indicating which type of loop to use for tiling, it's only implemented for scf::ForOp and scf::ParallelOp, not for affine::AffineForOp.

The two options are, therefore, either try to implement unroll & jam for scf::ForOp that return a value, or implement linalg::tileLinalgOp with affine::AffineForOp. Out of the two, although we can get extra advantages from using affine loops, the fact that is not already implemented upstream gives me pause. On the other hand, unroll & jam for scf::ForOp has been implemented for a couple of weeks only, so I can see why it's not matching functionality with the affine version, even though I see no reason not to.

Pending Issue?

Optimizing redundant transfer read/writes is implemented and can be trivially used, but it has to happen after the load/stores have been bufferized, and we have transfer read/write ops. At the loop insertion level, this hasn't happened yet, so this optimization would have to happen either right after bufferization but before getting rid of the loops, or it needs to be implemented for tensor extract/insert slice.

I'm going to preemptively start by extending unroll & jam on scf::ForOp, but thoughts, suggestions, and other ideas pointing in a different direction are welcome.

makslevental · 2024-07-10T13:13:42Z

using affine loops, the fact that is not already implemented upstream gives me pause.

affine dialect is generally considered "abandonware" so that's the reason (I believe).

jsetoain · 2024-07-10T13:23:38Z

using affine loops, the fact that is not already implemented upstream gives me pause.

affine dialect is generally considered "abandonware" so that's the reason (I believe).

That would explain quite a lot... but then again, unroll & jam is not fully implemented in scf, and the patch for partial implementation landed on June 19th (iirc). I'd expect these basic transformations to be fully implemented upstream. No time like the present, I guess 😂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bridge matmul performance gap, peano vs ukernel #433

Bridge matmul performance gap, peano vs ukernel #433

newling commented Jun 17, 2024

jsetoain commented Jul 10, 2024 •

edited

Loading

makslevental commented Jul 10, 2024

jsetoain commented Jul 10, 2024

Bridge matmul performance gap, peano vs ukernel #433

Bridge matmul performance gap, peano vs ukernel #433

Comments

newling commented Jun 17, 2024

jsetoain commented Jul 10, 2024 • edited Loading

The Current Status

Moving Forward

Pending Issue?

makslevental commented Jul 10, 2024

jsetoain commented Jul 10, 2024

jsetoain commented Jul 10, 2024 •

edited

Loading