Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bridge matmul performance gap, peano vs ukernel #433

Open
newling opened this issue Jun 17, 2024 · 3 comments
Open

Bridge matmul performance gap, peano vs ukernel #433

newling opened this issue Jun 17, 2024 · 3 comments

Comments

@newling
Copy link
Contributor

newling commented Jun 17, 2024

This is an issue to track progress, feel free to close if there is a more descriptive issue of what needs to be done. CC: @jsetoain, @jamestcl-amd

Currently there is a 4x performance gap. See #415 for the experiment to reproduce this 4x estimate. There has been discussion on what techniques chess uses in a private thread, and it sounds like by far-and-away the #1 performance optimization is the use of a "prepare_for_pipelining" pragma.

So we probably need software pipelining, probably in iree-amd-aie, but it could be anywhere in the stack. In iree-amd-aie (and in MLIR generally) any transformation with loops is easier than once it's been lowered to LLVM IR.

Maybe we need to unroll the loop(s) inserted in this pass, and then reschedule the operation in the resulting block (move operations which load data into registers up, so that they are before matmuls, so that matmuls can run concurrently).

Questions: how much unrolling do we want, is this the right approach, do we want to do prefetching without unrolling, etc.

At some point we will want to be able to benchmark this directly from mlir-aie level, starting with aievec matmuls.

@jsetoain
Copy link

jsetoain commented Jul 10, 2024

I've been looking into this for a while now, and I wanted to give an update on my findings.

The Current Status

First, a comparison between the direct code generation kernel and the μkernel. The kernel being generated is the most naïve version possible:

        for (i = 0; i < 16; i++) {     // loop 0
          for (j = 0; j < 16; j++) {   // loop 1
            for (k = 0; k < 8; k++) {  // loop 2
                a = A[k, i];
                b = B[j, k];
                c = C[j, i];
                r = mma(a, b, c);
                C[j, i] = r;
            }
          }
        }

The μkernel has the following structure:

        for (i = 0; i < 16; i+=2) {     // loop 0
          for (j = 0; j < 16; j+=2) {   // loop 1
            c0 = C[i, j];
            c1 = C[i, j+1];
            c2 = C[i+1, j];
            c3 = C[i+1, j+1];
            a0 = A[i, 0];
            a1 = A[i+1, 0];
            b0 = B[0, j];
            b1 = B[0, j+];
            c0 = mma(a0, b0, c0);
            c1 = mma(a0, b1, c1);
            c2 = mma(a1, b0, c2);
            c3 = mma(a1, b1, c3);            
            for (k = 1; k < 8; k++) {  // loop 2
              a0 = A[i, k];
              a1 = A[i+1, k];
              b0 = B[k, j];
              b1 = B[k, j+1];
              c0 = mma(a0, b0, c0);
              c1 = mma(a0, b1, c1);
              c2 = mma(a1, b0, c2);
              c3 = mma(a1, b1, c3);
            }
            C[i, j] = c0;
            C[i, j+1] = c1;
            C[i+1, j] = c2;
            C[i+1, j+1] = c3;
          }
        }

That means 3 optimizations have been performed to the basic loops. Loop 0 & loop 1 have been unrolled & jammed by a factor of 2, this has the effect of increasing the number of independent MMA ops in the innermost loop, improving instruction packing. For loop 2, the innermost loop, the first iteration has been peeled off as a preamble, and my assumption is that this makes pipelining easier. Another subtle optimization is that the load/store of the accumulator has been hoisted out of the reduction loop.

When Peano gets the directly generated kernel, it completely unrolls loop 2, and unrolls loop 1 by a factor of 2. That is, you end up with 16 MMA ops in the innermost loop, which are independent in pairs.

Edit: From experiments with Peano, matching this kernel structure should get us up to 75% the performance of the μkernel, up from the current 20-33%.

Moving Forward

An obvious way to try to improve performance would be to replicate these optimizations:

  1. Unroll & Jam the parallel loops (0 & 1) by a factor of 2
  2. Peel a preamble out of the reduction loop (2)
  3. Eliminate redundant load/stores on the accumulators

There are two issues blocking the first optimization from two different fronts:

  1. The recently added functionality to do unroll & jam on scf::ForOp, unlike the equivalent available for affine::AffineForOp, does not support loops that return values, but the linalg::tileLinalgOp generates loops like:
  %0 = scf.for %i = %c0 to %c16 step %c1 iter_args(%a0 = %aa) -> (tensor<16x16x4x4xf32>) {
    %1 = scf.for %j = %c0 to %c16 step %c1 iter_args(%a1 = %a0) -> (tensor<16x16x4x4xf32>) {
      %2 = scf.for %k = %c0 to %c8 step %c1 iter_args(%a2 = %a1) -> (tensor<16x16x4x4xf32>) {
  1. Although linalg::tileLinalgOp accepts a parameter indicating which type of loop to use for tiling, it's only implemented for scf::ForOp and scf::ParallelOp, not for affine::AffineForOp.

The two options are, therefore, either try to implement unroll & jam for scf::ForOp that return a value, or implement linalg::tileLinalgOp with affine::AffineForOp. Out of the two, although we can get extra advantages from using affine loops, the fact that is not already implemented upstream gives me pause. On the other hand, unroll & jam for scf::ForOp has been implemented for a couple of weeks only, so I can see why it's not matching functionality with the affine version, even though I see no reason not to.

Pending Issue?

Optimizing redundant transfer read/writes is implemented and can be trivially used, but it has to happen after the load/stores have been bufferized, and we have transfer read/write ops. At the loop insertion level, this hasn't happened yet, so this optimization would have to happen either right after bufferization but before getting rid of the loops, or it needs to be implemented for tensor extract/insert slice.

I'm going to preemptively start by extending unroll & jam on scf::ForOp, but thoughts, suggestions, and other ideas pointing in a different direction are welcome.

@makslevental
Copy link
Collaborator

using affine loops, the fact that is not already implemented upstream gives me pause.

affine dialect is generally considered "abandonware" so that's the reason (I believe).

@jsetoain
Copy link

using affine loops, the fact that is not already implemented upstream gives me pause.

affine dialect is generally considered "abandonware" so that's the reason (I believe).

That would explain quite a lot... but then again, unroll & jam is not fully implemented in scf, and the patch for partial implementation landed on June 19th (iirc). I'd expect these basic transformations to be fully implemented upstream. No time like the present, I guess 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants