[Tiling] Packing for convolution #756

newling · 2024-09-09T15:33:36Z

Use packing between L2 and L1 for convolution.

Using upstream MLIR packing I get the following.

func.func @conv_2d_nhwc_hwcf_dispatch_0_conv_2d_nhwc_hwcf_2x12x12x64x3x3x32_i32() attributes {translation_info = #translation} {
  %c1 = arith.constant 1 : index
  %c4 = arith.constant 4 : index
  %c3 = arith.constant 3 : index
  %c0_i32 = arith.constant 0 : i32
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan layout(#pipeline_layout) set(0) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<2x14x14x32xi32>>
  %1 = hal.interface.binding.subspan layout(#pipeline_layout) set(0) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<3x3x32x64xi32>>
  %2 = hal.interface.binding.subspan layout(#pipeline_layout) set(0) binding(2) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<2x12x12x64xi32>>
  %3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [2, 14, 14, 32], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<2x14x14x32xi32>> -> tensor<2x14x14x32xi32>
  %4 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0, 0], sizes = [3, 3, 32, 64], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<3x3x32x64xi32>> -> tensor<3x3x32x64xi32>
  %5 = flow.dispatch.tensor.load %2, offsets = [0, 0, 0, 0], sizes = [2, 12, 12, 64], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<2x12x12x64xi32>> -> tensor<2x12x12x64xi32>
  %6 = scf.forall (%arg0, %arg1, %arg2) = (0, 0, 0) to (12, 12, 64) step (4, 4, 4) shared_outs(%arg3 = %5) -> (tensor<2x12x12x64xi32>) {
    %extracted_slice = tensor.extract_slice %3[0, %arg0, %arg1, 0] [2, 6, 6, 32] [1, 1, 1, 1] : tensor<2x14x14x32xi32> to tensor<2x6x6x32xi32>
    %extracted_slice_0 = tensor.extract_slice %4[0, 0, 0, %arg2] [3, 3, 32, 4] [1, 1, 1, 1] : tensor<3x3x32x64xi32> to tensor<3x3x32x4xi32>
    %extracted_slice_1 = tensor.extract_slice %arg3[0, %arg0, %arg1, %arg2] [2, 4, 4, 4] [1, 1, 1, 1] : tensor<2x12x12x64xi32> to tensor<2x4x4x4xi32>
    %7 = bufferization.alloc_tensor() : tensor<2x6x6x32xi32>
    %alloc = memref.alloc() : memref<2x6x6x32xi32, 1 : i32>
    %8 = bufferization.to_tensor %alloc restrict writable : memref<2x6x6x32xi32, 1 : i32>
    %9 = linalg.copy ins(%extracted_slice : tensor<2x6x6x32xi32>) outs(%8 : tensor<2x6x6x32xi32>) -> tensor<2x6x6x32xi32>
    %10 = bufferization.alloc_tensor() : tensor<3x3x32x4xi32>
    %alloc_2 = memref.alloc() : memref<3x3x32x4xi32, 1 : i32>
    %11 = bufferization.to_tensor %alloc_2 restrict writable : memref<3x3x32x4xi32, 1 : i32>
    %12 = linalg.copy ins(%extracted_slice_0 : tensor<3x3x32x4xi32>) outs(%11 : tensor<3x3x32x4xi32>) -> tensor<3x3x32x4xi32>
    %13 = bufferization.alloc_tensor() : tensor<2x4x4x4xi32>
    %alloc_3 = memref.alloc() : memref<2x4x4x4xi32, 1 : i32>
    %14 = bufferization.to_tensor %alloc_3 restrict writable : memref<2x4x4x4xi32, 1 : i32>
    %15 = scf.forall (%arg4, %arg5, %arg6, %arg7) = (0, 0, 0, 0) to (2, 4, 4, 4) step (1, 1, 4, 4) shared_outs(%arg8 = %14) -> (tensor<2x4x4x4xi32>) {
      %extracted_slice_4 = tensor.extract_slice %9[%arg4, %arg5, %arg6, 0] [1, 3, 6, 32] [1, 1, 1, 1] : tensor<2x6x6x32xi32> to tensor<1x3x6x32xi32>
      %extracted_slice_5 = tensor.extract_slice %12[0, 0, 0, %arg7] [3, 3, 32, 4] [1, 1, 1, 1] : tensor<3x3x32x4xi32> to tensor<3x3x32x4xi32>
      %extracted_slice_6 = tensor.extract_slice %arg8[%arg4, %arg5, %arg6, %arg7] [1, 1, 4, 4] [1, 1, 1, 1] : tensor<2x4x4x4xi32> to tensor<1x1x4x4xi32>
      %alloc_7 = memref.alloc() : memref<1x3x4x6x8xi32, 2 : i32>
      %17 = bufferization.to_tensor %alloc_7 restrict writable : memref<1x3x4x6x8xi32, 2 : i32>
      %pack = tensor.pack %extracted_slice_4 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [3] inner_tiles = [8] into %17 : tensor<1x3x6x32xi32> -> tensor<1x3x4x6x8xi32>
      %alloc_8 = memref.alloc() : memref<3x3x4x1x8x4xi32, 2 : i32>
      %18 = bufferization.to_tensor %alloc_8 restrict writable : memref<3x3x4x1x8x4xi32, 2 : i32>
      %pack_9 = tensor.pack %extracted_slice_5 outer_dims_perm = [0, 1, 2, 3] inner_dims_pos = [2, 3] inner_tiles = [8, 4] into %18 : tensor<3x3x32x4xi32> -> tensor<3x3x4x1x8x4xi32>
      %alloc_10 = memref.alloc() : memref<1x1x4x1x4xi32, 2 : i32>
      %19 = bufferization.to_tensor %alloc_10 restrict writable : memref<1x1x4x1x4xi32, 2 : i32>
      %20 = linalg.fill ins(%c0_i32 : i32) outs(%19 : tensor<1x1x4x1x4xi32>) -> tensor<1x1x4x1x4xi32>
      %21 = scf.for %arg9 = %c0 to %c3 step %c1 iter_args(%arg10 = %20) -> (tensor<1x1x4x1x4xi32>) {
        %22 = scf.for %arg11 = %c0 to %c3 step %c1 iter_args(%arg12 = %arg10) -> (tensor<1x1x4x1x4xi32>) {
          %23 = scf.for %arg13 = %c0 to %c4 step %c1 iter_args(%arg14 = %arg12) -> (tensor<1x1x4x1x4xi32>) {
            %extracted_slice_11 = tensor.extract_slice %pack[0, %arg9, %arg13, %arg11, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : tensor<1x3x4x6x8xi32> to tensor<1x1x1x4x8xi32>
            %extracted_slice_12 = tensor.extract_slice %pack_9[%arg9, %arg11, %arg13, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : tensor<3x3x4x1x8x4xi32> to tensor<1x1x1x1x8x4xi32>
            %24 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction", "reduction", "parallel", "reduction"]} ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>) outs(%arg14 : tensor<1x1x4x1x4xi32>) attrs =  {lowering_config = #config, packing_config = #packingConfig} {
            ^bb0(%in: i32, %in_13: i32, %out: i32):
              %25 = arith.muli %in, %in_13 : i32
              %26 = arith.addi %out, %25 : i32
              linalg.yield %26 : i32
            } -> tensor<1x1x4x1x4xi32>
            scf.yield %24 : tensor<1x1x4x1x4xi32>
          }
          scf.yield %23 : tensor<1x1x4x1x4xi32>
        }
        scf.yield %22 : tensor<1x1x4x1x4xi32>
      }
      %unpack = tensor.unpack %21 inner_dims_pos = [3] inner_tiles = [4] into %extracted_slice_6 : tensor<1x1x4x1x4xi32> -> tensor<1x1x4x4xi32>
      memref.dealloc %alloc_7 : memref<1x3x4x6x8xi32, 2 : i32>
      memref.dealloc %alloc_8 : memref<3x3x4x1x8x4xi32, 2 : i32>
      memref.dealloc %alloc_10 : memref<1x1x4x1x4xi32, 2 : i32>
      scf.forall.in_parallel {
        tensor.parallel_insert_slice %unpack into %arg8[%arg4, %arg5, %arg6, %arg7] [1, 1, 4, 4] [1, 1, 1, 1] : tensor<1x1x4x4xi32> into tensor<2x4x4x4xi32>
      }
    } {mapping = [#gpu.thread<y>, #gpu.thread<x>, #gpu.thread<z>, #gpu.thread<linear_dim_0>]}
    %16 = linalg.copy ins(%15 : tensor<2x4x4x4xi32>) outs(%extracted_slice_1 : tensor<2x4x4x4xi32>) -> tensor<2x4x4x4xi32>
    memref.dealloc %alloc : memref<2x6x6x32xi32, 1 : i32>
    memref.dealloc %alloc_2 : memref<3x3x32x4xi32, 1 : i32>
    memref.dealloc %alloc_3 : memref<2x4x4x4xi32, 1 : i32>
    scf.forall.in_parallel {
      tensor.parallel_insert_slice %16 into %arg3[0, %arg0, %arg1, %arg2] [2, 4, 4, 4] [1, 1, 1, 1] : tensor<2x4x4x4xi32> into tensor<2x12x12x64xi32>
    }
  } {mapping = [#gpu.block<y>, #gpu.block<x>, #gpu.block<z>]}
  flow.dispatch.tensor.store %6, %2, offsets = [0, 0, 0, 0], sizes = [2, 12, 12, 64], strides = [1, 1, 1, 1] : tensor<2x12x12x64xi32> -> !flow.dispatch.tensor<writeonly:tensor<2x12x12x64xi32>>
  return
}

So I currently think we don't need any modification to upstream MLIR.

I'll post a PR with my packing config later. It currently results in other issues cropping up (in air: out of tile memory. in objectFifo: some other crash).

The text was updated successfully, but these errors were encountered:

yzhang93 · 2024-09-09T16:11:56Z

I think the main concern is whether the generated date layout can be vectorized.

And the linalg.generic's operands seem to be problematic to me.
ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>) outs(%arg14 : tensor<1x1x4x1x4xi32>)

newling · 2024-09-09T16:22:06Z

I think the main concern is whether the generated date layout can be vectorized.
?
And the linalg.generic's operands seem to be problematic to me. ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>) outs(%arg14 : tensor<1x1x4x1x4xi32>)

It looks ok to me, can you explain your concern? I think it's just a matmul with m=n=4 k=8 on contiguous slices of L1 allocations.

yzhang93 · 2024-09-09T17:23:14Z

I didn't see the indexing maps for the linalg.generic operands, so not sure what has been done. Is there an implicit collapse of dimension for tensor<1x1x1x1x8x4xi32>?

newling · 2024-09-09T20:05:21Z

Oops, here are the maps:

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1 + d4, d6, d2 + d5, d8)>
#map1 = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d4, d5, d6, d3, d8, d7)>
#map2 = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d2, d3, d7)>

As an aside, note that %extracted_slice_11 and extra %extracted_slice_12 are contiguous slices, which is exactly what our motivation for packing was in the first place. So that's good.

Now let's look more closely at the linalg.generic (I'm going to try and convince you that it's just a matmul...).

Copying the inner loop from before in a more readable way:

%extracted_slice_11 = tensor.extract_slice 
                                    %pack[0, %arg9, %arg13, %arg11, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : 
                                    tensor<1x3x4x6x8xi32> to tensor<1x1x1x4x8xi32>
%extracted_slice_12 = tensor.extract_slice 
                                    %pack_9[%arg9, %arg11, %arg13, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : 
                                    tensor<3x3x4x1x8x4xi32> to tensor<1x1x1x1x8x4xi32>
%24 = linalg.generic {indexing_maps = [#map, #map1, #map2], 
         iterator_types = ["p", "p", "p", "p", "r", "r", "r", "p", "r"]} 
         ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>)
         outs(%arg14 : tensor<1x1x4x1x4xi32>) 
         attrs =  {lowering_config = #config, packing_config = #packingconfig} {
^bb0(%in: i32, %in_13: i32, %out: i32):
  %25 = arith.muli %in, %in_13 : i32
  %26 = arith.addi %out, %25 : i32
  linalg.yield %26 : i32
} -> tensor<1x1x4x1x4xi32>

What are loop counts for each of the dimension d0 through d8? Matching the dimensions to the tensor we see:

d0: 1 (first dimension of %arg14 is size 1)
d1: 1 (second dimension of %arg14 is size 1)
d2: 4 (third dimension of %arg14 is size 4)
d3: 1 (fourth dimension of %arg14 is size 1)
d4: 1 (first dimension of %extracted_slice_12)
d5: 1 (second dimension of %extracted_slice_12)
d6: 1 (third dimension of %extracted_slice_12)
d7: 4 (final dimension of %arg14)
d8: 8 (final dimension of %extracted_slice_11)

What's interesting here is that d4 and d5 have loop count one. So #map is actually a trivial map because it is effectively just

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d6, d2, d8)>

newling · 2024-09-09T20:07:44Z

So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.

newling · 2024-09-10T16:03:45Z

So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.

Update on this: I have a WIP pass which eliminates the singleton dimensions so that after the pass the linalg.generic is clearly a matmul, but vectorization now introduces a broadcast before the vector.contract. Investigating...

MaheshRavishankar · 2024-09-10T18:12:44Z

I think I understand what is happening here. Can you post the method you are using to drop the unit-dimensions. There is an upstream method that allows you to drop unit dimensions and also control what dimensions are being dropped. If you are using this op

%24 = linalg.generic {indexing_maps = [#map, #map1, #map2], 
         iterator_types = ["p", "p", "p", "p", "r", "r", "r", "p", "r"]} 
         ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>)
         outs(%arg14 : tensor<1x1x4x1x4xi32>) 
         attrs =  {lowering_config = #config, packing_config = #packingconfig} {
^bb0(%in: i32, %in_13: i32, %out: i32):
  %25 = arith.muli %in, %in_13 : i32
  %26 = arith.addi %out, %25 : i32
  linalg.yield %26 : i32
} -> tensor<1x1x4x1x4xi32>

dropping the inner unit-dimension of tensor<1x1x4x1x4xi32> is probably causing the issue. You should be able to control the dimensions that you drop. But before that, Why is the result not tensor<1x1x1x4x4xi32>

MaheshRavishankar · 2024-09-10T18:16:09Z

Or if you post the IR post vectorization, that will give some clues

yzhang93 · 2024-09-10T19:10:43Z

So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.

Update on this: I have a WIP pass which eliminates the singleton dimensions so that after the pass the linalg.generic is clearly a matmul, but vectorization now introduces a broadcast before the vector.contract. Investigating...

I think previously it had compilation issue when lowered to vector.broadcast, but it's good to check if aievec can handle vector.broadcast now.

newling · 2024-09-11T19:03:11Z

Just noticed your comment @MaheshRavishankar and linalg-fold-unit-extent-dims works perfectly, thank you!

The pass I've written is basically the same as linalg-fold-unit-extent-dims but uses tensor.extract_slice instead of tensor.expand_shape, and for some reason comprehensive bufferization fails with the extract_slice approach.

Removing all unit dimensions is exactly what I want. It isn't enough to just remove the reduction dimensions, because then the broadcasts doesn't get eliminated (vector.contract verifies that all dimensions to appear in either LHS or RHS).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tiling] Packing for convolution #756

[Tiling] Packing for convolution #756

newling commented Sep 9, 2024

yzhang93 commented Sep 9, 2024

newling commented Sep 9, 2024

yzhang93 commented Sep 9, 2024

newling commented Sep 9, 2024

newling commented Sep 9, 2024

newling commented Sep 10, 2024

MaheshRavishankar commented Sep 10, 2024

MaheshRavishankar commented Sep 10, 2024

yzhang93 commented Sep 10, 2024

newling commented Sep 11, 2024

[Tiling] Packing for convolution #756

[Tiling] Packing for convolution #756

Comments

newling commented Sep 9, 2024

yzhang93 commented Sep 9, 2024

newling commented Sep 9, 2024

yzhang93 commented Sep 9, 2024

newling commented Sep 9, 2024

newling commented Sep 9, 2024

newling commented Sep 10, 2024

MaheshRavishankar commented Sep 10, 2024

MaheshRavishankar commented Sep 10, 2024

yzhang93 commented Sep 10, 2024

newling commented Sep 11, 2024