Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tiling] Packing for convolution #756

Open
newling opened this issue Sep 9, 2024 · 10 comments
Open

[Tiling] Packing for convolution #756

newling opened this issue Sep 9, 2024 · 10 comments

Comments

@newling
Copy link
Contributor

newling commented Sep 9, 2024

Use packing between L2 and L1 for convolution.

Using upstream MLIR packing I get the following.

func.func @conv_2d_nhwc_hwcf_dispatch_0_conv_2d_nhwc_hwcf_2x12x12x64x3x3x32_i32() attributes {translation_info = #translation} {
  %c1 = arith.constant 1 : index
  %c4 = arith.constant 4 : index
  %c3 = arith.constant 3 : index
  %c0_i32 = arith.constant 0 : i32
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan layout(#pipeline_layout) set(0) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<2x14x14x32xi32>>
  %1 = hal.interface.binding.subspan layout(#pipeline_layout) set(0) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<3x3x32x64xi32>>
  %2 = hal.interface.binding.subspan layout(#pipeline_layout) set(0) binding(2) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<2x12x12x64xi32>>
  %3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [2, 14, 14, 32], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<2x14x14x32xi32>> -> tensor<2x14x14x32xi32>
  %4 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0, 0], sizes = [3, 3, 32, 64], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<3x3x32x64xi32>> -> tensor<3x3x32x64xi32>
  %5 = flow.dispatch.tensor.load %2, offsets = [0, 0, 0, 0], sizes = [2, 12, 12, 64], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<2x12x12x64xi32>> -> tensor<2x12x12x64xi32>
  %6 = scf.forall (%arg0, %arg1, %arg2) = (0, 0, 0) to (12, 12, 64) step (4, 4, 4) shared_outs(%arg3 = %5) -> (tensor<2x12x12x64xi32>) {
    %extracted_slice = tensor.extract_slice %3[0, %arg0, %arg1, 0] [2, 6, 6, 32] [1, 1, 1, 1] : tensor<2x14x14x32xi32> to tensor<2x6x6x32xi32>
    %extracted_slice_0 = tensor.extract_slice %4[0, 0, 0, %arg2] [3, 3, 32, 4] [1, 1, 1, 1] : tensor<3x3x32x64xi32> to tensor<3x3x32x4xi32>
    %extracted_slice_1 = tensor.extract_slice %arg3[0, %arg0, %arg1, %arg2] [2, 4, 4, 4] [1, 1, 1, 1] : tensor<2x12x12x64xi32> to tensor<2x4x4x4xi32>
    %7 = bufferization.alloc_tensor() : tensor<2x6x6x32xi32>
    %alloc = memref.alloc() : memref<2x6x6x32xi32, 1 : i32>
    %8 = bufferization.to_tensor %alloc restrict writable : memref<2x6x6x32xi32, 1 : i32>
    %9 = linalg.copy ins(%extracted_slice : tensor<2x6x6x32xi32>) outs(%8 : tensor<2x6x6x32xi32>) -> tensor<2x6x6x32xi32>
    %10 = bufferization.alloc_tensor() : tensor<3x3x32x4xi32>
    %alloc_2 = memref.alloc() : memref<3x3x32x4xi32, 1 : i32>
    %11 = bufferization.to_tensor %alloc_2 restrict writable : memref<3x3x32x4xi32, 1 : i32>
    %12 = linalg.copy ins(%extracted_slice_0 : tensor<3x3x32x4xi32>) outs(%11 : tensor<3x3x32x4xi32>) -> tensor<3x3x32x4xi32>
    %13 = bufferization.alloc_tensor() : tensor<2x4x4x4xi32>
    %alloc_3 = memref.alloc() : memref<2x4x4x4xi32, 1 : i32>
    %14 = bufferization.to_tensor %alloc_3 restrict writable : memref<2x4x4x4xi32, 1 : i32>
    %15 = scf.forall (%arg4, %arg5, %arg6, %arg7) = (0, 0, 0, 0) to (2, 4, 4, 4) step (1, 1, 4, 4) shared_outs(%arg8 = %14) -> (tensor<2x4x4x4xi32>) {
      %extracted_slice_4 = tensor.extract_slice %9[%arg4, %arg5, %arg6, 0] [1, 3, 6, 32] [1, 1, 1, 1] : tensor<2x6x6x32xi32> to tensor<1x3x6x32xi32>
      %extracted_slice_5 = tensor.extract_slice %12[0, 0, 0, %arg7] [3, 3, 32, 4] [1, 1, 1, 1] : tensor<3x3x32x4xi32> to tensor<3x3x32x4xi32>
      %extracted_slice_6 = tensor.extract_slice %arg8[%arg4, %arg5, %arg6, %arg7] [1, 1, 4, 4] [1, 1, 1, 1] : tensor<2x4x4x4xi32> to tensor<1x1x4x4xi32>
      %alloc_7 = memref.alloc() : memref<1x3x4x6x8xi32, 2 : i32>
      %17 = bufferization.to_tensor %alloc_7 restrict writable : memref<1x3x4x6x8xi32, 2 : i32>
      %pack = tensor.pack %extracted_slice_4 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [3] inner_tiles = [8] into %17 : tensor<1x3x6x32xi32> -> tensor<1x3x4x6x8xi32>
      %alloc_8 = memref.alloc() : memref<3x3x4x1x8x4xi32, 2 : i32>
      %18 = bufferization.to_tensor %alloc_8 restrict writable : memref<3x3x4x1x8x4xi32, 2 : i32>
      %pack_9 = tensor.pack %extracted_slice_5 outer_dims_perm = [0, 1, 2, 3] inner_dims_pos = [2, 3] inner_tiles = [8, 4] into %18 : tensor<3x3x32x4xi32> -> tensor<3x3x4x1x8x4xi32>
      %alloc_10 = memref.alloc() : memref<1x1x4x1x4xi32, 2 : i32>
      %19 = bufferization.to_tensor %alloc_10 restrict writable : memref<1x1x4x1x4xi32, 2 : i32>
      %20 = linalg.fill ins(%c0_i32 : i32) outs(%19 : tensor<1x1x4x1x4xi32>) -> tensor<1x1x4x1x4xi32>
      %21 = scf.for %arg9 = %c0 to %c3 step %c1 iter_args(%arg10 = %20) -> (tensor<1x1x4x1x4xi32>) {
        %22 = scf.for %arg11 = %c0 to %c3 step %c1 iter_args(%arg12 = %arg10) -> (tensor<1x1x4x1x4xi32>) {
          %23 = scf.for %arg13 = %c0 to %c4 step %c1 iter_args(%arg14 = %arg12) -> (tensor<1x1x4x1x4xi32>) {
            %extracted_slice_11 = tensor.extract_slice %pack[0, %arg9, %arg13, %arg11, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : tensor<1x3x4x6x8xi32> to tensor<1x1x1x4x8xi32>
            %extracted_slice_12 = tensor.extract_slice %pack_9[%arg9, %arg11, %arg13, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : tensor<3x3x4x1x8x4xi32> to tensor<1x1x1x1x8x4xi32>
            %24 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction", "reduction", "parallel", "reduction"]} ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>) outs(%arg14 : tensor<1x1x4x1x4xi32>) attrs =  {lowering_config = #config, packing_config = #packingConfig} {
            ^bb0(%in: i32, %in_13: i32, %out: i32):
              %25 = arith.muli %in, %in_13 : i32
              %26 = arith.addi %out, %25 : i32
              linalg.yield %26 : i32
            } -> tensor<1x1x4x1x4xi32>
            scf.yield %24 : tensor<1x1x4x1x4xi32>
          }
          scf.yield %23 : tensor<1x1x4x1x4xi32>
        }
        scf.yield %22 : tensor<1x1x4x1x4xi32>
      }
      %unpack = tensor.unpack %21 inner_dims_pos = [3] inner_tiles = [4] into %extracted_slice_6 : tensor<1x1x4x1x4xi32> -> tensor<1x1x4x4xi32>
      memref.dealloc %alloc_7 : memref<1x3x4x6x8xi32, 2 : i32>
      memref.dealloc %alloc_8 : memref<3x3x4x1x8x4xi32, 2 : i32>
      memref.dealloc %alloc_10 : memref<1x1x4x1x4xi32, 2 : i32>
      scf.forall.in_parallel {
        tensor.parallel_insert_slice %unpack into %arg8[%arg4, %arg5, %arg6, %arg7] [1, 1, 4, 4] [1, 1, 1, 1] : tensor<1x1x4x4xi32> into tensor<2x4x4x4xi32>
      }
    } {mapping = [#gpu.thread<y>, #gpu.thread<x>, #gpu.thread<z>, #gpu.thread<linear_dim_0>]}
    %16 = linalg.copy ins(%15 : tensor<2x4x4x4xi32>) outs(%extracted_slice_1 : tensor<2x4x4x4xi32>) -> tensor<2x4x4x4xi32>
    memref.dealloc %alloc : memref<2x6x6x32xi32, 1 : i32>
    memref.dealloc %alloc_2 : memref<3x3x32x4xi32, 1 : i32>
    memref.dealloc %alloc_3 : memref<2x4x4x4xi32, 1 : i32>
    scf.forall.in_parallel {
      tensor.parallel_insert_slice %16 into %arg3[0, %arg0, %arg1, %arg2] [2, 4, 4, 4] [1, 1, 1, 1] : tensor<2x4x4x4xi32> into tensor<2x12x12x64xi32>
    }
  } {mapping = [#gpu.block<y>, #gpu.block<x>, #gpu.block<z>]}
  flow.dispatch.tensor.store %6, %2, offsets = [0, 0, 0, 0], sizes = [2, 12, 12, 64], strides = [1, 1, 1, 1] : tensor<2x12x12x64xi32> -> !flow.dispatch.tensor<writeonly:tensor<2x12x12x64xi32>>
  return
}

So I currently think we don't need any modification to upstream MLIR.

I'll post a PR with my packing config later. It currently results in other issues cropping up (in air: out of tile memory. in objectFifo: some other crash).

@yzhang93
Copy link
Contributor

yzhang93 commented Sep 9, 2024

I think the main concern is whether the generated date layout can be vectorized.

And the linalg.generic's operands seem to be problematic to me.
ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>) outs(%arg14 : tensor<1x1x4x1x4xi32>)

@newling
Copy link
Contributor Author

newling commented Sep 9, 2024

I think the main concern is whether the generated date layout can be vectorized.
?
And the linalg.generic's operands seem to be problematic to me. ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>) outs(%arg14 : tensor<1x1x4x1x4xi32>)

It looks ok to me, can you explain your concern? I think it's just a matmul with m=n=4 k=8 on contiguous slices of L1 allocations.

@yzhang93
Copy link
Contributor

yzhang93 commented Sep 9, 2024

I didn't see the indexing maps for the linalg.generic operands, so not sure what has been done. Is there an implicit collapse of dimension for tensor<1x1x1x1x8x4xi32>?

@newling
Copy link
Contributor Author

newling commented Sep 9, 2024

Oops, here are the maps:

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1 + d4, d6, d2 + d5, d8)>
#map1 = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d4, d5, d6, d3, d8, d7)>
#map2 = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d2, d3, d7)>

As an aside, note that %extracted_slice_11 and extra %extracted_slice_12 are contiguous slices, which is exactly what our motivation for packing was in the first place. So that's good.

Now let's look more closely at the linalg.generic (I'm going to try and convince you that it's just a matmul...).

Copying the inner loop from before in a more readable way:

%extracted_slice_11 = tensor.extract_slice 
                                    %pack[0, %arg9, %arg13, %arg11, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : 
                                    tensor<1x3x4x6x8xi32> to tensor<1x1x1x4x8xi32>
%extracted_slice_12 = tensor.extract_slice 
                                    %pack_9[%arg9, %arg11, %arg13, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : 
                                    tensor<3x3x4x1x8x4xi32> to tensor<1x1x1x1x8x4xi32>
%24 = linalg.generic {indexing_maps = [#map, #map1, #map2], 
         iterator_types = ["p", "p", "p", "p", "r", "r", "r", "p", "r"]} 
         ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>)
         outs(%arg14 : tensor<1x1x4x1x4xi32>) 
         attrs =  {lowering_config = #config, packing_config = #packingconfig} {
^bb0(%in: i32, %in_13: i32, %out: i32):
  %25 = arith.muli %in, %in_13 : i32
  %26 = arith.addi %out, %25 : i32
  linalg.yield %26 : i32
} -> tensor<1x1x4x1x4xi32>

What are loop counts for each of the dimension d0 through d8? Matching the dimensions to the tensor we see:

d0: 1 (first dimension of %arg14 is size 1)
d1: 1 (second dimension of %arg14 is size 1)
d2: 4 (third dimension of %arg14 is size 4)
d3: 1 (fourth dimension of %arg14 is size 1)
d4: 1 (first dimension of %extracted_slice_12)
d5: 1 (second dimension of %extracted_slice_12)
d6: 1 (third dimension of %extracted_slice_12)
d7: 4 (final dimension of %arg14)
d8: 8 (final dimension of %extracted_slice_11)

What's interesting here is that d4 and d5 have loop count one. So #map is actually a trivial map because it is effectively just

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d6, d2, d8)>

@newling
Copy link
Contributor Author

newling commented Sep 9, 2024

So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.

@newling
Copy link
Contributor Author

newling commented Sep 10, 2024

So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.

Update on this: I have a WIP pass which eliminates the singleton dimensions so that after the pass the linalg.generic is clearly a matmul, but vectorization now introduces a broadcast before the vector.contract. Investigating...

@MaheshRavishankar
Copy link
Collaborator

I think I understand what is happening here. Can you post the method you are using to drop the unit-dimensions. There is an upstream method that allows you to drop unit dimensions and also control what dimensions are being dropped. If you are using this op

%24 = linalg.generic {indexing_maps = [#map, #map1, #map2], 
         iterator_types = ["p", "p", "p", "p", "r", "r", "r", "p", "r"]} 
         ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>)
         outs(%arg14 : tensor<1x1x4x1x4xi32>) 
         attrs =  {lowering_config = #config, packing_config = #packingconfig} {
^bb0(%in: i32, %in_13: i32, %out: i32):
  %25 = arith.muli %in, %in_13 : i32
  %26 = arith.addi %out, %25 : i32
  linalg.yield %26 : i32
} -> tensor<1x1x4x1x4xi32>

dropping the inner unit-dimension of tensor<1x1x4x1x4xi32> is probably causing the issue. You should be able to control the dimensions that you drop. But before that, Why is the result not tensor<1x1x1x4x4xi32>

@MaheshRavishankar
Copy link
Collaborator

Or if you post the IR post vectorization, that will give some clues

@yzhang93
Copy link
Contributor

So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.

Update on this: I have a WIP pass which eliminates the singleton dimensions so that after the pass the linalg.generic is clearly a matmul, but vectorization now introduces a broadcast before the vector.contract. Investigating...

I think previously it had compilation issue when lowered to vector.broadcast, but it's good to check if aievec can handle vector.broadcast now.

@newling
Copy link
Contributor Author

newling commented Sep 11, 2024

Just noticed your comment @MaheshRavishankar and linalg-fold-unit-extent-dims works perfectly, thank you!

The pass I've written is basically the same as linalg-fold-unit-extent-dims but uses tensor.extract_slice instead of tensor.expand_shape, and for some reason comprehensive bufferization fails with the extract_slice approach.

Removing all unit dimensions is exactly what I want. It isn't enough to just remove the reduction dimensions, because then the broadcasts doesn't get eliminated (vector.contract verifies that all dimensions to appear in either LHS or RHS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants