Split L2 input and output objectFifos for memTile/shimTile distribution #903

yzhang93 · 2024-11-13T23:32:25Z

This is the first PR needed to enable 4x4 AIE array. The L2 objectFifos are split to distribute on multiple memTiles and shimTiles for more channel usage. The shim/mem tile reassignment will be addressed in a separate PR.

Also note this PR doesn't change/combine the previous pass that splitting the third input (elementwise) for connection reuse with mamtul ops. What unclear to me is how matmul-elementwise path will be like for 4x4 AIE cores. If the existing logic is kept, we'll have to split the elementwise op to 16 objectFifos which will be a big challenge given the number of shimTile channels . I'd rather leave it as a separate thing for now before we have a clear path to move forward.

@jtuyls I've simplified the functions a bit based on your initial version. Feel free to make modifications/push new commits.

newling

I think I understand the basic idea of the pass, thanks for the clear comments and code. I'd try to take another look later today to better understand the details.

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/split_logicalobjfifos.mlir

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/Passes.td

...gins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIESplitLogicalObjFifosForConnectionReuse.cpp

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/Passes.td

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp

newling · 2024-11-15T20:13:01Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp

+    return op.emitOpError() << "expected objectFifo shape larger than 2";
+  }
+  size_t splitDim;
+  if (memrefShape[0] != 1) {


Is it intentional to not split on both dimensions if they are both not one? If this is run on array with 1 row or 1 column, it will fail, maybe don't fail?

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/split_logicalobjfifos.mlir

newling · 2024-11-15T20:35:46Z

It would be nice to have include a motivation for this in the description. I imagine it is to split tensors equally across all memory tiles, rather than putting them entirely on one? If this is the case, are you assuming nrows = ncols in choosing splitFactor? If the array is 4x4 and you have

 %alloc_0 = memref.alloc() : memref<4x1x32x32xi32, 1 : i32>
 %alloc_1 = memref.alloc() : memref<1x4x32x32xi32, 1 : i32>

I guess it works nicely. But is the array is 4x8 (8 cols) and the allocs are 4x1x... and 1x8x... it's not optimal?

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp

yzhang93 · 2024-11-18T20:43:03Z

It would be nice to have include a motivation for this in the description. I imagine it is to split tensors equally across all memory tiles, rather than putting them entirely on one? If this is the case, are you assuming nrows = ncols in choosing splitFactor? If the array is 4x4 and you have
 %alloc_0 = memref.alloc() : memref<4x1x32x32xi32, 1 : i32>
 %alloc_1 = memref.alloc() : memref<1x4x32x32xi32, 1 : i32>
I guess it works nicely. But is the array is 4x8 (8 cols) and the allocs are 4x1x... and 1x8x... it's not optimal?

Yes, currently it aims for balance use of number of rows and columns (i.g., 2x2/4x4) and added basic support for that. Also it should work for unbalanced use of columns/rows such as A: 4x1x... B: 1x2x...., which will split A into 4 separate objectfifos and B into 2 objectfifos, but we can't control how to distribute these in this pass. I think it's the tiling strategy and the tile assignment strategy that should responsible for the distribution part.

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp

jtuyls · 2024-11-18T20:44:47Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp

+  FailureOr<bool> l2DmaTransposed = isL2DmaTransposed(op, isL2Target);
+  if (failed(l2DmaTransposed)) return failure();


Why is this needed?

It's because the shape of objectFifo and the target size of DmaCpNdOp can be different if we select transpose on target option within ConvertToDma pass. The splitDim we get earlier in this function is only the split dim in objectFifo, but not the split dim for the target size in DmaCpNdOp.

For example, in this example %3 = amdaie.dma_cpy_nd(%0[0, 0, 0, 0] [1, 32, 2, 32] [2048, 32, 1024, 1], %1[0, %2] [32, 64] [128, 1]) : (!amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>),

the splitDim for objectfifo is index 1 while the splitDimTarget should be index 2.

jtuyls · 2024-11-18T20:50:22Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp

+
+  // Both outer dims are 1, no need to split, return success.
+  if (splitDim == memrefShape.size()) return success();
+  int64_t splitFactor = memrefShape[splitDim];


The splitting factor should really be derived from the number of columns soon as we might create larger L2 buffers that need to be split in a smaller factor ([1, 16, 32, 32], split into 4 x [1, 4, 32, 32] for example). Might be ok to leave for future work though.

This PR made some necessary changes in `LogicalObjFifoSplittingUtils` for matmul-elementwise ops as preparation for another splitting usage in #903. Specially, the utility to check if the L3->L2 dma is transposed on the L2 side is rewritten. --------- Co-authored-by: James Newling <[email protected]>

jtuyls · 2024-11-20T00:52:41Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp

+}
+
+/// Utility to get the split dimension and factor from a L3->L2 dma op.
+LogicalResult getSplitDimAndFactorFromDma(AMDAIE::DmaCpyNdOp op,


This getSplitDim... function and the other one above are quite hardcoded, but as discussed offline I agree it's ok for now to get an initial 4x4 working. We will start generalizing this soon in follow-ups.

newling · 2024-11-20T15:32:00Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp

+/// Split L2 space input and output logical objectFifos.
+LogicalResult splitLogicalObjectFifo(IRRewriter &rewriter,
+                                     AMDAIE::LogicalObjectFifoFromMemrefOp op,
+                                     int64_t &splitDim, int64_t &splitFactor) {


Suggested change

int64_t &splitDim, int64_t &splitFactor) {

int64_t splitDim, int64_t splitFactor) {

newling

Yes, currently it aims for balance use of number of rows and columns (i.g., 2x2/4x4) and added basic support for that

It would be great if you can add a check on this in the code. My only remaining requests are

document the assumed relationship between first 2 dims of tensors and nrows, nols
add tests/safeguards for the cases nrows != ncols

newling · 2024-11-20T15:38:25Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/Passes.td

+  let summary = "Pass to split L2 buffers to distribute on multiple shimTiles and memTiles.";
+  let description = [{
+    Splitting L2 input and output logical objectFifos and their user dma operations
+    by a factor of the number of AIE columns being used, so that the logical objectFifos


by a factor of the number of AIE columns being used

This isn't the case when the number of columns is different to the number of rows.

No, but there's still a relationship with the number of columns. Suppose rows: 4, cols: 2 and a different input being broadcasted to each row, then you likely still want to split it in 2 (number of columns), instead of 4.

Exactly, that makes sense. Currently though it'd get split 4 ways afaict (I know the plan is to revisit this which is fine by me).

Currently though it'd get split 4 ways afaict

Yes, indeed, and that's a 'valid' splitting, however, it will likely fail because of not enough channels. 'valid' meaning that it's not intrinsically incorrect, but currently we might not be able to find the channel resources to generate code for it.

newling · 2024-11-20T15:39:43Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/Passes.td

+    by a factor of the number of AIE columns being used, so that the logical objectFifos
+    can be distributed on multiple shimTiles/memTiles.
+
+    For example, A matrix is distributed on a 2x2 AIE array, with L2 buffer size


Suggested change

For example, A matrix is distributed on a 2x2 AIE array, with L2 buffer size

For example, for the case of a matmul C = A @ B, the A matrix is distributed on a 2x2 AIE array, with L2 buffer size

Please mention the assumed relationship between the first 2 dimensions of tensors in L2 and nb_cols and nb_rows

newling · 2024-11-20T16:23:32Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp

+
+  // Create `splitFactor` number of doubly stride ops.
+  rewriter.setInsertionPoint(op);
+  for (int i = 0; i < splitFactor; ++i) {


Not blocking this PR on this question, but trying to understand how we can generalize this.

It seems like @yzhang93 you are is assuming that the size of the dma is exactly the same as the shape of the memref of the objectfifo. This needn't be the case, right, they can be completely different? ( @jtuyls )

Yeah, that's not necessarily the case and soon won't be valid when we cache larger buffers on L2, but I think it's ok for now to assume that to start off. We should make sure though to fail if the assumptions don't hold.

yzhang93 · 2024-11-20T19:57:33Z

Yes, currently it aims for balance use of number of rows and columns (i.g., 2x2/4x4) and added basic support for that

It would be great if you can add a check on this in the code. My only remaining requests are

document the assumed relationship between first 2 dims of tensors and nrows, nols

add tests/safeguards for the cases nrows != ncols

I've added more comments for the current assumptions. I'm not adding more tests specially for nrows != ncols, will leave it in the follow ups when the split factor is only depend on the ncols not the way it's hardcoded right now.

yzhang93 requested review from MaheshRavishankar, nirvedhmeshram, Abhishek-Varma and jtuyls as code owners November 13, 2024 23:32

yzhang93 force-pushed the split_inputs_outputs branch 2 times, most recently from efd6123 to 29067ce Compare November 14, 2024 01:21

newling requested changes Nov 14, 2024

View reviewed changes

newling requested changes Nov 15, 2024

View reviewed changes

jtuyls reviewed Nov 18, 2024

View reviewed changes

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp Outdated Show resolved Hide resolved

jtuyls requested changes Nov 18, 2024

View reviewed changes

yzhang93 mentioned this pull request Nov 18, 2024

Refactor LogicalObjFifoSplittingUtils #910

Merged

yzhang93 force-pushed the split_inputs_outputs branch from c7386ad to 0635b6a Compare November 19, 2024 06:36

jtuyls approved these changes Nov 20, 2024

View reviewed changes

yzhang93 force-pushed the split_inputs_outputs branch from 0635b6a to bca6998 Compare November 20, 2024 01:27

newling reviewed Nov 20, 2024

View reviewed changes

newling requested changes Nov 20, 2024

View reviewed changes

yzhang93 and others added 5 commits November 20, 2024 11:48

Split L2 input and output objectFifos for memTile/shimTile distribution

15f2fca

More comments and tests

8bb7689

Address comments

e215605

Add separate functions to get split dim and factor

fdfbddb

More comments

3675537

yzhang93 force-pushed the split_inputs_outputs branch from a91c568 to 3675537 Compare November 20, 2024 19:49

newling approved these changes Nov 20, 2024

View reviewed changes

Merge branch 'main' into split_inputs_outputs

fd59486

yzhang93 enabled auto-merge (squash) November 20, 2024 22:19

Merge branch 'main' into split_inputs_outputs

04b64ec

yzhang93 merged commit d7bf670 into nod-ai:main Nov 20, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split L2 input and output objectFifos for memTile/shimTile distribution #903

Split L2 input and output objectFifos for memTile/shimTile distribution #903

yzhang93 commented Nov 13, 2024 •

edited

Loading

newling left a comment

newling Nov 15, 2024

newling commented Nov 15, 2024

yzhang93 commented Nov 18, 2024

jtuyls Nov 18, 2024

yzhang93 Nov 18, 2024

jtuyls Nov 18, 2024 •

edited

Loading

jtuyls Nov 20, 2024

newling Nov 20, 2024

newling left a comment

newling Nov 20, 2024

jtuyls Nov 20, 2024

newling Nov 20, 2024

jtuyls Nov 20, 2024 •

edited

Loading

newling Nov 20, 2024

newling Nov 20, 2024

newling Nov 20, 2024

jtuyls Nov 20, 2024 •

edited

Loading

yzhang93 commented Nov 20, 2024

		FailureOr<bool> l2DmaTransposed = isL2DmaTransposed(op, isL2Target);
		if (failed(l2DmaTransposed)) return failure();

	int64_t &splitDim, int64_t &splitFactor) {
	int64_t splitDim, int64_t splitFactor) {

	For example, A matrix is distributed on a 2x2 AIE array, with L2 buffer size
	For example, for the case of a matmul C = A @ B, the A matrix is distributed on a 2x2 AIE array, with L2 buffer size

Split L2 input and output objectFifos for memTile/shimTile distribution #903

Split L2 input and output objectFifos for memTile/shimTile distribution #903

Conversation

yzhang93 commented Nov 13, 2024 • edited Loading

newling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newling commented Nov 15, 2024

yzhang93 commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtuyls Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtuyls Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtuyls Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

yzhang93 commented Nov 20, 2024

yzhang93 commented Nov 13, 2024 •

edited

Loading

jtuyls Nov 18, 2024 •

edited

Loading

jtuyls Nov 20, 2024 •

edited

Loading

jtuyls Nov 20, 2024 •

edited

Loading