-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split L2 input and output objectFifos for memTile/shimTile distribution #903
Conversation
efd6123
to
29067ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the basic idea of the pass, thanks for the clear comments and code. I'd try to take another look later today to better understand the details.
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/split_logicalobjfifos.mlir
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Show resolved
Hide resolved
...gins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIESplitLogicalObjFifosForConnectionReuse.cpp
Outdated
Show resolved
Hide resolved
...gins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIESplitLogicalObjFifosForConnectionReuse.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Outdated
Show resolved
Hide resolved
return op.emitOpError() << "expected objectFifo shape larger than 2"; | ||
} | ||
size_t splitDim; | ||
if (memrefShape[0] != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional to not split on both dimensions if they are both not one? If this is run on array with 1 row or 1 column, it will fail, maybe don't fail?
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/split_logicalobjfifos.mlir
Outdated
Show resolved
Hide resolved
It would be nice to have include a motivation for this in the description. I imagine it is to split tensors equally across all memory tiles, rather than putting them entirely on one? If this is the case, are you assuming nrows = ncols in choosing splitFactor? If the array is 4x4 and you have
I guess it works nicely. But is the array is 4x8 (8 cols) and the allocs are 4x1x... and 1x8x... it's not optimal? |
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Outdated
Show resolved
Hide resolved
Yes, currently it aims for balance use of number of rows and columns (i.g., 2x2/4x4) and added basic support for that. Also it should work for unbalanced use of columns/rows such as A: 4x1x... B: 1x2x...., which will split A into 4 separate objectfifos and B into 2 objectfifos, but we can't control how to distribute these in this pass. I think it's the tiling strategy and the tile assignment strategy that should responsible for the distribution part. |
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELogicalObjFifoSplittingUtils.cpp
Outdated
Show resolved
Hide resolved
FailureOr<bool> l2DmaTransposed = isL2DmaTransposed(op, isL2Target); | ||
if (failed(l2DmaTransposed)) return failure(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because the shape of objectFifo and the target size of DmaCpNdOp can be different if we select transpose on target option within ConvertToDma
pass. The splitDim we get earlier in this function is only the split dim in objectFifo, but not the split dim for the target size in DmaCpNdOp.
For example, in this example %3 = amdaie.dma_cpy_nd(%0[0, 0, 0, 0] [1, 32, 2, 32] [2048, 32, 1024, 1], %1[0, %2] [32, 64] [128, 1]) : (!amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
,
the splitDim for objectfifo is index 1 while the splitDimTarget should be index 2.
|
||
// Both outer dims are 1, no need to split, return success. | ||
if (splitDim == memrefShape.size()) return success(); | ||
int64_t splitFactor = memrefShape[splitDim]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The splitting factor should really be derived from the number of columns soon as we might create larger L2 buffers that need to be split in a smaller factor ([1, 16, 32, 32]
, split into 4 x [1, 4, 32, 32]
for example). Might be ok to leave for future work though.
c7386ad
to
0635b6a
Compare
This PR made some necessary changes in `LogicalObjFifoSplittingUtils` for matmul-elementwise ops as preparation for another splitting usage in #903. Specially, the utility to check if the L3->L2 dma is transposed on the L2 side is rewritten. --------- Co-authored-by: James Newling <[email protected]>
} | ||
|
||
/// Utility to get the split dimension and factor from a L3->L2 dma op. | ||
LogicalResult getSplitDimAndFactorFromDma(AMDAIE::DmaCpyNdOp op, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This getSplitDim...
function and the other one above are quite hardcoded, but as discussed offline I agree it's ok for now to get an initial 4x4 working. We will start generalizing this soon in follow-ups.
0635b6a
to
bca6998
Compare
/// Split L2 space input and output logical objectFifos. | ||
LogicalResult splitLogicalObjectFifo(IRRewriter &rewriter, | ||
AMDAIE::LogicalObjectFifoFromMemrefOp op, | ||
int64_t &splitDim, int64_t &splitFactor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t &splitDim, int64_t &splitFactor) { | |
int64_t splitDim, int64_t splitFactor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently it aims for balance use of number of rows and columns (i.g., 2x2/4x4) and added basic support for that
It would be great if you can add a check on this in the code. My only remaining requests are
- document the assumed relationship between first 2 dims of tensors and nrows, nols
- add tests/safeguards for the cases nrows != ncols
let summary = "Pass to split L2 buffers to distribute on multiple shimTiles and memTiles."; | ||
let description = [{ | ||
Splitting L2 input and output logical objectFifos and their user dma operations | ||
by a factor of the number of AIE columns being used, so that the logical objectFifos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by a factor of the number of AIE columns being used
This isn't the case when the number of columns is different to the number of rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but there's still a relationship with the number of columns. Suppose rows: 4, cols: 2
and a different input being broadcasted to each row, then you likely still want to split it in 2
(number of columns), instead of 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, that makes sense. Currently though it'd get split 4 ways afaict (I know the plan is to revisit this which is fine by me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently though it'd get split 4 ways afaict
Yes, indeed, and that's a 'valid' splitting, however, it will likely fail because of not enough channels. 'valid' meaning that it's not intrinsically incorrect, but currently we might not be able to find the channel resources to generate code for it.
by a factor of the number of AIE columns being used, so that the logical objectFifos | ||
can be distributed on multiple shimTiles/memTiles. | ||
|
||
For example, A matrix is distributed on a 2x2 AIE array, with L2 buffer size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, A matrix is distributed on a 2x2 AIE array, with L2 buffer size | |
For example, for the case of a matmul C = A @ B, the A matrix is distributed on a 2x2 AIE array, with L2 buffer size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention the assumed relationship between the first 2 dimensions of tensors in L2 and nb_cols
and nb_rows
|
||
// Create `splitFactor` number of doubly stride ops. | ||
rewriter.setInsertionPoint(op); | ||
for (int i = 0; i < splitFactor; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking this PR on this question, but trying to understand how we can generalize this.
It seems like @yzhang93 you are is assuming that the size
of the dma is exactly the same as the shape of the memref of the objectfifo. This needn't be the case, right, they can be completely different? ( @jtuyls )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's not necessarily the case and soon won't be valid when we cache larger buffers on L2, but I think it's ok for now to assume that to start off. We should make sure though to fail if the assumptions don't hold.
a91c568
to
3675537
Compare
I've added more comments for the current assumptions. I'm not adding more tests specially for nrows != ncols, will leave it in the follow ups when the split factor is only depend on the ncols not the way it's hardcoded right now. |
This is the first PR needed to enable 4x4 AIE array. The L2 objectFifos are split to distribute on multiple memTiles and shimTiles for more channel usage. The shim/mem tile reassignment will be addressed in a separate PR.
Also note this PR doesn't change/combine the previous pass that splitting the third input (elementwise) for connection reuse with mamtul ops. What unclear to me is how matmul-elementwise path will be like for 4x4 AIE cores. If the existing logic is kept, we'll have to split the elementwise op to 16 objectFifos which will be a big challenge given the number of shimTile channels . I'd rather leave it as a separate thing for now before we have a clear path to move forward.
@jtuyls I've simplified the functions a bit based on your initial version. Feel free to make modifications/push new commits.