Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split L2 input and output objectFifos for memTile/shimTile distribution #903

Merged
merged 7 commits into from
Nov 20, 2024

Conversation

yzhang93
Copy link
Contributor

@yzhang93 yzhang93 commented Nov 13, 2024

This is the first PR needed to enable 4x4 AIE array. The L2 objectFifos are split to distribute on multiple memTiles and shimTiles for more channel usage. The shim/mem tile reassignment will be addressed in a separate PR.

Also note this PR doesn't change/combine the previous pass that splitting the third input (elementwise) for connection reuse with mamtul ops. What unclear to me is how matmul-elementwise path will be like for 4x4 AIE cores. If the existing logic is kept, we'll have to split the elementwise op to 16 objectFifos which will be a big challenge given the number of shimTile channels . I'd rather leave it as a separate thing for now before we have a clear path to move forward.

@jtuyls I've simplified the functions a bit based on your initial version. Feel free to make modifications/push new commits.

Copy link
Contributor

@newling newling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the basic idea of the pass, thanks for the clear comments and code. I'd try to take another look later today to better understand the details.

return op.emitOpError() << "expected objectFifo shape larger than 2";
}
size_t splitDim;
if (memrefShape[0] != 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional to not split on both dimensions if they are both not one? If this is run on array with 1 row or 1 column, it will fail, maybe don't fail?

@newling
Copy link
Contributor

newling commented Nov 15, 2024

It would be nice to have include a motivation for this in the description. I imagine it is to split tensors equally across all memory tiles, rather than putting them entirely on one? If this is the case, are you assuming nrows = ncols in choosing splitFactor? If the array is 4x4 and you have

 %alloc_0 = memref.alloc() : memref<4x1x32x32xi32, 1 : i32>
 %alloc_1 = memref.alloc() : memref<1x4x32x32xi32, 1 : i32>

I guess it works nicely. But is the array is 4x8 (8 cols) and the allocs are 4x1x... and 1x8x... it's not optimal?

@yzhang93
Copy link
Contributor Author

It would be nice to have include a motivation for this in the description. I imagine it is to split tensors equally across all memory tiles, rather than putting them entirely on one? If this is the case, are you assuming nrows = ncols in choosing splitFactor? If the array is 4x4 and you have

 %alloc_0 = memref.alloc() : memref<4x1x32x32xi32, 1 : i32>
 %alloc_1 = memref.alloc() : memref<1x4x32x32xi32, 1 : i32>

I guess it works nicely. But is the array is 4x8 (8 cols) and the allocs are 4x1x... and 1x8x... it's not optimal?

Yes, currently it aims for balance use of number of rows and columns (i.g., 2x2/4x4) and added basic support for that. Also it should work for unbalanced use of columns/rows such as A: 4x1x... B: 1x2x...., which will split A into 4 separate objectfifos and B into 2 objectfifos, but we can't control how to distribute these in this pass. I think it's the tiling strategy and the tile assignment strategy that should responsible for the distribution part.

Comment on lines 709 to 710
FailureOr<bool> l2DmaTransposed = isL2DmaTransposed(op, isL2Target);
if (failed(l2DmaTransposed)) return failure();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because the shape of objectFifo and the target size of DmaCpNdOp can be different if we select transpose on target option within ConvertToDma pass. The splitDim we get earlier in this function is only the split dim in objectFifo, but not the split dim for the target size in DmaCpNdOp.

For example, in this example %3 = amdaie.dma_cpy_nd(%0[0, 0, 0, 0] [1, 32, 2, 32] [2048, 32, 1024, 1], %1[0, %2] [32, 64] [128, 1]) : (!amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>),

the splitDim for objectfifo is index 1 while the splitDimTarget should be index 2.


// Both outer dims are 1, no need to split, return success.
if (splitDim == memrefShape.size()) return success();
int64_t splitFactor = memrefShape[splitDim];
Copy link
Collaborator

@jtuyls jtuyls Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The splitting factor should really be derived from the number of columns soon as we might create larger L2 buffers that need to be split in a smaller factor ([1, 16, 32, 32], split into 4 x [1, 4, 32, 32] for example). Might be ok to leave for future work though.

yzhang93 added a commit that referenced this pull request Nov 20, 2024
This PR made some necessary changes in `LogicalObjFifoSplittingUtils`
for matmul-elementwise ops as preparation for another splitting usage in
#903. Specially, the utility
to check if the L3->L2 dma is transposed on the L2 side is rewritten.

---------

Co-authored-by: James Newling <[email protected]>
}

/// Utility to get the split dimension and factor from a L3->L2 dma op.
LogicalResult getSplitDimAndFactorFromDma(AMDAIE::DmaCpyNdOp op,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This getSplitDim... function and the other one above are quite hardcoded, but as discussed offline I agree it's ok for now to get an initial 4x4 working. We will start generalizing this soon in follow-ups.

/// Split L2 space input and output logical objectFifos.
LogicalResult splitLogicalObjectFifo(IRRewriter &rewriter,
AMDAIE::LogicalObjectFifoFromMemrefOp op,
int64_t &splitDim, int64_t &splitFactor) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int64_t &splitDim, int64_t &splitFactor) {
int64_t splitDim, int64_t splitFactor) {

Copy link
Contributor

@newling newling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently it aims for balance use of number of rows and columns (i.g., 2x2/4x4) and added basic support for that

It would be great if you can add a check on this in the code. My only remaining requests are

  1. document the assumed relationship between first 2 dims of tensors and nrows, nols
  2. add tests/safeguards for the cases nrows != ncols

let summary = "Pass to split L2 buffers to distribute on multiple shimTiles and memTiles.";
let description = [{
Splitting L2 input and output logical objectFifos and their user dma operations
by a factor of the number of AIE columns being used, so that the logical objectFifos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by a factor of the number of AIE columns being used

This isn't the case when the number of columns is different to the number of rows.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but there's still a relationship with the number of columns. Suppose rows: 4, cols: 2 and a different input being broadcasted to each row, then you likely still want to split it in 2 (number of columns), instead of 4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, that makes sense. Currently though it'd get split 4 ways afaict (I know the plan is to revisit this which is fine by me).

Copy link
Collaborator

@jtuyls jtuyls Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently though it'd get split 4 ways afaict

Yes, indeed, and that's a 'valid' splitting, however, it will likely fail because of not enough channels. 'valid' meaning that it's not intrinsically incorrect, but currently we might not be able to find the channel resources to generate code for it.

by a factor of the number of AIE columns being used, so that the logical objectFifos
can be distributed on multiple shimTiles/memTiles.

For example, A matrix is distributed on a 2x2 AIE array, with L2 buffer size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, A matrix is distributed on a 2x2 AIE array, with L2 buffer size
For example, for the case of a matmul C = A @ B, the A matrix is distributed on a 2x2 AIE array, with L2 buffer size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention the assumed relationship between the first 2 dimensions of tensors in L2 and nb_cols and nb_rows


// Create `splitFactor` number of doubly stride ops.
rewriter.setInsertionPoint(op);
for (int i = 0; i < splitFactor; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking this PR on this question, but trying to understand how we can generalize this.

It seems like @yzhang93 you are is assuming that the size of the dma is exactly the same as the shape of the memref of the objectfifo. This needn't be the case, right, they can be completely different? ( @jtuyls )

Copy link
Collaborator

@jtuyls jtuyls Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's not necessarily the case and soon won't be valid when we cache larger buffers on L2, but I think it's ok for now to assume that to start off. We should make sure though to fail if the assumptions don't hold.

@yzhang93
Copy link
Contributor Author

Yes, currently it aims for balance use of number of rows and columns (i.g., 2x2/4x4) and added basic support for that

It would be great if you can add a check on this in the code. My only remaining requests are

  1. document the assumed relationship between first 2 dims of tensors and nrows, nols
  2. add tests/safeguards for the cases nrows != ncols

I've added more comments for the current assumptions. I'm not adding more tests specially for nrows != ncols, will leave it in the follow ups when the split factor is only depend on the ncols not the way it's hardcoded right now.

@yzhang93 yzhang93 enabled auto-merge (squash) November 20, 2024 22:19
@yzhang93 yzhang93 merged commit d7bf670 into nod-ai:main Nov 20, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants