-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sharding support in ttnn backend #541
Conversation
74636f4
to
ad2d754
Compare
Thanks for making this change @jnie-TT, right at the moment when we need it! Minor comments left. |
lib/Dialect/TT/IR/TTOpsTypes.cpp
Outdated
@@ -604,14 +611,24 @@ LayoutAttr LayoutAttr::withElementType(::mlir::MLIRContext *context, | |||
Type elementType) { | |||
return LayoutAttr::get( | |||
context, getLinear(), getOobVal(), getGrid(), | |||
buildMemRef(context, getShardShape(), elementType, getMemorySpace())); | |||
buildMemRef(context, getShardShape(), elementType, getMemorySpace()), | |||
getMemlayout()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: getMemLayout would be alligned with rest of attrs.
lib/Dialect/TTNN/IR/TTNNOps.cpp
Outdated
@@ -27,6 +30,26 @@ ::mlir::LogicalResult mlir::tt::ttnn::ToMemoryConfigOp::verify() { | |||
if (not outputLayout) { | |||
return emitOpError("Output tensor type missing layout attribute"); | |||
} | |||
|
|||
// This will always be false for now until the compiler optimizer supports it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide a bit more information why? Ok Optimizer does not generate it today(row/height sharding), but would something break if it would?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't break anything as long as the generated memory layout is correct. This is a pass in the verifier to make sure that the compiler is generating the correct memory layout. Currently it checks that if the memory layout is sharded it must be block sharded, and the shard shape must divide the tile shape evenly (which is asserted in ttnn).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should make verifier process only if (outputLayout.getMemlayout() ==
::mlir::tt::TensorMemoryLayout::BlockSharded) and not throw error for others? Or you meant it as a forcing function when height shard is generated to come here and update code with checks for other types?
runtime/lib/ttnn/program.cpp
Outdated
::ttnn::Tensor result; | ||
if (isOnHost(inputTensor)) { | ||
result = | ||
updateLayoutAndDataType(inputTensor, targetDataTypeTTNN, false, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you name const params via comments at least true /* paramName */?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a side note this is waiting on compiler support as well - ideally we want to determine whether we want to tilize based on the tile shape. However seems that currently the tile shape is always 1x1.
// CHECK: %[[C:.*]] = "ttnn.empty"[[C:.*]] | ||
%0 = tensor.empty() : tensor<256x512xf32> | ||
// CHECK: %[[C:.*]] = "ttnn.relu"[[C:.*]] | ||
%1 = "ttir.relu"(%arg0, %0) <{operandSegmentSizes = array<i32: 1, 1>, operand_constraints = [#any_device, #any_device]}> : (tensor<256x512xf32>, tensor<256x512xf32>) -> tensor<256x512xf32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we check that layout is indeed sharded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it's inferred by the runtime, later the runtime will just check the TensorMemoryLayout. But since the TensoryMemoryLayout is always UnDef for now, the runtime will check the shard shape and memory space - if the shard shape divide the tile shape evenly and the memory space is l1, then the runtime will shard the tensor implicitly. This is a hacky workaround until the compiler can generate the correct memory layouts
We can run multicore in DRAM interleaved mode as well? |
With current default block sharding mode, will we perform any tensor deallocation or they will remain till end of execution? Is there any open issue to provide support to compiler for tensor alloc/dealloc? |
That's correct. In fact, we try to run it on the whole compute_with_storage_grid for unary ops. |
The tensors are by default stored until the end of execution (end of submit, when tensorPool goes out of scope). I agree that we should perform dynamic alloc dealloc once the tensors are not needed anymore (maybe add a dealloc op on the compiler side), or else we could run out of L1 space if we have a chain of ops. I don't know if there's an active effort for supporting this ATM, I can create an issue to track it so we don't forget. Update: Created issue #553 |
auto [isLayoutChange, isGridChange, isFormatChange, isMemorySpaceChange, | ||
isMemoryLayoutChange] = op.compoundComponents(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please create a struct encompassing all these changes instead of expanding tuple
lib/Dialect/TTIR/IR/TTIROps.cpp
Outdated
@@ -32,7 +32,7 @@ ::mlir::LogicalResult mlir::tt::ttir::ToLayoutOp::verify() { | |||
return success(); | |||
} | |||
|
|||
std::tuple<bool, bool, bool, bool> | |||
std::tuple<bool, bool, bool, bool, bool> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment below and change this function to return created struct
@@ -90,7 +90,7 @@ def TTIR_ToLayoutOp : TTIR_Op<"to_layout", [DestinationStyleOpInterface, TTIROpI | |||
// return {OperandConstraint::Any, OperandConstraint::Any}; | |||
} | |||
// Returns a tuple of booleans indicating if the op changes layout, grid, format, or memory space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update comment
lib/Dialect/TT/IR/TTOpsTypes.cpp
Outdated
@@ -573,6 +573,13 @@ mlir::Type LayoutAttr::getScalarElementType() const { | |||
return elementType; | |||
} | |||
|
|||
bool LayoutAttr::isSharded() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to call this hasShardedTensorMemoryLayout
, because isSharded
has a very different meaning for direct to metal path.
def TT_WidthSharded : I32EnumAttrCase<"WidthSharded", 4, "width_sharded">; | ||
def TT_BlockSharded : I32EnumAttrCase<"BlockSharded", 5, "block_sharded">; | ||
|
||
def TT_TensorMemoryLayout : I32EnumAttr<"TensorMemoryLayout", "TT TensorMemoryLayout", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rpavlovicTT / @nobradovictt for direct to metal path we will not be using this enum at all, in fact we should always assert it's UndefLayout
. It just seemed like more work/more complicated to specialize the LayoutAttr
for TTNN in some way, esp since the optimizer will be written against TTIR dialect and this is one of the main optimizations. I wonder if there's another way we could solve this issue though, it's definitely going to lead to confusion. Perhaps at least we should name this enum TTNNTensorMemoryLayout
despite it living in the TT dialect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The truly "proper" way to solve this would be to make LayoutAttr an interface, where we can instantiate either a TTNN based layout or a direct to metal based layout, it's just a decent size refactor for a single enum. It's one of those things though where we could keep making exceptions and end up in a situation where we need the refactor anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can decide on proper refactor for this when we introduce D Metal backend support to Optimizer. In the meantime as you suggested we can assert in DMetal pipeline that this is UndefLayout, I am even fine with renaming it to TTNNTensorMemoryLayout for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we can go forward with renaming the attribute. However, I'd definitely refactor it as soon as we find dev cycles for it. We should strive for not leaving too much debt behind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created issue #596 to track this
@@ -72,6 +72,26 @@ def TT_MemorySpace : I32EnumAttr<"MemorySpace", "TT MemorySpace", | |||
let cppNamespace = "::mlir::tt"; | |||
} | |||
|
|||
def TT_UndefLayout : I32EnumAttrCase<"UndefLayout", 0, "undef_layout">; // For host tensors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should call this None to match flatbuffer enum definition. Also this comment "For host tensors" isn't strictly true, it'll also always be None/Undef for direct to metal path.
7dfffc6
to
8b46052
Compare
67a1a30
to
d7e10a1
Compare
d7e10a1
to
04612f9
Compare
Related to #450 as this enables multi-core runs if sharding is feasible.
Related to #518 as this may rely on the compiler to generate legal memory layouts in the future.
Added sharding support in ttnn backend, bootstrapped sharding attrs in compiler. Currently since the compiler cannot dynamically generate legal memory layouts, runtime will infer the memory layout for tensors - if tensor resides in l1 and its shard shape is divisible by tile shape, then create sharded memory config, else use interleaved as before.
Added a couple of sharding tests.