Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sharding support in ttnn backend #541

Merged
merged 6 commits into from
Sep 3, 2024
Merged

Conversation

jnie-TT
Copy link
Contributor

@jnie-TT jnie-TT commented Aug 29, 2024

Related to #450 as this enables multi-core runs if sharding is feasible.

Related to #518 as this may rely on the compiler to generate legal memory layouts in the future.

Added sharding support in ttnn backend, bootstrapped sharding attrs in compiler. Currently since the compiler cannot dynamically generate legal memory layouts, runtime will infer the memory layout for tensors - if tensor resides in l1 and its shard shape is divisible by tile shape, then create sharded memory config, else use interleaved as before.

Added a couple of sharding tests.

@nobradovictt
Copy link
Contributor

Thanks for making this change @jnie-TT, right at the moment when we need it! Minor comments left.

@@ -604,14 +611,24 @@ LayoutAttr LayoutAttr::withElementType(::mlir::MLIRContext *context,
Type elementType) {
return LayoutAttr::get(
context, getLinear(), getOobVal(), getGrid(),
buildMemRef(context, getShardShape(), elementType, getMemorySpace()));
buildMemRef(context, getShardShape(), elementType, getMemorySpace()),
getMemlayout());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: getMemLayout would be alligned with rest of attrs.

@@ -27,6 +30,26 @@ ::mlir::LogicalResult mlir::tt::ttnn::ToMemoryConfigOp::verify() {
if (not outputLayout) {
return emitOpError("Output tensor type missing layout attribute");
}

// This will always be false for now until the compiler optimizer supports it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a bit more information why? Ok Optimizer does not generate it today(row/height sharding), but would something break if it would?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't break anything as long as the generated memory layout is correct. This is a pass in the verifier to make sure that the compiler is generating the correct memory layout. Currently it checks that if the memory layout is sharded it must be block sharded, and the shard shape must divide the tile shape evenly (which is asserted in ttnn).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should make verifier process only if (outputLayout.getMemlayout() ==
::mlir::tt::TensorMemoryLayout::BlockSharded) and not throw error for others? Or you meant it as a forcing function when height shard is generated to come here and update code with checks for other types?

::ttnn::Tensor result;
if (isOnHost(inputTensor)) {
result =
updateLayoutAndDataType(inputTensor, targetDataTypeTTNN, false, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you name const params via comments at least true /* paramName */?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a side note this is waiting on compiler support as well - ideally we want to determine whether we want to tilize based on the tile shape. However seems that currently the tile shape is always 1x1.

runtime/lib/ttnn/program.cpp Show resolved Hide resolved
// CHECK: %[[C:.*]] = "ttnn.empty"[[C:.*]]
%0 = tensor.empty() : tensor<256x512xf32>
// CHECK: %[[C:.*]] = "ttnn.relu"[[C:.*]]
%1 = "ttir.relu"(%arg0, %0) <{operandSegmentSizes = array<i32: 1, 1>, operand_constraints = [#any_device, #any_device]}> : (tensor<256x512xf32>, tensor<256x512xf32>) -> tensor<256x512xf32>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we check that layout is indeed sharded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it's inferred by the runtime, later the runtime will just check the TensorMemoryLayout. But since the TensoryMemoryLayout is always UnDef for now, the runtime will check the shard shape and memory space - if the shard shape divide the tile shape evenly and the memory space is l1, then the runtime will shard the tensor implicitly. This is a hacky workaround until the compiler can generate the correct memory layouts

@nobradovictt
Copy link
Contributor

Related to #450 as this enables multi-core runs if sharding is feasible.

Related to #518 as this may rely on the compiler to generate legal memory layouts in the future.

Added sharding support in ttnn backend, bootstrapped sharding attrs in compiler. Currently since the compiler cannot dynamically generate legal memory layouts, runtime will infer the memory layout for tensors - if tensor resides in l1 and its shard shape is divisible by tile shape, then create sharded memory config, else use interleaved as before.

Added a couple of sharding tests.

We can run multicore in DRAM interleaved mode as well?

@nobradovictt
Copy link
Contributor

Related to #450 as this enables multi-core runs if sharding is feasible.

Related to #518 as this may rely on the compiler to generate legal memory layouts in the future.

Added sharding support in ttnn backend, bootstrapped sharding attrs in compiler. Currently since the compiler cannot dynamically generate legal memory layouts, runtime will infer the memory layout for tensors - if tensor resides in l1 and its shard shape is divisible by tile shape, then create sharded memory config, else use interleaved as before.

Added a couple of sharding tests.

With current default block sharding mode, will we perform any tensor deallocation or they will remain till end of execution? Is there any open issue to provide support to compiler for tensor alloc/dealloc?

@jnie-TT
Copy link
Contributor Author

jnie-TT commented Aug 29, 2024

Related to #450 as this enables multi-core runs if sharding is feasible.
Related to #518 as this may rely on the compiler to generate legal memory layouts in the future.
Added sharding support in ttnn backend, bootstrapped sharding attrs in compiler. Currently since the compiler cannot dynamically generate legal memory layouts, runtime will infer the memory layout for tensors - if tensor resides in l1 and its shard shape is divisible by tile shape, then create sharded memory config, else use interleaved as before.
Added a couple of sharding tests.

We can run multicore in DRAM interleaved mode as well?

That's correct. In fact, we try to run it on the whole compute_with_storage_grid for unary ops.
image

@jnie-TT
Copy link
Contributor Author

jnie-TT commented Aug 29, 2024

Related to #450 as this enables multi-core runs if sharding is feasible.
Related to #518 as this may rely on the compiler to generate legal memory layouts in the future.
Added sharding support in ttnn backend, bootstrapped sharding attrs in compiler. Currently since the compiler cannot dynamically generate legal memory layouts, runtime will infer the memory layout for tensors - if tensor resides in l1 and its shard shape is divisible by tile shape, then create sharded memory config, else use interleaved as before.
Added a couple of sharding tests.

With current default block sharding mode, will we perform any tensor deallocation or they will remain till end of execution? Is there any open issue to provide support to compiler for tensor alloc/dealloc?

The tensors are by default stored until the end of execution (end of submit, when tensorPool goes out of scope). I agree that we should perform dynamic alloc dealloc once the tensors are not needed anymore (maybe add a dealloc op on the compiler side), or else we could run out of L1 space if we have a chain of ops. I don't know if there's an active effort for supporting this ATM, I can create an issue to track it so we don't forget.

Update: Created issue #553

Comment on lines 367 to 368
auto [isLayoutChange, isGridChange, isFormatChange, isMemorySpaceChange,
isMemoryLayoutChange] = op.compoundComponents();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please create a struct encompassing all these changes instead of expanding tuple

@@ -32,7 +32,7 @@ ::mlir::LogicalResult mlir::tt::ttir::ToLayoutOp::verify() {
return success();
}

std::tuple<bool, bool, bool, bool>
std::tuple<bool, bool, bool, bool, bool>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment below and change this function to return created struct

@@ -90,7 +90,7 @@ def TTIR_ToLayoutOp : TTIR_Op<"to_layout", [DestinationStyleOpInterface, TTIROpI
// return {OperandConstraint::Any, OperandConstraint::Any};
}
// Returns a tuple of booleans indicating if the op changes layout, grid, format, or memory space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment

@@ -573,6 +573,13 @@ mlir::Type LayoutAttr::getScalarElementType() const {
return elementType;
}

bool LayoutAttr::isSharded() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to call this hasShardedTensorMemoryLayout, because isSharded has a very different meaning for direct to metal path.

def TT_WidthSharded : I32EnumAttrCase<"WidthSharded", 4, "width_sharded">;
def TT_BlockSharded : I32EnumAttrCase<"BlockSharded", 5, "block_sharded">;

def TT_TensorMemoryLayout : I32EnumAttr<"TensorMemoryLayout", "TT TensorMemoryLayout",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rpavlovicTT / @nobradovictt for direct to metal path we will not be using this enum at all, in fact we should always assert it's UndefLayout. It just seemed like more work/more complicated to specialize the LayoutAttr for TTNN in some way, esp since the optimizer will be written against TTIR dialect and this is one of the main optimizations. I wonder if there's another way we could solve this issue though, it's definitely going to lead to confusion. Perhaps at least we should name this enum TTNNTensorMemoryLayout despite it living in the TT dialect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The truly "proper" way to solve this would be to make LayoutAttr an interface, where we can instantiate either a TTNN based layout or a direct to metal based layout, it's just a decent size refactor for a single enum. It's one of those things though where we could keep making exceptions and end up in a situation where we need the refactor anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can decide on proper refactor for this when we introduce D Metal backend support to Optimizer. In the meantime as you suggested we can assert in DMetal pipeline that this is UndefLayout, I am even fine with renaming it to TTNNTensorMemoryLayout for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we can go forward with renaming the attribute. However, I'd definitely refactor it as soon as we find dev cycles for it. We should strive for not leaving too much debt behind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issue #596 to track this

@@ -72,6 +72,26 @@ def TT_MemorySpace : I32EnumAttr<"MemorySpace", "TT MemorySpace",
let cppNamespace = "::mlir::tt";
}

def TT_UndefLayout : I32EnumAttrCase<"UndefLayout", 0, "undef_layout">; // For host tensors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should call this None to match flatbuffer enum definition. Also this comment "For host tensors" isn't strictly true, it'll also always be None/Undef for direct to metal path.

@jnie-TT jnie-TT force-pushed the jnie/ttnn_sharding_rebased branch 4 times, most recently from 7dfffc6 to 8b46052 Compare September 3, 2024 02:15
@jnie-TT jnie-TT force-pushed the jnie/ttnn_sharding_rebased branch 2 times, most recently from 67a1a30 to d7e10a1 Compare September 3, 2024 15:38
@jnie-TT jnie-TT force-pushed the jnie/ttnn_sharding_rebased branch from d7e10a1 to 04612f9 Compare September 3, 2024 16:10
@jnie-TT jnie-TT merged commit c75811b into main Sep 3, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants