#15061: Implement multi-device tensor distribution APIs in terms of C++ ttnn tensors #15886

omilyutin-tt · 2024-12-11T03:19:25Z

Ticket

Problem description

Multi-device tensor distribution currently works through distributed.py, which relies on PyTorch libraries to perform sharding / concatenation.

What's changed

Add xtensor to ttnn.
Lower facilities from tt-train down to ttnn. In particular: chunk, concatenate functions along with some conversion utils, and the relevant tests.
Add distributed_tensor.hpp header with the multi-device distribution APIs.

In follow up PRs:

Support bf4 / bf8 and other formats in from_vector / to_vector and other overloads.
Support outputting a tilized tensor.
Migrate functionality from pytensor.cpp to using the new APIs.

Checklist

Post commit CI passes (failure in clang-tidy in unreleated tt-train directory)
- code analysis run
T3K unit + frequent + model reg tests - same breakage on main.
New/Existing tests provide coverage for changes

cfjchu

Great wrok, we're getting close to API parity between Python / C++ side for multi-device!

ttnn/cpp/ttnn/tensor/xtensor/partition.cpp

ttnn/cpp/ttnn/tensor/xtensor/partition.hpp

ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.cpp

ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.hpp

ttnn/cpp/ttnn/tensor/xtensor/partition.cpp

ttnn/cpp/ttnn/tensor/xtensor/partition.hpp

ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.hpp

ayerofieiev-tt · 2024-12-12T02:51:54Z

ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.cpp

+using ::tt::tt_metal::Tensor;
+
+// copypaste from deprecated tensor pybinds ttnn
+tt::tt_metal::OwnedBuffer create_owned_buffer(const std::vector<float>& data, DataType data_type) {


is data - a raw data or already partially prepared data? e.g. is it supposed to be tiled?

It isn't raw data, and it is in row major layout. Does it make sense to allow providing buffer in non row major layout? Same goes for to_vector - I am assuming the primary use case is getting the data back in row major view?

Same goes for to_vector - I am assuming the primary use case is getting the data back in row major view?

Yes

Does it make sense to allow providing buffer in non row major layout?

If you take data that was already prepared but is still on host.
e.g. getting data from another tensor or getting some data that was serialized

Ok. We can provide this as part of the Tensor(Data data, const TensorSpec&) as per the comment below?

ayerofieiev-tt · 2024-12-12T02:52:51Z

ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.cpp

+
+template <typename T>
+Tensor create_owned_tensor(
+    std::vector<T> data, const ttnn::Shape& shape, tt::tt_metal::DataType data_type, tt::tt_metal::Layout layout) {


no ttnn::Shape please. anything prevents us from making this ttnn::SimpleShape?

must be Tensor(vector raw_data, const TensorSpec&)
raw data in this case would mean that some transformation must happen to align raw_data to the requested spec.

we can additionally have Tensor(Data data, const TensorSpec&) for cases where we want to create a Tensor from a data which was already processed and user knows its aligned with the spec. Data in this case is just a type that provides a semantic distinction between raw and prepared vector. Maybe its a HostBuffer

anything prevents us from making this ttnn::SimpleShape?

No reason - had this change locally:)

+1 to TensorSpec. I'm adding some documentation to clarify that the input has to match the tensor volume, and that we are assuming the row major layout.

For the second use case where we have the prepared / pre-processed data, we can do this in a follow up, this seems a somewhat separate effort.

ayerofieiev-tt · 2024-12-13T05:04:42Z

tests/ttnn/unit_tests/gtests/tensor/test_distributed_tensor.cpp

+    Tensor input_tensor =
+        from_vector(test_data, GetTensorSpec(ttnn::SimpleShape{1, num_rows, num_cols, 3}, DataType::FLOAT32));
+
+    auto mapper = api::shard_tensor_2d_to_mesh_mapper(


what is this api:: namespace?

Yeah, agree, I removed. In general, we use this pattern frequently, which I don't quite understand:

namespace ttnn::distributed::api { // Declare stuff.... } namespace ttnn::distributed { using namespace api; }

I can see how this might be useful to navigate through very large headers, but then would it be better to simply split them up? Other thoughts?

ttnn/CMakeLists.txt

tests/ttnn/unit_tests/gtests/tensor/test_distributed_tensor.cpp

dmakoviichuk-tt · 2024-12-13T17:13:34Z

tests/ttnn/unit_tests/gtests/tensor/test_vector_conversion.cpp

+using ::ttnn::experimental::xtensor::to_vector;
+
+const std::vector<ttnn::SimpleShape>& GetShapesForTest() {
+    static auto* shapes = new std::vector<ttnn::SimpleShape>{


don't need to use new here.

We don't, just muscle memory to never use static storage for types with non-trivial dtor:)

this should be corrected, there is really no need for new here

Hmm wait - what is the issue exactly? The pattern is just so that we don't have a static variable to run a non-trivial dtor. See https://google.github.io/styleguide/cppguide.html#Static_and_Global_Variables

Here of course it does not matter and I can just return a copy. No strong opinion, but was wondering where are you coming from:)

tests/ttnn/unit_tests/gtests/tensor/test_vector_conversion.cpp

ttnn/cpp/ttnn/distributed/distributed_tensor.cpp

ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.cpp

ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.hpp

dmakoviichuk-tt

Overall looks good for me. just some codestyle related comments.

cfjchu

🚀 Looks great, just kick off the model regressions so we see whether your change had any impact on those serialized tensors and open up issues for the follow-up items in the PR!

cfjchu · 2024-12-13T23:11:42Z

ttnn/cpp/ttnn/tensor/xtensor/partition.hpp

+// library for efficient host-side operations.
+
+// Splits a tensor into chunks along the specified dimension.
+std::vector<tt::tt_metal::Tensor> chunk(const tt::tt_metal::Tensor& tensor, int num_chunks, int dim = 0);


@sminakov-tt FYI since we talked about this yesterday. This is one of the primitives used to build the user-facing multi-device sharding (this adds C++ parity now).

tests/ttnn/unit_tests/gtests/tensor/test_vector_conversion.cpp

ayerofieiev-tt · 2024-12-14T04:33:21Z

tests/ttnn/unit_tests/gtests/tensor/test_vector_conversion.cpp

+    ttnn::SimpleShape shape{128, 128};
+
+    auto input = arange<TypeParam>(0, shape.volume(), 1);
+    // TODO: Support this.


whats the issue with this?

I don't want to encourage on-host tilization, so ideally from_vector should accept a device as well... There is just too many problems with tilization now, so I'd rather have this in and work on tilization / performance in a follow up.

ayerofieiev-tt · 2024-12-14T04:34:37Z

tt-train/sources/ttml/core/mesh_device.cpp

-    ttnn::distributed::api::close_mesh_device(m_mesh_device);
+    ttnn::distributed::close_mesh_device(m_mesh_device);


ayerofieiev-tt · 2024-12-14T04:38:07Z

ttnn/cpp/ttnn/distributed/distributed_tensor.hpp

+namespace ttnn::distributed {
+
+// Mapper interface that distributes a host tensor onto a multi-device configuration.
+class TensorToMesh {


Would love to hop on call to try finding better names.
Something feels very off about current names of classes and methods and it is not clear from the initial glance how this must be used.

These come for the Python world, but yeah, not ideal. Let's do a follow up with naming and possible API unification? I think we can do better in general.

ayerofieiev-tt · 2024-12-14T04:46:48Z

ttnn/cpp/ttnn/tensor/tensor.hpp

+    // Elements in the buffer will be stored in row-major order. The type of the elements has to match that of the
+    // `Tensor`.
+    template <typename T>
+    std::vector<T> to_vector() const;


I assume this is only currently supported for a subset of cases. Again, afaik tile layout is not handled, block float formats not supported, sharding, including logical sharding is not supported. Right?

Added a clarifying comment. Does tilized std::vector make sense though? For bf4/bf8, my plan was to just convert them into float and return a row-major slice.

thanks!

yes, tilized output does not makes sense. returning floats is right. but it is not handled if I got it right

ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.hpp

ayerofieiev-tt · 2024-12-14T04:52:19Z

ttnn/cpp/ttnn/tensor/xtensor/partition.hpp

+std::vector<xt::xarray<T>> chunk(const xt::xarray<T>& tensor, int num_chunks, int dim = 0);
+
+// Concatenates a list of tensors along the specified dimension.
+tt::tt_metal::Tensor concatenate(const std::vector<tt::tt_metal::Tensor>& tensors, int dim = 0);


consider concat or join

Two other apis operate on xt::xarray, but this one operates on a Tensor. This seem a bit off to me

maybe (just maybe) this is another Tensor constructor

Two other apis operate on xt::xarray, but this one operates on a Tensor.

Yeah, likely we will move tt-train off the one that relies on xt::array. I'm not too worried for now, as both of these rely on the same impl.

I'm not sure about the ctor, this feels more like an utility function (which also shout sit next to chunk IMO). Big +1 that we need to have a better organization though.

And sure, concat sounds good. No strong opinion here.

ayerofieiev-tt · 2024-12-14T04:55:48Z

ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.hpp

(thought) I imagine a newcomer looking at ttnn/tensor/xtensor and getting confused. xtensor is not like boost, it is not a well known library/typename. Instead of calling this folder xtensor, I'd rather focus on the goal it helps to achieve. But I also understand that its goal seem to be to bridge tensor with xtensor... Just thinking out load.

The goal is to sandbox the xtensor stuff:) I hope the namespace makes it clear it is a less stable API.

ttnn/cpp/ttnn/distributed/distributed_tensor.cpp

… to ttnn

…mpatibility with tensor serialization

cfjchu requested changes Dec 11, 2024

View reviewed changes

sminakov-tt reviewed Dec 11, 2024

View reviewed changes

ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.hpp Outdated Show resolved Hide resolved

ayerofieiev-tt reviewed Dec 12, 2024

View reviewed changes

omilyutin-tt force-pushed the omilyutin/distributed-cpp branch 3 times, most recently from cc9e599 to eb71e44 Compare December 12, 2024 20:52

omilyutin-tt marked this pull request as ready for review December 13, 2024 03:37

omilyutin-tt requested review from dmakoviichuk-tt, rfurko-tt, yan-zaretskiy, TT-BrianLiu, razorback3, dongjin-na and bbradelTT as code owners December 13, 2024 03:37

ayerofieiev-tt reviewed Dec 13, 2024

View reviewed changes

ttnn/CMakeLists.txt Show resolved Hide resolved