Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#15061: Implement multi-device tensor distribution APIs in terms of C++ ttnn tensors #15886

Merged
merged 18 commits into from
Dec 17, 2024

Conversation

omilyutin-tt
Copy link
Contributor

@omilyutin-tt omilyutin-tt commented Dec 11, 2024

Ticket

#15755

Problem description

Multi-device tensor distribution currently works through distributed.py, which relies on PyTorch libraries to perform sharding / concatenation.

What's changed

  • Add xtensor to ttnn.
  • Lower facilities from tt-train down to ttnn. In particular: chunk, concatenate functions along with some conversion utils, and the relevant tests.
  • Add distributed_tensor.hpp header with the multi-device distribution APIs.

In follow up PRs:

  • Support bf4 / bf8 and other formats in from_vector / to_vector and other overloads.
  • Support outputting a tilized tensor.
  • Migrate functionality from pytensor.cpp to using the new APIs.

Checklist

Copy link
Collaborator

@cfjchu cfjchu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great wrok, we're getting close to API parity between Python / C++ side for multi-device!

ttnn/cpp/ttnn/tensor/xtensor/partition.cpp Outdated Show resolved Hide resolved
ttnn/cpp/ttnn/tensor/xtensor/partition.hpp Outdated Show resolved Hide resolved
ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.cpp Outdated Show resolved Hide resolved
ttnn/cpp/ttnn/tensor/xtensor/conversion_utils.hpp Outdated Show resolved Hide resolved
ttnn/cpp/ttnn/tensor/xtensor/partition.cpp Outdated Show resolved Hide resolved
ttnn/cpp/ttnn/tensor/xtensor/partition.cpp Show resolved Hide resolved
ttnn/cpp/ttnn/tensor/xtensor/partition.cpp Show resolved Hide resolved
ttnn/cpp/ttnn/tensor/xtensor/partition.hpp Outdated Show resolved Hide resolved
using ::tt::tt_metal::Tensor;

// copypaste from deprecated tensor pybinds ttnn
tt::tt_metal::OwnedBuffer create_owned_buffer(const std::vector<float>& data, DataType data_type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is data - a raw data or already partially prepared data? e.g. is it supposed to be tiled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't raw data, and it is in row major layout. Does it make sense to allow providing buffer in non row major layout? Same goes for to_vector - I am assuming the primary use case is getting the data back in row major view?

Copy link
Member

@ayerofieiev-tt ayerofieiev-tt Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for to_vector - I am assuming the primary use case is getting the data back in row major view?

Yes

Does it make sense to allow providing buffer in non row major layout?

If you take data that was already prepared but is still on host.
e.g. getting data from another tensor or getting some data that was serialized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. We can provide this as part of the Tensor(Data data, const TensorSpec&) as per the comment below?


template <typename T>
Tensor create_owned_tensor(
std::vector<T> data, const ttnn::Shape& shape, tt::tt_metal::DataType data_type, tt::tt_metal::Layout layout) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ttnn::Shape please. anything prevents us from making this ttnn::SimpleShape?

Copy link
Member

@ayerofieiev-tt ayerofieiev-tt Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must be Tensor(vector raw_data, const TensorSpec&)
raw data in this case would mean that some transformation must happen to align raw_data to the requested spec.

we can additionally have Tensor(Data data, const TensorSpec&) for cases where we want to create a Tensor from a data which was already processed and user knows its aligned with the spec. Data in this case is just a type that provides a semantic distinction between raw and prepared vector. Maybe its a HostBuffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything prevents us from making this ttnn::SimpleShape?

No reason - had this change locally:)

+1 to TensorSpec. I'm adding some documentation to clarify that the input has to match the tensor volume, and that we are assuming the row major layout.

For the second use case where we have the prepared / pre-processed data, we can do this in a follow up, this seems a somewhat separate effort.

@omilyutin-tt omilyutin-tt force-pushed the omilyutin/distributed-cpp branch 3 times, most recently from cc9e599 to eb71e44 Compare December 12, 2024 20:52
@omilyutin-tt omilyutin-tt marked this pull request as ready for review December 13, 2024 03:37
Tensor input_tensor =
from_vector(test_data, GetTensorSpec(ttnn::SimpleShape{1, num_rows, num_cols, 3}, DataType::FLOAT32));

auto mapper = api::shard_tensor_2d_to_mesh_mapper(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this api:: namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree, I removed. In general, we use this pattern frequently, which I don't quite understand:

namespace ttnn::distributed::api {
// Declare stuff....
}

namespace ttnn::distributed {
using namespace api;
}  

I can see how this might be useful to navigate through very large headers, but then would it be better to simply split them up? Other thoughts?

using ::ttnn::experimental::xtensor::to_vector;

const std::vector<ttnn::SimpleShape>& GetShapesForTest() {
static auto* shapes = new std::vector<ttnn::SimpleShape>{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to use new here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, just muscle memory to never use static storage for types with non-trivial dtor:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be corrected, there is really no need for new here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm wait - what is the issue exactly? The pattern is just so that we don't have a static variable to run a non-trivial dtor. See https://google.github.io/styleguide/cppguide.html#Static_and_Global_Variables

Here of course it does not matter and I can just return a copy. No strong opinion, but was wondering where are you coming from:)

Copy link
Contributor

@dmakoviichuk-tt dmakoviichuk-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good for me. just some codestyle related comments.

@omilyutin-tt omilyutin-tt force-pushed the omilyutin/distributed-cpp branch from c8cc96d to 3bc5157 Compare December 13, 2024 23:07
Copy link
Collaborator

@cfjchu cfjchu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Looks great, just kick off the model regressions so we see whether your change had any impact on those serialized tensors and open up issues for the follow-up items in the PR!

// library for efficient host-side operations.

// Splits a tensor into chunks along the specified dimension.
std::vector<tt::tt_metal::Tensor> chunk(const tt::tt_metal::Tensor& tensor, int num_chunks, int dim = 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sminakov-tt FYI since we talked about this yesterday. This is one of the primitives used to build the user-facing multi-device sharding (this adds C++ parity now).

ttnn::SimpleShape shape{128, 128};

auto input = arange<TypeParam>(0, shape.volume(), 1);
// TODO: Support this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the issue with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to encourage on-host tilization, so ideally from_vector should accept a device as well... There is just too many problems with tilization now, so I'd rather have this in and work on tilization / performance in a follow up.

ttnn::distributed::api::close_mesh_device(m_mesh_device);
ttnn::distributed::close_mesh_device(m_mesh_device);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

namespace ttnn::distributed {

// Mapper interface that distributes a host tensor onto a multi-device configuration.
class TensorToMesh {
Copy link
Member

@ayerofieiev-tt ayerofieiev-tt Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to hop on call to try finding better names.
Something feels very off about current names of classes and methods and it is not clear from the initial glance how this must be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These come for the Python world, but yeah, not ideal. Let's do a follow up with naming and possible API unification? I think we can do better in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

// Elements in the buffer will be stored in row-major order. The type of the elements has to match that of the
// `Tensor`.
template <typename T>
std::vector<T> to_vector() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is only currently supported for a subset of cases. Again, afaik tile layout is not handled, block float formats not supported, sharding, including logical sharding is not supported. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a clarifying comment. Does tilized std::vector make sense though? For bf4/bf8, my plan was to just convert them into float and return a row-major slice.

Copy link
Member

@ayerofieiev-tt ayerofieiev-tt Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

yes, tilized output does not makes sense. returning floats is right. but it is not handled if I got it right

std::vector<xt::xarray<T>> chunk(const xt::xarray<T>& tensor, int num_chunks, int dim = 0);

// Concatenates a list of tensors along the specified dimension.
tt::tt_metal::Tensor concatenate(const std::vector<tt::tt_metal::Tensor>& tensors, int dim = 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider concat or join

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two other apis operate on xt::xarray, but this one operates on a Tensor. This seem a bit off to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe (just maybe) this is another Tensor constructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two other apis operate on xt::xarray, but this one operates on a Tensor.

Yeah, likely we will move tt-train off the one that relies on xt::array. I'm not too worried for now, as both of these rely on the same impl.

I'm not sure about the ctor, this feels more like an utility function (which also shout sit next to chunk IMO). Big +1 that we need to have a better organization though.

And sure, concat sounds good. No strong opinion here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(thought) I imagine a newcomer looking at ttnn/tensor/xtensor and getting confused. xtensor is not like boost, it is not a well known library/typename. Instead of calling this folder xtensor, I'd rather focus on the goal it helps to achieve. But I also understand that its goal seem to be to bridge tensor with xtensor... Just thinking out load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to sandbox the xtensor stuff:) I hope the namespace makes it clear it is a less stable API.

@omilyutin-tt omilyutin-tt force-pushed the omilyutin/distributed-cpp branch from a910a0d to 981a34f Compare December 16, 2024 16:43
@omilyutin-tt omilyutin-tt force-pushed the omilyutin/distributed-cpp branch from 981a34f to a955137 Compare December 16, 2024 18:23
@omilyutin-tt omilyutin-tt merged commit 60f2d28 into main Dec 17, 2024
11 checks passed
@omilyutin-tt omilyutin-tt deleted the omilyutin/distributed-cpp branch December 17, 2024 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants