#0: Add WriteShard and ReadShard MeshBuffer APIs and resolve MeshBuffer dealloc issues #16960

tt-asaigal · 2025-01-22T02:21:10Z

Ticket

No ticket.

Problem description

MeshBuffer deallocation on destruction is currently a nop on main (issue deallocate to address 0 when destroying any MeshBuffer)
Basic IO functionality for MeshBuffer needs to be added

What's changed

Resolve deallocation issue: Allocate the single device backing buffer when its created, and store this in the MeshBuffer object. The backing buffer gets deleted and automatically deallocated at the correct address when its destroyed
Add WriteShard and ReadShard APIs. Multi-device sharding and replication builds on top of this (to be added in a separate PR).
Add tests.

Checklist

Post commit CI passes
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
Device performance regression CI testing passes (if applicable)
(For models and ops writers) Full new models tests passes
New/Existing tests provide coverage for changes

tt_metal/distributed/mesh_command_queue.cpp

tt_metal/distributed/mesh_buffer.hpp

tt_metal/distributed/mesh_buffer.cpp

cfjchu · 2025-01-22T02:39:22Z

tests/tt_metal/distributed/test_mesh_buffer.cpp


 namespace tt::tt_metal::distributed::test {
 namespace {

 using MeshBufferTest = T3000MultiDeviceFixture;

+class DeviceLocalShardedBufferTestConfig {


what's the plan with DeviceLocalBufferConfig?

This is a struct used for testing only. It generates device local sharding parameters based on user provided config, making it easier for us to test single device sharding functionality.
DeviceLocalBufferConfig is a core user facing struct exposed in the mesh_buffer.hpp header

cfjchu · 2025-01-22T02:42:52Z

tt_metal/distributed/distributed.hpp

+template <typename DType>
+void WriteShard(
+    MeshCommandQueue& mesh_cq,
+    std::shared_ptr<MeshBuffer>& mesh_buffer,
+    std::vector<DType>& src,
+    const Coordinate& coord,
+    bool blocking = false) {
+    mesh_cq.enqueue_write_shard(mesh_buffer, src.data(), coord, blocking);
+}
+
+template <typename DType>
+void ReadShard(
+    MeshCommandQueue& mesh_cq,
+    std::vector<DType>& dst,
+    std::shared_ptr<MeshBuffer>& mesh_buffer,
+    const Coordinate& coord,
+    bool blocking = true) {
+    auto shard = mesh_buffer->get_device_buffer(coord);
+    dst.resize(shard->page_size() * shard->num_pages() / sizeof(DType));
+    mesh_cq.enqueue_read_shard(dst.data(), mesh_buffer, coord, blocking);
+}
+


void ReadShard(Buffer& buffer, std::vector<DType>& host_buffer, const uint32_t& core_id);

We should converge on the interface for eventual unification with single-device

I agree, my preference is to modify the tt-metal variants, since:

They currently dont work with Fast Dispatch, so they'll need a command queue in the argument list

They pass in a 1D core_id which doesn't make sense for a 2D grid.

I think this should be a separate effort, since it requires dedicated dispatch changes to properly support.

tt-asaigal · 2025-01-22T05:44:38Z

Passing Post Commit: https://github.com/tenstorrent/tt-metal/actions/runs/12900817938
T3K (same state as Main): https://github.com/tenstorrent/tt-metal/actions/runs/12901481894

omilyutin-tt

The APIs look nice - thank you for following up!

omilyutin-tt · 2025-01-22T06:40:19Z

tt_metal/distributed/mesh_buffer.hpp

    std::vector<std::vector<std::shared_ptr<Buffer>>> buffers_;
+    // Buffer owned by the MeshBuffer. Responsible for interfacing with the
+    // single device allocator.


Add "not set if the MeshBuffer is externally owned"

omilyutin-tt · 2025-01-22T06:45:15Z

tt_metal/distributed/mesh_command_queue.cpp

+
+void MeshCommandQueue::enqueue_read_shard(
+    void* host_data, std::shared_ptr<MeshBuffer>& mesh_buffer, const Coordinate& coord, bool blocking) {
+    TT_FATAL(blocking, "Only blocking reads are currently supported from MeshBuffer shards.");


blocking here doesn't do anything - is this intentional?

yes this is intentional for now. Once we support non-blocking reads, this parameter will actually do something. For now, we have the API in place and just assert if the user tries to do a non-blocking read

omilyutin-tt · 2025-01-22T06:53:32Z

tests/tt_metal/distributed/test_mesh_buffer.cpp

    }
-    EXPECT_FALSE(allocator->allocated_buffers.contains(buffer.get()));


Does the old test pass as it is? Just wondering. Btw I tried testing with Buffer::is_allocated(), but tt::tt_metal::detail::DeallocateBuffer(backing_buffer_ptr); didn't reset the variable that tracks the allocation state. I think you have a better luck with the new revision. Can you try:

std::shared_ptr<Buffer> buffer; { auto replicated_buffer = MeshBuffer::create(buffer_config, device_local_config, mesh_device_.get()); buffer = replicated_buffer->get_device_buffer(Coordinate{0, 0}); EXPECT_TRUE(buffer->is_allocated()); } EXPECT_FALSE(buffer->is_allocated());

The original test does not pass as is. The reason is that the previous impl for MeshBuffer stored the backing buffer as the shard for device (0, 0). So doing the following gives you an accurate representation for the state of the backing buffer

buffer = replicated_buffer->get_device_buffer(Coordinate{0, 0}); allocator = buffer->allocator(); EXPECT_TRUE(allocator->allocated_buffers.contains(buffer.get()));

This is not the case anymore, since the backing buffer is not accessible to the user in any form.

The test you've described will not pass either, since the individual shards returned by get_device_buffer and stored in a temporary variable do not get deallocated when a MeshBuffer goes out of scope.

omilyutin-tt · 2025-01-22T06:53:59Z

tests/tt_metal/distributed/test_mesh_buffer.cpp

@@ -112,5 +157,95 @@ TEST_F(MeshBufferTest, GetDeviceBuffer) {
    EXPECT_NO_THROW(replicated_buffer->get_device_buffer(Coordinate{1, 3}));
 }

+TEST_F(MeshBufferTest, TestInterleavedShardsReadWrite) {


MeshBufferTest already has "Test" in it, drop Test prefix here and below

omilyutin-tt · 2025-01-22T06:58:53Z

tests/tt_metal/distributed/test_mesh_buffer.cpp

+            std::shared_ptr<MeshBuffer> buf =
+                MeshBuffer::create(global_buffer_config, per_device_buffer_config, mesh_device_.get());
+
+            std::vector<uint32_t> src_vec = create_constant_vector_of_bfloat16(num_random_tiles * single_tile_size, i);


Can you create a std::vector<bfloat16> and write it using WriteShard API? Or rely on a simpler dtype like float or int? The same vector can then be used to compared with the output:

EXPECT_THAT(dst_vect, Pointwise(Eq(), src_vec))

... without a loop.

done, no for loop comparisons being done in tests

omilyutin-tt · 2025-01-22T07:00:27Z

tests/tt_metal/distributed/test_mesh_buffer.cpp

+    uint32_t seed = tt::parse_env("TT_METAL_SEED", 0);
+    uint32_t single_tile_size = ::tt::tt_metal::detail::TileSize(DataFormat::Float16_b);
+
+    for (auto buffer_type : {BufferType::L1, BufferType::DRAM}) {


If we want to do this, this should be a parameterized TEST_P.

omilyutin-tt · 2025-01-22T07:01:29Z

tests/tt_metal/distributed/test_mesh_buffer.cpp

+                        std::vector<uint32_t> dst_vec = {};
+                        ReadShard(mesh_device_->mesh_command_queue(), dst_vec, buf, Coordinate(logical_y, logical_x));
+                        for (int j = 0; j < dst_vec.size(); j++) {
+                            EXPECT_EQ(dst_vec[j], j);


Same here for comparing vectors without a loop

omilyutin-tt · 2025-01-22T07:01:48Z

tests/tt_metal/distributed/test_mesh_buffer.cpp

+    CoreCoord core_grid_size = mesh_device_->compute_with_storage_grid_size();
+    std::vector<std::array<uint32_t, 2>> num_pages_per_core_vec = {{1, 1}, {3, 137}, {67, 4}, {7, 11}, {2, 2}};
+    std::vector<std::array<uint32_t, 2>> page_shapes = {{1, 1024}, {1, 2048}, {1, 4}, {32, 32}, {1, 120}};
+    std::vector<TensorMemoryLayout> shard_strategies = {
+        TensorMemoryLayout::HEIGHT_SHARDED, TensorMemoryLayout::WIDTH_SHARDED, TensorMemoryLayout::BLOCK_SHARDED};
+
+    for (const auto shard_strategy : shard_strategies) {
+        for (const auto& num_pages_per_core : num_pages_per_core_vec) {
+            for (const auto& page_shape : page_shapes) {
+                DeviceLocalShardedBufferTestConfig test_config(


Same here for using a parameterized test suite.

omilyutin-tt · 2025-01-22T07:03:22Z

tests/tt_metal/distributed/test_mesh_buffer.cpp


 namespace tt::tt_metal::distributed::test {
 namespace {

 using MeshBufferTest = T3000MultiDeviceFixture;

+class DeviceLocalShardedBufferTestConfig {
+public:


class with just public data members and methods is a struct, let's do that and remove constructor? You can use aggregate initialization with this syntax:

DeviceLocalShardedBufferTestConfig config{ .num_pages_per_core = ..., .num_cores = ... // etc }

omilyutin-tt · 2025-01-22T07:03:30Z

tests/tt_metal/distributed/test_mesh_buffer.cpp


 namespace tt::tt_metal::distributed::test {
 namespace {

 using MeshBufferTest = T3000MultiDeviceFixture;

+class DeviceLocalShardedBufferTestConfig {
+public:
+    std::array<uint32_t, 2> num_pages_per_core;


…er dealloc issues - Add tests for reading and writing shards with Interleaved and Sharded configs - Add test for deallocation, verying addresses

abhullar-tt

overall lgtm once other feedback is addressed

tt-asaigal requested review from cfjchu, aliuTT, omilyutin-tt, abhullar-tt, pgkeller, tt-aho, tt-dma and ubcheema as code owners January 22, 2025 02:21

cfjchu approved these changes Jan 22, 2025

View reviewed changes

tt-asaigal force-pushed the asaigal/mesh_buffer_io branch 2 times, most recently from f9508de to 42f0a14 Compare January 22, 2025 03:49

omilyutin-tt requested changes Jan 22, 2025

View reviewed changes

tt-asaigal force-pushed the asaigal/mesh_buffer_io branch from 42f0a14 to 17ceb33 Compare January 22, 2025 18:02

#0: Add WriteShard and ReadShard MeshBuffer APIs and resolve MeshBuff…

befeaca

…er dealloc issues - Add tests for reading and writing shards with Interleaved and Sharded configs - Add test for deallocation, verying addresses

tt-asaigal force-pushed the asaigal/mesh_buffer_io branch from 17ceb33 to befeaca Compare January 22, 2025 18:44

abhullar-tt approved these changes Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#0: Add WriteShard and ReadShard MeshBuffer APIs and resolve MeshBuffer dealloc issues #16960

#0: Add WriteShard and ReadShard MeshBuffer APIs and resolve MeshBuffer dealloc issues #16960

tt-asaigal commented Jan 22, 2025 •

edited

Loading

cfjchu Jan 22, 2025

tt-asaigal Jan 22, 2025

cfjchu Jan 22, 2025

tt-asaigal Jan 22, 2025

tt-asaigal commented Jan 22, 2025

omilyutin-tt left a comment

omilyutin-tt Jan 22, 2025

tt-asaigal Jan 22, 2025

omilyutin-tt Jan 22, 2025

tt-asaigal Jan 22, 2025

omilyutin-tt Jan 22, 2025

tt-asaigal Jan 22, 2025

omilyutin-tt Jan 22, 2025

tt-asaigal Jan 22, 2025

omilyutin-tt Jan 22, 2025

tt-asaigal Jan 22, 2025

omilyutin-tt Jan 22, 2025

omilyutin-tt Jan 22, 2025

tt-asaigal Jan 22, 2025

omilyutin-tt Jan 22, 2025

omilyutin-tt Jan 22, 2025

tt-asaigal Jan 22, 2025

omilyutin-tt Jan 22, 2025

tt-asaigal Jan 22, 2025

abhullar-tt left a comment

		}
		EXPECT_FALSE(allocator->allocated_buffers.contains(buffer.get()));

#0: Add WriteShard and ReadShard MeshBuffer APIs and resolve MeshBuffer dealloc issues #16960

Are you sure you want to change the base?

#0: Add WriteShard and ReadShard MeshBuffer APIs and resolve MeshBuffer dealloc issues #16960

Conversation

tt-asaigal commented Jan 22, 2025 • edited Loading

Ticket

Problem description

What's changed

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tt-asaigal commented Jan 22, 2025

omilyutin-tt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhullar-tt left a comment

Choose a reason for hiding this comment

tt-asaigal commented Jan 22, 2025 •

edited

Loading