Fix bug when calling CreateDevice in a loop on TG #16260

aliuTT · 2024-12-23T04:32:42Z

Ticket

Problem description

Bug when calling CreateDevice over multiple chips in TG.

set_internal_routing_info_for_ethernet_cores was writing to all chips to enable the erisc FW. But this is bugged when you open one set of mmio devices, because the next time you open the other mmio devices, the enable bit is already true before loading FW. Led to random ND timing sensitive hangs.

What's changed

Split cluster write enable config to be per mmio grouping of devices.

Checklist

Post commit CI passes
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
Device performance regression CI testing passes (if applicable)
(For models and ops writers) Full new models tests passes
New/Existing tests provide coverage for changes

cfjchu

I don't quite understand what the actual bug was. Why does calling in a loop result in an issue?

cfjchu · 2024-12-23T06:01:30Z

tt_metal/llrt/tt_cluster.hpp

@@ -182,7 +182,7 @@ class Cluster {
    //       set_internal_routing_info_for_ethernet_cores(false);
    //       CloseDevice(0)
    //       CloseDevice(1)
-    void set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing) const;
+    void set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing, std::vector<chip_id_t> target_mmio_devices = {}) const;


I'm modifying it in the function 🐒 , oops. I'll make change

set_internal_routing_info_for_ethernet_cores was writing to all chips to enable the erisc FW. But this is bugged when you open one set of mmio devices, because the next time you open the other mmio devices, the enable bit is already true before loading FW. Led to random ND timing sensitive hangs.

tt-aho · 2024-12-23T17:45:03Z

tt_metal/llrt/tt_cluster.cpp

+    std::vector<chip_id_t> mmio_devices = target_mmio_devices;
+    if (mmio_devices.size() == 0) {
+        for (const auto &[assoc_mmio_device, devices] : this->devices_grouped_by_assoc_mmio_device_) {
+            mmio_devices.emplace_back(assoc_mmio_device);
+        }
+    }
+    for (const auto &mmio_chip_id : mmio_devices) {
+        for (const auto &chip_id : this->devices_grouped_by_assoc_mmio_device_.at(mmio_chip_id)) {


Can we reserve sizes for non_mmio_devices vector and also mmio_devices vector if target_mmio_devices is empty ahead of time?

tt-aho · 2024-12-23T17:46:07Z

tt_metal/impl/device/device_pool.cpp

@@ -206,20 +206,31 @@ void DevicePool::initialize(

    // Never skip for TG Cluster
    bool skip = not tt::Cluster::instance().is_galaxy_cluster();
+    std::vector<chip_id_t> target_mmio_ids;


We could reserve by device_ids.size() ahead of time? May slightly over-allocate but this vector is discarded after right?
Edit: Reserving by tt::Cluster::instance().number_of_pci_devices() might be better?

cfjchu · 2024-12-23T17:56:49Z

tt_metal/llrt/tt_cluster.cpp

@@ -973,18 +973,21 @@ std::tuple<tt_cxy_pair, tt_cxy_pair> Cluster::get_eth_tunnel_core(
 }

 // TODO: ALLAN Can change to write one bit
-void Cluster::set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing) const {
+void Cluster::set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing, const std::vector<chip_id_t> &target_mmio_devices) const {


we call this function in a number of different spots for both init and close. Do we need to survey whether those callsites need update?

Is it simpler to just read back the enable bit before writing ?

aliuTT requested review from abhullar-tt, pgkeller, tt-aho, tt-dma, tt-asaigal, ubcheema, davorchap and cfjchu as code owners December 23, 2024 04:32

aliuTT force-pushed the aliu/fix-cluster-bug branch from d8af64a to bb659e2 Compare December 23, 2024 04:32

cfjchu reviewed Dec 23, 2024

View reviewed changes

Fix bug when calling CreateDevice in a loop on TG

51aeefb

ubcheema approved these changes Dec 23, 2024

View reviewed changes

aliuTT force-pushed the aliu/fix-cluster-bug branch from bb659e2 to 51aeefb Compare December 23, 2024 17:37

tt-aho approved these changes Dec 23, 2024

View reviewed changes

cfjchu approved these changes Dec 23, 2024

View reviewed changes

aliuTT merged commit 8a683ef into main Jan 7, 2025
181 of 184 checks passed

aliuTT deleted the aliu/fix-cluster-bug branch January 7, 2025 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug when calling CreateDevice in a loop on TG #16260

Fix bug when calling CreateDevice in a loop on TG #16260

aliuTT commented Dec 23, 2024 •

edited

Loading

cfjchu left a comment

cfjchu Dec 23, 2024

aliuTT Dec 23, 2024

aliuTT Dec 23, 2024

tt-aho Dec 23, 2024

tt-aho Dec 23, 2024 •

edited

Loading

cfjchu Dec 23, 2024

Fix bug when calling CreateDevice in a loop on TG #16260

Fix bug when calling CreateDevice in a loop on TG #16260

Conversation

aliuTT commented Dec 23, 2024 • edited Loading

Ticket

Problem description

What's changed

Checklist

cfjchu left a comment

Choose a reason for hiding this comment

cfjchu Dec 23, 2024

Choose a reason for hiding this comment

aliuTT Dec 23, 2024

Choose a reason for hiding this comment

aliuTT Dec 23, 2024

Choose a reason for hiding this comment

tt-aho Dec 23, 2024

Choose a reason for hiding this comment

tt-aho Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

cfjchu Dec 23, 2024

Choose a reason for hiding this comment

aliuTT commented Dec 23, 2024 •

edited

Loading

tt-aho Dec 23, 2024 •

edited

Loading