-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug when calling CreateDevice in a loop on TG #16260
Conversation
d8af64a
to
bb659e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what the actual bug was. Why does calling in a loop result in an issue?
tt_metal/llrt/tt_cluster.hpp
Outdated
@@ -182,7 +182,7 @@ class Cluster { | |||
// set_internal_routing_info_for_ethernet_cores(false); | |||
// CloseDevice(0) | |||
// CloseDevice(1) | |||
void set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing) const; | |||
void set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing, std::vector<chip_id_t> target_mmio_devices = {}) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const ref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm modifying it in the function 🐒 , oops. I'll make change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_internal_routing_info_for_ethernet_cores
was writing to all chips to enable the erisc FW. But this is bugged when you open one set of mmio devices, because the next time you open the other mmio devices, the enable bit is already true before loading FW. Led to random ND timing sensitive hangs.
bb659e2
to
51aeefb
Compare
std::vector<chip_id_t> mmio_devices = target_mmio_devices; | ||
if (mmio_devices.size() == 0) { | ||
for (const auto &[assoc_mmio_device, devices] : this->devices_grouped_by_assoc_mmio_device_) { | ||
mmio_devices.emplace_back(assoc_mmio_device); | ||
} | ||
} | ||
for (const auto &mmio_chip_id : mmio_devices) { | ||
for (const auto &chip_id : this->devices_grouped_by_assoc_mmio_device_.at(mmio_chip_id)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reserve sizes for non_mmio_devices vector and also mmio_devices vector if target_mmio_devices is empty ahead of time?
@@ -206,20 +206,31 @@ void DevicePool::initialize( | |||
|
|||
// Never skip for TG Cluster | |||
bool skip = not tt::Cluster::instance().is_galaxy_cluster(); | |||
std::vector<chip_id_t> target_mmio_ids; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could reserve by device_ids.size() ahead of time? May slightly over-allocate but this vector is discarded after right?
Edit: Reserving by tt::Cluster::instance().number_of_pci_devices()
might be better?
@@ -973,18 +973,21 @@ std::tuple<tt_cxy_pair, tt_cxy_pair> Cluster::get_eth_tunnel_core( | |||
} | |||
|
|||
// TODO: ALLAN Can change to write one bit | |||
void Cluster::set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing) const { | |||
void Cluster::set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing, const std::vector<chip_id_t> &target_mmio_devices) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we call this function in a number of different spots for both init and close. Do we need to survey whether those callsites need update?
Is it simpler to just read back the enable bit before writing ?
Ticket
Link to Github Issue
Problem description
Bug when calling CreateDevice over multiple chips in TG.
set_internal_routing_info_for_ethernet_cores
was writing to all chips to enable the erisc FW. But this is bugged when you open one set of mmio devices, because the next time you open the other mmio devices, the enable bit is already true before loading FW. Led to random ND timing sensitive hangs.What's changed
Split cluster write enable config to be per mmio grouping of devices.
Checklist