Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug when calling CreateDevice in a loop on TG #16260

Merged
merged 1 commit into from
Jan 7, 2025
Merged

Conversation

aliuTT
Copy link
Contributor

@aliuTT aliuTT commented Dec 23, 2024

Ticket

Link to Github Issue

Problem description

Bug when calling CreateDevice over multiple chips in TG.

set_internal_routing_info_for_ethernet_cores was writing to all chips to enable the erisc FW. But this is bugged when you open one set of mmio devices, because the next time you open the other mmio devices, the enable bit is already true before loading FW. Led to random ND timing sensitive hangs.

What's changed

Split cluster write enable config to be per mmio grouping of devices.

Checklist

  • Post commit CI passes
  • Blackhole Post commit (if applicable)
  • Model regression CI testing passes (if applicable)
  • Device performance regression CI testing passes (if applicable)
  • (For models and ops writers) Full new models tests passes
  • New/Existing tests provide coverage for changes

Copy link
Collaborator

@cfjchu cfjchu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what the actual bug was. Why does calling in a loop result in an issue?

@@ -182,7 +182,7 @@ class Cluster {
// set_internal_routing_info_for_ethernet_cores(false);
// CloseDevice(0)
// CloseDevice(1)
void set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing) const;
void set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing, std::vector<chip_id_t> target_mmio_devices = {}) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const ref

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm modifying it in the function 🐒 , oops. I'll make change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_internal_routing_info_for_ethernet_cores was writing to all chips to enable the erisc FW. But this is bugged when you open one set of mmio devices, because the next time you open the other mmio devices, the enable bit is already true before loading FW. Led to random ND timing sensitive hangs.

@aliuTT aliuTT force-pushed the aliu/fix-cluster-bug branch from bb659e2 to 51aeefb Compare December 23, 2024 17:37
Comment on lines +983 to +990
std::vector<chip_id_t> mmio_devices = target_mmio_devices;
if (mmio_devices.size() == 0) {
for (const auto &[assoc_mmio_device, devices] : this->devices_grouped_by_assoc_mmio_device_) {
mmio_devices.emplace_back(assoc_mmio_device);
}
}
for (const auto &mmio_chip_id : mmio_devices) {
for (const auto &chip_id : this->devices_grouped_by_assoc_mmio_device_.at(mmio_chip_id)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reserve sizes for non_mmio_devices vector and also mmio_devices vector if target_mmio_devices is empty ahead of time?

@@ -206,20 +206,31 @@ void DevicePool::initialize(

// Never skip for TG Cluster
bool skip = not tt::Cluster::instance().is_galaxy_cluster();
std::vector<chip_id_t> target_mmio_ids;
Copy link
Contributor

@tt-aho tt-aho Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reserve by device_ids.size() ahead of time? May slightly over-allocate but this vector is discarded after right?
Edit: Reserving by tt::Cluster::instance().number_of_pci_devices() might be better?

@@ -973,18 +973,21 @@ std::tuple<tt_cxy_pair, tt_cxy_pair> Cluster::get_eth_tunnel_core(
}

// TODO: ALLAN Can change to write one bit
void Cluster::set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing) const {
void Cluster::set_internal_routing_info_for_ethernet_cores(bool enable_internal_routing, const std::vector<chip_id_t> &target_mmio_devices) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we call this function in a number of different spots for both init and close. Do we need to survey whether those callsites need update?

Is it simpler to just read back the enable bit before writing ?

@aliuTT aliuTT merged commit 8a683ef into main Jan 7, 2025
181 of 184 checks passed
@aliuTT aliuTT deleted the aliu/fix-cluster-bug branch January 7, 2025 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants