Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#0: API Unification for Device and MeshDevice #16570

Merged
merged 7 commits into from
Jan 15, 2025

Conversation

cfjchu
Copy link
Collaborator

@cfjchu cfjchu commented Jan 9, 2025

Ticket

None

Problem description

The motivation is to provide API unification between Device and MeshDevice as we push virtualization over devices in a mesh down to tt-metal. This is pre-requisite work needed for TT-Distributed.

At a high-level, the changes in this PR provides API unification between Device/MeshDevice by simply implementing all required methods captured in IDevice. This means APIs interfacing with an IDevice* can do so via a MeshDevice through runtime polymorphism.

For APIs that can't sensibly be called through MeshDevice, we simply throw an exception (for now). Unfortunately because we have such a large surface area of APIs exposed through the IDevice interface, we are currently required to implement all of these methods. However, there is a separate effort to help minimize this surface area which is out of scope for this PR.

What's changed

  • Explicitly make MeshDevice use the interface exposed by IDevice
  • Move inf/nan/eps methods out of IDevice and into HAL

Checklist

@cfjchu cfjchu force-pushed the jchu/mesh-device-interface-v0 branch from 915c8f9 to e736d72 Compare January 11, 2025 09:48
// Get the physical device IDs mapped to a MeshDevice
std::vector<chip_id_t> get_mapped_physical_device_ids(const MeshDeviceConfig& config) const;
std::vector<chip_id_t> request_available_devices(const MeshDeviceConfig& config) const;
void register_mesh_device(const std::shared_ptr<MeshDevice>& mesh_device, const std::vector<IDevice*>& devices);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share the story behind this method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I am not clear on the role of this entity vs MeshDevice

It is rare to see register without unregister, or not returning a guard.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little funny, I agree. These methods are copied over so my intention is not to fix this in this PR. The story behind this method is that whenever a user instantiates a MeshDevice, the intention was to bind to a set of physical devices. Those physical devices should then be "reserved" for the lifecycle of the MeshDevice.

Let's look at an example:

A user instantiates a MeshDevice(1x1) mapping to physical device-id = 0. Internally, we "register" this with the SystemMesh. SystemMesh now marks physical id = 0 as "UNAVAILABLE". Next, if a user instantiates another MeshDevice(1x1), the registration process searches for an "available" physical device-id satisfying (1x1) and provides device_id=1 as an available candidate physical device to bind to. In effect the registration serves as a form of bookkeeping for in-use devices.

@cfjchu cfjchu force-pushed the jchu/mesh-device-interface-v0 branch 4 times, most recently from b246da3 to 85beacf Compare January 14, 2025 05:05
@cfjchu cfjchu force-pushed the jchu/mesh-device-interface-v0 branch 2 times, most recently from 0514da5 to 45100c1 Compare January 14, 2025 21:55
@cfjchu
Copy link
Collaborator Author

cfjchu commented Jan 15, 2025

The current state of the IDevice interface requires more revision. I've synced offline with @ayerofieiev-tt for the APIs that currently throw exceptions where its implementation is not currently sensible. This is the current set of methods identified:

https://docs.google.com/spreadsheets/d/1hvJte8q1D-5C9HLXQuSd_01pLGwIZJlO-Dth4ZHIT_g/edit?usp=sharing

The plan is to either remove APIs entirely from the IDevice interface or provide explicit MeshDevice implementations. This will be work that future PRs will address.

Copy link
Member

@ayerofieiev-tt ayerofieiev-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Joseph!

To the rest. We recognize that MeshDevice is not yet compliant with IDevice requirements. Definitely far from ideal. We discussed alternatives. If instead we first change IDevice to be the way we need it to be, it will completely block the work on the mesh. In a different circumstances, that would be my way to go because we likely can't propagate MeshDevice any further anyway.

The priority during the review was to ensure that we have a clear plan going forward. This plan includes removing some methods from IDevice as well as providing impl to others in MeshDevice. Should take 4-8 weeks.

@cfjchu cfjchu force-pushed the jchu/mesh-device-interface-v0 branch from 45100c1 to e77d03d Compare January 15, 2025 02:52
@cfjchu cfjchu merged commit 2c4a641 into main Jan 15, 2025
197 of 199 checks passed
@cfjchu cfjchu deleted the jchu/mesh-device-interface-v0 branch January 15, 2025 06:31
@avoraTT avoraTT restored the jchu/mesh-device-interface-v0 branch January 15, 2025 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.