-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#0: API Unification for Device and MeshDevice #16570
Conversation
b2c2351
to
d9cd112
Compare
915c8f9
to
e736d72
Compare
// Get the physical device IDs mapped to a MeshDevice | ||
std::vector<chip_id_t> get_mapped_physical_device_ids(const MeshDeviceConfig& config) const; | ||
std::vector<chip_id_t> request_available_devices(const MeshDeviceConfig& config) const; | ||
void register_mesh_device(const std::shared_ptr<MeshDevice>& mesh_device, const std::vector<IDevice*>& devices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share the story behind this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I am not clear on the role of this entity vs MeshDevice
It is rare to see register without unregister, or not returning a guard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little funny, I agree. These methods are copied over so my intention is not to fix this in this PR. The story behind this method is that whenever a user instantiates a MeshDevice, the intention was to bind to a set of physical devices. Those physical devices should then be "reserved" for the lifecycle of the MeshDevice.
Let's look at an example:
A user instantiates a MeshDevice(1x1) mapping to physical device-id = 0. Internally, we "register" this with the SystemMesh. SystemMesh now marks physical id = 0 as "UNAVAILABLE". Next, if a user instantiates another MeshDevice(1x1), the registration process searches for an "available" physical device-id satisfying (1x1) and provides device_id=1 as an available candidate physical device to bind to. In effect the registration serves as a form of bookkeeping for in-use devices.
b246da3
to
85beacf
Compare
0514da5
to
45100c1
Compare
tests/tt_eager/python_api_testing/unit_testing/misc/test_eps.py
Outdated
Show resolved
Hide resolved
The current state of the IDevice interface requires more revision. I've synced offline with @ayerofieiev-tt for the APIs that currently throw exceptions where its implementation is not currently sensible. This is the current set of methods identified: https://docs.google.com/spreadsheets/d/1hvJte8q1D-5C9HLXQuSd_01pLGwIZJlO-Dth4ZHIT_g/edit?usp=sharing The plan is to either remove APIs entirely from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Joseph!
To the rest. We recognize that MeshDevice is not yet compliant with IDevice requirements. Definitely far from ideal. We discussed alternatives. If instead we first change IDevice to be the way we need it to be, it will completely block the work on the mesh. In a different circumstances, that would be my way to go because we likely can't propagate MeshDevice any further anyway.
The priority during the review was to ensure that we have a clear plan going forward. This plan includes removing some methods from IDevice as well as providing impl to others in MeshDevice. Should take 4-8 weeks.
45100c1
to
e77d03d
Compare
Ticket
None
Problem description
The motivation is to provide API unification between Device and MeshDevice as we push virtualization over devices in a mesh down to tt-metal. This is pre-requisite work needed for TT-Distributed.
At a high-level, the changes in this PR provides API unification between Device/MeshDevice by simply implementing all required methods captured in IDevice. This means APIs interfacing with an IDevice* can do so via a MeshDevice through runtime polymorphism.
For APIs that can't sensibly be called through MeshDevice, we simply throw an exception (for now). Unfortunately because we have such a large surface area of APIs exposed through the IDevice interface, we are currently required to implement all of these methods. However, there is a separate effort to help minimize this surface area which is out of scope for this PR.
What's changed
Checklist