-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce ManagedDeviceMesh to integrate DeviceMesh with TorchFT #56
Conversation
[ghstack-poisoned]
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 9b2bdf3aa301a643726c8d8fb43f385bb022ba96 Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: b1ed52b20adff13f2389aa554f20e150e6e375b8 Pull Request resolved: #56
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Though I don't know a ton about DeviceMesh internals
It would be nice to add in a FSDP integration tests (in manager_integ_tests.py) but we can do that as a follow up
torchft/process_group_test.py
Outdated
# real mesh but is virtually added to the mesh via ManagedDeviceMesh. | ||
device_mesh = ft_init_device_mesh( | ||
device_type="cpu", | ||
mesh_shape=(2, 4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this value used at all? I assume it doesn't really matter what it's set to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The replicate part is not going to be valid but other parts are valid and will be used.
raise NotImplementedError | ||
|
||
|
||
class _FlattenDeviceMesh(DeviceMesh): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice! Does this solve the issue with flattening in FSDP or just throws an error for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work for the case where we flatten the device mesh to compute the global loss average but not work for data loader. I think we need to customize dataloader anyway.
torchft/process_group_test.py
Outdated
self.assertEqual(replicate_mesh.get_group(), replicate_group) | ||
flatten_mesh = device_mesh._flatten("dp") | ||
manager.num_participants.return_value = 1 | ||
self.assertEqual(flatten_mesh.size(), 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be equal to world_size
?
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 888be370d2f8e81fbe0a9a29a9a99a4e6404cab8 Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: faf2012b63df387807ddac7e9dc30af634abc3c5 Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: f62593b8d08554895f7a00a9a140417b0e22c55c Pull Request resolved: #56
…eviceMesh with TorchFT" ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing. [ghstack-poisoned]
…orchFT" ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing. [ghstack-poisoned]
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 35ed91c90e242aa9f114a7b25f58097d312e274d Pull Request resolved: #56
…orchFT" ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing. [ghstack-poisoned]
…eviceMesh with TorchFT" ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing. [ghstack-poisoned]
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 2bde839287b6dff46d8ac54c8851fb86abe6b788 Pull Request resolved: #56
…orchFT" ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing. [ghstack-poisoned]
…eviceMesh with TorchFT" ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing. [ghstack-poisoned]
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 321d2f2f5ff2cf9bc16622623b2d80eb95db33cf Pull Request resolved: #56
…orchFT" ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing. [ghstack-poisoned]
…eviceMesh with TorchFT" ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing. [ghstack-poisoned]
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 62ae91be4f8137745b4afa42824b3c97b270a54c Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: efc1bc7da656ca45419cc0fac1747f1ebc9ef23f Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 00d30a48f09b9afe1525babbd5c1968ac1c66b16 Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: ace0838d729c7ecdd3720fb9037185c83d9a289a Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: e7afbf22208758883e0cc3abacab4f6deb60c41b Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: a9349a1096ab8bf2f9e2b231add1bfc395291b16 Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 43af564a9fb3b830eb2a20c0b24a067cbc030005 Pull Request resolved: #56
[ghstack-poisoned]
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 60d693e38aafdd180cb85c346eba8f71a8477f71 Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 77c4ac0e166fd10d71e54c5f155b2c73ffe53a0a Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: da9dfabc229106b3473f0d86c882880ff5b8cc56 Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 43387c55947963203513e5277886fef59a2e306b Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 43387c55947963203513e5277886fef59a2e306b Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: 98b4f2b9f5371e9c27ceba3a7239740ecf65881c Pull Request resolved: #56
Summary: ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. ghstack-source-id: c42ae2205624402ccfe99fb87c847f6ffb7a1703 Pull Request resolved: #56
Stack from ghstack (oldest at bottom):
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing.