Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ManagedDeviceMesh to integrate DeviceMesh with TorchFT #56

Merged
merged 27 commits into from
Jan 10, 2025

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Dec 24, 2024

Stack from ghstack (oldest at bottom):

ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing.

[ghstack-poisoned]
[ghstack-poisoned]
fegin added a commit that referenced this pull request Dec 24, 2024
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 9b2bdf3aa301a643726c8d8fb43f385bb022ba96
Pull Request resolved: #56
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 24, 2024
@fegin fegin requested a review from d4l3k December 24, 2024 00:15
[ghstack-poisoned]
fegin added a commit that referenced this pull request Dec 24, 2024
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: b1ed52b20adff13f2389aa554f20e150e6e375b8
Pull Request resolved: #56
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Though I don't know a ton about DeviceMesh internals

It would be nice to add in a FSDP integration tests (in manager_integ_tests.py) but we can do that as a follow up

# real mesh but is virtually added to the mesh via ManagedDeviceMesh.
device_mesh = ft_init_device_mesh(
device_type="cpu",
mesh_shape=(2, 4),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this value used at all? I assume it doesn't really matter what it's set to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replicate part is not going to be valid but other parts are valid and will be used.

raise NotImplementedError


class _FlattenDeviceMesh(DeviceMesh):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice! Does this solve the issue with flattening in FSDP or just throws an error for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work for the case where we flatten the device mesh to compute the global loss average but not work for data loader. I think we need to customize dataloader anyway.

self.assertEqual(replicate_mesh.get_group(), replicate_group)
flatten_mesh = device_mesh._flatten("dp")
manager.num_participants.return_value = 1
self.assertEqual(flatten_mesh.size(), 4)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be equal to world_size?

[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 7, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 888be370d2f8e81fbe0a9a29a9a99a4e6404cab8
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 7, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: faf2012b63df387807ddac7e9dc30af634abc3c5
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 7, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: f62593b8d08554895f7a00a9a140417b0e22c55c
Pull Request resolved: #56
fegin added 2 commits January 7, 2025 14:23
…eviceMesh with TorchFT"



ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing.

[ghstack-poisoned]
…orchFT"



ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 7, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 35ed91c90e242aa9f114a7b25f58097d312e274d
Pull Request resolved: #56
fegin added 2 commits January 7, 2025 15:59
…orchFT"



ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing.

[ghstack-poisoned]
…eviceMesh with TorchFT"



ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 7, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 2bde839287b6dff46d8ac54c8851fb86abe6b788
Pull Request resolved: #56
fegin added 2 commits January 7, 2025 16:14
…orchFT"



ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing.

[ghstack-poisoned]
…eviceMesh with TorchFT"



ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 8, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 321d2f2f5ff2cf9bc16622623b2d80eb95db33cf
Pull Request resolved: #56
fegin added 2 commits January 7, 2025 19:49
…orchFT"



ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing.

[ghstack-poisoned]
…eviceMesh with TorchFT"



ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup. This currently work with a simple HSDP case but the actual integration and e2e tests are likely to expose more issues, e.g., checkpointing.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 8, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 62ae91be4f8137745b4afa42824b3c97b270a54c
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 9, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: efc1bc7da656ca45419cc0fac1747f1ebc9ef23f
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 9, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 00d30a48f09b9afe1525babbd5c1968ac1c66b16
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 9, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: ace0838d729c7ecdd3720fb9037185c83d9a289a
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 9, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: e7afbf22208758883e0cc3abacab4f6deb60c41b
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 9, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: a9349a1096ab8bf2f9e2b231add1bfc395291b16
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 9, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 43af564a9fb3b830eb2a20c0b24a067cbc030005
Pull Request resolved: #56
fegin added 2 commits January 10, 2025 10:47
[ghstack-poisoned]
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 10, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 60d693e38aafdd180cb85c346eba8f71a8477f71
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 10, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 77c4ac0e166fd10d71e54c5f155b2c73ffe53a0a
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 10, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: da9dfabc229106b3473f0d86c882880ff5b8cc56
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 10, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 43387c55947963203513e5277886fef59a2e306b
Pull Request resolved: #56
fegin added a commit that referenced this pull request Jan 10, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 43387c55947963203513e5277886fef59a2e306b
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 10, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: 98b4f2b9f5371e9c27ceba3a7239740ecf65881c
Pull Request resolved: #56
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 10, 2025
Summary:
ManagedDeviceMesh allow users to manipulate DeviceMesh with TorchFT ManagedProcessGroup.

ghstack-source-id: c42ae2205624402ccfe99fb87c847f6ffb7a1703
Pull Request resolved: #56
@d4l3k d4l3k changed the base branch from gh/fegin/1/base to main January 10, 2025 20:42
@d4l3k d4l3k merged commit b617bd2 into main Jan 10, 2025
6 checks passed
@d4l3k d4l3k deleted the gh/fegin/1/head branch January 10, 2025 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants