Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoweringContext] Support an optimized parameter mapping for SPMD #8460

Merged
merged 1 commit into from
Dec 7, 2024

Conversation

rpsilva-aws
Copy link
Contributor

Currently, the existing parameter mapping for the lowering context is not well suited for SPMD. In case of large models, it will cause a large synchronous bottleneck when transferring all device data to the host. This is caused by each ReplicateShardedData computation that gathers and reassembles each sharded data across multiple devices. This is by design, since it is expected to collect all parameters regardless of their allocation.

In this PR, we introduce a new mapping that does not invoke the sharded replication, but instead uses references to the device data. This is generally sufficient and preferred in most cases, where the user only wants to access the validate parameters (those that are not returned as -1 from tensor_parameter_id, as 'fake' parameters).

@rpsilva-aws
Copy link
Contributor Author

Re-opened from #8453, cleaned up the merge commit.

@tengyifei tengyifei self-requested a review December 5, 2024 23:51
@tengyifei tengyifei added the tpuci label Dec 5, 2024
@tengyifei tengyifei marked this pull request as ready for review December 5, 2024 23:52
@rpsilva-aws rpsilva-aws force-pushed the rpsilva_lc_mapping_v3 branch from 8fd7ac7 to 9858577 Compare December 5, 2024 23:53
@tengyifei tengyifei merged commit 5d11f66 into pytorch:master Dec 7, 2024
12 checks passed
@rpsilva-aws rpsilva-aws deleted the rpsilva_lc_mapping_v3 branch December 9, 2024 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants