Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include token as part of the input/output tuple in all-gather and reduce-scatter #7338

Closed
wants to merge 4 commits into from

Conversation

jeffhataws
Copy link
Contributor

@jeffhataws jeffhataws commented Nov 28, 2023

(Replaced by #7677)

This is a follow up change to #5740 to include token as part of the input/output tuple in all-gather and reduce-scatter. The change helps clean up the interface to be similar to all-reduce and also enable to use to XLA token type. Per discussions in the previous PR we will open RFC discussion on openxla-discuss.

…tter

Imported from GitHub PR openxla#5740

This PR adds tuple input support to all-gather and reduce-scatter. This is a revival of part of tensorflow/tensorflow#58377 and to be used in conjunction with pytorch/xla#5624 .

In FSDP, different layers' weights need to be all-gathered/reduced-scatter during training. If some layers are small, multiple layers' weights can be aggregated for more efficient data transfer (same concept as bucket_cap_mb in DDP). With existing all-gather and reduce-scatter in PyTorch-XLA, you would have to do the bucketing and decomposing outside of the operation. This PR enables multiple different tensors to be all-gathered/reduce-scatter, keeping the original tensor shapes to enable bucketing and decomposing optimizations inside the operation.

Original PR has token support like the token used for allreduce to ensure order between CCops. That will be separate PR if needed.

Copybara import of the project:

--
7ea1159 by Junmin Hao <[email protected]>:

Add Tuple input and token support to all-gather and reduce-scatter.

Committer: Junmin Hao <[email protected]>

--
cdb873e by Junmin Hao <[email protected]>:

lint fix

--
aad3521 by Jeffrey Huynh <[email protected]>:

Fix hlo_verifier_test failure due to changed expectation

--
32e8145 by Jeffrey Huynh <[email protected]>:

Separate the token change out into a separate PR with RFC.

--
b301c2a by Jeffrey Huynh <[email protected]>:

Change *WithToken tests to *WithTuple

--
5890278 by Jeffrey Huynh <[email protected]>:

Fix missing parenthesis

Merging this change closes openxla#5740

COPYBARA_INTEGRATE_REVIEW=openxla#5740 from jeffhataws:ag_rs_coalesce_revived 14e09f0
PiperOrigin-RevId: 573976449
@github-actions github-actions bot added the kokoro:force-run Forces CI to rerun label Nov 28, 2023
@ddunl ddunl added kokoro:force-run Forces CI to rerun and removed kokoro:force-run Forces CI to rerun labels Nov 28, 2023
@kamaljeeti kamaljeeti requested a review from ezhulenev November 29, 2023 07:08
@ddunl
Copy link
Member

ddunl commented Nov 29, 2023

I think if you rebase the CI will run as expected, sorry about that

@kamaljeeti
Copy link
Contributor

Hi @jeffhataws, Can you please check @ddunl comments? Thanks.

@jeffhataws
Copy link
Contributor Author

Hi @jeffhataws, Can you please check @ddunl comments? Thanks.

Rebased in another branch, so switching to #7677 .

@jeffhataws
Copy link
Contributor Author

Switching to #7677

@jeffhataws jeffhataws closed this Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kokoro:force-run Forces CI to rerun
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants