Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use gloo as part of DeviceGPU's process group backend #3509

Merged
merged 14 commits into from
Aug 5, 2024

Conversation

snarayan21
Copy link
Contributor

@snarayan21 snarayan21 commented Jul 31, 2024

What does this PR do?

As pytorch defaults to, uses gloo for cpu, nccl for gpu, for DeviceGPU's backend for process group.

As older versions of torch / composer do not support checkpoint load/save with gloo + nccl multi-backend, we restrict this change to torch >=2.3.0.

Successful daily test (partial): https://github.com/mosaicml/composer/actions/runs/10221614957
Another successful daily test (full): https://github.com/mosaicml/composer/actions/runs/10254386328

What issue(s) does this change relate to?

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@snarayan21 snarayan21 marked this pull request as draft July 31, 2024 19:59
@snarayan21 snarayan21 marked this pull request as ready for review August 5, 2024 14:31
composer/devices/device_gpu.py Outdated Show resolved Hide resolved
@snarayan21 snarayan21 enabled auto-merge (squash) August 5, 2024 17:03
@snarayan21 snarayan21 disabled auto-merge August 5, 2024 17:03
@snarayan21 snarayan21 closed this Aug 5, 2024
@snarayan21 snarayan21 reopened this Aug 5, 2024
@snarayan21
Copy link
Contributor Author

Debugging daily test failures

@snarayan21
Copy link
Contributor Author

tfw you put or instead of and....

@snarayan21 snarayan21 merged commit cccc8a7 into dev Aug 5, 2024
35 of 54 checks passed
@snarayan21 snarayan21 deleted the saaketh/gloo_default branch August 5, 2024 20:16
snarayan21 added a commit that referenced this pull request Aug 6, 2024
snarayan21 added a commit that referenced this pull request Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants