Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.breakpoint(rank=1) hangs because of --local-ranks-filter 0 #652

Open
weifengpy opened this issue Oct 25, 2024 · 0 comments
Labels
documentation Improvements or additions to documentation

Comments

@weifengpy
Copy link
Contributor

weifengpy commented Oct 25, 2024

I was debugging on rank 1 using torch.distributed.breakpoint(rank=1), but it's always hanging. It turns out to be caused by --local-ranks-filter 0 in run_llama_train.sh. Not sure if we want to remind people that two things don't work well together

I have to debug rank 1 (instead of rank0) because dim-0 sharding can be uneven and only rank1+ have paddings

repo:

diff --git a/train.py b/train.py
index 7945949..e2843b2 100644
--- a/train.py
+++ b/train.py
@@ -64,6 +64,8 @@ def main(job_config: JobConfig):
     device = torch.device(f"cuda:{int(os.environ['LOCAL_RANK'])}")
     torch.cuda.set_device(device)
     utils.init_distributed(job_config)
+
+    torch.distributed.breakpoint(rank=1)
@tianyu-l tianyu-l added the bug Something isn't working label Oct 25, 2024
@tianyu-l tianyu-l added documentation Improvements or additions to documentation and removed bug Something isn't working labels Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants