torch.distributed.breakpoint(rank=1) hangs because of --local-ranks-filter 0 #652

weifengpy · 2024-10-25T02:30:27Z

I was debugging on rank 1 using torch.distributed.breakpoint(rank=1), but it's always hanging. It turns out to be caused by --local-ranks-filter 0 in run_llama_train.sh. Not sure if we want to remind people that two things don't work well together

I have to debug rank 1 (instead of rank0) because dim-0 sharding can be uneven and only rank1+ have paddings

repo:

diff --git a/train.py b/train.py
index 7945949..e2843b2 100644
--- a/train.py
+++ b/train.py
@@ -64,6 +64,8 @@ def main(job_config: JobConfig):
     device = torch.device(f"cuda:{int(os.environ['LOCAL_RANK'])}")
     torch.cuda.set_device(device)
     utils.init_distributed(job_config)
+
+    torch.distributed.breakpoint(rank=1)

The text was updated successfully, but these errors were encountered:

tianyu-l added the bug Something isn't working label Oct 25, 2024

tianyu-l added documentation Improvements or additions to documentation and removed bug Something isn't working labels Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed.breakpoint(rank=1) hangs because of --local-ranks-filter 0 #652

torch.distributed.breakpoint(rank=1) hangs because of --local-ranks-filter 0 #652

weifengpy commented Oct 25, 2024 •

edited

Loading

torch.distributed.breakpoint(rank=1) hangs because of --local-ranks-filter 0 #652

torch.distributed.breakpoint(rank=1) hangs because of --local-ranks-filter 0 #652

Comments

weifengpy commented Oct 25, 2024 • edited Loading

weifengpy commented Oct 25, 2024 •

edited

Loading