-
Notifications
You must be signed in to change notification settings - Fork 143
dist.init_process_group stuck #313
Comments
Can you verify that the service works well in your environment? |
We will create a headless service and use that service name to run the training job. Seems that the service discovery is broken. |
Thanks for replying. How can I verify that?
Could you please provide more explanation? I'm not familiar with the terms... sorry about that! |
I didn't create any service. So I should create a headless service first, then deploy the pytorchjob yaml file, right? Or install some add-on like CoreDNS on the k8s cluster? |
No, pytorch-operator will create such a headless service for you. But pytorch-operator does not guarantee that the serivce works well in your k8s cluster. |
Thank you very much for the hint. When I deployed a job, it says
I checked demo2-224-pre-master-0, it's indeed not working properly Then I created a headless service
and add
and deployed again. When checking pytorchjob-headless-service, there is still no pod attached to it. Where did I do wrong? Thanks!! |
Do you have wechat or are you in Slack? |
just sent a friend request. thank you very much for your time!! |
Hi, I'm trying to start distributed training by kubeflow pytorchjob. However, the
dist.init_process_group(backend="nccl", init_method='tcp://'+args.master_addr+':'+args.master_port, world_size=args.world_size, rank=args.rank)
doesn't work. If I use os.environ["MASTER_PORT"] as the args.master_addr, it says
If I put pod ip (os.environ["RequestedIP"] which I think if the ip of master pod), it would just hang there for a long time.
Do you know what I should use? Thanks!
The text was updated successfully, but these errors were encountered: