dist.init_process_group stuck #313

ravenj73 · 2020-12-22T11:14:08Z

Hi, I'm trying to start distributed training by kubeflow pytorchjob. However, the

dist.init_process_group(backend="nccl", init_method='tcp://'+args.master_addr+':'+args.master_port, world_size=args.world_size, rank=args.rank)

doesn't work. If I use os.environ["MASTER_PORT"] as the args.master_addr, it says

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/admin/aihub/pytorchjobDemo2.py", line 115, in main_worker
dist.init_process_group(backend="nccl", init_method='tcp://'+args.master_addr+':'+args.master_port, world_size=args.world_size, rank=args.rank)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
ValueError: host not found: Name or service not known

If I put pod ip (os.environ["RequestedIP"] which I think if the ip of master pod), it would just hang there for a long time.

Do you know what I should use? Thanks!

The text was updated successfully, but these errors were encountered:

gaocegege · 2020-12-22T11:15:38Z

Can you verify that the service works well in your environment?

gaocegege · 2020-12-22T11:16:08Z

We will create a headless service and use that service name to run the training job. Seems that the service discovery is broken.

ravenj73 · 2020-12-22T11:18:48Z

Can you verify that the service works well in your environment?

Thanks for replying. How can I verify that?

We will create a headless service and use that service name to run the training job. Seems that the service discovery is broken.

Could you please provide more explanation? I'm not familiar with the terms... sorry about that!

gaocegege · 2020-12-22T11:23:11Z

https://kubernetes.io/docs/concepts/services-networking/service/#headless-services

ravenj73 · 2020-12-22T11:48:31Z

I didn't create any service. So I should create a headless service first, then deploy the pytorchjob yaml file, right? Or install some add-on like CoreDNS on the k8s cluster?

gaocegege · 2020-12-22T12:00:22Z

No, pytorch-operator will create such a headless service for you. But pytorch-operator does not guarantee that the serivce works well in your k8s cluster.

ravenj73 · 2020-12-23T06:08:52Z

Thank you very much for the hint. When I deployed a job, it says

Events:
Type Reason Age From Message

Warning SettedPodTemplateRestartPolicy 6s (x2 over 6s) PytorchController Restart policy in pod template will be overwritten by restart policy in replica spec
Normal SuccessfulCreatePod 6s PytorchController Created pod: demo2-224-pre-master-0
Normal SuccessfulCreateService 6s PytorchController Created service: demo2-224-pre-master-0
Normal SuccessfulCreatePod 6s PytorchController Created pod: demo2-224-pre-worker-0

I checked demo2-224-pre-master-0, it's indeed not working properly

Then I created a headless service

apiVersion: v1
kind: Service
metadata:
name: pytorchjob-headless-service
spec:
clusterIP: None
selector:
app: pytorchjob-headless-service-selector
ports:
- protocol: TCP
port: 23456
targetPort: 23456

and add

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
labels:
jobtype: pytorchjob
platform: k8s
run: demo2-224-pre
name: demo2-224-pre
namespace: ailabs
spec:
pytorchReplicaSpecs:
Master:
backoffLimit: 0
replicas: 1
template:
metadata:
annotations:
...
sidecar.istio.io/inject: "false"
labels:
...
app: pytorchjob-headless-service-selector
...

and deployed again. When checking pytorchjob-headless-service, there is still no pod attached to it. Where did I do wrong?

Thanks!!

gaocegege · 2020-12-23T06:47:40Z

Do you have wechat or are you in Slack?

ravenj73 · 2020-12-23T06:49:43Z

just sent a friend request. thank you very much for your time!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dist.init_process_group stuck #313

dist.init_process_group stuck #313

ravenj73 commented Dec 22, 2020

gaocegege commented Dec 22, 2020

gaocegege commented Dec 22, 2020

ravenj73 commented Dec 22, 2020

gaocegege commented Dec 22, 2020

ravenj73 commented Dec 22, 2020

gaocegege commented Dec 22, 2020

ravenj73 commented Dec 23, 2020 •

edited

Loading

gaocegege commented Dec 23, 2020

ravenj73 commented Dec 23, 2020

dist.init_process_group stuck #313

dist.init_process_group stuck #313

Comments

ravenj73 commented Dec 22, 2020

gaocegege commented Dec 22, 2020

gaocegege commented Dec 22, 2020

ravenj73 commented Dec 22, 2020

gaocegege commented Dec 22, 2020

ravenj73 commented Dec 22, 2020

gaocegege commented Dec 22, 2020

ravenj73 commented Dec 23, 2020 • edited Loading

gaocegege commented Dec 23, 2020

ravenj73 commented Dec 23, 2020

ravenj73 commented Dec 23, 2020 •

edited

Loading