Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

Can PytorchJob skip or cancel the init cantainer? #352

Open
SeibertronSS opened this issue Sep 15, 2021 · 2 comments
Open

Can PytorchJob skip or cancel the init cantainer? #352

SeibertronSS opened this issue Sep 15, 2021 · 2 comments

Comments

@SeibertronSS
Copy link

Hello,
Dear developers. I encounter a question when using pytorchjob. Can PytorchJob skip or cancel the init cantainer?

@johnugeorge
Copy link
Member

You might see couple of restarts in worker pods till master pod is up. I don't see any other problem. I haven't tested it.

@Shuai-Xie
Copy link

Shuai-Xie commented Sep 16, 2021

Hi, @johnugeorge, I also ran into the same problem as @SeibertronSS.

I want to accelerate the training of pytorchjob to achieve a comparable training speed performance like on bare metal.

What I do:

  • turn the hostNetwork on for each pod
  • assign each pod with 4 GPUs, and launch 4 processes inside. (GPU machines 48/49 have 4 GPUs, respectively.)

Ideally, the training will start like running the command below on 48/49:

# 48
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py
# 49
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py

However, the experiment results show that the pytorchjob master is Running quickly while the worker is stuck in Init:0/1.

$ kubectl get pods -o wide
NAME                 READY   STATUS     RESTARTS   AGE     IP              NODE
mnist-ddp-master-0   1/1     Running    0          2m48s   10.252.192.48   gpu-10-252-192-48
mnist-ddp-worker-0   0/1     Init:0/1   0          2m48s   10.252.192.49   gpu-10-252-192-49

$ kubectl describe pod mnist-ddp-worker-0
...
Status:       Pending
IP:           10.252.192.49
IPs:
  IP:           10.252.192.49
Controlled By:  PyTorchJob/mnist-ddp
Init Containers:
  init-pytorch:
    ...
    Command:
      sh
      -c
      until nslookup mnist-ddp-master-0; do echo waiting for master; sleep 2; done; 
    State:          Running		# always sleep, can't pass the init-pytorch container
      Started:      Mon, 13 Sep 2021 16:22:12 +0800
    Ready:          False
    Restart Count:  0

I just want to know if there is a way to use pytorchjob like on bare metal.

Thanks very much.


Here is the YAML file I start the pytorchjob.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "mnist-ddp"
  namespace: "default"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: shuaix/pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent
              command:
                [
                  "python",
                  "-m",
                  "torch.distributed.launch",          	# launch 4 processes in a pod
                  "--nproc_per_node=4",       
                  "--nnodes=2",
                  "--node_rank=0",                   	# node rank 0
                  "--master_addr=10.252.192.48",      	# master IP -> host network IP
                  "mnist_ddp.py",
                ]
              resources:
                limits:
                  nvidia.com/gpu: 4						# assign 4 gpus for each pod
          hostIPC: true
          hostNetwork: true                          	# turn on host Network
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: kubernetes.io/hostname
                        operator: In
                        values:
                          - gpu-10-252-192-48   		# assgin pod to 48
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: shuaix/pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent
              command:
                [
                  "python",
                  "-m",
                  "torch.distributed.launch",
                  "--nproc_per_node=4",
                  "--nnodes=2",
                  "--node_rank=1",               		# node rank 1
                  "--master_addr=10.252.192.48",		# master IP -> host network IP
                  "mnist_ddp.py",
                ]
              resources:
                limits:
                  nvidia.com/gpu: 4						# assign 4 gpus for each pod
          hostIPC: true
          hostNetwork: true                          	# turn on host Network
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: kubernetes.io/hostname
                        operator: In
                        values:
                          - gpu-10-252-192-49			# assgin pod to 49

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants