This repository has been archived by the owner on Sep 19, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 143
Can PytorchJob skip or cancel the init cantainer? #352
Comments
You might see couple of restarts in worker pods till master pod is up. I don't see any other problem. I haven't tested it. |
Hi, @johnugeorge, I also ran into the same problem as @SeibertronSS. I want to accelerate the training of pytorchjob to achieve a comparable training speed performance like on bare metal. What I do:
Ideally, the training will start like running the command below on 48/49: # 48
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py
# 49
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py However, the experiment results show that the pytorchjob master is $ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
mnist-ddp-master-0 1/1 Running 0 2m48s 10.252.192.48 gpu-10-252-192-48
mnist-ddp-worker-0 0/1 Init:0/1 0 2m48s 10.252.192.49 gpu-10-252-192-49
$ kubectl describe pod mnist-ddp-worker-0
...
Status: Pending
IP: 10.252.192.49
IPs:
IP: 10.252.192.49
Controlled By: PyTorchJob/mnist-ddp
Init Containers:
init-pytorch:
...
Command:
sh
-c
until nslookup mnist-ddp-master-0; do echo waiting for master; sleep 2; done;
State: Running # always sleep, can't pass the init-pytorch container
Started: Mon, 13 Sep 2021 16:22:12 +0800
Ready: False
Restart Count: 0 I just want to know if there is a way to use pytorchjob like on bare metal. Thanks very much. Here is the YAML file I start the pytorchjob. apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "mnist-ddp"
namespace: "default"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: shuaix/pytorch-dist-mnist:1.0
imagePullPolicy: IfNotPresent
command:
[
"python",
"-m",
"torch.distributed.launch", # launch 4 processes in a pod
"--nproc_per_node=4",
"--nnodes=2",
"--node_rank=0", # node rank 0
"--master_addr=10.252.192.48", # master IP -> host network IP
"mnist_ddp.py",
]
resources:
limits:
nvidia.com/gpu: 4 # assign 4 gpus for each pod
hostIPC: true
hostNetwork: true # turn on host Network
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- gpu-10-252-192-48 # assgin pod to 48
Worker:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: shuaix/pytorch-dist-mnist:1.0
imagePullPolicy: IfNotPresent
command:
[
"python",
"-m",
"torch.distributed.launch",
"--nproc_per_node=4",
"--nnodes=2",
"--node_rank=1", # node rank 1
"--master_addr=10.252.192.48", # master IP -> host network IP
"mnist_ddp.py",
]
resources:
limits:
nvidia.com/gpu: 4 # assign 4 gpus for each pod
hostIPC: true
hostNetwork: true # turn on host Network
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- gpu-10-252-192-49 # assgin pod to 49 |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hello,
Dear developers. I encounter a question when using pytorchjob. Can PytorchJob skip or cancel the init cantainer?
The text was updated successfully, but these errors were encountered: