Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

Pytorch version may have an effect on the training reproduction #355

Open
Shuai-Xie opened this issue Sep 21, 2021 · 4 comments
Open

Pytorch version may have an effect on the training reproduction #355

Shuai-Xie opened this issue Sep 21, 2021 · 4 comments

Comments

@Shuai-Xie
Copy link

I try to figure out why Bare Metal (BM) and PytorchJob (PJ) have different training results in #354 (comment).

And now I find that PytorchJon v1.8.0 and 1.9.0 have different training results both on BM and PJ.

Experiment settings

  • Two V100 GPU machines 48/49. Each has 4 cards. We have 8 GPUs in total.
  • DDP training resnet18 on mnist dataset with batchsize=256 and epochs=1
  • set random seed=1

BM

# torch             1.8.0+cu111
# torchvision       0.9.0+cu111
Train Epoch: 0 [0/30]   loss=2.5691
Train Epoch: 0 [10/30]  loss=2.2320
Train Epoch: 0 [20/30]  loss=0.8108
Test Epoch: 0 [0/40]    acc=33.5938
Test Epoch: 0 [10/40]   acc=35.5469
Test Epoch: 0 [20/40]   acc=34.7098
Test Epoch: 0 [30/40]   acc=35.0302
Test Epoch: 0, acc=35.7200
test acc: 35.72, best acc: 35.72
training seconds: 19.506625175476074
best_acc: 35.72

# torch             1.9.0+cu111
# torchvision       0.10.0+cu111
Train Epoch: 0 [0/30]   loss=2.5137
Train Epoch: 0 [10/30]  loss=2.4295
Train Epoch: 0 [20/30]  loss=0.9048
Test Epoch: 0 [0/40]    acc=63.2812
Test Epoch: 0 [10/40]   acc=64.9858
Test Epoch: 0 [20/40]   acc=63.8021
Test Epoch: 0 [30/40]   acc=63.9365
Test Epoch: 0, acc=64.1200
test acc: 64.12, best acc: 64.12
training seconds: 18.64181399345398
best_acc: 64.12

PJ

I build docker images from different versions of the PyTorch base images.

# FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30]   loss=2.5691
Train Epoch: 0 [10/30]  loss=2.5132
Train Epoch: 0 [20/30]  loss=0.7198
Test Epoch: 0 [0/40]    acc=38.2812
Test Epoch: 0 [10/40]   acc=40.9091
Test Epoch: 0 [20/40]   acc=39.8996
Test Epoch: 0 [30/40]   acc=40.4738
Test Epoch: 0, acc=40.9600
test acc: 40.96, best acc: 40.96
training seconds: 20.630347967147827
best_acc: 40.96

# FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30]   loss=2.5137
Train Epoch: 0 [10/30]  loss=2.3939
Train Epoch: 0 [20/30]  loss=0.6989
Test Epoch: 0 [0/40]    acc=67.5781
Test Epoch: 0 [10/40]   acc=69.2827
Test Epoch: 0 [20/40]   acc=68.4152
Test Epoch: 0 [30/40]   acc=67.8805
Test Epoch: 0, acc=67.9700
test acc: 67.97, best acc: 67.97
training seconds: 26.458710193634033
best_acc: 67.97

Please let me know if I write the wrong code. I've posted my code here: https://github.com/Shuai-Xie/mnist-pytorchjob-example.

@gaocegege
Copy link
Member

PyTorch 1.9.0 introduces elastic distributed training but it is not stable. I think maybe you can wait until 1.9.1 is released and have a try again.

@zw0610
Copy link
Member

zw0610 commented Sep 21, 2021

The version in title means the version of PyTorch instead of PyTorchJob. Let's fix it on 1.8.0 and see how the difference is introduced.

@Shuai-Xie
Copy link
Author

The version in title means the version of PyTorch instead of PyTorchJob. Let's fix it on 1.8.0 and see how the difference is introduced.

Oh yes. I'm sorry to make this mistake. I'll change it right now.

@Shuai-Xie Shuai-Xie changed the title PytorchJob version has an effect on the training reproduction Pytorch version may have an effect on the training reproduction Sep 21, 2021
@Shuai-Xie
Copy link
Author

Shuai-Xie commented Sep 21, 2021

Thanks for your kind reply @zw0610 @gaocegege.

I'll fix the Pytorch version on 1.8.0 in the following experiments and look forward to figuring out this problem early with your help.

By the way, the example https://github.com/kubeflow/pytorch-operator/blob/master/examples/mnist/Dockerfile uses pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime.

Many Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants