-
Notifications
You must be signed in to change notification settings - Fork 443
Training on distributed machine is slow. Using 8 Nvidia V100. #28
Comments
I think distributed fine-tuning is not possible currently. As you can see it is labeld as "out-of-date". I think you are better off trying to fine-tune with the PyTorch implementation from PyTorch-Transformers. |
What command did you use to make it work? I am using aws
I am assuming my GPUs are not enough to train this 345M with even batch_size 1. Would using more GPU help me or multi-GPU training is just not possible? |
Are you talking about the PyTorch version? I was able to train the 345M version on a single V100. |
No, I am using this code mentioned in this repository which is using Tensorflow instead of Pytorch. Does Pytorch version work better than the Tensorflow version? Also what Pytorch version are you talking about? (Any link would be helpful)
Yes, I agree. I also tried to run the 345M model using train.py used in this very repository which also uses Tensorflow. It successfully ran this model on a single V100 but only for I have explained my issues in #53 and #52. It would be helpful if you go through these two issues too. Thank you. |
I can recommend checking out Huggingface-transformers. When I was working with GPT-2 it was only for PyTorch, but they extended the repository to Tensorflow as well. There should be some examples of people doing exactly what you are trying as well. Best of luck! |
Hi @shamiul94! It's written in my notes that I used this command:
Also, as others said, V100 can fit the gpt2-medium model with batch size 1. |
Hi, @dimeldo! Huge thanks for your input!
|
I can't quite remember. I think the 345M one. I can't remember if multi-GPU worked out alright in the end or not. Good luck in your research! |
I'm using aws p3dn.24xlarge to train my data on 8 Nvidia V100 GPU's but the training seem slower than 1 GPU.
This is the config in
train-horovod.py
:That's the output, as you can see it takes a long time for each step. Trying to increase the batch size results in OOM.
The text was updated successfully, but these errors were encountered: