How to ACTUALLY train 345M on Multiple GPU using train-horovod.py? #53

shamiul94 · 2020-06-20T09:52:40Z

What am I doing wrong? Is it impossible to train the 345M model on multiple GPUs? Or is my GPUs are not enough? If it's the case, what GPU size and how many GPUs would work?

Is this the right process?

I am using an ml.p3.8xlarge instance on AWS with 4x v100 GPUs (16 GB each). I am trying to run train-horovod.py to train the 345M model on these 4 GPUs. I am running this command -

mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib python /home/ec2-user/SageMaker/gpt-2/train-horovod.py --dataset /home/ec2-user/SageMaker/gpt-2/src/Dataset/data_encoded.npz --model_name /home/ec2-user/SageMaker/gpt-2/src/models/345M --batch_size 1

I am using batch_size == 1. Still, I am getting this error repeatedly.

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node strided_slice_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[Mean/_5215]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node strided_slice_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
...
...
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node strided_slice_1 (defined at /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
........
........

I also tried installing CUDNN as mentioned here in issue ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU #8 using this command -

conda install -c anaconda cudnn --yes

After running the nvidia-smi command, this is the state of the GPU just before facing the error.

Sat Jun 20 09:48:49 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   47C    P0    65W / 300W |  15602MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   49C    P0    66W / 300W |  15626MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   51C    P0    72W / 300W |  15626MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   49C    P0    72W / 300W |  15626MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    115021      C   python                                     15583MiB |
|    1    115022      C   python                                     15583MiB |
|    2    115023      C   python                                     15583MiB |
|    3    115024      C   python                                     15583MiB |
+-----------------------------------------------------------------------------+

If anyone has solved this problem or faced it and shares his/her experience, it would help a great deal. Thanks.

The text was updated successfully, but these errors were encountered:

This was referenced Jun 20, 2020

Training on distributed machine is slow. Using 8 Nvidia V100. #28

Open

Added instructions and script for distributed training with Horovod #2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to ACTUALLY train 345M on Multiple GPU using train-horovod.py? #53

How to ACTUALLY train 345M on Multiple GPU using train-horovod.py? #53

shamiul94 commented Jun 20, 2020 •

edited

Loading

How to ACTUALLY train 345M on Multiple GPU using train-horovod.py? #53

How to ACTUALLY train 345M on Multiple GPU using train-horovod.py? #53

Comments

shamiul94 commented Jun 20, 2020 • edited Loading

What am I doing wrong? Is it impossible to train the 345M model on multiple GPUs? Or is my GPUs are not enough? If it's the case, what GPU size and how many GPUs would work?

Is this the right process?

shamiul94 commented Jun 20, 2020 •

edited

Loading