You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2022. It is now read-only.
What am I doing wrong? Is it impossible to train the 345M model on multiple GPUs? Or is my GPUs are not enough? If it's the case, what GPU size and how many GPUs would work?
Is this the right process?
I am using an ml.p3.8xlarge instance on AWS with 4x v100 GPUs (16 GB each). I am trying to run train-horovod.py to train the 345M model on these 4 GPUs. I am running this command -
I am using batch_size == 1. Still, I am getting this error repeatedly.
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node strided_slice_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Mean/_5215]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node strided_slice_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
...
...
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node strided_slice_1 (defined at /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
........
........
What am I doing wrong? Is it impossible to train the 345M model on multiple GPUs? Or is my GPUs are not enough? If it's the case, what GPU size and how many GPUs would work?
Is this the right process?
ml.p3.8xlarge
instance on AWS with 4x v100 GPUs (16 GB each). I am trying to run train-horovod.py to train the 345M model on these 4 GPUs. I am running this command -I am using
batch_size == 1
. Still, I am getting this error repeatedly.nvidia-smi
command, this is the state of the GPU just before facing the error.If anyone has solved this problem or faced it and shares his/her experience, it would help a great deal. Thanks.
The text was updated successfully, but these errors were encountered: