You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2022. It is now read-only.
Hi,
I am getting an error when trying to train using 345M with GPU. If I use CPU it trains fine, albeit very slowly. I am using Nvidia GTX 1070 and have CUDA and CUDNN installed.
The interactive_conditional_samples.py and generate_unconditional_samples.py work fine with GPU so I know GPU is working. I only encounter the OOM when trying to train.
I tried using the "--optimizer sg" flag and using default batch_size of 1: python train.py --dataset data.npz --model_name 345M --optimizer sgd
Error (truncated):
2021-01-31 21:52:28.744480: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory
trying to allocate 4.00MiB (rounded to 4194304). Current allocation summary follows.
2021-01-31 21:52:28.744901: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): Total Chunks: 276, Chunks
in use: 266. 69.0KiB allocated for chunks. 66.5KiB in use in bin. 1.0KiB client-requested in use in bin.
2021-01-31 21:52:28.745756: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512): Total Chunks: 0, Chunks in
use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-01-31 21:52:28.746117: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024): Total Chunks: 3, Chunks in
use: 1. 4.0KiB allocated for chunks. 1.3KiB in use in bin. 1.0KiB client-requested in use in bin.
...
2021-01-31 21:52:28.966844: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 6.64GiB
2021-01-31 21:52:28.966890: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 71357135
36 memory_limit_: 7135713690 available bytes: 154 curr_region_allocation_bytes_: 8589934592
2021-01-31 21:52:28.966950: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 7135713690
InUse: 7134272000
MaxInUse: 7134274048
NumAllocs: 3078
MaxAllocSize: 268435456
2021-01-31 21:52:28.967091: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ***************************************
2021-01-31 21:52:28.967156: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cwise_ops_common.cc:82 :
Resource exhausted: OOM when allocating tensor with shape[1,16,1024,1024] and type bool on /job:localhost/replica:0/task:0
/device:GPU:0 by allocator GPU_0_bfc
Any idea on how to resolve this?
The text was updated successfully, but these errors were encountered:
Thank you for your reply, I am trying to get my hands on a 3080 or 3090 for this very reason and your screenshot and message just confirmed I need the 3090!
I honestly couldn't recommend getting a 3090 just for training/fine-tuning 345M gpt-2. 117M Is definitely good enough for every use case (for me anyway) if your 1070 can handle that. I only use it to train when I'm at work. ;)
I agree it won’t solely be for training I just wanted to justify it for my gaming needs ;) with that said it’s almost impossible to find 3080/3090 for good prices so I’m waiting it out. I do appreciate everyone’s input. I did train the 117M and it does seem to give good results. There’s also the google collabs that everyone’s been using to train since they give you access to a nvidia t4 for free.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi,
I am getting an error when trying to train using 345M with GPU. If I use CPU it trains fine, albeit very slowly. I am using Nvidia GTX 1070 and have CUDA and CUDNN installed.
The
interactive_conditional_samples.py
andgenerate_unconditional_samples.py
work fine with GPU so I know GPU is working. I only encounter the OOM when trying to train.I tried using the "--optimizer sg" flag and using default batch_size of 1:
python train.py --dataset data.npz --model_name 345M --optimizer sgd
Error (truncated):
2021-01-31 21:52:28.744480: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory
trying to allocate 4.00MiB (rounded to 4194304). Current allocation summary follows.
2021-01-31 21:52:28.744901: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): Total Chunks: 276, Chunks
in use: 266. 69.0KiB allocated for chunks. 66.5KiB in use in bin. 1.0KiB client-requested in use in bin.
2021-01-31 21:52:28.745756: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512): Total Chunks: 0, Chunks in
use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-01-31 21:52:28.746117: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024): Total Chunks: 3, Chunks in
use: 1. 4.0KiB allocated for chunks. 1.3KiB in use in bin. 1.0KiB client-requested in use in bin.
...
2021-01-31 21:52:28.966844: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 6.64GiB
2021-01-31 21:52:28.966890: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 71357135
36 memory_limit_: 7135713690 available bytes: 154 curr_region_allocation_bytes_: 8589934592
2021-01-31 21:52:28.966950: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 7135713690
InUse: 7134272000
MaxInUse: 7134274048
NumAllocs: 3078
MaxAllocSize: 268435456
2021-01-31 21:52:28.967091: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ***************************************
2021-01-31 21:52:28.967156: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cwise_ops_common.cc:82 :
Resource exhausted: OOM when allocating tensor with shape[1,16,1024,1024] and type bool on /job:localhost/replica:0/task:0
/device:GPU:0 by allocator GPU_0_bfc
Any idea on how to resolve this?
The text was updated successfully, but these errors were encountered: