Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory error: while allocating the memory #16

Open
jagadeesh09 opened this issue Mar 15, 2018 · 7 comments
Open

CUDA out of memory error: while allocating the memory #16

jagadeesh09 opened this issue Mar 15, 2018 · 7 comments

Comments

@jagadeesh09
Copy link

Hi

I am working on Tesla K40, 12 GB GPU machine. I am facing this error constantly. If I calculate the required memory for VGG model with respect to the mentioned batch size in dataset.py , the required memory is far less than the available memory of GPU. What could be the reason and how to overcome this?
I am facing this after initializing the model and while calling cuda() also.

THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCCachingHostAllocator.cpp line=258 error=2 : out of memory
Traceback (most recent call last):
File "finetune.py", line 272, in
fine_tuner.train(epoches = 20)
File "finetune.py", line 163, in train
self.train_epoch(optimizer)
File "finetune.py", line 182, in train_epoch
for batch, label in self.train_data_loader:
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 281, in next
return self._process_next_batch(batch)
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 301, in process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 81, in worker_manager_loop
batch = pin_memory_batch(batch)
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 148, in pin_memory_batch
return [pin_memory_batch(sample) for sample in batch]
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 142, in pin_memory_batch
return batch.pin_memory()
File "/usr/local/lib/python2.7/dist-packages/torch/tensor.py", line 92, in pin_memory
return type(self)().set
(storage.pin_memory()).view_as(self)
File "/usr/local/lib/python2.7/dist-packages/torch/storage.py", line 87, in pin_memory
return type(self)(self.size(), allocator=allocator).copy
(self)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/THCCachingHostAllocator.cpp:258

@guangzhili
Copy link

@jagadeesh09 Model parameters are not the only one that occupies the GPU memory, reduce batch_size to 16 or smaller would help.

@buttercutter
Copy link

@guangzhili

When I change the two lines of batch_size value in dataset.py from 32 to 16 , I have the following error. Why ?

[phung@archlinux pytorch-pruning]$ python finetune.py --train
/usr/lib/python3.7/site-packages/torchvision/transforms/transforms.py:187: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
  warnings.warn("The use of the transforms.Scale transform is deprecated, " +
/usr/lib/python3.7/site-packages/torchvision/transforms/transforms.py:562: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
  warnings.warn("The use of the transforms.RandomSizedCrop transform is deprecated, " +
Epoch:  0
Accuracy:  0.3398
Epoch:  1
Accuracy:  0.8265
Epoch:  2
Accuracy:  0.6071
Epoch:  3
Accuracy:  0.63
Epoch:  4
Accuracy:  0.5951
Epoch:  5
Accuracy:  0.5837
Epoch:  6
Accuracy:  0.5537
Epoch:  7
Accuracy:  0.5672
Epoch:  8
Accuracy:  0.506
Epoch:  9
Accuracy:  0.5962
Epoch:  10
Accuracy:  0.6039
Epoch:  11
Accuracy:  0.5436
Epoch:  12
Accuracy:  0.6215
Epoch:  13
Accuracy:  0.5622
Epoch:  14
Accuracy:  0.5872
Epoch:  15
Accuracy:  0.5969
Epoch:  16
Accuracy:  0.5741
Epoch:  17
Accuracy:  0.5725
Epoch:  18
Accuracy:  0.6213
Epoch:  19
Accuracy:  0.6483
Finished fine tuning.
[phung@archlinux pytorch-pruning]$ python finetune.py --prune
/usr/lib/python3.7/site-packages/torchvision/transforms/transforms.py:187: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
  warnings.warn("The use of the transforms.Scale transform is deprecated, " +
/usr/lib/python3.7/site-packages/torchvision/transforms/transforms.py:562: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
  warnings.warn("The use of the transforms.RandomSizedCrop transform is deprecated, " +
Accuracy:  0.6483
Number of prunning iterations to reduce 67% filters 5
Ranking filters.. 
Traceback (most recent call last):
  File "finetune.py", line 270, in <module>
    fine_tuner.prune()
  File "finetune.py", line 217, in prune
    prune_targets = self.get_candidates_to_prune(num_filters_to_prune_per_iteration)
  File "finetune.py", line 186, in get_candidates_to_prune
    self.prunner.normalize_ranks_per_layer()
  File "finetune.py", line 101, in normalize_ranks_per_layer
    v = v / np.sqrt(torch.sum(v * v))
  File "/usr/lib/python3.7/site-packages/torch/tensor.py", line 432, in __array__
    return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
[phung@archlinux pytorch-pruning]$ 

@buttercutter
Copy link

I have already installed the latest pytorch since it had solved this tensor.cpu() problem

I am using https://www.archlinux.org/packages/community/x86_64/python-pytorch-cuda/

So, what is actually still triggering this tensor.cpu() issue ?

@nguyenbh1507
Copy link

v = v / np.sqrt(torch.sum(v * v))

replace np.sqrt(torch.sum(v * v)) by v.norm

It worked for me. I think that np.sqrt() requires a variable on cpu, not gpu

@nguyenbh1507
Copy link

I observed that the out-of-memory still occurs even I change batch size to 16. The first round was OK, but the second wasn't. I think we should delete the previous redundant unused model on GPU to free up memory before allocating the new one.

@ChaoLi977
Copy link

I met a similiar issue, and solved it by setting pin_memory=false.
https://discuss.pytorch.org/t/using-pined-memory-causes-out-of-memory-error-even-though-batch-size-is-set-to-low-values/30602

@akbarali2019
Copy link

I met a similiar issue, and solved it by setting pin_memory=false. https://discuss.pytorch.org/t/using-pined-memory-causes-out-of-memory-error-even-though-batch-size-is-set-to-low-values/30602

Could you clarify the path for the pin_memory? How can I change it into false?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants