Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory problems when batch classifying large directories #490

Closed
1 task done
davidwhealey opened this issue Apr 30, 2024 · 11 comments
Closed
1 task done

Memory problems when batch classifying large directories #490

davidwhealey opened this issue Apr 30, 2024 · 11 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@davidwhealey
Copy link

davidwhealey commented Apr 30, 2024

Search before asking

  • I have searched the Pytorch-Wildlife issues and found no similar bug report.

Description

I have a directory of about 17,000 camera trap images, probably an average of a handful of detections per image. When I try to run the batch megadetector on that directory from within a notebook, at around halfway through the batch, the machine runs out of memory (32GB).

If the high memory usage is unavoidable, a nice option would be to be able to run the detector on lists of images rather than directories, that way large directories could be broken up more easily.

Thanks for everything!

Use case

Large directories of images to be detected

@davidwhealey davidwhealey added the enhancement New feature or request label Apr 30, 2024
@zhmiao
Copy link
Collaborator

zhmiao commented May 2, 2024

Hello @davidwhealey, thank you so much for reporting this. We also realized this issue on our end and already have a solution to it. We are working on integrating it to the codebase and will give you an update as soon as the new inference function is released!

@zhmiao zhmiao added the bug Something isn't working label May 2, 2024
zhmiao added a commit that referenced this issue May 8, 2024
Fixing batch detection memory issue #490
@zhmiao
Copy link
Collaborator

zhmiao commented May 8, 2024

Hello @davidwhealey , we just pushed a new version with the fix to the batch detection memory issue. Could you try update the package and see if it fixes your issue?

@JaimyvS
Copy link

JaimyvS commented May 31, 2024

Hi @zhmiao,

Not sure if this is the same issue, but still wanted to chime in. I'm currently running 1.0.2.14 which seems to be the latest version. But I'm also running into a memory issue. I'm running batch detect on a folder of 2000 images of between 300 and 1500 KB each.

Here's the log:

13%|██████████▎ | 8/63 [38:33<4:25:05, 289.19s/it]
Traceback (most recent call last):
File "batch_detect.py", line 25, in
results = detection_model.batch_image_detection(loader)
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/PytorchWildlife/models/detection/yolov5/base_detector.py", line 136, in batch_image_detection
for batch_index, (imgs, paths, sizes) in enumerate(dataloader):
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop
data = pin_memory(data)
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory
return [pin_memory(sample) for sample in data]
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in
return [pin_memory(sample) for sample in data]
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory
return data.pin_memory()
RuntimeError: cuda runtime error (2) : out of memory at ../aten/src/THC/THCCachingHostAllocator.cpp:280

@zhmiao
Copy link
Collaborator

zhmiao commented Jun 5, 2024

Hello @JaimyvS , I am sorry for the late reply! We will take a look at this issue and see if we can reproduce the memory issue on our side. Your dataset is not very big. Maybe it was caused by other package issues. We have an idea but need to do some testing to confirm. We will get back to you as soon as we get the results!

@JaimyvS
Copy link

JaimyvS commented Jun 8, 2024

Thanks, even with some small datasets I've been having issues. I've been getting the error:
THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp line=280 error=2 : out of memory
Which seems the same as above. But not exactly because with this one it will continue to run the inference process and then after a while it will crash with error I gave above. Hope you find something! If you need more info, I'd be happy to help.

@zhmiao
Copy link
Collaborator

zhmiao commented Jun 10, 2024

@JaimyvS, so this whole thing might be a numpy issue. Here is some reference: #390 and jacobgil/pytorch-pruning#16

Before we had this issue with our batch loading functions, and now we realize it happens in this for loop:

for i, pred in enumerate(predictions):

If you could also help us get rid of this numpy issue, it would also be greatly appreciated! Otherwise, we will try fixing this on our end as well. Thank you so much!

@JaimyvS
Copy link

JaimyvS commented Jun 12, 2024

@zhmiao I'm not 100% sure what you'd like me to do. I've looked at the references but the first seems to have been fixed by an update on your part. For the second one, I've tried running with pin_memory=False but this didn't work.

However when running with a batch size of 16 instead of 32. It seems to run. Which is weird, because I've already run a ton of detections with a batch size of 32 in the past. I sometimes have a feeling that it might be due to my Microsoft Surface Book 3 that I'm running Windows Subsystem for Linux on. Because the screen of the laptop is detachable it sometimes doesn't recognize the GPU in the base. And also the system throttles the GPU when it's not connected to net power. But I'm not sure how to test or fix this.

@zhmiao
Copy link
Collaborator

zhmiao commented Jun 14, 2024

@JaimyvS , oh this is interesting! Dose your Surface book 3 have a nvidia gpu? From the spec page the only nvidia gpu on surface book 3 only has a 6g of gpu memory, which I think probably is relatively small for a 32 batch size. You mentioned that you have successfully run 32 batchsize in the past, was you using pytorchwildlife at that time or MegaDetectorv5? There might also be a differences in terms of model sizes. But I think the wsl issue you mentioned might also be possible.

@JaimyvS
Copy link

JaimyvS commented Jun 15, 2024

@zhmiao Yeah it has a Nvidia Geforce GTX 1660 Ti with 6GB memory. I have definitely used the new pytorchwildlife library as well as the old Megadetector library with batchsize 32. But only recently ran into memory issues. Maybe after around version 1.0.2.12.
But if nothing really changed in the last few minor versions. It might just be due to my system and in that case I'll just keep on using a batch size of 16 until I have better hardware.

@zhmiao
Copy link
Collaborator

zhmiao commented Jun 18, 2024

Hello @JaimyvS, sorry for the late responses. We are in a two-week conference run and didn't have time to fully get back in the issue. I think we did make some changes in 1.0.2.14, but not in 1.0.2.12. If you had the issues before 1.0.2.14 and didn't have the issue before 1.0.2.12, then I think it is not the code issue. But we will still try if we could reproduce the out of memory error from our end.

@zhmiao
Copy link
Collaborator

zhmiao commented Oct 7, 2024

There is no further response since June. Therefore, we are closing this issue. If the issue still persist, please feel free to reopen it! Thank you so much for your participation!

@zhmiao zhmiao closed this as completed Oct 7, 2024
lucas-a-meyer pushed a commit to lucas-a-meyer/CameraTraps that referenced this issue Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants