Memory problems when batch classifying large directories #490

davidwhealey · 2024-04-30T16:15:34Z

Search before asking

I have searched the Pytorch-Wildlife issues and found no similar bug report.

Description

I have a directory of about 17,000 camera trap images, probably an average of a handful of detections per image. When I try to run the batch megadetector on that directory from within a notebook, at around halfway through the batch, the machine runs out of memory (32GB).

If the high memory usage is unavoidable, a nice option would be to be able to run the detector on lists of images rather than directories, that way large directories could be broken up more easily.

Thanks for everything!

Use case

Large directories of images to be detected

zhmiao · 2024-05-02T21:43:38Z

Hello @davidwhealey, thank you so much for reporting this. We also realized this issue on our end and already have a solution to it. We are working on integrating it to the codebase and will give you an update as soon as the new inference function is released!

Fixing batch detection memory issue #490

zhmiao · 2024-05-08T22:04:55Z

Hello @davidwhealey , we just pushed a new version with the fix to the batch detection memory issue. Could you try update the package and see if it fixes your issue?

JaimyvS · 2024-05-31T10:25:24Z

Hi @zhmiao,

Not sure if this is the same issue, but still wanted to chime in. I'm currently running 1.0.2.14 which seems to be the latest version. But I'm also running into a memory issue. I'm running batch detect on a folder of 2000 images of between 300 and 1500 KB each.

Here's the log:

13%|██████████▎ | 8/63 [38:33<4:25:05, 289.19s/it]
Traceback (most recent call last):
File "batch_detect.py", line 25, in
results = detection_model.batch_image_detection(loader)
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/PytorchWildlife/models/detection/yolov5/base_detector.py", line 136, in batch_image_detection
for batch_index, (imgs, paths, sizes) in enumerate(dataloader):
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop
data = pin_memory(data)
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory
return [pin_memory(sample) for sample in data]
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in
return [pin_memory(sample) for sample in data]
File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory
return data.pin_memory()
RuntimeError: cuda runtime error (2) : out of memory at ../aten/src/THC/THCCachingHostAllocator.cpp:280

zhmiao · 2024-06-05T17:03:08Z

Hello @JaimyvS , I am sorry for the late reply! We will take a look at this issue and see if we can reproduce the memory issue on our side. Your dataset is not very big. Maybe it was caused by other package issues. We have an idea but need to do some testing to confirm. We will get back to you as soon as we get the results!

JaimyvS · 2024-06-08T19:18:09Z

Thanks, even with some small datasets I've been having issues. I've been getting the error:
THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp line=280 error=2 : out of memory
Which seems the same as above. But not exactly because with this one it will continue to run the inference process and then after a while it will crash with error I gave above. Hope you find something! If you need more info, I'd be happy to help.

zhmiao · 2024-06-10T21:43:01Z

@JaimyvS, so this whole thing might be a numpy issue. Here is some reference: #390 and jacobgil/pytorch-pruning#16

Before we had this issue with our batch loading functions, and now we realize it happens in this for loop:

CameraTraps/PytorchWildlife/models/detection/yolov5/base_detector.py

Line 142 in 4c44b1a

for i, pred in enumerate(predictions):

If you could also help us get rid of this numpy issue, it would also be greatly appreciated! Otherwise, we will try fixing this on our end as well. Thank you so much!

JaimyvS · 2024-06-12T18:47:30Z

@zhmiao I'm not 100% sure what you'd like me to do. I've looked at the references but the first seems to have been fixed by an update on your part. For the second one, I've tried running with pin_memory=False but this didn't work.

However when running with a batch size of 16 instead of 32. It seems to run. Which is weird, because I've already run a ton of detections with a batch size of 32 in the past. I sometimes have a feeling that it might be due to my Microsoft Surface Book 3 that I'm running Windows Subsystem for Linux on. Because the screen of the laptop is detachable it sometimes doesn't recognize the GPU in the base. And also the system throttles the GPU when it's not connected to net power. But I'm not sure how to test or fix this.

zhmiao · 2024-06-14T16:28:37Z

@JaimyvS , oh this is interesting! Dose your Surface book 3 have a nvidia gpu? From the spec page the only nvidia gpu on surface book 3 only has a 6g of gpu memory, which I think probably is relatively small for a 32 batch size. You mentioned that you have successfully run 32 batchsize in the past, was you using pytorchwildlife at that time or MegaDetectorv5? There might also be a differences in terms of model sizes. But I think the wsl issue you mentioned might also be possible.

JaimyvS · 2024-06-15T12:23:35Z

@zhmiao Yeah it has a Nvidia Geforce GTX 1660 Ti with 6GB memory. I have definitely used the new pytorchwildlife library as well as the old Megadetector library with batchsize 32. But only recently ran into memory issues. Maybe after around version 1.0.2.12.
But if nothing really changed in the last few minor versions. It might just be due to my system and in that case I'll just keep on using a batch size of 16 until I have better hardware.

zhmiao · 2024-06-18T18:05:53Z

Hello @JaimyvS, sorry for the late responses. We are in a two-week conference run and didn't have time to fully get back in the issue. I think we did make some changes in 1.0.2.14, but not in 1.0.2.12. If you had the issues before 1.0.2.14 and didn't have the issue before 1.0.2.12, then I think it is not the code issue. But we will still try if we could reproduce the out of memory error from our end.

zhmiao · 2024-10-07T18:44:28Z

There is no further response since June. Therefore, we are closing this issue. If the issue still persist, please feel free to reopen it! Thank you so much for your participation!

Fixing batch detection memory issue microsoft#490

davidwhealey added the enhancement New feature or request label Apr 30, 2024

zhmiao added the bug Something isn't working label May 2, 2024

zhmiao assigned zhmiao and luvargas2 May 2, 2024

luvargas2 referenced this issue May 7, 2024

Fixing batch detection memory issue

4f89434

zhmiao added a commit that referenced this issue May 8, 2024

Merge pull request #495 from microsoft/PreRelease

ea18450

Fixing batch detection memory issue #490

zhmiao closed this as completed Oct 7, 2024

lucas-a-meyer pushed a commit to lucas-a-meyer/CameraTraps that referenced this issue Jan 3, 2025

Merge pull request microsoft#495 from microsoft/PreRelease

c98db05

Fixing batch detection memory issue microsoft#490

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory problems when batch classifying large directories #490

Memory problems when batch classifying large directories #490

davidwhealey commented Apr 30, 2024 •

edited

Loading

zhmiao commented May 2, 2024

zhmiao commented May 8, 2024

JaimyvS commented May 31, 2024

zhmiao commented Jun 5, 2024 •

edited

Loading

JaimyvS commented Jun 8, 2024 •

edited

Loading

zhmiao commented Jun 10, 2024

JaimyvS commented Jun 12, 2024

zhmiao commented Jun 14, 2024

JaimyvS commented Jun 15, 2024

zhmiao commented Jun 18, 2024

zhmiao commented Oct 7, 2024

Memory problems when batch classifying large directories #490

Memory problems when batch classifying large directories #490

Comments

davidwhealey commented Apr 30, 2024 • edited Loading

Search before asking

Description

Use case

zhmiao commented May 2, 2024

zhmiao commented May 8, 2024

JaimyvS commented May 31, 2024

zhmiao commented Jun 5, 2024 • edited Loading

JaimyvS commented Jun 8, 2024 • edited Loading

zhmiao commented Jun 10, 2024

JaimyvS commented Jun 12, 2024

zhmiao commented Jun 14, 2024

JaimyvS commented Jun 15, 2024

zhmiao commented Jun 18, 2024

zhmiao commented Oct 7, 2024

davidwhealey commented Apr 30, 2024 •

edited

Loading

zhmiao commented Jun 5, 2024 •

edited

Loading

JaimyvS commented Jun 8, 2024 •

edited

Loading