Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential bug: reference leak when torch is loaded and a function errors #666

Open
ncullen93 opened this issue May 26, 2024 · 9 comments
Open

Comments

@ncullen93
Copy link
Member

When torch is loaded and a function errors, there is a reference leak reported when the ipython console is exited. This does not seem to happen in a normal python console and only when torch is loaded. Very weird.. I've seen this a few other times but never reproducible so unsure if it's the same issue or if there's a general reference leak.

import ants
import torch
img = ants.image_read(ants.get_data('r16'))
img.crop_indices([0,0],[10,1000])

Then exit() and you get this:

nanobind: leaked 1 instances!
 - leaked instance 0x13362a208 of type "AntsImageF2"
nanobind: leaked 1 types!
 - leaked type "ants.lib.AntsImageF2"
nanobind: leaked 1 functions!
 - leaked function "cropImage"
nanobind: this is likely caused by a reference counting issue in the binding code.
@ncullen93
Copy link
Member Author

ncullen93 commented May 26, 2024

Seems to be something with pytorch and third-party modules generally - e.g. python/cpython#98253. Probably not a big issue but still worrying.

See also nanobind FAQ on the issue https://nanobind.readthedocs.io/en/latest/faq.html#why-am-i-getting-errors-about-leaked-functions-and-types

@dipterix
Copy link

dipterix commented Nov 8, 2024

Got the same issue and memory keeps building up until shutting down the program. Found a previous issue #117, and it seems that their example code also has memory leaks in 0.5.4.

antsRegistration -d 2 -r [0x13fc74ce8,0x13fc74d08,1] -m mattes[0x13fc74ce8,0x13fc74d08,1,32,regular,0.2] -t Affine[0.25] -c 2100x1200x1200x0 -s 3x2x1x0 -f 4x2x2x1 -x [NA,NA] -m mattes[0x13fc74ce8,0x13fc74d08,1,32] -t SyN[0.200000,3.000000,0.000000] -c [40x20x0,1e-7,8] -s 2x1x0 -f 4x2x1 -u 1 -z 1 -o [/var/folders/bs/n0q8wqv931g89ppshhgp2m2m0000gn/T//RtmpbpYhz2/filea3bb284caaea,0x13fc74ca8,0x13fc74cc8] -x [NA,NA] --float 1 --write-composite-transform 0 -v 1
nanobind: leaked 9 instances!
 - leaked instance 0x13fc74a08 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13fc749c8 of type "ants.lib.AntsTransformF22"
 - leaked instance 0x13fc74be8 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13fc747a8 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13fc74828 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13fc74ae8 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13fc74a68 of type "ants.lib.AntsTransformF22"
 - leaked instance 0x13fc747c8 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13dd27828 of type "ants.lib.AntsTransformF33"
nanobind: leaked 3 types!
 - leaked type "ants.lib.AntsImageF2"
 - leaked type "ants.lib.AntsTransformF33"
 - leaked type "ants.lib.AntsTransformF22"
nanobind: this is likely caused by a reference counting issue in the binding code.
>>> ants.__version__
'0.5.4'

@cookpa
Copy link
Member

cookpa commented Nov 8, 2024

Got the same issue and memory keeps building up until shutting down the program. Found a previous issue #117, and it seems that their example code also has memory leaks in 0.5.4.

Can you please post a reproducible example?

@cookpa
Copy link
Member

cookpa commented Nov 8, 2024

There was also a report of something similar in #678 - the user appears to have deleted their comments but it was something about running registrations in a loop. I tried reporting memory usage in a loop but it appeared to increase a lot then only very slightly after the first iteration. Hard to tell if it was a small leak or just background optimization processes. Or maybe Python was not reporting memory correctly.

Will definitely investigate further if there's an example, ideally using built-in antspy data or failing that some other public data

@cookpa
Copy link
Member

cookpa commented Nov 13, 2024

I can't get the torch warnings on my Intel Mac

Python 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 15:55:29) [Clang 17.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ants
>>> import torch
>>> img = ants.image_read(ants.get_data('r16'))
>>> img.crop_indices([0,0],[10,1000])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniforge/base/envs/antspy_torch/lib/python3.12/site-packages/ants/decorators.py", line 7, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Caskroom/miniforge/base/envs/antspy_torch/lib/python3.12/site-packages/ants/ops/crop_image.py", line 110, in crop_indices
    itkimage = libfn(image.pointer, image.pointer, 1, 2, lowerind, upperind)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: /Users/runner/work/ANTsPy/ANTsPy/itksource/Modules/Core/Common/src/itkDataObject.cxx:367:
Requested region is (at least partially) outside the largest possible region.
>>> exit()

torch == 2.2.2
antspyx == 0.5.4

@cookpa
Copy link
Member

cookpa commented Nov 13, 2024

Leaving this thread for the torch issue. For general memory leak problems, please see #733.

@dipterix
Copy link

For some reasons when I run in bare python in terminal, it doesn't complain about the memory leaks. However, the memory leak still persists. For example, the following example keeps occupying more memories (activity monitor shows 7GB RAM usage after a few hundred iters. However, the diff of memory table shows nothing (signs of mem leak).

My suspect of nanobind warning is that because we use python IDE, and IDE keeps a weak ref to the python objects. If the object is not cleared on exit, then nanobind will raise the warning. I'll try to capture a reproducible example when I get back.

import ants
import numpy as np

img = np.random.random((400, 400))
img_ants = ants.image_clone(ants.from_numpy(img), pixeltype='float')

from pympler.tracker import SummaryTracker
tracker = SummaryTracker()


for i in range(5000):
  ants.image_similarity(img_ants, img_ants, metric_type='Correlation')
  if i % 100 == 0:
    # expect no change in memory table, but increased memory usage
    print(tracker.print_diff())

@dipterix
Copy link

Leaving this thread for the torch issue. For general memory leak problems, please see #733.

Oh just saw your message. Thanks for reference. I'll dig more into this (indeed we need reproducible example but this is tricky)

@cookpa
Copy link
Member

cookpa commented Nov 13, 2024

Thanks @dipterix , this tracks with what I found previously - Python doesn't think it's using the memory, I think there's leftover C++ objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants