Replies: 7 comments 1 reply
-
This is true, and the memory use in import gc
import psutil
import uproot
this_process = psutil.Process()
def memory_diff(task):
gc.disable()
gc.collect()
start_memory = this_process.memory_full_info().uss
task()
gc.collect()
stop_memory = this_process.memory_full_info().uss
gc.enable()
return stop_memory - start_memory
def task():
with uproot.open(
{"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
) as tree:
pass
for _ in range(200):
print(f"{memory_diff(task) * 1e-6:.3f} MB") reports
Change the def task():
tree = uproot.open(
{"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
) so that it leaks file handles, and it's
We'd eventually run out of file handles this way, but apparently not memory (on the MB scale). Now def task():
lazy = uproot.dask(
{"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
)
This is a problem. (Also, it's noticeably slower, though there might be good reasons for that.) Using Pympler, >>> import gc
>>> import pympler.tracker
>>> import uproot
>>>
>>> summary_tracker = pympler.tracker.SummaryTracker()
>>>
>>> # run it once to get past the necessary first-time things (filling uproot.classes, etc.)
>>> lazy = uproot.dask(
... {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> del lazy
>>> gc.collect()
0
>>> # run print_diff enough times to get to the quiescent state
>>> summary_tracker.print_diff()
...
>>> summary_tracker.print_diff()
types | # objects | total size
======= | =========== | ============
>>>
>>> # what does an Uproot Dask array bring in?
>>> lazy = uproot.dask(
... {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> summary_tracker.print_diff()
types | # objects | total size
================================================= | =========== | ============
dict | 72059 | 11.54 MB
bytes | 3 | 5.66 MB
list | 24007 | 1.73 MB
numpy.int64 | 37491 | 1.14 MB
uproot.source.cursor.Cursor | 21000 | 984.38 KB
numpy.ndarray | 4501 | 492.30 KB
str | 4886 | 436.10 KB
uproot.models.TObject.Model_TObject | 5999 | 281.20 KB
tuple | 3385 | 147.00 KB
uproot.models.TObjArray.Model_TObjArray | 3000 | 140.62 KB
uproot.models.TNamed.Model_TNamed | 2999 | 140.58 KB
frozenset | 1 | 128.21 KB
int | 3492 | 95.51 KB
awkward._nplikes.typetracer.TypeTracerArray | 1878 | 88.03 KB
uproot.models.TTree.Model_ROOT_3a3a_TIOFeatures | 1500 | 70.31 KB
>>>
>>> # what goes away when we delete it?
>>> del lazy
>>> gc.collect()
14
>>> gc.collect()
0
>>> summary_tracker.print_diff()
types | # objects | total size
============================================== | =========== | ============
code | 0 | 37 B
aiohttp.helpers.TimerContext | -1 | -48 B
awkward.contents.recordarray.RecordArray | -1 | -48 B
dask.highlevelgraph.HighLevelGraph | -1 | -48 B
dask_awkward.utils.LazyInputsDict | -1 | -48 B
dask.blockwise.BlockwiseDepDict | -1 | -48 B
awkward.highlevel.Array | -1 | -48 B
dask_awkward.layers.layers.AwkwardInputLayer | -1 | -48 B
dask_awkward.lib.core.Array | -1 | -48 B
asyncio.trsock.TransportSocket | -2 | -80 B
ssl.SSLObject | -2 | -96 B
aiohttp.streams.StreamReader | -2 | -96 B
asyncio.sslproto.SSLProtocol | -2 | -96 B
asyncio.sslproto._SSLPipe | -2 | -96 B
bytearray | -2 | -112 B Hardly anything goes away when This TTree has 1499 TBranches. So having approximately that many TIOFeatures, TypeTracerArray, twice that many Model_TNamed, Model_TObjArray (TBranch and TLeaf), and four times as many Model_TObject make sense. There are only 3 bytes objects, but they comprise 5.66 MB. I don't know, offhand, what they could be, but I think they're more likely Uproot than Dask. There are a lot of big dicts, which is not too surprising, and I can't say offhand whether I expect more in Uproot or more in Dask. The one, major problem is that Who gets a reference to it and doesn't let go? It might be possible to find out with |
Beta Was this translation helpful? Give feedback.
-
Okay, setting up to follow this object with >>> import uproot
>>> import gc
>>> lazy = uproot.dask(
... {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> type(lazy)
<class 'dask_awkward.lib.core.Array'>
>>> type(lazy._meta)
<class 'awkward.highlevel.Array'> I'll be looking at lists and one reference will be the list I'm using to look at it, so I make that a special class that's easier to ignore in a print-out of type names. >>> class IgnoreMeList(list):
... pass
...
>>> def show(follow):
... print("\n".join(f"{i:2d} {type(x).__module__}.{type(x).__name__}" for i, x in enumerate(follow)))
... In the Pympler output, we saw that the TypeTracerArrays were not deleted when >>> follow = IgnoreMeList([lazy._meta.layout.content("Muon_pt").content.data])
>>> show(follow)
0 awkward._nplikes.typetracer.TypeTracerArray Now I'll just walk along the graph of its referrers, ignoring the >>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
0 __main__.IgnoreMeList
1 builtins.dict
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
0 __main__.IgnoreMeList
1 awkward.contents.numpyarray.NumpyArray
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
0 __main__.IgnoreMeList
1 builtins.dict
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
0 __main__.IgnoreMeList
1 awkward.contents.listoffsetarray.ListOffsetArray
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
0 __main__.IgnoreMeList
1 builtins.list
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
0 builtins.dict
1 __main__.IgnoreMeList
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
0 awkward.contents.recordarray.RecordArray
1 __main__.IgnoreMeList
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
0 builtins.dict
1 __main__.IgnoreMeList
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
0 awkward.highlevel.Array
1 __main__.IgnoreMeList
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
0 builtins.dict
1 __main__.IgnoreMeList
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
0 dask_awkward.lib.core.Array
1 __main__.IgnoreMeList
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
0 __main__.IgnoreMeList
1 builtins.dict
>>>
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
0 builtins.function
1 builtins.dict
2 __main__.IgnoreMeList
3 builtins.module Okay! The dict and the module are just >>> follow[3]
<module '__main__' (built-in)>
>>> follow[1].keys()
dict_keys(['use_main_ns', 'namespace', 'matches'])
>>> follow[1]["use_main_ns"]
1
>>> follow[1]["matches"]
['follow']
>>> type(follow[1]["namespace"])
<class 'dict'>
>>> follow[1]["namespace"].keys()
dict_keys(['__name__', '__doc__', '__package__', '__loader__', '__spec__', '__annotations__', '__builtins__', 'uproot', 'gc', 'lazy', 'IgnoreMeList', 'show', 'follow']) So what about the function? >>> follow[0]
<function show at 0x77a70b5889d0> Nope. All of this is either Python's infrastructure or the infrastructure I set up in the So what gives? I didn't see any other referrers along the way. Who's holding a reference to this object? If nobody is, why isn't it deleted (why is it not negative in the Pympler list) when the Does anyone have any ideas? |
Beta Was this translation helpful? Give feedback.
-
Even more to the point, following the advice of https://stackoverflow.com/a/28406001/1623645 >>> import uproot
>>> import gc
>>> lazy = uproot.dask(
... {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> class IgnoreMeList(list):
... pass
...
>>> def show(follow):
... print("\n".join(f"{i:2d} {type(x).__module__}.{type(x).__name__}" for i, x in enumerate(follow)))
...
>>> follow = IgnoreMeList([lazy._meta.layout.content("Muon_pt").content.data])
>>> del lazy
>>> gc.collect()
17
>>> gc.collect()
0
>>> show(follow)
0 awkward._nplikes.typetracer.TypeTracerArray
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> gc.collect()
0
>>> show(follow)
0 __main__.IgnoreMeList The TypeTracerArray goes away. I don't know why this disagrees with Pympler (and the fact that 30 MB of USS doesn't go away). |
Beta Was this translation helpful? Give feedback.
-
It looks to me like this might be in dask. Add the following to the loop body: import gc
import dask.base
dask.base.function_cache.clear()
gc.collect() I notice that the total memory usage remains fairly stable. |
Beta Was this translation helpful? Give feedback.
-
If it's referenced in >>> import gc
>>> import pympler.tracker
>>> import dask.base
>>> import uproot
>>>
>>> summary_tracker = pympler.tracker.SummaryTracker()
>>>
>>> lazy = uproot.dask(
... {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> del lazy
>>> gc.collect()
3
>>> gc.collect()
0
>>> summary_tracker.print_diff()
# ... several times ... #
>>> summary_tracker.print_diff()
types | # objects | total size
======= | =========== | ============
>>> lazy = uproot.dask(
... {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> summary_tracker.print_diff()
types | # objects | total size
============================================= | =========== | ============
dict | 72060 | 11.54 MB
bytes | 3 | 5.66 MB
list | 24007 | 1.73 MB
numpy.int64 | 37491 | 1.14 MB
uproot.source.cursor.Cursor | 21000 | 984.38 KB
numpy.ndarray | 4501 | 492.30 KB
str | 4886 | 436.10 KB
uproot.models.TObject.Model_TObject | 5999 | 281.20 KB
tuple | 3385 | 147.00 KB
uproot.models.TObjArray.Model_TObjArray | 3000 | 140.62 KB
uproot.models.TNamed.Model_TNamed | 2999 | 140.58 KB
frozenset | 1 | 128.21 KB
int | 3490 | 95.45 KB
awkward._nplikes.typetracer.TypeTracerArray | 1878 | 88.03 KB
uproot.models.TAtt.Model_TAttFill_v2 | 1500 | 70.31 KB
>>> del lazy
>>> dask.base.function_cache.clear() # clearing Dask's function cache
>>> gc.collect()
221296
>>> gc.collect()
0
>>> summary_tracker.print_diff()
types | # objects | total size
============================================== | =========== | ============
code | 0 | 37 B
awkward.highlevel.Array | -1 | -48 B
dask.blockwise.BlockwiseDepDict | -1 | -48 B
aiohttp.helpers.TimerContext | -1 | -48 B
awkward.contents.recordarray.RecordArray | -1 | -48 B
dask_awkward.layers.layers.AwkwardInputLayer | -1 | -48 B
dask.highlevelgraph.HighLevelGraph | -1 | -48 B
dask_awkward.lib.core.Array | -1 | -48 B
dask_awkward.utils.LazyInputsDict | -1 | -48 B
asyncio.trsock.TransportSocket | -2 | -80 B
asyncio.sslproto._SSLPipe | -2 | -96 B
fsspec.caching.BytesCache | -2 | -96 B
ssl.SSLObject | -2 | -96 B
uproot._dask.TrivialFormMappingInfo | -2 | -96 B
uproot.models.TTree.Model_TTree_v20 | -2 | -96 B It's still the case that creating Oh, but the total USS memory usage does go down: import gc
import psutil
import dask.base
import uproot
this_process = psutil.Process()
def memory_diff(task):
gc.disable()
gc.collect()
start_memory = this_process.memory_full_info().uss
task()
gc.collect()
stop_memory = this_process.memory_full_info().uss
gc.enable()
return stop_memory - start_memory
def task():
lazy = uproot.dask(
{"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
)
del lazy
dask.base.function_cache.clear()
for _ in range(200):
print(f"{memory_diff(task) * 1e-6:.3f} MB") results in
whereas removing the
So that is what's holding all of the memory. It must be some connection that Python doesn't see—maybe it goes through a reference in an extension module? (Maybe it goes through a NumPy object array? numpy/numpy#6581) Since this is a Dask feature, what do we want to do about it? @lgray, would it be sufficient to have Coffea clear the Dask function cache? |
Beta Was this translation helpful? Give feedback.
-
That'll certainly fix it for coffea. One thing I noticed in function_cache is that it's holding the uncompressed That's probably why we don't see a connection to |
Beta Was this translation helpful? Give feedback.
-
That's it then! That's why it costs memory, but can't be seen as objects of the expected types. So, in the end, the recommendation for everyone is to check their Dask function cache. I'll convert this into a Discussion as a way for others to find this conclusion. Actually, it could—possibly—be fixed in Uproot by replacing the TTree metadata data structure with a streamlined data structure containing only that which is necessary to fetch arrays. (Mostly the >>> import sys
>>> import uproot
>>> tree = uproot.open(
... {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> minimal = {}
>>> for k, v in tree.items():
... minimal[k, "seek"] = v.member("fBasketSeek")[:v.num_baskets]
... minimal[k, "bytes"] = v.member("fBasketBytes")[:v.num_baskets]
... minimal[k, "entry"] = v.member("fBasketEntry")[:v.num_baskets + 1]
...
>>> sum(sys.getsizeof(k) + sys.getsizeof(v) for k, v in minimal.items()) / 1024**2
0.7204971313476562 i.e. something like 0.7 MiB for this file, but larger if it had more baskets. It's likely that I'm forgetting some other essential metadata, which would bring this figure up. |
Beta Was this translation helpful? Give feedback.
-
reproducer:
This particular instance leaks ~30MB per open. This adds up very quickly if you need to extract the form of hundreds of files in a remote process as evident from scikit-hep/coffea#1007 where this bug manifested pretty nastily.
Beta Was this translation helpful? Give feedback.
All reactions