Correct way to specify lazy_cache in JEC factory? #467

kondratyevd · 2021-03-10T22:09:04Z

kondratyevd
Mar 10, 2021

Currently I apply JEC like this:

jets = jec_factory.build(jets, lazy_cache=df.caches[0])
# (df is input NanoEvents dataset)

but this leads to a very high memory usage.
Can this behavior be avoided by passing something else as lazy_cache parameter?

Answered by nsmith-

Mar 11, 2021

The cachetools package (which is a chained dependency of coffea) provides several options. Here is one example, using the awkward nbytes feature:

from cachetools import LRUCache
cache = LRUCache(1_000_000, lambda a: a.nbytes)
jets = jec_factory.build(jets, lazy_cache=cache)

Similarly, such caches can also be used when instantiating NanoEvents:

cache = LRUCache(int(1e8), lambda a: a.nbytes)
factory = NanoEventsFactory.from_root(
    filename,
    runtime_cache=cache,
    entry_stop=10000,
    metadata={"dataset": "SomeDataset"},
)
events = factory.events()

View full answer

nsmith- · 2021-03-11T03:00:20Z

nsmith-
Mar 11, 2021
Maintainer

The cachetools package (which is a chained dependency of coffea) provides several options. Here is one example, using the awkward nbytes feature:

from cachetools import LRUCache
cache = LRUCache(1_000_000, lambda a: a.nbytes)
jets = jec_factory.build(jets, lazy_cache=cache)

Similarly, such caches can also be used when instantiating NanoEvents:

cache = LRUCache(int(1e8), lambda a: a.nbytes)
factory = NanoEventsFactory.from_root(
    filename,
    runtime_cache=cache,
    entry_stop=10000,
    metadata={"dataset": "SomeDataset"},
)
events = factory.events()

0 replies

kondratyevd · 2021-03-11T18:42:41Z

kondratyevd
Mar 11, 2021
Author

Okay, so I investigated this a bit more, and I came to the conclusion that the memory consumption is "real", i.e. the memory usage does not blow up uncontrollably.

I was able to fit all the work into workers memory by reducing the chunk size from 100k to 10k.

To give some anecdotal numbers: JEC uncertainties evaluated on 10k event chunks used ~4GB of memory on each worker. (This is for VBF Hµµ sample where each event contains 2 high-pT jets and possibly some soft jet activity). The behavior was in fact similar whether I use df.caches[0] or LRUCache.

@nsmith- would you still recommend using LRUCache? Are there any caveats related to hard-coding the max cache size?

3 replies

nsmith- Mar 11, 2021
Maintainer

Thanks for the numbers, even if anecdotal its good to know. I suspect the reason LRUCache does not show much difference is that the memory usage is high for reasons other than the JEC uncertainties. Do you have uproot >= 4.0.6? https://github.com/scikit-hep/uproot4/releases/tag/4.0.6 got some memory usage improveents

kondratyevd Mar 11, 2021
Author

I'm using uproot 4.0.1 right now, haven't tried updating yet.
But the same setup with 10k events per chunk uses just over 0.5GB per worker if evaluation of JEC uncertainties is off.

kondratyevd Mar 12, 2021
Author

Please disregard the numbers I quoted above.

I found that the high memory consumption was mainly caused by ak.to_pandas() later in my code, because I was dumping entire jets into the DataFrames, with all their properties and uncertainties.

Memory usage was getting higher simply due to a larger number of columns.

I have optimized this by converting to Pandas only the columns I really need, which decreased the memory usage by a factor of 4, and improved overall timing by 30%.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct way to specify lazy_cache in JEC factory? #467

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Correct way to specify lazy_cache in JEC factory? #467

kondratyevd Mar 10, 2021

Replies: 2 comments · 3 replies

nsmith- Mar 11, 2021 Maintainer

kondratyevd Mar 11, 2021 Author

nsmith- Mar 11, 2021 Maintainer

kondratyevd Mar 11, 2021 Author

kondratyevd Mar 12, 2021 Author

kondratyevd
Mar 10, 2021

Replies: 2 comments 3 replies

nsmith-
Mar 11, 2021
Maintainer

kondratyevd
Mar 11, 2021
Author

nsmith- Mar 11, 2021
Maintainer

kondratyevd Mar 11, 2021
Author

kondratyevd Mar 12, 2021
Author