Correct way to specify lazy_cache in JEC factory? #467
-
Currently I apply JEC like this: jets = jec_factory.build(jets, lazy_cache=df.caches[0])
# (df is input NanoEvents dataset) but this leads to a very high memory usage. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
The cachetools package (which is a chained dependency of coffea) provides several options. Here is one example, using the awkward nbytes feature: from cachetools import LRUCache
cache = LRUCache(1_000_000, lambda a: a.nbytes)
jets = jec_factory.build(jets, lazy_cache=cache) Similarly, such caches can also be used when instantiating NanoEvents: cache = LRUCache(int(1e8), lambda a: a.nbytes)
factory = NanoEventsFactory.from_root(
filename,
runtime_cache=cache,
entry_stop=10000,
metadata={"dataset": "SomeDataset"},
)
events = factory.events() |
Beta Was this translation helpful? Give feedback.
-
Okay, so I investigated this a bit more, and I came to the conclusion that the memory consumption is "real", i.e. the memory usage does not blow up uncontrollably. I was able to fit all the work into workers memory by reducing the chunk size from 100k to 10k. To give some anecdotal numbers: JEC uncertainties evaluated on 10k event chunks used ~4GB of memory on each worker. (This is for VBF Hµµ sample where each event contains 2 high-pT jets and possibly some soft jet activity). The behavior was in fact similar whether I use @nsmith- would you still recommend using |
Beta Was this translation helpful? Give feedback.
The cachetools package (which is a chained dependency of coffea) provides several options. Here is one example, using the awkward nbytes feature:
Similarly, such caches can also be used when instantiating NanoEvents: