Sensitivity of step_size
parameter
#926
Replies: 1 comment 1 reply
-
For some clarity on the Why use the compressed TBasket size (the size on disk) instead of the uncompressed size or some kind of estimate of the memory space needed by the objects that will be instantiated?
These issues complicate the relationship between |
Beta Was this translation helpful? Give feedback.
-
Dear Jim, uprooters et al.,
I am running an
uproot
program which reads-in.root
trees, conducts an analysis and writes into trees in ouput.root
files. The program usesutils.make_chunk_events
fromUprootFramework
. Each of the ntuples being read contain a varying number of.root
files which amount to varying sizes per ntuple (from ~0.5 GB
to ~1.5 TB
which correspond to ~1 to ~350 files per ntuple).Recently, it has occurred to me that the program is extremely sensitive to the
step_size
parameter, both in terms of calculation speed and memory usage. E.g. when running with the nominal1.5 GB
(as given in the documentation), the chunks (see above) become very large and the memory usage of the program increases dramatically (~O(20 GB
)). Conversely, reducing thestep_size
to50 MB
(75 MB
and100 MB
give similar results) reduces the chunk size and requires much less memory during the run. Reducingstep_size
too much to about2 MB
results in the chunk size being extremely small and total running time quite long. Also, the memory usage whilst using a ~2 MB
step_size
is not kept low during a long run, and increases to about6 GB
after an ~hour's run.It seems there is a non-linear dependence of running time and memory usage on
step_size
and it's hard to understand what the optimalstep_size
is, per size of the input ntuple. Similarly, it seems that during a run, the memory usage increases and it is not clear whether it reaches a steady-state. To that effect, I thought settingparallel=False
might help be more memory-conservative, but it's unclear whether this helps. It is hard to test and know, for a run that's anticipated to take many hours to complete.Could someone kindly provide some information about these issues? Having read the documentation, it's very unclear how to tune the
step_size
correctly (and whetherparallel=True/False
has anything to do with restricting memory consumption). Given this program is meant to run on a very large amount of datasets in a batch system, it would be very helpful to understand and hence optimise, before exploiting many computing resources.Many thanks in advance.
Roy
Beta Was this translation helpful? Give feedback.
All reactions