Replies: 1 comment 1 reply
-
Hi @maxgalli, Perhaps this does not answer your question directly, but I'm doing something similar in my setup (saving outputs directly from processor). In order to create unique names for output files I'm using unique identifiers of Dask tasks where the processing is running. from dask.distributed import get_worker
for key, task in get_worker().tasks.items():
if task.state == "executing":
name = key[-32:] And here is what the output filenames look like (they are stored in a directory which is named after input dataset):
I agree that accessing the name of original file would make sense too, but I think it would not be enough when the file is split into multiple chunks - you would also need to somehow extract a chunk number. Cheers, |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I have a question regarding a use case that I wasn't able to find in the examples.
Let' say that when running a general workflow I'm interested in dumping the entire
NanoEventsArray
objects into parquet files rather than filling histograms. Now, in case I'm working with many files (with also many branches), I was thinking to dump one output file for each input file directly from inside theprocess
function instead of returning theNanoEventsArray
objects inside an accumulator, to avoid a huge increase of memory usage when the files are huge.This snippet should clarify a bit what I mean:
I have two questions about this procedure:
process
function? In this way I could I could for instance call the output fileoutput_name = input_name + "_output"
Thank you,
Massimiliano
Beta Was this translation helpful? Give feedback.
All reactions