Access name of input file from MyProcessor.process #546

maxgalli · 2021-06-28T17:56:28Z

maxgalli
Jun 28, 2021

Hi, I have a question regarding a use case that I wasn't able to find in the examples.
Let' say that when running a general workflow I'm interested in dumping the entire NanoEventsArray objects into parquet files rather than filling histograms. Now, in case I'm working with many files (with also many branches), I was thinking to dump one output file for each input file directly from inside the process function instead of returning the NanoEventsArray objects inside an accumulator, to avoid a huge increase of memory usage when the files are huge.
This snippet should clarify a bit what I mean:

fileset = {
    "Files": ["some_file_1", "some_file_2"]  # in practice they would be much more
}

class MyProcessor(processor.ProcessorABC):
    def __init__(self, columns=[]):
        self.columns = columns
    
    @property
    def accumulator(self):
        return self._accumulator
    
    def process(self, events):
        # Atypical use of process function: instead of returning an output, dump 
        # a NanoEventsArray object in a parquet file 
        output = events[self.columns]
        # Ideally perform some operations and dump the result to disk
        coffea.nanoevents.factory.awkward.to_parquet(output, "some_output_name")
    
    def postprocess(self, accumulator):
        return accumulator

processor.run_uproot_job(
    fileset,
    treename="Events",
    processor_instance=MyProcessor(["some_column_name"]),
    executor=processor.dask_executor,
    executor_args={
        "schema": NanoAODSchema,
        # other arguments
    }
)

I have two questions about this procedure:

first of all, does this make sense? are there "cleaner" ways to achieve this result?
ideally I would like to have a different "some_output_name" for each of the parquet files that I dump, maybe something like "some_file_*_output" (where * can be, in this case, either 1 or 2, which are my input files); is there a way to access the name of the input file from inside the process function? In this way I could I could for instance call the output file output_name = input_name + "_output"

Thank you,

Massimiliano

kondratyevd · 2021-06-28T18:20:00Z

kondratyevd
Jun 28, 2021

Hi @maxgalli,

Perhaps this does not answer your question directly, but I'm doing something similar in my setup (saving outputs directly from processor). In order to create unique names for output files I'm using unique identifiers of Dask tasks where the processing is running.
Here is how I do it:

from dask.distributed import get_worker

for key, task in get_worker().tasks.items():
    if task.state == "executing":
        name = key[-32:]

And here is what the output filenames look like (they are stored in a directory which is named after input dataset):

008952a0049ceeb36cbbd8abb4f8428c.parquet
3c62820b32de2be988474f9a728ffe0a.parquet
6c3a804bcb8b9b143760a641b9a7b922.parquet
bdde31d96e71a2063c6f2466f0e7f7c0.parquet

I agree that accessing the name of original file would make sense too, but I think it would not be enough when the file is split into multiple chunks - you would also need to somehow extract a chunk number.

Cheers,
Dmitry

1 reply

maxgalli Jun 28, 2021
Author

Hi @kondratyevd, yes this is actually a valid alternative for what I have to do, thank you for sharing :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access name of input file from MyProcessor.process #546

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Access name of input file from MyProcessor.process #546

maxgalli Jun 28, 2021

Replies: 1 comment · 1 reply

kondratyevd Jun 28, 2021

maxgalli Jun 28, 2021 Author

maxgalli
Jun 28, 2021

Replies: 1 comment 1 reply

kondratyevd
Jun 28, 2021

maxgalli Jun 28, 2021
Author