Issue with reading files though coffea-casa xcache for large scale analysis #420

oshadura · 2024-12-12T13:20:05Z

User reported issue with running their analysis on coffea.casa with the large amounts of data samples:

OSError: File did not vector_read properly: [ERROR] Operation expired

The reproducer is https://github.com/sihyunjeon/test_coffea-casa

What it does is:

It reads the sample root file names from json file;
Puts "xcache" instead of the full xrootd link;
Runs the "Processor" which just takes AK8 jets and dump the mass of it.

Now when it runs on all files given in the json (~500 files) it fails with the error message you see at the very end of ipynb file (the vector read error).

If you uncomment "# break # FIXME" in In[4], it will run on only 3 files and this has no issues on running ipynb.

The problem is that xcache has some capacity (understandably) which then has issues for full scale analysis.

If I run on N different physics processes and they are cached, N+1 th physics process crash with vector read problem
 (and I think this is mainly due to the connection issue). If I try to make N+1 th physics process work by only processing
 that one, then some other physics process that previously worked stops working and faces vector read problem, 
and so on...

The text was updated successfully, but these errors were encountered:

oshadura · 2024-12-12T13:20:59Z

cc: @sihyunjeon

oshadura · 2024-12-12T16:53:46Z

@sihyunjeon Can you try to run again your test? I tried your reproducer and it works for me

sihyunjeon · 2024-12-12T17:23:27Z

well... the problem is that once it's CACHED and works well, somebody else cannot reproduce it. One has to just dump large dataset in one go and observe. :( I'll see what else I can find

oshadura · 2024-12-12T17:25:10Z

Ok I got it :)

how big should be a dataset?

sihyunjeon · 2024-12-12T17:33:00Z

Roughly that json file size i gave as an example, was not working if you look at the ipynb file (but now it seems like it does).
I'll randomly pick up large datasets and see if things don't work and then report back.

Maybe a question I can ask is -- is there somewhat a way to execute the ipynb lines in a safer way?

    dataset_runnable, dataset_updated = preprocess(
        dataset,
        align_clusters=False,
        step_size=500_000,
        files_per_batch=1,
        skip_bad_files=True,
        save_form=False,
    )
    to_compute = apply_to_fileset(
        Processor(True),
        max_chunks(dataset_runnable, 10000),
        schemaclass=BaseSchema,
    )

I took some parameters rather randomly, not sure if there is somewhat better setting to avoid such problem..

oshadura · 2024-12-13T09:44:41Z

Roughly that json file size i gave as an example, was not working if you look at the ipynb file (but now it seems like it does). I'll randomly pick up large datasets and see if things don't work and then report back.

Many thanks for all your efforts!

Maybe a question I can ask is -- is there somewhat a way to execute the ipynb lines in a safer way?
    dataset_runnable, dataset_updated = preprocess(
        dataset,
        align_clusters=False,
        step_size=500_000,
        files_per_batch=1,
        skip_bad_files=True,
        save_form=False,
    )
    to_compute = apply_to_fileset(
        Processor(True),
        max_chunks(dataset_runnable, 10000),
        schemaclass=BaseSchema,
    )
I took some parameters rather randomly, not sure if there is somewhat better setting to avoid such problem..

@ikrommyd do you have a suggestion about best parameters settings for NanoAODs for preprocessing?

ikrommyd · 2024-12-13T17:29:13Z

Roughly that json file size i gave as an example, was not working if you look at the ipynb file (but now it seems like it does). I'll randomly pick up large datasets and see if things don't work and then report back.

Many thanks for all your efforts!
Maybe a question I can ask is -- is there somewhat a way to execute the ipynb lines in a safer way?
    dataset_runnable, dataset_updated = preprocess(
        dataset,
        align_clusters=False,
        step_size=500_000,
        files_per_batch=1,
        skip_bad_files=True,
        save_form=False,
    )
    to_compute = apply_to_fileset(
        Processor(True),
        max_chunks(dataset_runnable, 10000),
        schemaclass=BaseSchema,
    )
I took some parameters rather randomly, not sure if there is somewhat better setting to avoid such problem..
@ikrommyd do you have a suggestion about best parameters settings for NanoAODs for preprocessing?

For an actual analysis, typically step_size=100_000 or 200_000 is good. 500_000 is also fine and I use it often cause it makes the computation faster. May only be a problem if other "things" in your computation are using a lot of memory. I don't see the point of max_chunks(dataset_runnable, 10000), here though. It is supposed to "Modify the input dataset so that only the first "maxchunks" chunks of each file will be processed." So it will keep the first 10000 chunks. That's definitely gonna be the whole file.

sihyunjeon · 2024-12-13T17:33:58Z

@ikrommyd hey thanks,
actually, can you clarify what step_size and max_chunks means?
step_size is like nevents and "chunk" is the chunks splitted from one dataset?

like 100K event, split into 10 chunks -> 10K events per chunk.
and if max_chunks(dataset_runnable, 1) is given, if will run on the first 10K events and finish?

ikrommyd · 2024-12-13T17:52:58Z

@ikrommyd hey thanks, actually, can you clarify what step_size and max_chunks means? step_size is like nevents and "chunk" is the chunks splitted from one dataset?

like 100K event, split into 10 chunks -> 10K events per chunk. and if max_chunks(dataset_runnable, 1) is given, if will run on the first 10K events and finish?

Yeah so step_size is going to split each file of your fileset into chunks to be processed as separate partitions of roughly step_size number of events: documentation is available in the docs and source is here: https://github.com/scikit-hep/coffea/blob/587f288e98a368ede2fb92468f737bdb31a1545f/src/coffea/dataset_tools/preprocess.py#L253

max_chunks will edit the fileset and only keep the first N chunks of each file. source here: https://github.com/scikit-hep/coffea/blob/587f288e98a368ede2fb92468f737bdb31a1545f/src/coffea/dataset_tools/manipulations.py#L12. Useful for just doing a quicker computation to avoid processing the whole thing. There is also a max_files function which keeps only the first N files of each dataset.

sihyunjeon · 2025-01-10T19:58:22Z

Hi @oshadura

https://github.com/sihyunjeon/test_coffea-casa/blob/main/2025_01_10/Untitled1.ipynb

I've put another round of test there. It gives me error and also everytime with slightly different messages (probably depending on which sample gets read first).

OSError: File did not vector_read properly: [ERROR] Server responded with an error: [3005] Unable to readv /store/mc/Run3Summer22EENanoAODv12/DYto2L-4Jets_MLL-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_postEE_v6_ext1-v2/40000/5dcbbdb4-8eb7-41a5-a076-4f3e50954a8e.root; no route to host

is one example, the other one in the notebook is

OSError: File did not vector_read properly: [ERROR] Server responded with an error: [3005] Unable to readv /store/mc/Run3Summer22EENanoAODv12/DYto2L-4Jets_MLL-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_postEE_v6_ext1-v2/2520000/0b377ebc-89c0-487f-995a-a344255c11e3.root; too many levels of symbolic links

I checked xrdcp to those files and they do exist "somewhere" but something breaks when trying to read
I removed the client part in the notebook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with reading files though coffea-casa xcache for large scale analysis #420

Issue with reading files though coffea-casa xcache for large scale analysis #420

oshadura commented Dec 12, 2024 •

edited

Loading

oshadura commented Dec 12, 2024

oshadura commented Dec 12, 2024

sihyunjeon commented Dec 12, 2024

oshadura commented Dec 12, 2024

sihyunjeon commented Dec 12, 2024

oshadura commented Dec 13, 2024

ikrommyd commented Dec 13, 2024 •

edited

Loading

sihyunjeon commented Dec 13, 2024 •

edited

Loading

ikrommyd commented Dec 13, 2024 •

edited

Loading

sihyunjeon commented Jan 10, 2025 •

edited

Loading

Issue with reading files though coffea-casa xcache for large scale analysis #420

Issue with reading files though coffea-casa xcache for large scale analysis #420

Comments

oshadura commented Dec 12, 2024 • edited Loading

oshadura commented Dec 12, 2024

oshadura commented Dec 12, 2024

sihyunjeon commented Dec 12, 2024

oshadura commented Dec 12, 2024

sihyunjeon commented Dec 12, 2024

oshadura commented Dec 13, 2024

ikrommyd commented Dec 13, 2024 • edited Loading

sihyunjeon commented Dec 13, 2024 • edited Loading

ikrommyd commented Dec 13, 2024 • edited Loading

sihyunjeon commented Jan 10, 2025 • edited Loading

oshadura commented Dec 12, 2024 •

edited

Loading

ikrommyd commented Dec 13, 2024 •

edited

Loading

sihyunjeon commented Dec 13, 2024 •

edited

Loading

ikrommyd commented Dec 13, 2024 •

edited

Loading

sihyunjeon commented Jan 10, 2025 •

edited

Loading