Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with reading files though coffea-casa xcache for large scale analysis #420

Open
oshadura opened this issue Dec 12, 2024 · 10 comments
Open

Comments

@oshadura
Copy link
Member

oshadura commented Dec 12, 2024

User reported issue with running their analysis on coffea.casa with the large amounts of data samples:

OSError: File did not vector_read properly: [ERROR] Operation expired

The reproducer is https://github.com/sihyunjeon/test_coffea-casa

What it does is:

  1. It reads the sample root file names from json file;
  2. Puts "xcache" instead of the full xrootd link;
  3. Runs the "Processor" which just takes AK8 jets and dump the mass of it.

Now when it runs on all files given in the json (~500 files) it fails with the error message you see at the very end of ipynb file (the vector read error).

If you uncomment "# break # FIXME" in In[4], it will run on only 3 files and this has no issues on running ipynb.

The problem is that xcache has some capacity (understandably) which then has issues for full scale analysis.

If I run on N different physics processes and they are cached, N+1 th physics process crash with vector read problem
 (and I think this is mainly due to the connection issue). If I try to make N+1 th physics process work by only processing
 that one, then some other physics process that previously worked stops working and faces vector read problem, 
and so on...
@oshadura
Copy link
Member Author

cc: @sihyunjeon

@oshadura
Copy link
Member Author

@sihyunjeon Can you try to run again your test? I tried your reproducer and it works for me

@sihyunjeon
Copy link

well... the problem is that once it's CACHED and works well, somebody else cannot reproduce it. One has to just dump large dataset in one go and observe. :( I'll see what else I can find

@oshadura
Copy link
Member Author

Ok I got it :)

how big should be a dataset?

@sihyunjeon
Copy link

Roughly that json file size i gave as an example, was not working if you look at the ipynb file (but now it seems like it does).
I'll randomly pick up large datasets and see if things don't work and then report back.

Maybe a question I can ask is -- is there somewhat a way to execute the ipynb lines in a safer way?

    dataset_runnable, dataset_updated = preprocess(
        dataset,
        align_clusters=False,
        step_size=500_000,
        files_per_batch=1,
        skip_bad_files=True,
        save_form=False,
    )
    to_compute = apply_to_fileset(
        Processor(True),
        max_chunks(dataset_runnable, 10000),
        schemaclass=BaseSchema,
    )

I took some parameters rather randomly, not sure if there is somewhat better setting to avoid such problem..

@oshadura
Copy link
Member Author

Roughly that json file size i gave as an example, was not working if you look at the ipynb file (but now it seems like it does). I'll randomly pick up large datasets and see if things don't work and then report back.

Many thanks for all your efforts!

Maybe a question I can ask is -- is there somewhat a way to execute the ipynb lines in a safer way?

    dataset_runnable, dataset_updated = preprocess(
        dataset,
        align_clusters=False,
        step_size=500_000,
        files_per_batch=1,
        skip_bad_files=True,
        save_form=False,
    )
    to_compute = apply_to_fileset(
        Processor(True),
        max_chunks(dataset_runnable, 10000),
        schemaclass=BaseSchema,
    )

I took some parameters rather randomly, not sure if there is somewhat better setting to avoid such problem..

@ikrommyd do you have a suggestion about best parameters settings for NanoAODs for preprocessing?

@ikrommyd
Copy link

ikrommyd commented Dec 13, 2024

Roughly that json file size i gave as an example, was not working if you look at the ipynb file (but now it seems like it does). I'll randomly pick up large datasets and see if things don't work and then report back.

Many thanks for all your efforts!

Maybe a question I can ask is -- is there somewhat a way to execute the ipynb lines in a safer way?

    dataset_runnable, dataset_updated = preprocess(
        dataset,
        align_clusters=False,
        step_size=500_000,
        files_per_batch=1,
        skip_bad_files=True,
        save_form=False,
    )
    to_compute = apply_to_fileset(
        Processor(True),
        max_chunks(dataset_runnable, 10000),
        schemaclass=BaseSchema,
    )

I took some parameters rather randomly, not sure if there is somewhat better setting to avoid such problem..

@ikrommyd do you have a suggestion about best parameters settings for NanoAODs for preprocessing?

For an actual analysis, typically step_size=100_000 or 200_000 is good. 500_000 is also fine and I use it often cause it makes the computation faster. May only be a problem if other "things" in your computation are using a lot of memory. I don't see the point of max_chunks(dataset_runnable, 10000), here though. It is supposed to "Modify the input dataset so that only the first "maxchunks" chunks of each file will be processed." So it will keep the first 10000 chunks. That's definitely gonna be the whole file.

@sihyunjeon
Copy link

sihyunjeon commented Dec 13, 2024

@ikrommyd hey thanks,
actually, can you clarify what step_size and max_chunks means?
step_size is like nevents and "chunk" is the chunks splitted from one dataset?

like 100K event, split into 10 chunks -> 10K events per chunk.
and if max_chunks(dataset_runnable, 1) is given, if will run on the first 10K events and finish?

@ikrommyd
Copy link

ikrommyd commented Dec 13, 2024

@ikrommyd hey thanks, actually, can you clarify what step_size and max_chunks means? step_size is like nevents and "chunk" is the chunks splitted from one dataset?

like 100K event, split into 10 chunks -> 10K events per chunk. and if max_chunks(dataset_runnable, 1) is given, if will run on the first 10K events and finish?

Yeah so step_size is going to split each file of your fileset into chunks to be processed as separate partitions of roughly step_size number of events: documentation is available in the docs and source is here: https://github.com/scikit-hep/coffea/blob/587f288e98a368ede2fb92468f737bdb31a1545f/src/coffea/dataset_tools/preprocess.py#L253

max_chunks will edit the fileset and only keep the first N chunks of each file. source here: https://github.com/scikit-hep/coffea/blob/587f288e98a368ede2fb92468f737bdb31a1545f/src/coffea/dataset_tools/manipulations.py#L12. Useful for just doing a quicker computation to avoid processing the whole thing. There is also a max_files function which keeps only the first N files of each dataset.

@sihyunjeon
Copy link

sihyunjeon commented Jan 10, 2025

Hi @oshadura

https://github.com/sihyunjeon/test_coffea-casa/blob/main/2025_01_10/Untitled1.ipynb

I've put another round of test there. It gives me error and also everytime with slightly different messages (probably depending on which sample gets read first).

OSError: File did not vector_read properly: [ERROR] Server responded with an error: [3005] Unable to readv /store/mc/Run3Summer22EENanoAODv12/DYto2L-4Jets_MLL-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_postEE_v6_ext1-v2/40000/5dcbbdb4-8eb7-41a5-a076-4f3e50954a8e.root; no route to host

is one example, the other one in the notebook is

OSError: File did not vector_read properly: [ERROR] Server responded with an error: [3005] Unable to readv /store/mc/Run3Summer22EENanoAODv12/DYto2L-4Jets_MLL-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_postEE_v6_ext1-v2/2520000/0b377ebc-89c0-487f-995a-a344255c11e3.root; too many levels of symbolic links

I checked xrdcp to those files and they do exist "somewhere" but something breaks when trying to read
I removed the client part in the notebook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants