Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to read only a subset of available collections in readers #504

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

tmadlener
Copy link
Collaborator

@tmadlener tmadlener commented Oct 11, 2023

BEGINRELEASENOTES

  • Add a collsToRead argument to readEntry and readNextEntry for the ROOTFrameReader to allow limiting the collections that are actually read.
  • This requires a bump in the minimum version of ROOT for RNTuple support to 6.32 as the previous RNTuple implementation apparently cannot handle switching between reading different sets of fields.

ENDRELEASENOTES

This is an (early) proposal for introducing this functionality which would in the end address #499

  • Interface design: Do we want to have the possibility to change this frame-by-frame?
    • Alternatively would have to introduce a configuration method that takes a category name and a list of collections.
  • Should we only pass on the limited collection id table?
    • Interplay with Frame and assumptions there
  • Add tests and checks to avoid breaking things when collection names are passed that are not even available from the data
    • The way the check is implemented only names that exist in the frame will be checked. Non-existent names in collsToRead will be silently ignored.
  • Same capabilities for RNTuple reader
  • Documentation (once the rest has settled)
  • Ensure that files that have been read with a limited set of collections and are then written again, can be read without issues.
  • Needs Require C++20 and update to C++20 #698 as it also starts using std::ranges.
  • Needs Make sure RNTupleReader builds with ROOT > 6.34 #719 to pass on dev3 stacks
  • Needs Refactor the read tests partially #724
  • Includes [ci] Switch to LCG_106b for sanitizer workflows #730
  • Check whether this can be used to ignore collections with unsupported schema evolutions.
  • python bindings

@tmadlener tmadlener linked an issue Oct 11, 2023 that may be closed by this pull request
@tmadlener
Copy link
Collaborator Author

Revived this as this is likely to become an issue for moving EDM4hep forwards otherwise. I have settled on making it possible to pass the collections that should be made available on a frame-by-frame basis as that is the most flexible and is also rather simple to implement.

I still have to make sure that the idTable that is available from the constructed Frames is properly cleaned up and also add some tests that passing in unavailable collection names doesn't break the whole thing.

@tmadlener tmadlener force-pushed the select-colls-root branch 5 times, most recently from 7cc113d to 8b532ac Compare January 9, 2025 13:30
src/RNTupleReader.cc Outdated Show resolved Hide resolved
@tmadlener
Copy link
Collaborator Author

It looks like we don't strictly need to trim down the collection id table, as it is by now really only used for the lookups between hashes and collection names. I will keep it as is and leave trimming that down for now as an optimization that can also be done later.

@tmadlener tmadlener changed the title [WIP] Make it possible to read only a subset of available collections in ROOTFrameReader Make it possible to read only a subset of available collections in readers Jan 20, 2025
@tmadlener tmadlener force-pushed the select-colls-root branch 2 times, most recently from 8d37566 to 2a04ed8 Compare January 20, 2025 11:58
@tmadlener tmadlener force-pushed the select-colls-root branch 3 times, most recently from ae216f5 to c57a4e8 Compare January 30, 2025 13:28
@tmadlener
Copy link
Collaborator Author

It looks like the RNTuple implementation up to (and including) ROOT 6.30 cannot really deal with switching between different sets of Fields when reading. It's possible to read full frames and frames with only a subset of collections separately, but it is not possible to first read full frames and then frames with only a subset of collections with the same reader. I have decided to bump the minimum ROOT version to 6.32 for RNTuple support to make things work for now.

@tmadlener
Copy link
Collaborator Author

I have done some (manual) checks with EDM4hep and schema changes that lead to breaks when reading back collections, that are handled gracefully if the affected collections are ignored. I have not yet found a really good way to add automated tests for that in podio directly. They would be nice to have, but I am not sure if it is worth the effort now. I would almost be in favor of merging this rather soon, after some review, and then adding them later, resp. checking that via the EDM4hep build & tests we also do in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow to limit the collections that are read
2 participants