Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: use fast uproot.num_entries method instead of TTree.num_entries #1197

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pfackeldey
Copy link
Contributor

@pfackeldey pfackeldey commented Oct 23, 2024

uproot.num_entries is a fast path to get the number of events. It should always be faster than TTree.num_entries. This may become noticeable for preprocessing very large filesets.

The nanoevents factory should use uproot.num_entries aswell (at some point), in this PR I just removed the overhead of calculating it twice.

@ikrommyd
Copy link
Collaborator

@pfackeldey I was wondering because I find preprocessing slow (many minutes for large datasets), did you try it and have seen speed improvement?

@pfackeldey
Copy link
Contributor Author

pfackeldey commented Oct 23, 2024

@ikrommyd I have not measured it, because I don't have an analysis setup to do so. If you can, I'd be happy to see the difference!
I'm not sure how much one can gain here (I didn't profile the preprocess method), but calculating the number of entries should be always faster with uproot.num_entries compared than TTree.num_entries.
(I noticed slow preprocessing aswell when running the AGC - actually I might be able to measure the performance gain with the 200GBps challenge?)

@pfackeldey pfackeldey changed the title use fast uproot.num_entries method instead of TTree.num_entries perf: use fast uproot.num_entries method instead of TTree.num_entries Oct 23, 2024
@lgray
Copy link
Collaborator

lgray commented Nov 14, 2024

So the reason I used TTree rather than uproot.num_entries was to avoid multiple file opens, which can be the bottleneck in cases using xrootd or any network filesystem.

Since we're already opening the file and deserializing the TTree metadata to get the base form and greatest common basket offsets anyway on line 69 of preprocess. IIUC, this also deserializes the num_entries info for the tree. It makes very little sense to make a request to open the file again so that we can access num_entries faster without having the deserialize the full tree.

I'd be happy to see a benchmark that shows this change is faster though!

@lgray
Copy link
Collaborator

lgray commented Nov 14, 2024

That being said - if you want to make a version of this that's optimized for when all you're doing is just getting the number of entries (i.e. save_form=False and align_clusters=False, if either is true then you need to deserialize a bunch more data anyway) that uses uproot.num_entries then that would make a fair amount of sense. You'd likely need to call a different function from preprocess, or you can alter get_steps to have the optimal file-opening patterns depending on the options passed.

@pfackeldey
Copy link
Contributor Author

Hi @lgray,
that makes sense to me. I didn't see a visible improvement in a test based on the AGC.
Feel free to close this PR :)

@lgray
Copy link
Collaborator

lgray commented Nov 14, 2024

I'll leave it open if you want to try to the one optimized pathway I mentioned. If you don't then I'll close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants