Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml2json extremely slow #1882

Closed
1 task done
cburgard opened this issue Jun 10, 2022 · 13 comments
Closed
1 task done

xml2json extremely slow #1882

cburgard opened this issue Jun 10, 2022 · 13 comments
Labels
bug Something isn't working needs-triage Needs a maintainer to categorize and assign

Comments

@cburgard
Copy link

Summary

When trying to convert some XMLs to pyhf JSON, I found that the runtime was unacceptably slow (order of hours).

OS / Environment

NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://bugs.archlinux.org/"
LOGO=archlinux-logo

Steps to Reproduce

The XMLs are ATLAS internal, but I'm happy to share them with ATLAS members of the dev team on request.

File Upload (optional)

No response

Expected Results

I was hoping it would finish relatively fast.

Actual Results

It took hours.

pyhf Version

pyhf, version 0.6.3
uproot 4.2.2

Code of Conduct

  • I agree to follow the Code of Conduct
@cburgard cburgard added bug Something isn't working needs-triage Needs a maintainer to categorize and assign labels Jun 10, 2022
@matthewfeickert
Copy link
Member

Thanks for the report @cburgard. Can you please share the files as a private GitHub Gist with the dev team?

@cburgard
Copy link
Author

The files are too large for a gist, but I've shared them with you privately on CERNbox. Please distribute them to the team.

@matthewfeickert
Copy link
Member

matthewfeickert commented Jun 10, 2022

@cburgard Please first verify that you have read through the translations docs and tell us how the workspaces were created. Then please tell us the exact commands that you are using.

It is important to understand how you created them, because the files you've provided as is are not reproducible given that you have hard paths throughout them like /home/cburgard/Physics/statistics/pyfittools/hww-xml/histograms.root.

@cburgard
Copy link
Author

cburgard commented Jun 11, 2022

The workspaces were created with SFramwork, specifically with this function: https://gitlab.cern.ch/atlas-caf/CAFCore/-/blob/b1d40c29b0b4635c28d563db1483317d4a8b25d5/SFramework/Root/TSModelFactory.cxx#L176

The exact command I was using is

pyhf xml2json hww-xml/HWWRun2GGF.xml

@matthewfeickert
Copy link
Member

matthewfeickert commented Jun 14, 2022

The workspaces were created with SFramwork, specifically with this function: https://gitlab.cern.ch/atlas-caf/CAFCore/-/blob/b1d40c29b0b4635c28d563db1483317d4a8b25d5/SFramework/Root/TSModelFactory.cxx#L176

@cburgard Okay, well as you'll note we have no translation recipe for SFramework in the docs, so we can't make any statements of support for SFramework until we have one (maybe you can suggest one that we can iterate on and add to the docs?). If SFramework is only capable of hardcoding full paths into the XML then to do any debugging we would need the original files. Though I'm assuming that this is just a choice that was enabled somewhere in SFramework? Can you produce the workspaces with relative paths? Otherwise this would mean the workspaces would be unusable anywhere but on the machine where they were originally produced.

@cburgard
Copy link
Author

SFramework doesn't really use these XMLs, they are just produced as a debugging tool.
SFramework mostly relies on the C++ interface of HistFactory.

If you'd be aware of a simple way to instruct the C++ implementation of HistFactory to output xmls with relative rather than absolute paths, I'd be happy to reproduce the xmls with that setting, otherwise, sed will have to suffice I fear.

@matthewfeickert
Copy link
Member

matthewfeickert commented Jun 17, 2022

If you'd be aware of a simple way to instruct the C++ implementation of HistFactory to output xmls with relative rather than absolute paths, I'd be happy to reproduce the xmls with that setting, otherwise, sed will have to suffice I fear.

Thanks for the info as this helps move things forward. Manually changing paths isn't a reproducible workflow, and so can't be considered seriously. 😬

Perhaps we he's back from holiday we can ask @kratsg if he has any insight onto how HistFitter is able to do things unless @longjon929 is able to comment on that sooner.

@matthewfeickert
Copy link
Member

matthewfeickert commented Jul 1, 2022

@cburgard great news (I hope)! @kratsg is adding a feature in PR #1909 that is able to automatically deal with hardcoded absolute paths and when he does the conversion on an old 2011 macbook running macOS 10.11 (his current MBP is in for repairs 😢) he gets the following

$ time pyhf xml2json --hide-progress -v $PWD:/home/cburgard/Physics/statistics/pyfittools/ --output-file test.json hww-xml/HWWRun2GGF.xml

real	1m7.212s
user	1m6.320s
sys	0m1.180s

so hopefully you'll get things even faster than that on your local machine.

@longjon929
Copy link

Hi @matthewfeickert , sorry for being slow and I imagine this isn't very useful. It looks like HistFitter writes out the XML by hand making a top level file from the fitconfig here and then each channel is looped over and dumps a file here. I imagine HF was developed with this structure in mind from HistFactory, which makes it straightforward to write out the xml.

@alexander-held
Copy link
Member

PrintXML allows using relative paths too. The downside of using relative paths there is that you (currently) need to call pyhf xml2json from the same place where the relative paths originate from, which is not necessarily the place where the top-level xml is located.

@kratsg
Copy link
Contributor

kratsg commented Jul 1, 2022

PrintXML allows using relative paths too. The downside of using relative paths there is that you (currently) need to call pyhf xml2json from the same place where the relative paths originate from, which is not necessarily the place where the top-level xml is located.

This will be better with #1909 but you can already use --basedir to change the base of the relative paths, by prefixing them, so it doesn't necessarily need to be in the same place.

@matthewfeickert
Copy link
Member

Thanks to some fantastic work by @kratsg, PR #1909 is in master now. So I believe that this should be resolved if people install from the development HEAD. I haven't actually taken the time to look at this particular Issue with PR #1909 in yet, but I can try to later this week.

@kratsg kratsg closed this as completed Sep 8, 2022
@matthewfeickert
Copy link
Member

matthewfeickert commented Sep 22, 2022

A final note on this, as I write up the release notes in PR #1705:

$ ls
hww-xml
$ time pyhf xml2json \
    --hide-progress \
    --mount $PWD:/home/cburgard/Physics/statistics/pyfittools/ \
    --output-file workspace.json \
    hww-xml/HWWRun2GGF.xml
pyhf xml2json --hide-progress --mount  --output-file workspace.json   14.71s user 2.33s system 119% cpu 14.269 total

👍 Great work by @kratsg! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Needs a maintainer to categorize and assign
Projects
None yet
Development

No branches or pull requests

5 participants