-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add a schema for serialization #105
Conversation
d20a3c4
to
1c0064d
Compare
formats like HDF5 may limit the size of attributes to 64K. | ||
* Floating point variations could be introduced in storages, as the storage | ||
format uses a stable but different representation. | ||
* Axis `name` is only part of the metadata, and is not standardized. This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a very welcome addition, at least in HS3 we will rely on the name of variables in order to match data and function evaluation inputs (both standards, in principle, don't need to agree, but it would be great of course if they could).
How would Hist handle this? Just as before, taking the name from the label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would store the names in the metadata
slot. So it would look for a name
field in the metadata dict. Same way __dict__
and boost-histogram works now. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me what is lacking in support from boost-histogram? The metadata field of the axis was originally just intended to be a name. @henryiii pushed for it to be a general Python type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata stores the name, title, all ROOT metadata, etc. There are no requirements on it, except that it be a string-keyed dictionary. It would be nice to have an optional "name" slot which was guaranteed to be a unique string that can be used to refer to the the axis if present. One of the common themes at PyHEP.dev was a desire to avoid ever writing axis=<int>
, something which Awkward and xarray allow you to do. Hist already has this, but unless it was added elsewhere, it's probably easier to leave it unstandardized and just hope people put "name" into the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this nearly leads to a slightly different discussion, but metadata can indeed be helpful. I would disect it into two parts:
- name identifier: it seems in general beneficial to use names to access data (i.e. pandas vs numpy). I also see though why bh may choose to not go this path (as before, it's more "numpy" than "pandas" AFAIU). Still, talking also about xarray etc, the general accesability by name may becomes more common.
- metadata: it seems that data is desired to have in many places metadata. "Lower" and "upper" seem to me a perfect example of that, label obviously. Nothing that every library (i.e. uproot) needs to know about, but that could be useful to carry around.
To go back specifically to BH and serialization:
- Is an "axis-access-by-name" implementation for the library a possibility (given xarray etc) or out-of-scope (I can see why it may is)?
- One could in the serialization standard only add a "name" attribute that will be autoconverted to and from "metadata" (if available). That's practical but not nice IMHO and could, worst case, just be another conversion layer (i.e. to make it HS3 compatible). Not great, not bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hist will look for "name": ...
in the metadata, and if it's present, that will be the .name
attribute on the axis, and you'll get access-by-name. Same with "label": ...
. When uproot is loading attributes from a ROOT file, it renames labels and names to match (all other metadata goes in as-is, like line widths and colors and such).
We could bake it into the standard instead, and then hist's job gets easier, while boost-histogram's job gets harder. Boost-histogram would have to have some way to handle names - currently this packing/unpacking is in Hist, so it's a little harder. That also could be done, so it's really a case of picking what's best for everyone else.
@HDembinski, what you do think? Rendered version is at https://uhi--105.org.readthedocs.build/en/105/serialization.html. Main focus currently is on names, etc. @aryamanjeendgar is going to start working on the HDF5 Python backend based on this proposal soon. |
* Serialization followed by deserialisation may cause axis changes. Axis types | ||
may change to an equivalent but less performant axis, growth status will be | ||
lost, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an interchangable standard that is acceptable, but boost-histogram should always have also its own data format were no information is lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have pickling, which is both stable and lossless. It's just not interchangeable. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments regarding the names.
f43624d
to
cede081
Compare
Signed-off-by: Henry Schreiner <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci
Signed-off-by: Henry Schreiner <[email protected]>
Signed-off-by: Henry Schreiner <[email protected]>
Signed-off-by: Henry Schreiner <[email protected]> Update ci.yml
Signed-off-by: Henry Schreiner <[email protected]>
4ac8fda
to
c620a5e
Compare
Signed-off-by: Henry Schreiner <[email protected]>
Signed-off-by: Henry Schreiner <[email protected]>
Signed-off-by: Henry Schreiner <[email protected]>
Signed-off-by: Henry Schreiner <[email protected]>
Signed-off-by: Henry Schreiner <[email protected]>
Signed-off-by: Henry Schreiner <[email protected]>
c620a5e
to
84272b6
Compare
Starting a schema for serialization as discussed at PyHEP dot dev.