Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LGDO format conversion utilities #4

Closed
gipert opened this issue Sep 22, 2022 · 5 comments · Fixed by #30
Closed

Add LGDO format conversion utilities #4

gipert opened this issue Sep 22, 2022 · 5 comments · Fixed by #30
Labels
enhancement New feature or request
Milestone

Comments

@gipert
Copy link
Member

gipert commented Sep 22, 2022

We should implement a method for each LGDO to convert underlying data to third-party formats like NumPy, Pandas, AwkwardArray. I'm thinking about something like:

lgdo_obj.convert(fmt="pandas.DataFrame", copy=False)

Where fmt could take pandas.DataFrame, numpy.ndarray, awkward.Array.

This way, we would store the conversion code along with the LGDO implementation and make it easier to jump between data representations (like in load_nda(), load_pd(), build_tcm(), the DataLoader, etc).

We need of course to make a distinction between copy and zero-copy conversions.

@gipert gipert added the enhancement New feature or request label Sep 22, 2022
@gipert
Copy link
Member Author

gipert commented Oct 3, 2022

I propose then to deprecate load_nda() (and load_pd()) in favor of:

store.read_object("obj", "file.lh5").convert(fmt="numpy.ndarray")

Which would return the same.

This new convert() functions should also handle units at some point. With numpy.ndarray, we could just use Pint's NumPy support and that should work. With pandas.DataFrame, we could use pint-pandas – but I'm not sure whether the package is fully functional.

@gipert gipert self-assigned this Jan 12, 2023
@gipert gipert transferred this issue from legend-exp/pygama May 23, 2023
@gipert gipert added this to the v2 milestone Oct 25, 2023
@gipert gipert removed their assignment Oct 25, 2023
@MoritzNeuberger
Copy link
Contributor

I am confused about how the return type annotation would work in this case. Can you have a single function with multiple types of output depending on the input parameters?

@gipert
Copy link
Member Author

gipert commented Oct 30, 2023

Yes, it would look like this:

def convert(...) -> pandas.DataFrame | numpy.NDArray | ...:
    pass

@MoritzNeuberger
Copy link
Contributor

MoritzNeuberger commented Nov 2, 2023

Over the last few days, I have been playing around with implementing this feature. For the most part, it is straightforward, although a few questions arose:

VectorOfVectors:

  • To convert it to a numpy.ndarray, I now first convert it to an aoesa using to_aoesa and use its convert function. to_aoesa also uses np.empty to implement the nda, and when preserve_dtype is set to True we also have the problem that the previously empty entries are filled with random values. I assume it is not preferable to have preserve_dtype set to False in which case these values would be set to nan.

Struct/Table:

  • The implementation in numpy.ndarray is not easy either. For now, I solved it by returning a dict containing the key and value entries of Struct/Table in two separate numpy arrays. What would be a better way to implement this?

WaveformTable/encoded data:

  • Does it need convert?

copy:

  • I have implemented this option wherever possible. That is, always for pandas.DataFrame and when necessary for numpy.ndarrays. AFAIK awkward arrays usually do not copy?

ToDos:

  • Figure out how to implement units with pint. Is this possible with awkward arrays?
  • Write tests

@MoritzNeuberger
Copy link
Contributor

I think it would be easier to see in code. I will prepare a PR with the status as it is at the moment.

MoritzNeuberger added a commit to MoritzNeuberger/legend-pydataobj that referenced this issue Nov 2, 2023
…utilities

The idea is to add a `convert` function to each LGDO datatype that converts the underlying data to a third-party datatype.
These are `pandas.DataFrame`, `numpy.ndarray` and `awkward.Array`.
Additionally, you have the option to control whether `convert` copies data or not.

At the moment, these issues are still open:

[ ] How to use `to_aoesa` to convert VectorOfVectors to `numpy.ndarray`?
[ ] How to implement the conversion of structures/tables to `numpy.ndarray`?
[ ] How to implement the `convert' function for WaveformTable and encoded data?
[ ] Find out how to implement units with pint. Is it possible for awkward arrays?
[ ] Write many, many tests.
@gipert gipert linked a pull request Nov 3, 2023 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants