Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update vov description for flattened_data other than array<1> #8

Closed
wants to merge 2 commits into from

Conversation

jasondet
Copy link

No description provided.

@jasondet jasondet requested review from oschulz and gipert January 14, 2024 22:32
@jasondet
Copy link
Author

@oschulz @gipert I've been wanting for a long time to update the vov spec to allow for flattened data other than array<1>. Please review this suggested change where we simply specify that it can be any array-like datatype.

@@ -73,16 +73,16 @@ Flat ``n``-dimensional arrays are stored as ``n``-dimensional HDF5 datasets.

A vector of vectors of unqual sizes is stored as an HDF5 group that contains two datasets:

* A 1-dimensional dataset `flattened_data` that stores the concatenation of all vectors into a single vector.
* A 1-dimensional dataset `cumulative_length` that stores the cumulative sum of the length of all vectors.
* An array-like dataset `flattened_data` that stores the concatenation of all vectors into a single vector. Can be `*array<n>{...}`, `table{...}`, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flattened data should always be a one-dimensional vector, I think. If we want to support vectors of multi-dimensional arrays of non-equal size, we need an additional dataset that provides the size of dim 2 to n of each member arrays (in Julia, in ArraysOfArrays.VectorOfArrays we use an additional vector kernel_size for this).

@gipert gipert self-requested a review March 6, 2024 15:03
@gipert
Copy link
Member

gipert commented Mar 6, 2024

@jasondet what is again the use case you have in mind?

I'm not sure your proposed changes include this, but I would like to have an additional dimension in VectorOfVectors, to store, in the event tier, data for each SiPM pulse in a waveform:

[ # event dimension
  [ # channel dimension
    [1, 2, 3, 4], # pulse in channel dimension
    [5, 6, 7, 8, 9]
  ],
  [
    [1, 2, 3, 4],
    [5, 6, 7, 8, 9],
    [0]
  ],
]

I also believe, like @oschulz says, that this should not be represented by a structure of nested LGDOs, but should rather be a single LGDO. Would also make it easier to read in as an Awkward Array in legend-pydataobj (we could then rewrite VectorOfVectors to wrap it, as I wanted to do).

@oschulz do you have a proposal in mind?

@gipert
Copy link
Member

gipert commented Mar 6, 2024

For reference, this is how Awkward folks recommend writing arrays to disk: https://awkward-array.org/doc/main/reference/generated/ak.to_buffers.html#ak-to-buffers

@oschulz
Copy link
Contributor

oschulz commented Mar 7, 2024

I'm not sure your proposed changes include this, but I would like to have an additional dimension in VectorOfVectors, to store, in the event tier, data for each SiPM pulse in a waveform

So do you want vectors-of-vectors-of-vectors (this we can do already), or a vector of two-dimensional arrays of varying size (this we don't cover yet)?

@gipert
Copy link
Member

gipert commented Mar 7, 2024

vectors-of-vectors-of-vectors. Do we really support this already? How? I'm confused

@oschulz
Copy link
Contributor

oschulz commented Mar 7, 2024

vectors-of-vectors-of-vectors. Do we really support this already? How? I'm confused

Yes, we do - the Julia event-tier files use it (multiple hits in multiple LAr-channels in multiple events). It worked almost out of the box (had to do one small bugfix).

The LH5 datatype is simply "array<1>{array<1>{array<1>{real}}}" and flattened_data is a vector-of-vectors with datatype "array<1>{array<1>{real}}".

So our vectors-of-vectors are just "naturally" nestable.

@oschulz
Copy link
Contributor

oschulz commented Mar 8, 2024

To clarify this a bit more: A vector-of-vectors-of-vectors, in this scheme (and I think this is natural) is a vector-of-(vectors-of-vectors). And so is can simply be constructed using a vector-of-vectors as it's flattened content. In Julia, we use ArraysOfArrays.VectorOfArrays as the in-memory representation, which supports this kind of nesting out of the box, without having a single special line of code for it (this is why think it's the natural approach, with a generically written vector-of-vectors it "just works").

@gipert
Copy link
Member

gipert commented Mar 8, 2024

Yes, we do

Well, it's actually not documented: https://legend-exp.github.io/legend-data-format-specs/dev/hdf5/#Vector-of-vectors.

Anyways, i will put this legend-pydataobj development in my ToDo list. We need to switch to Awkward arrays for this.

@oschulz
Copy link
Contributor

oschulz commented Mar 8, 2024

Well, it's actually not documented

True, the example there was only for vector-of-vector-of-reals. I've added how nesting works to the documentation now.

@gipert
Copy link
Member

gipert commented Mar 8, 2024

Nice thanks!

@oschulz
Copy link
Contributor

oschulz commented Mar 8, 2024

@jasondet should we close this for now, then?

@gipert gipert closed this Jul 15, 2024
@jasondet
Copy link
Author

yeah should be okay thx oli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants