-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can VLenArray support 2D arrays #199
Comments
Hi @sofroniewn, I haven't thought about this deeply but I imagine the codec could be modified to be aware of the expected number of dimensions in each array, and then to encode the length of all dimensions. Currently the encode method encodes the data by interleaving the array lengths and the array buffers into a single contiguous buffer. So 2D arrays you would have to store 2 ints, then data, then 2 ints, then data, etc. That would be relatively straightforward if all the arrays are 2D. It might get a git messier if you had a mix of arrays with different numbers of dimensions. You may find it easier to flatten the arrays and keep track of array shapes separately and reshape on reading :-) |
Btw don't mean this to sound discouraging, happy to help further if you think it's worth exploring changes to the codec. |
@alimanfoo thanks for thinking about this. If you could do it in the codec with guaranteed One proposal might be to go all the way to the full general case and support a mix of arrays with different numbers of dimensions too, where we interleaved everything into a single continuous buffer when the first number was the number of dimensions, say The change to the existing codec for 1-D arrays would be an additional What are your thoughts? I'm new to the concepts in |
On Fri, 6 Sep 2019 at 15:49, Nicholas Sofroniew ***@***.***> wrote:
@alimanfoo <https://github.com/alimanfoo> thanks for thinking about this.
If you could do it in the codec with guaranteed 2D arrays then I'd
strongly prefer that compared to having to do the flattening and reshaping
on my end. It will make my apis much simpler and more consistent -
sometimes I just have normal arrays, somethings I have these ragged arrays
and the code will look much more similar if the codec can handle it.
One proposal might be to go all the way to the full general case and
support a mix of arrays with different numbers of dimensions too, where we
interleaved everything into a single continuous buffer when the first
number was the number of dimensions, say D, then then next D numbers were
there the shapes of each of the D dimensions, and then came the flattened
array.
The change to the existing codec for 1-D arrays would be an additional 1
would appear at the beginning of every block. For my all 2-D arrays there
would be an additional 2 at the beginning of every block, but this scheme
would support the fully general case of mixing.
This approach sounds fine to me, we'd only need a single byte to store the
number of dimensions, then 4 bytes for each dimension to store the lengths.
I might be inclined to code this up as a separate codec, rather than adapt
the existing VLenArray codec, just because we would not have to worry about
any data migration issues.
E.g., create a new codec class called something like VLenNDArray, with
codec ID "vlen-ndarray".
Within Zarr we could then add a convenience to use dtype="ndarray:T", which
is a shorthand for dtype=object, object_codec=numcodecs.VLenNDArray(T).
|
A new codec class |
Cool, thank you. I'd start by copy-pasting the VLenArray codec class.
During encoding, the codec makes two passes over the input, one to collect
the lengths, the second to write the data. So the first pass would need to
be modified to also collect the number of dimensions for each array, and to
store lengths for multiple dimensions. The second pass would then need to
be modified to include writing out the number of dimensions and lengths of
all dimensions.
During decoding, there are then some modifications to read back the number
of dimensions and dimension lengths.
HTH, give me a shout if any questions about this or how to set up a dev
environment to test locally.
…On Fri, 6 Sep 2019 at 16:29, Nicholas Sofroniew ***@***.***> wrote:
A new codec class VLenNDArray with convenience shorthands makes sense.
I'm happy to give this a try myself - though as I said I'm new to the
codebase, so any additional tips before I get started would be great if
that's ok with you.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#199?email_source=notifications&email_token=AAFLYQVHE5KXJ3JSHKNA7X3QIJZOVA5CNFSM4IUEW6KKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6DGI6Q#issuecomment-528901242>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFLYQTSJVY2DN323WUARJLQIJZOVANCNFSM4IUEW6KA>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
University of Oxford
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
|
@alimanfoo Could you take a look at @sofroniewn 's work (#200)? He has been pinging people, but there seems to be no response from the Zarr developers. Which would be a shame for his hard work. I'm interested in this functionality to store 2-channel audio recordings of varying length per recording. |
Hi @NumesSanguis, sorry for radio silence on this one, I've taken a look at the PR and seems good, added a few small comments. |
Right now I think the
VLenArray
only supports 1-D arrays. What would it take to extend support to 2-D arrays too? I have a list of 2D numpy arrays that I'd like to save as a zarr file using something likewhere
foo
is my list of 2D numpy arrays (they are all NxD where D is fixed across the list but N is variable), but I currently get an error message that ends withIf this sounds like a bad idea I can think of workarounds where I reshape flatten the arrays ahead of time and then keep track of that
D
number and reshape them on reading, but I thought I'd ask! Thanks!!The text was updated successfully, but these errors were encountered: