-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions about implementing custom Codec classes #402
Comments
I am suspecting, though, that you mean you have metadata that is different for each chunk. That totally doesn't fit into the zarr/numcodecs model. Kerchunk/referenceFS has been thinking about stacking some of the chunk processing directly in the storage layer, where is work chunkwise, rather than in the zarr machinery to allow such a thing, even though it duplicates some code and logic. The initial use case is for simple offset/scale, but where the values to use depend on the specific chunk. |
Hi @martindurant thanks for the answers! We do indeed have per-chunk metadata for the quantization. Sounds like this will be a fun challenge for me later on when I try and get kerchunk/zarr to process these data. We did think about initializing the codec having done the analysis to get the scale and offset, which would work well for decode. I assume kerchunk/zarr can't initialise a filter chain per-chunk? That's something we could do inside Astropy to make this work while still presenting a Codec interface which might be a better base to build from? |
No. There is both the idea I described above in which kerchunk could store extra metadata per chunk and use those at load time, but also you may be interested by the discussion here. I describe passing a "context" to decode, so you know where in the overall array you are decoding. That would mean you could store the full list of arguments in the single codec instance, much closer to the current model. |
I am going to drop generalising the problem for a moment as I am pretty sure you are familiar enough with FITS to let me get away with it. The medium-term goal of the work we are doing is to get to the point where I can kerchunk a tile compressed image FITS file, which is where each tile of the image is stored in a vararr in one of a few possible columns (can change per-tile/row) and some per-chunk metadata for the decoding can be stored in other columns (scale & offset etc). I can imagine a situation where if you could set parameters for a codec chain for each row of the data in the heap, you could have the column offsets for that row be inputs to the first codec, which could split the columns in the row, then determine how to decode the data (i.e. if it needs dequantizing) and then read the other per-tile metadata out of the columns and then pass that through as inputs to the other codecs later in the chain. I can't see how you can maintain a per-tile kerchunked approach to these data without having the ability to set some metadata per-chunk. If you passed the whole table and a pointer to the heap you could decode the whole image, but I was hoping to be able to have a resulting array chunked up so you can just read and decode a single tile at a time. |
Exactly right. So we need to do something, whether that is keeping a list of parameters in the global metadata, or work in referenceFS to perform specific actions on a chunk in the storage layer. I am leaning towards the latter. An interesting similar situation arose recently with WCS parameters varying per image of a massive cube (I think you saw this in another thread, not sure where now), and that the set of WCS parameters could also be saved in the same parquet structure as the offset/size information. In this case, the codec doesn't need it, but the xarray flexible index mechanism does. Kerchunk/referenceFS can do all of this, but it feels just a little like we're circumventing zarr's new extension mechanism, unless the whole of the kerchunk framework itself becomes an extension. |
You mentioned parquet, which I don't have a lot of experience with, I was hoping that I would be able to put the referenceFS data in my existing asdf files, via a json representation, is the parquet thing a second option to json or something else? Mainly for curiosity, why do you think the approach should be adding this stuff to referenceFS? Does zarr not have per-chunk metadata at all? For things not related to the reading of the bytes? |
@Cadair, correct. There's nothing at the moment in the spec, though as @martindurant mentions, it's certainly conceivable to add a mechanism either via an extension or as a convention. |
Interesting, as @martindurant points out there's use cases for per-chunk coordinate information in sunpy/ndcube#222 and I obviously have per-tile compression metadata for FITS tiled image compression. |
It could in principle be put into JSON too. Not quite sure how it would live inside ASDF, would be interested to see what you are thinking. The reason for thinking parquet, is that we have an increasing number of cases where the number of chunks is large, and JSON encoding becomes a bottleneck for opening datasets at some point. That would be exacerbated if we store more information per chunk.
It's something that we can do right away, before zarr is ready to pass chunk details to the codec (which would still need storing all the parameters in the global metadata) or there is some kind of transformer. One could argue that implementing this in referenceFS would be the equivalent of the first such transformer; but storing the parameters alongside the offset/size information of a reference means implementing the decode in the same place makes sense to me. |
See also preffs which already stores references in parquet, and has the extra feature of concatenating byte ranges for keys that have more than one entry. That would open up another alternative closer to what some of the current codecs do: storing the chunk specific parameter in the chunk itself as a header/frame. Of course, we can't change the original chunk/tile, but we could store extra bytes for concatenation and so cheat. |
I haven't looked at it, but if you can serialise a ReferenceFS to JSON I was going to take that and give it to asdf to save in it's yaml metadata, then take it back out and de-serialise the ReferenceFS? Obviously, I haven't tried this, or even looked into it yet, so maybe I am missing something?
Sounds good, I am not sure when I am likely to get to this phase of the project (got some parental leave coming up in the new year 👶 and trying to get the first phase stuff done before), but is there an issue in fsspec tracking it somewhere I can keep an eye on / come back to when I get there? |
I don't see that this gets you anything beyond directly storing the JSON, which is what we do now. |
It means I can put it in my ASDF file we are already making and distributing to our users. |
You might also find interesting that ASDF (containing data!) is another future target for kerchunk. It is a fairly sensible format and not too dissimilar to zarr. I think astropy ought to be able to read from zarr, would not take too much to implement, but an alternative might be to pretend that zarrs are actually ASDF, which can already be loaded. I'm not sure how much work that would be. |
We (DKIST) have a weird situation where we are shipping FITS files and then we make an ASDF file which is a metadata-only file which let's you load a combined array (currently one chunk per-FITS) and have a WCS for the whole array etc. I have currently cooked up a not particularly efficient Dask way of doing this, but I am looking for kerchunk and zarr to replace it. My thinking is that if I can serialise the ReferenceFS to the ASDF then I can use that to make a zarr array and return that to my users on load. I feel like we are driving this a little off topic, |
Wonder if a good conclusion to this discussion would be adding some documentation here to make it a bit clearer how people can plug their own compressors in |
Hello 👋
Myself and @astrofrog are currently re-implementing how astropy handles the compression of tiled image data in it's FITS library. We have implemented the compression algorithms in the form of
Codec
classes (without introducing a required dependency on numcodecs). The main motivation for this is to allow projects like kerchunk and zarr etc to use these codecs in fun ways in the future.We have run into a few things which aren't clear to us as we have been learning how to make use of the Codec API:
numcodecs API states that "Inputs to and outputs from these methods may be any Python object exporting a contiguous buffer via the new-style Python protocol." What objects are acceptable here? I see some codecs implemented in numcodecs returning bytes and some returning numpy arrays from their decode methods. These objects don't strictly adhere to the buffer API (although obviously contain a contiguous buffer). How are the return types used, and what practical restrictions are there on them? We are currently returning a
memoryview
object from ourdecode
methods, but would it be a compatibility issue if we returned numpy arrays instead?The signature of the
Code.__init__
isn't specified as part of the abc. I assume from this line that it would be fine for all the settings to be keyword-only arguments to the constructor?Finally, and most unclear, is that for our FITS quantization methods, the
encode
method generates metadata (offset and scale) which are needed by thedecode
method to successfully round-trip through the quantization. What's the best way to handle this? Can we safely mutate the instance attributes of the Codec? Will the config of the codec be serialised to the metadata (by zarr or other things) after encoding in a way which makes this reliable? Are there any other examples of this kind of pattern we could learn from?The text was updated successfully, but these errors were encountered: