-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing multiple kerchunk sideload files to open_mfdataset
, not possible with intake
#135
Comments
I wonder, does it work to phrase the URL as:
? By the way, xarray typically still does have to do a certain amount of work in such a case, so you might want to use kerchunk.combine.MultiZarrToZarr to create a single reference set across all the inputs, so that you don't need open_mfdataset at all. |
even providing
That was the initial goal, but right now MultiZarrToZarr, only supports regular chunking between files. The data has many dimensions and most are not chunked regularly. There isn't a way around it as far as I know? PS: Almost gave up on using kerchunk, and the |
It ought to not be too complex to fold this into intake-xarray. We do try to stay close to what xarray itself offers, so one could argue that if open_mfdataset accepts a lis of URLs or paths, it should allow for a list of storage_options-per-path too, and then everyone gets this kind of workflow, not just intake users.
We require ZEP003 in zarr. Please ping the discussion and this draft implementation: zarr-developers/zarr-python#1483 |
I just came across this issue as I was searching for an option to merge two datasets originating from two kerchunk reference datasets with different chunk sizes. I tested the workflow with import xarray as xr
xr.open_mfdataset(['reference://::ref1.json', 'reference://::ref2.json'], engine='zarr', storage_options={'remote_protocol':'s3', 'remote_options':{'anon':'true'}}) and it also works with intake: sources:
some_dataset:
driver: zarr
args:
urlpath:
- reference://::ref1.json
- reference://::ref2.json
storage_options:
remote_protocol: s3
remote_options:
anon: true |
Here is a working example: import intake
cat = intake.open_catalog("https://github.com/ISSI-CONSTRAIN/isccp/raw/main/catalog.yaml")
cat['ISCCP_BASIC_HGH'].to_dask() |
Standard intake plugins seem to support glob,
*
orlist
urlpath's, to consume multiple files withopen_mfdataset
. This aproach isn't suitable for theintake_xarray.xzarr.ZarrSource
plugin since it expects the(urlpath: "reference://")
, and usesstorage_options::fo
to load the sideload file:Ideally catalog
fo
, should be able to acceptglob
paths ?More details:
Having many netcdf files with variable dimensions, we hit the "irregular chunk size between files issue" trying to use kerchunk.
So instead of combining netcdf files, to a single sideload json file, we created a sideload .json for each netcdf file, and let xarray take care of the merge. For our datasets this was good enough, and made working with several months of remote data, possible.
Using xarray
open_mfdataset
directly, it was possible to use multiple jsons. e.g:It would have been nice to get rid of this code, and use an intake catalog.
The text was updated successfully, but these errors were encountered: