Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HyCOM Public Zarr #163

Open
norlandrhagen opened this issue Oct 16, 2024 · 20 comments
Open

HyCOM Public Zarr #163

norlandrhagen opened this issue Oct 16, 2024 · 20 comments

Comments

@norlandrhagen
Copy link
Contributor

I'm at the pangeo showcase talk. Shane Elipot has a massive public ocean model Zarr output on the AWS public data program. I think it's split into 12 separate Zarr stores.

https://github.com/selipot/hycom-oceantrack?tab=readme-ov-file

Wondering if LEAP folks would find this useful? @jbusecke

@jbusecke
Copy link
Contributor

Ohhhh that looks really cool! @dhruvbalwada might be interested in this? I guess this will work, but might not be as fast as on gcs. Wondering if we should have a badge for the 'cloud'? But either way, this would be dope to link in.

@dhruvbalwada
Copy link

Would be great to link to this!

@norlandrhagen
Copy link
Contributor Author

Great! @dhruvbalwada if you have some background on this dataset, do you have any interest in doing a bit of exploring on which of these Zarr stores would be useful? Seems like there are Zarr stores per variable as well as lagrangian vs eulerian versions.

@dhruvbalwada
Copy link

@norlandrhagen I think all of these will potentially be useful. (This dataset is very complementary to a LLC4320 data that was made available through Pangeo, and has been used by many).

Is the discussion here to just provide a link to these datasets? or is something that will cost LEAP and so we have some resource constrain?

@jbusecke
Copy link
Contributor

@dhruvbalwada the former. It will be very beneficial to get an idea how to present these stores in the catalog in a meaningful way.

@dhruvbalwada
Copy link

Happy to help with that, let me know what you would like me to actually do.

@norlandrhagen
Copy link
Contributor Author

Awesome! Thanks for the expertise @dhruvbalwada.

I think a good start would be to see if you can access / catalog these Zarr stores.

I think the data is here, but I haven't explored it yet.

Also might be some clues here.

The data producer / speaker, Shane Elipot, seems super nice and was eager to have people using his data. I bet you/we could reach out to him with questions.

I think ideally we have a table of Zarr stores we want to add to the catalog + some metadata.

ex:

|-------------------------------------------------------------------------------------
| dataset_name_variable.       | zarr store link                                     |
|-------------------------------------------------------------------------------------
| lagrangian_HYCOM_u_component | s3://../../lagrangian_HYCOM_u_component.zarr        |
|-------------------------------------------------------------------------------------
| lagrangian_HYCOM_v_component | s3://../../lagrangian_HYCOM_v_component.zarr        |
|-------------------------------------------------------------------------------------

@jbusecke
Copy link
Contributor

Just played around with the data a bit, and wanted to note some points:

  • It seems we need to provide anon=True. I am not quite sure how to provide this as kwarg to xarray. This works
import s3fs
import xarray as xr
fs = s3fs.S3FileSystem(anon=True)
mapper = fs.get_mapper("s3://hycom-global-drifters/lagrangian/global_hycom_0m_step_1.zarr")
xr.open_dataset(mapper, engine='zarr')

but this doesnt:

xr.open_dataset("s3://hycom-global-drifters/lagrangian/global_hycom_0m_step_1.zarr", engine='zarr')

We might need a way for the catalog to add custom kwargs to the snippet due to this!

  • This dataset has a lot of different 'steps'. I have no clue if we could potentially virtually concatenate these?
'hycom-global-drifters/lagrangian/',
 'hycom-global-drifters/lagrangian/global_hycom_0m_step_1.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_0m_step_10.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_0m_step_11.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_0m_step_2.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_0m_step_3.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_0m_step_4.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_0m_step_5.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_0m_step_6.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_0m_step_7.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_0m_step_8.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_0m_step_9.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_15m_step_1.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_15m_step_10.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_15m_step_11.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_15m_step_2.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_15m_step_3.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_15m_step_4.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_15m_step_5.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_15m_step_6.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_15m_step_7.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_15m_step_8.zarr',
 'hycom-global-drifters/lagrangian/global_hycom_15m_step_9.zarr'

@norlandrhagen
Copy link
Contributor Author

* This dataset has a lot of different 'steps'. I have no clue if we could potentially virtually concatenate these?

This seems like a cool use case! Maybe we open up an issue in virtualizarr. It seems possible to merge the virtual zarrs.

@norlandrhagen
Copy link
Contributor Author

Bit of an update here. I'm working towards a Virtualizarr Zarr reader (zarr-developers/VirtualiZarr#262). This should allow us to combine a bunch of existing Zarr stores here into a single virtual Zarr store.

@TomNicholas
Copy link
Member

Do these steps have alignable dimensions though? If not then you're in DataTree territory...

@norlandrhagen
Copy link
Contributor Author

Do these steps have alignable dimensions though? If not then you're in DataTree territory...

Totally, which would be very cool to have in virtualizarr as well!

On some initial digging through the Eulerian stores, it seems like the step section of the path corresponds to time. In the README:

The data correspond to field variables at 8759 hourly time steps from 2014-01-01T01:00:00 to 2014-12-31T23:00:00. These data are split in 12 zarr stores containing 720 hourly steps or 60 days of data, for stores number 1 to 11, and a 12th store with 839 steps to complete the year. The data are further split between velocity data and sea surface height data for a total of 24 stores. The date range and steps of each of the two sets of 12 stores are:

import s3fs
import xarray as xr

fs = s3fs.S3FileSystem(anon=True)                           
ds1 = xr.open_zarr(fs.get_mapper("s3://hycom-global-drifters/eulerian/hycom12-1-rechunked-corr.zarr"), chunks={})
ds2 = xr.open_zarr(fs.get_mapper("s3://hycom-global-drifters/eulerian/hycom12-2-rechunked-corr.zarr"), chunks={})

ds_concat = xr.concat([ds1,ds2], dim="time")
<xarray.Dataset> Size: 1TB
Dimensions:    (time: 1440, Depth: 2, Y: 7055, X: 9000)
Coordinates:
  * Depth      (Depth) float32 8B 0.0 15.0
    Latitude   (Y, X) float32 254MB -86.0 -86.0 -86.0 -86.0 ... 47.1 47.07 47.04
    Longitude  (Y, X) float32 254MB 74.16 74.19 74.22 ... 74.14 74.14 74.14
  * X          (X) int32 36kB 1 2 3 4 5 6 7 ... 8995 8996 8997 8998 8999 9000
  * Y          (Y) int32 28kB 1 2 3 4 5 6 7 ... 7050 7051 7052 7053 7054 7055
  * time       (time) datetime64[ns] 12kB 2014-01-01T01:00:00 ... 2014-03-02
Data variables:
    u          (time, Depth, Y, X) float32 731GB dask.array<chunksize=(720, 1, 1, 9000), meta=np.ndarray>
    v          (time, Depth, Y, X) float32 731GB dask.array<chunksize=(720, 1, 1, 9000), meta=np.ndarray>

@TomNicholas
Copy link
Member

I'm sorry what? There are 24 separate zarr stores, just to hold different timesteps and different variables??? 🤦‍♂️

@TomNicholas
Copy link
Member

I mean at least you can use your new zarr virtual reader to combine them all into one sane icechunk store @norlandrhagen 😆

@jbusecke
Copy link
Contributor

I'm sorry what? There are 24 separate zarr stores, just to hold different timesteps and different variables??? 🤦‍♂️

Lets not be too judgy, I bet this was one hell of a lift to produce and get into zarr (yeah I realize me advocating for not being judgy is rich 🤣 - overworked data manager hat off). But I agree that it is very nice that we can now combine this and make it more usable!

Is each step a different release date for a bunch of floats (in which case we have to probably concatenate along a new dimension) or are these literally just split along a common time dimension.

@jbusecke
Copy link
Contributor

Just requested support for xr.open_datatree in the LEAP catalog here.

@TomNicholas
Copy link
Member

Lets not be too judgy, I bet this was one hell of a lift to produce and get into zarr

Yes you're right - it's easy for me to say after being steeped in Zarr for the last year! 😅 Also yes even getting something into a slightly wonky Zarr store can be a huge amount of work.

I'm just a bit concerned by the anti-pattern/misunderstanding this implies, of treating each Zarr store like a single chunked netCDF4 file, instead of treating one Zarr store as representing thousands of related netCDF files.

Is each step a different release date for a bunch of floats (in which case we have to probably concatenate along a new dimension)

Does this dataset contain individual drifter timeseries? Looks like its post-processed into a regular grid?

@norlandrhagen
Copy link
Contributor Author

I'm sorry what? There are 24 separate zarr stores, just to hold different timesteps and different variables??? 🤦‍♂️

Lets not be too judgy, I bet this was one hell of a lift to produce and get into zarr (yeah I realize me advocating for not being judgy is rich 🤣 - overworked data manager hat off). But I agree that it is very nice that we can now combine this and make it more usable!

Is each step a different release date for a bunch of floats (in which case we have to probably concatenate along a new dimension) or are these literally just split along a common time dimension.

Maybe once/if my ZarrV2 virtualizarr reader is in, we can share a reference that has all the combined zarr stores back to the data provider.

@jbusecke
Copy link
Contributor

I'm just a bit concerned by the anti-pattern/misunderstanding this implies, of treating each Zarr store like a single chunked netCDF4 file, instead of treating one Zarr store as representing thousands of related netCDF files.

100% aligned here! I think this might need sone deeper understanding of the data.

I think the original data is acrual float timeseries (and the eulerian stiff is an aggregation of stats!)

@norlandrhagen
Copy link
Contributor Author

FWIW, the speaker / data producer Shane Elipot seemed very eager for people to use this dataset. I think he would be happy to chat about any design choices etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants