Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write script to convert .nat EUMETSAT files to Zarr intermediate #14

Closed
6 of 9 tasks
JackKelly opened this issue Oct 5, 2021 · 3 comments · Fixed by #21 or #23
Closed
6 of 9 tasks

Write script to convert .nat EUMETSAT files to Zarr intermediate #14

JackKelly opened this issue Oct 5, 2021 · 3 comments · Fixed by #21 or #23
Assignees
Labels
enhancement New feature or request

Comments

@JackKelly
Copy link
Member

JackKelly commented Oct 5, 2021

Features of this script

  • Takes command-line arguments for directory of .nat files; and target directory for the Zarr.
  • The script should be able to append newly downloaded data to an existing Zarr store, so we can incremently grow the Zarr store whenever we download new .nat data: When the script starts, it checks through all the .nat files (recursively), and checks through the existing Zarr, and only converts data which is present in the .nat files but absent in the Zarr. I think you can append to Zarr stores using something like xr.Dataset.to_zarr(mode='a', append_dim='time'). Definitely have a look at the xarray docs on appending to Zarr. It's possible that appending to Zarr only works correctly if data is appending in order, but I'm not certain! (Zarr's fragility when it comes to appending data might be one strong argument for swapping to using GeoTIFF or individual NetCDF files per EUMETSAT timestep, instead of Zarr... But let's try to get Zarr to work because it does seem to enable the fastest reads).
  • Save to Zarr as int16, using only 10 bits per pixel per channel. i.e., re-scale each channel to [0, 1023], and save in np.int16 dtype. This results in really good compression (better than using float16), and probably more precise (see the raw benchmark results here. I benchmarked a bunch of compression algorithms. compressor = numcodecs.Blosc(cname="zstd", clevel=5) was the best setting I found. If we want to be really ambitious we could try compressing with a lossless, modern image compression algorithm like AVIF or WebP. Some more notes about these options in Benchmark candidate intermediate file formats for EUMETSAT data #13. But, for now, zstd is probably fine.)
  • Save EUMETSAT metadata into the Zarr stores? (Maybe this isn't very important given that we currently have no plans to use the EUMETSAT metadata!)
  • Discard any images with NaNs.
  • Optionally: Spatially reproject data in data conversion script #15
  • Optionally only saves a geographical subset of the data (perhaps with some handy human-readable shortcuts like "UK")
  • Use all the CPU cores.
  • Each Zarr chunk should probably be at least 500 kBytes on disk. Any smaller and it becomes really inefficient to load small files! We probably want 1 chunk per timestep (so we can efficiently read any combination of timesteps). Or maybe 1 chunk for a small number of timesteps (4?). One chunk could hold all satellite channels, given that we usually use all satellite channels.

Related:

@JackKelly JackKelly added the enhancement New feature or request label Oct 5, 2021
@JackKelly
Copy link
Member Author

@jacobbieker and @peterdudfield what do you guys think about keeping the script to convert EUMETSAT native files to an intermediate file format in satip (instead of in nowcasting_dataset)? I don't have any strong feelings. A few advantages of keeping this script in satip:

  • nowcasting_dataset doesn't have to be dependent on satip or satpy
  • Other users might want to convert .nat files to an easier-to-use intermediate format. These users might have no interest in nowcasting_dataset.

@jacobbieker
Copy link
Member

Yeah, I agree, keeping as much of the satellite specific stuff in Satip makes sense. I like generally keeping the packages as separate as possible.

@flowirtz flowirtz moved this to Todo in Nowcasting Oct 15, 2021
@jacobbieker jacobbieker self-assigned this Oct 21, 2021
@JackKelly JackKelly changed the title Write script to convert .nat EUMETSAT files to intermediate file format Write script to convert .nat EUMETSAT files to Zarr intermediate Oct 22, 2021
@JackKelly
Copy link
Member Author

JackKelly commented Oct 25, 2021

On the topic of compression, it might be worth checking out zfp. I haven't tried it! Definitely not that important though! zstd is probably fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Archived in project
2 participants