Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF and N-Dimensional Variable Support #26

Merged
merged 28 commits into from
Feb 12, 2024
Merged

NetCDF and N-Dimensional Variable Support #26

merged 28 commits into from
Feb 12, 2024

Conversation

jpswinski
Copy link
Member

@jpswinski jpswinski commented Feb 6, 2024

Summary

This PR provides a number of improvements to h5coro:

  • support for n-dimensional variables and hyperslicing
  • added parameters to xarray backend for specifying variables to read and coordinates to assign
  • support for v2 and v3 attribute messages (enables h5coro to work with some netcdf files)
  • added experimental support for multiprocessing
  • fixed symbol table processing (see Cannot read dataset: 'H5Coro' object has no attribute 'offsetsize' #27)

Multi-Processing Mode

Experimental support has been added for running h5coro in multi-process mode as opposed to multi-threaded mode which is the default. When enabled for multiprocessing (by specifying the multiProcess=True parameter in the constructor for the h5coro object), decoding chunk data for each variable occurs in a separate process instead of a separate thread. On systems with multiple cores, this makes a significant difference in performance. For example, on an 8 core system, opening an xarray on the ICESat-2 heights group took ~200 seconds in multi-threaded mode, and 30 seconds in multi-processing mode.

There are a couple of caveats:

  • When opening an xarray with h5coro as a backend and multiprocessing enabled, we get shared memory leaks. We don't see these leaks when h5coro is used directly, so further investigation is needed to determine why it is happening with xarray and not with direct use.
  • There is a slight amount of overhead involved in using multiple processes (e.g. forking, inter-process communication). So if a single variable is being opened, or if the system being run on only has a single core, then using multi-processing is going to slightly decrease performance.
  • More testing is needed overall as this is a very new feature added to the code

Column Conversions

The xarray backend supports running every value in a column through a conversion function, but when profiling the code, this process took an extremely long time. It is used to convert the ICESat-2 delta times into datetimes; but this has been defaulted to off in this PR.

@jpswinski jpswinski requested a review from rwegener2 February 6, 2024 22:40
@rwegener2
Copy link
Collaborator

This is awesome @jpswinski! I played around a bit with reading 2D files trying out an ATL23 product, an ATL20 product, and a netcdf of ocean temperature. Both ATL products read well. In both those products the coordinate variables didn't read in, but that was because the coordinates were listed in a group above the group being read (coordinates had the path ../latitude, and ../longitude). This can be seen in the plot in the below, in that instead of 'latitude' and 'longitude' as coordinates we get the xarray default ones (depth_dfw_0, and depth_dfw_1):
Screenshot 2024-02-12 at 9 42 24 AM
I don't think it's an issue for this PR, but I still wanted to note it.

With the netcdf file I did get an error FatalError: failed to read expected bytes for dataspace message: 8 != 4, which I assume is related to having an attribute of some type that isn't currently supported.

This is an absolutely amazing update!! 🎉

@jpswinski jpswinski merged commit 6a92cf1 into main Feb 12, 2024
@jpswinski jpswinski deleted the netcdf-2 branch February 12, 2024 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants