Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an xarray backend reader for h5coro #21

Merged
merged 29 commits into from
Dec 18, 2023
Merged

Create an xarray backend reader for h5coro #21

merged 29 commits into from
Dec 18, 2023

Conversation

rwegener2
Copy link
Collaborator

@rwegener2 rwegener2 commented Dec 13, 2023

Goal

The goal of this PR is to create a basic xarray backend reader for h5coro on ATL03 data. The reader opens all \BEAM\heights groups, as well as some additional groups. This allows a user to read an HDF5 file with h5coro and get the bytes directly as an xarray Dataset object. Tasks out of scope of this PR are:

  • testing and optimizing the parallelization
  • checking to see if this code runs on a variety 1D hdf5 datasets (thus far testing has been confined to an ATL03 file)

This PR closes #3 .

How it was done

A new backends folder was added with an xarray_h5coro.py file. The file formatting follows the xarray docs. A class with an open_dataset method was defined. The method uses h5coro to read the data bytes and convert them into a series of Python dictionaries that xarray can use to create a Dataset.

The goal was to mirror the way xarray already reads HDF5 data: one group at a time. When a group is read (using open_dataset with the group argument) xarray loads all the data variables in that group, the attributes of those variables, and the attributes of the user specified groups. Because the current h5coro library only reads 1D arrays, the backend also drops any data variable with more than 1 dimension (for ATL data that seems to be signal_conf_ph)

[UPDATE] A minor note: xarray does not accept more dimension names than the number of dimensions in a variable. So while most ATL03 products list delta_time, lat_ph, and lon_ph as coordinates, variables are only listed as having delta_time for a coordinate in the xarray DataSet. This again mirrors the default behavior of xarray HDF reads.

How it can be tested

  1. Install xarray in your h5coro dev environment
  2. Run the code below

Code

import xarray as xr
import earthaccess

s3url_atl03 = 'nsidc-cumulus-prod-protected/ATLAS/ATL03/006/2019/11/30/' \
                'ATL03_20191130112041_09860505_006_01.h5'
auth = earthaccess.login()

ds = xr.open_dataset(s3url_atl03, engine='h5coro', group='/gt1l/heights', creds=auth)
ds

Expected result:
Screenshot 2023-12-13 at 2 19 24 PM

Tasks remaining

In my opinion should be complete before merging

  • parse delta_time from integer to datetime - some suggestions for how to do this are in the comments of h5coro as an xarray backend #3 @jpswinski would you be willing to work on this one?
  • Groups that aren't /BEAM/heights/ don't work right now. test functionality again once listGroup returns nested dictionary #19 is addressed, since I think the errors may have to do with hardcoding of attribute paths (I'm willing to do this once the issue is addressed) Follow up conversations will be needed.

Nice-to-haves before merging:

  • read local files - it looks like h5coro.H5Coro requires a credentials argument, and it can't be None. @jpswinski do you think it is possible to make credentials optional so a user reading local files doesn't have to provide s3 credentials? (I'm willing to look at this again after the credentials update)
  • The INFO log level doesn't seem to stop the warnings @jpswinski any thoughts?
  • investigate: why are delta time attributes getting dropped?
  • add setup instructions (project.toml?) to readme

Depends on: #19 (nice-to-have improvements would come from #20)

@rwegener2
Copy link
Collaborator Author

rwegener2 commented Dec 14, 2023

Thanks for the fixes on #19 and #20 @jpswinski! They were super helpful!

Some comments from the end of my day:

delta_time

The good news is that the code is running more smoothly with the updates. The mystery about where the delta_time attributes were going was resolved somewhere in #19/#20 (marked as complete above). The bad news is that now xarray doesn't like delta_time's coordinate information. The gist of it, as I understand, is that delta_time lists lat_ph and lon_ph as coordinates in the HDF file. Because lat_ph and lon_ph are not dimension coordinates (different from a dimension see here) neither one can be used as a coordinate dimension for delta_time. delta_time is the only dimension that can do that.

I pushed a hardcoded fix in the most recent commit because I'm not sure how to generalize a solution. It's clear to me how to fix the code for this case but I'm not sure how to write it so that the code would find this sort of conflict in any dataset.

warnings

I'm not sure if INFO logging is now working, or all the pink from earlier was fixed with one of the last PRs, but there are quite a bit less warnings now!

extraneous oddities found while running the code on more groups

I started trying the code out on some new groups (not just gt1l/heights). It was a smidge chaos 😨 but not too bad 😁. There is probably lots of cleaning we could do in here. For example:

  • I think that groups can be by checking for a lack of __metadata__. It seems to be that groups, unlike data variables, lack a __metadata__ key in their attributes dictionaries. It makes me wonder if groups distinct from data variables in the HDF data models, or if they considered the same thing. If they are different perhaps a future nice-to-have would be an .is_group attribute, or something similar, for a more readable way to check.
  • ds_surf_type doesn't return any data, but is 1D
promise = h5obj.readDatasets(['/ds_surf_type'], block=True, enableAttributes=False)
view = H5View(promise)
print(view['ds_surf_type']) # does return any data?

I stopped for the day when I hit another odd one on start_gpsweek of the ancillary_data group.

These all seem like nice things to fix eventually, but they don't seem the most pressing. I think the most important takeaway is knowing that this PR merge will work for specifically the BEAM/heights groups, and perhaps future PRs can expand that to other groups.

Summary

My next two days are busier than today, so I'll have less time to dig into problems. The most pressing thing in my opinion in the delta_time piece. @jpswinski let me know what you think a good next step is. If you want to try to push for a less hardcoded fix this week let me know and we can find a time for a video call again so I can explain what I've learned. If you don't think it's vital to get in this week that's of course fine with me.

@rwegener2
Copy link
Collaborator Author

rwegener2 commented Dec 14, 2023

So I know I said I was stepping back for the day, but I had an idea this morning about the delta_time piece and it seems to works! I wasn't able to fully test it because just a few minutes ago it seems like Earthdata services stopped working (🤔😳). I'll check back in a bit and try this out if it looks like things are back online. [update: Everything works smoothly with the latest commit.]

In the most recent commit I added a loop that goes through and makes sure that all of the coordinates listed are also dimensions. If it isn't the case it uses the variable as a coordinate. This very well may fail in other circumstances, but it should work well for at least ATL03 and we can confront other types of HDF files as they arise. What we do instead would depend on why other circumstances are failing.

Ok that's all (for real this time). Happy coding!

@jpswinski
Copy link
Member

@rwegener2 - I just added code that converts the delta_times to numpy datetimes, and I also updated the example notebook so that it shows the delta_time column as a datetime.

I am not sure exactly how we will want it in the end because I was not able to find anything in the data itself that would allow me to perform this translation automatically and generically. I used the ICESat-2, GEDI, and GOES data for comparisons. Between them they have different names and attributes and epochs for their time variable.

As a result, in order to support this conversion, I created the ability to pass in a "column conversion" dictionary. The dictionary has the column name for the key, and a function that excepts a numpy array for the value. So when the xarray is being built, if there is a column that has an entry in the conversion dictionary, then the corresponding conversion function is called on it.

I then created a datasets/icesat2.py module in h5coro that has a function that converts delta times to datetimes.

Stepping back, I think this functionality is needed, but I am not sure where it should live. I wonder if the datasets/icesat2.py functionality shouldn't be in a package like icepyx. h5coro could support supplying conversion functions in a generic way, and then higher level packages that are geared more towards a specific mission or dataset could provide the mission specific conversions.

Thoughts?

I am happy to keep it in h5coro for now as we figure out how best to organize it all.

@rwegener2
Copy link
Collaborator Author

rwegener2 commented Dec 15, 2023

Thanks for working on it! That does all sound like a pickle.

After a bit of reading, it seems the xarray default is to assume that the datetime is encoded using CF conventions. Looking through the source code it seems like this function is where that happens. When I thought about this in August it definitely made me wonder about what it would take to use some of xarray's existing machinery to read these files. It seems relevant here, and in the case of trying finding a generalized solution for the delta_time dimension coordinate.

Back to what we should do right now, though, it sounds to me like you've found a good solution for the moment. I think it's fine to merge this and come back to it, because it fits in the goal of the PR which to me is "allow h5coro to read IS2 data into xarray". It looks like continuing to generalize the reader will be an ongoing process.

[Follow up comment: The pangeo discourse is often incredibly helpful and full of very xarray-knowledgeable people. At some point we could post in there looking for advice for how to utilize some existing xarray parsing functionality in our reader.]

@rwegener2 rwegener2 marked this pull request as ready for review December 15, 2023 15:00
@rwegener2 rwegener2 requested a review from jpswinski December 15, 2023 15:03
@rwegener2
Copy link
Collaborator Author

I just marked this as ready for review. If you see anything that needs to be improved @jpswinski feel free to request changes. If not I think we could merge!

@jpswinski
Copy link
Member

extraneous oddities found while running the code on more groups

I started trying the code out on some new groups (not just gt1l/heights). It was a smidge chaos 😨 but not too bad 😁. There is probably lots of cleaning we could do in here. For example: ....

I stopped for the day when I hit another odd one on start_gpsweek of the ancillary_data group.

I've been diving into this and there were some gotchas in my code for handling groups/attributes/variables. As you mentioned previously, I wasn't handling groups as their own thing, I was treating them as variables. Unfortunately, this didn't always work. So I've cleaned this up in one of my branches and will be merging it into this branch soon. Hopefully that makes things a lot more robust. For instance, I am now able to list the root directory (/) as well as the ancillary_data directory with and without attributes. Before I wasn't able to do that as both of those paths have a mix of variables and groups.

@rwegener2
Copy link
Collaborator Author

I've been diving into this and there were some gotchas in my code for handling groups/attributes/variables.

Nice! @jpswinski if you comment when you have that merged I can take another look at loading groups

@jpswinski jpswinski merged commit 81612b3 into main Dec 18, 2023
@rwegener2 rwegener2 deleted the backends branch December 18, 2023 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

h5coro as an xarray backend
2 participants