-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create an xarray backend reader for h5coro #21
Conversation
Thanks for the fixes on #19 and #20 @jpswinski! They were super helpful! Some comments from the end of my day: delta_timeThe good news is that the code is running more smoothly with the updates. The mystery about where the delta_time attributes were going was resolved somewhere in #19/#20 (marked as complete above). The bad news is that now xarray doesn't like delta_time's coordinate information. The gist of it, as I understand, is that delta_time lists lat_ph and lon_ph as coordinates in the HDF file. Because lat_ph and lon_ph are not dimension coordinates (different from a dimension see here) neither one can be used as a coordinate dimension for delta_time. delta_time is the only dimension that can do that. I pushed a hardcoded fix in the most recent commit because I'm not sure how to generalize a solution. It's clear to me how to fix the code for this case but I'm not sure how to write it so that the code would find this sort of conflict in any dataset. warningsI'm not sure if extraneous oddities found while running the code on more groupsI started trying the code out on some new groups (not just
I stopped for the day when I hit another odd one on These all seem like nice things to fix eventually, but they don't seem the most pressing. I think the most important takeaway is knowing that this PR merge will work for specifically the SummaryMy next two days are busier than today, so I'll have less time to dig into problems. The most pressing thing in my opinion in the delta_time piece. @jpswinski let me know what you think a good next step is. If you want to try to push for a less hardcoded fix this week let me know and we can find a time for a video call again so I can explain what I've learned. If you don't think it's vital to get in this week that's of course fine with me. |
So I know I said I was stepping back for the day, but I had an idea this morning about the delta_time piece and it In the most recent commit I added a loop that goes through and makes sure that all of the coordinates listed are also dimensions. If it isn't the case it uses the variable as a coordinate. This very well may fail in other circumstances, but it should work well for at least ATL03 and we can confront other types of HDF files as they arise. What we do instead would depend on why other circumstances are failing. Ok that's all (for real this time). Happy coding! |
@rwegener2 - I just added code that converts the delta_times to numpy datetimes, and I also updated the example notebook so that it shows the delta_time column as a datetime. I am not sure exactly how we will want it in the end because I was not able to find anything in the data itself that would allow me to perform this translation automatically and generically. I used the ICESat-2, GEDI, and GOES data for comparisons. Between them they have different names and attributes and epochs for their time variable. As a result, in order to support this conversion, I created the ability to pass in a "column conversion" dictionary. The dictionary has the column name for the key, and a function that excepts a numpy array for the value. So when the xarray is being built, if there is a column that has an entry in the conversion dictionary, then the corresponding conversion function is called on it. I then created a datasets/icesat2.py module in h5coro that has a function that converts delta times to datetimes. Stepping back, I think this functionality is needed, but I am not sure where it should live. I wonder if the datasets/icesat2.py functionality shouldn't be in a package like Thoughts? I am happy to keep it in h5coro for now as we figure out how best to organize it all. |
Thanks for working on it! That does all sound like a pickle. After a bit of reading, it seems the xarray default is to assume that the datetime is encoded using CF conventions. Looking through the source code it seems like this function is where that happens. When I thought about this in August it definitely made me wonder about what it would take to use some of xarray's existing machinery to read these files. It seems relevant here, and in the case of trying finding a generalized solution for the delta_time dimension coordinate. Back to what we should do right now, though, it sounds to me like you've found a good solution for the moment. I think it's fine to merge this and come back to it, because it fits in the goal of the PR which to me is "allow h5coro to read IS2 data into xarray". It looks like continuing to generalize the reader will be an ongoing process. [Follow up comment: The pangeo discourse is often incredibly helpful and full of very xarray-knowledgeable people. At some point we could post in there looking for advice for how to utilize some existing xarray parsing functionality in our reader.] |
I just marked this as ready for review. If you see anything that needs to be improved @jpswinski feel free to request changes. If not I think we could merge! |
I've been diving into this and there were some gotchas in my code for handling groups/attributes/variables. As you mentioned previously, I wasn't handling groups as their own thing, I was treating them as variables. Unfortunately, this didn't always work. So I've cleaned this up in one of my branches and will be merging it into this branch soon. Hopefully that makes things a lot more robust. For instance, I am now able to list the root directory (/) as well as the ancillary_data directory with and without attributes. Before I wasn't able to do that as both of those paths have a mix of variables and groups. |
Nice! @jpswinski if you comment when you have that merged I can take another look at loading groups |
Goal
The goal of this PR is to create a basic xarray backend reader for h5coro on ATL03 data. The reader opens all
\BEAM\heights
groups, as well as some additional groups. This allows a user to read an HDF5 file with h5coro and get the bytes directly as an xarray Dataset object. Tasks out of scope of this PR are:This PR closes #3 .
How it was done
A new
backends
folder was added with anxarray_h5coro.py
file. The file formatting follows the xarray docs. A class with anopen_dataset
method was defined. The method uses h5coro to read the data bytes and convert them into a series of Python dictionaries that xarray can use to create aDataset
.The goal was to mirror the way xarray already reads HDF5 data: one group at a time. When a group is read (using
open_dataset
with thegroup
argument) xarray loads all the data variables in that group, the attributes of those variables, and the attributes of the user specified groups. Because the current h5coro library only reads 1D arrays, the backend also drops any data variable with more than 1 dimension (for ATL data that seems to besignal_conf_ph
)[UPDATE] A minor note: xarray does not accept more dimension names than the number of dimensions in a variable. So while most ATL03 products list delta_time, lat_ph, and lon_ph as coordinates, variables are only listed as having delta_time for a coordinate in the xarray DataSet. This again mirrors the default behavior of xarray HDF reads.
How it can be tested
Code
Expected result:
Tasks remaining
In my opinion should be complete before merging
Groups that aren'tFollow up conversations will be needed./BEAM/heights/
don't work right now. test functionality again once listGroup returns nested dictionary #19 is addressed, since I think the errors may have to do with hardcoding of attribute paths (I'm willing to do this once the issue is addressed)Nice-to-haves before merging:
h5coro.H5Coro
requires acredentials
argument, and it can't be None. @jpswinski do you think it is possible to makecredentials
optional so a user reading local files doesn't have to provide s3 credentials? (I'm willing to look at this again after the credentials update)INFO
log level doesn't seem to stop the warnings @jpswinski any thoughts?project.toml
?) to readmeDepends on: #19 (nice-to-have improvements would come from #20)