Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

returning 'nan' when too many variables requested? #620

Closed
derekpickell opened this issue Oct 14, 2024 · 6 comments
Closed

returning 'nan' when too many variables requested? #620

derekpickell opened this issue Oct 14, 2024 · 6 comments

Comments

@derekpickell
Copy link

Hi there,

I'm playing around with a basic read of locally downloaded .h5 files:

reader = ipx.Read(data_source=file_list)
reader.vars.append(beam_list=['gt2l', 'gt2r'], var_list=['h_li', "latitude", "longitude", "h_li_sigma", "sigma_geo_h", "bsnow_h", "cloud_flg_asr", "atl06_quality_summary"])
 ds = reader.load()

It seems when I add just one more variable to the list, e.g., 'h_rms_misfit', the number of 'nans' in the returned 'ds' xarray increases for no apparent reason, sometimes for all variables.

icepyx v1.3.0

Thank you!

@JessicaS11
Copy link
Member

Hello @derekpickell! I just wanted to acknowledge I saw this post (thanks for reaching out!) and am wondering why this is unexpected behavior? It could be that adding h_rms_misfit is increasing one of the dataset dimensions, which would tend to increase the number of nans as Xarray pads out the data to take this new shape.

A few questions that will make it easier for me to diagnose if there's an issue:

  • Does the behavior seem to be tied to the h_rms_misfit variable specifically, or any number of variables >8?
  • How are you counting the number of nans?
  • What data product are you using?
  • Can you share either the search you used to download the granules or some of the granule IDs where you're noticing this happening?

@derekpickell
Copy link
Author

Hi @JessicaS11,

Thank you for the response! To answer your questions:

  • as far as I can tell, it’s any number of variables > 8. I experimented with cloud and blowing snow flags as well.
  • To count nans, I search the field using np.nanmax() and nanmin(), and sometimes no number is returned, indicating the field is all nan, so I suspect this isn’t a padding issue since these data are returned with fewer variables.
  • I am using ATL06, manually downloaded from the NSCIDC data access tool using a 30km bounding box centered near Summit, GL from 2018 to present.

@JessicaS11
Copy link
Member

Thanks for these answers. I've dug in a bit more and now suspect that it is not the number of variables you're playing with, but which variables. The note on which ones you've experimented with was a clue. h_rms_misfit, bsnow_h, and cloud_flg_asr are all more deeply nested variables than (for instance) h_li (if you look at the variable paths, they have either geophysical or fit_statistics after the land_ice_segments layer. If you look at the resulting dataset for a single file after reading in two versus three of the above specific variables, the coordinates attached to the variable are different. What's happening behind the scenes is essentially icepyx is doing all of the individual group reads with xarray and then trying to cleverly merge the per-group dataarrays together into one dataset. As you've noted, this doesn't always work! Handling (generically) the multiple layers of nesting is an ongoing challenge in icepyx, so thanks for reporting this case we missed.

I think I've isolated where in the code the issue is happening (lines 816-822 or so in the read module, so could also be in one of the functions called therein), but I haven't yet figured out what the solution might be (any suggestions welcome!). I'll continue to work on resolving this as time allows, but any assistance would be greatly appreciated.

@JessicaS11
Copy link
Member

Hello @derekpickell! I have good news and bad news. Good news is the bug I identified where all dimensions were not being applied to the deeper nested variables of interest is fixed via #623. Bad news is I don't think this was actually the problem you noted.

When I dug in further, I found a granule that only has nan values for some variables. However, it seems like only bsnow_h fits into this category, not cloud_flg_asr or h_rms_misfit. If I'm not mistaken, in some situations the blowing snow algorithm is unable to confidently quantify blowing snow, which would result in no blowing snow values. @mikala-nsidc (ICESat-2 support specialist at NSIDC) or @tsutterley (one of the ATL06 product leads), can you confirm that in some cases no bsnow_h (and thus all nans) is expected behavior for ATL06 granules?

@derekpickell
Copy link
Author

@JessicaS11 wow amazing thank you. It looks like everything 'makes sense' with the data I am looking at: few nans here and there, but no large gaps where I wouldn't expect them.

@JessicaS11
Copy link
Member

@derekpickell Excellent! I'm going to close this issue as resolved, but feel free to comment again if need be. Would you be able/willing to do a PR review for #623?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants