-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Add test for checking physical limits and zeroes in NWP data #… #340
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
@peterdudfield are there any other suggestion for this PR? |
Thanks so much, ive put a few comments, but then i think it should be ready |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
hey @peterdudfield , apologies for the delay in solving this |
@AUdaltsova (and maybe @Sukh-P) do you mind looking at this? |
ocf_datapipes/load/nwp/nwp.py
Outdated
def check_if_zeros(self, nwp: Union[xr.DataArray, xr.Dataset]): | ||
"""Checks if the NWP data contains zeros""" | ||
if isinstance(nwp, xr.DataArray): | ||
if (nwp.values == 0).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it will not be performed lazily (as in, this will load the whole dataArray into memory to check the values), which we really want to avoid in this place because at this point the arrays we operate on are often massive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some changes to accommodate lazy loading. Have leveraged Dask arrays, as its often used with xarray for the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good way to go about it! There is a danger that some of our data might not fit anyway, but since it can be turned on and off that's fine.
I was wondering if it's worth exploring implementing this check downstream, somewhere after spacial and temporal crop and before normalisation, so that it operates on samples instead? And then maybe skip ones with too many zeroes/nans/out of physical bounds values and give a userWarning/log info of how many were skipped as a proxy for understanding how much of the data is corrupted. Thoughts @Sukh-P @peterdudfield?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea, less chance of a chance of running into memory issues by loading a chunk, I guess the only draw back would be doing processing on some data you are going to chuck anyway but in this case that processing gets it down to a more manageable size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AUdaltsova I'm trying to understand if this is acceptable w.r.t the scope of this PR? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glitch401 might be! @peterdudfield happy to merge this then? :)
@@ -90,7 +90,7 @@ def stack_np_examples_into_batch(dict_list: Sequence[NumpyBatch]) -> NumpyBatch: | |||
|
|||
nwp_batch[nwp_source] = nwp_source_batch | |||
|
|||
batch[BatchKey.nwp] = nwp_batch | |||
batch[BatchKey.nwp] = check_for_nans(nwp_batch) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, but I was under the impression that we allow for nans currently to be present in batches, which then get filled with zeroes during training? @peterdudfield is this a gsp thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this to when the NWP gets opened? And have an option to check it or not?
I think that would make it safer and clearer whats going on.
We could have a different issue that checks for nans in the batches, but we need to think how we turn that on and off .e.tc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glitch401 would you mind moving this to when the nwp is opened? with an option to do this or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean, when data element NWP is opened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like below, in the load stage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, will append changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was hoping this would be removed, and it would be mvoed to below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still open
"VIS008": (0, 1000), # Visible channel | ||
"WV_062": (0, 1000), # Water vapor channel | ||
"WV_073": (0, 1000), # Water vapor channel | ||
} | ||
logger.info(f"Using {provider.lower()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very much just a suggestion, but it would be nice to have some control over which variables receive the checks. Intuitively, that should probably be possible by just passing a list of keys to be checked instead of True
to check_for_zeroes
/check_physical_limits
Thank you so much for doing this! Really really appreciate that. The only important note that I have is the laziness thing. I've left some nitpicky suggestions, but please do treat them more as thoughts than actual requests @peterdudfield re: breaking something, I'll not be affected by any updates, I'm very much locked into my version :) Thanks for asking! |
ocf_datapipes/load/nwp/nwp.py
Outdated
"lcc": (0, 100), # Low cloud cover, % | ||
"tcc": (0, 100), # Total cloud cover, % | ||
"sde": (0, 1000), # Snowfall depth, meters | ||
"sr": (0, 10), # Surface roughness, meters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is getting right into OCFs data model here but @devsjc has helped me understand that we have an internal naming convention that deviates slightly from some NWP providers e.g. sr
actually maps to dsrp
for us, not surface roughness. Obviously there would not be a way to know that as a contributor, so apologies for that. @peterdudfield FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Can supply keys and then pull corresponding ranges out of consts maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right @Sukh-P , @peterdudfield did help me understand the conventions
@peterdudfield @AUdaltsova how does it look for now? |
for more information, see https://pre-commit.ci
any updates @peterdudfield |
…o check for NaNs when NWP is loaded
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated changes
@@ -40,49 +40,13 @@ def _single_batch_sample(fill_value): | |||
return sample | |||
|
|||
|
|||
def _single_batch_sample_nan(fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed all the functions related to checking nans
@peterdudfield any updates? |
Pull Request
Description
This pull request addresses issues identified in #335 and #337 by implementing checks for zeros and physical limits in NWP data processing. The changes ensure that the OpenNWP class correctly raises a ValueError when encountering NWP data arrays containing zeros (addressing #335) and when NWP data values are outside specified physical limits (addressing #337). These enhancements are crucial for maintaining data integrity and reliability in our processing pipeline.
Fixes #337 , #335
How Has This Been Tested?
The modifications have been validated through comprehensive unit tests. Specifically, tests were added to verify that a
ValueError
is raised both when zeros are present in the data array and when data values fall outside of physical limits. These tests were conducted using sample Zarr datasets designed to mimic real-world scenarios where such issues might arise.A sanity check was performed by visually inspecting the processed data to ensure that the new checks effectively identify and handle data with zeros and data outside physical limits.
Checklist: