Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Add test for checking physical limits and zeroes in NWP data #… #340

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

glitch401
Copy link

@glitch401 glitch401 commented Jul 3, 2024

Pull Request

Description

This pull request addresses issues identified in #335 and #337 by implementing checks for zeros and physical limits in NWP data processing. The changes ensure that the OpenNWP class correctly raises a ValueError when encountering NWP data arrays containing zeros (addressing #335) and when NWP data values are outside specified physical limits (addressing #337). These enhancements are crucial for maintaining data integrity and reliability in our processing pipeline.

Fixes #337 , #335

How Has This Been Tested?

The modifications have been validated through comprehensive unit tests. Specifically, tests were added to verify that a ValueError is raised both when zeros are present in the data array and when data values fall outside of physical limits. These tests were conducted using sample Zarr datasets designed to mimic real-world scenarios where such issues might arise.

  • Yes

A sanity check was performed by visually inspecting the processed data to ensure that the new checks effectively identify and handle data with zeros and data outside physical limits.

  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@glitch401 glitch401 marked this pull request as ready for review July 4, 2024 13:59
@glitch401 glitch401 marked this pull request as draft July 4, 2024 13:59
@glitch401
Copy link
Author

@peterdudfield are there any other suggestion for this PR?

@peterdudfield
Copy link
Contributor

@peterdudfield are there any other suggestion for this PR?

Thanks so much, ive put a few comments, but then i think it should be ready

ocf_datapipes/load/nwp/nwp.py Outdated Show resolved Hide resolved
ocf_datapipes/load/nwp/nwp.py Show resolved Hide resolved
ocf_datapipes/load/nwp/nwp.py Show resolved Hide resolved
@glitch401 glitch401 marked this pull request as ready for review July 5, 2024 11:20
@glitch401
Copy link
Author

hey @peterdudfield , apologies for the delay in solving this
but I've made certain changes to solve #336

.gitignore Outdated Show resolved Hide resolved
@peterdudfield
Copy link
Contributor

@AUdaltsova (and maybe @Sukh-P) do you mind looking at this?
I think this code is ready to merge, but I know you 2 are using the code, so I dont want to break something as you running stuff

def check_if_zeros(self, nwp: Union[xr.DataArray, xr.Dataset]):
"""Checks if the NWP data contains zeros"""
if isinstance(nwp, xr.DataArray):
if (nwp.values == 0).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it will not be performed lazily (as in, this will load the whole dataArray into memory to check the values), which we really want to avoid in this place because at this point the arrays we operate on are often massive

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some changes to accommodate lazy loading. Have leveraged Dask arrays, as its often used with xarray for the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good way to go about it! There is a danger that some of our data might not fit anyway, but since it can be turned on and off that's fine.

I was wondering if it's worth exploring implementing this check downstream, somewhere after spacial and temporal crop and before normalisation, so that it operates on samples instead? And then maybe skip ones with too many zeroes/nans/out of physical bounds values and give a userWarning/log info of how many were skipped as a proxy for understanding how much of the data is corrupted. Thoughts @Sukh-P @peterdudfield?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea, less chance of a chance of running into memory issues by loading a chunk, I guess the only draw back would be doing processing on some data you are going to chuck anyway but in this case that processing gets it down to a more manageable size

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AUdaltsova I'm trying to understand if this is acceptable w.r.t the scope of this PR? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glitch401 might be! @peterdudfield happy to merge this then? :)

@@ -90,7 +90,7 @@ def stack_np_examples_into_batch(dict_list: Sequence[NumpyBatch]) -> NumpyBatch:

nwp_batch[nwp_source] = nwp_source_batch

batch[BatchKey.nwp] = nwp_batch
batch[BatchKey.nwp] = check_for_nans(nwp_batch)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but I was under the impression that we allow for nans currently to be present in batches, which then get filled with zeroes during training? @peterdudfield is this a gsp thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this to when the NWP gets opened? And have an option to check it or not?
I think that would make it safer and clearer whats going on.

We could have a different issue that checks for nans in the batches, but we need to think how we turn that on and off .e.tc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glitch401 would you mind moving this to when the nwp is opened? with an option to do this or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean, when data element NWP is opened?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like below, in the load stage

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, will append changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was hoping this would be removed, and it would be mvoed to below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still open

"VIS008": (0, 1000), # Visible channel
"WV_062": (0, 1000), # Water vapor channel
"WV_073": (0, 1000), # Water vapor channel
}
logger.info(f"Using {provider.lower()}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very much just a suggestion, but it would be nice to have some control over which variables receive the checks. Intuitively, that should probably be possible by just passing a list of keys to be checked instead of True to check_for_zeroes/check_physical_limits

@AUdaltsova
Copy link
Contributor

AUdaltsova commented Jul 12, 2024

Thank you so much for doing this! Really really appreciate that. The only important note that I have is the laziness thing. I've left some nitpicky suggestions, but please do treat them more as thoughts than actual requests

@peterdudfield re: breaking something, I'll not be affected by any updates, I'm very much locked into my version :) Thanks for asking!

"lcc": (0, 100), # Low cloud cover, %
"tcc": (0, 100), # Total cloud cover, %
"sde": (0, 1000), # Snowfall depth, meters
"sr": (0, 10), # Surface roughness, meters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is getting right into OCFs data model here but @devsjc has helped me understand that we have an internal naming convention that deviates slightly from some NWP providers e.g. sr actually maps to dsrp for us, not surface roughness. Obviously there would not be a way to know that as a contributor, so apologies for that. @peterdudfield FYI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Can supply keys and then pull corresponding ranges out of consts maybe?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right @Sukh-P , @peterdudfield did help me understand the conventions

@glitch401
Copy link
Author

@peterdudfield @AUdaltsova how does it look for now?

@glitch401
Copy link
Author

any updates @peterdudfield

Copy link
Author

@glitch401 glitch401 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated changes

@@ -40,49 +40,13 @@ def _single_batch_sample(fill_value):
return sample


def _single_batch_sample_nan(fill_value):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed all the functions related to checking nans

ocf_datapipes/load/nwp/nwp.py Show resolved Hide resolved
@glitch401
Copy link
Author

@peterdudfield any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check physical limits of nwp data when data is loaded
4 participants