Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for zeros in the data #159

Open
peterdudfield opened this issue Jun 27, 2024 · 11 comments
Open

Check for zeros in the data #159

peterdudfield opened this issue Jun 27, 2024 · 11 comments
Labels
good first issue Good for newcomers

Comments

@peterdudfield
Copy link
Contributor

Detailed Description

It would be great to have a check in place that checks for zeros. A large amount of these is normally an error

Context

  • good to catch data problems early and fail hard

Possible Implementation

  • add a a check, if number os zeros >20% of the data, error
  • @devsjc can you suggest a place in the code for this check?
@peterdudfield peterdudfield added the good first issue Good for newcomers label Jun 27, 2024
@devsjc
Copy link
Collaborator

devsjc commented Jul 1, 2024

Yes, I would add it to the filter here:

def _dataQualityFilter(ds: xr.Dataset) -> bool:
"""Filter out data that is not of sufficient quality."""
if ds == xr.Dataset():
return False
# Carry out a basic data quality check
for data_var in ds.data_vars:
if ds[f"{data_var}"].isnull().any():
log.warn(
event=f"Dataset has NaNs in variable {data_var}",
initTime=str(ds.coords["init_time"].values[0])[:16],
variable=data_var,
)
return True

@GAMinsect
Copy link

Hi, I'm someone who is just starting in the field of open source development.
I would like to know if i can contribute to this, since it' a good first issue.
Thank you in advance.

@peterdudfield
Copy link
Contributor Author

Hi, I'm someone who is just starting in the field of open source development. I would like to know if i can contribute to this, since it' a good first issue. Thank you in advance.

You can definately contirbute. Are you familiar with python and xarray?

@GAMinsect
Copy link

I'm familiar with python and common libraries like pandas, but not with xarray

@peterdudfield
Copy link
Contributor Author

thanks good @GAMinsect, you migh need to learn a bit of xarray. Your welcome to give it ago.

@GAMinsect
Copy link

After 2 weeks of working on it in my free time, here's my implementation.
I formatted the code following your coding style.

def _dataQualityFilter(ds: xr.Dataset) -> bool:
    """Filter out data that is not of sufficient quality."""
    if ds == xr.Dataset():
        return False

    zeroCount = 0
    elementCount = 0
    # Carry out a basic data quality check
    for data_var in ds.data_vars:
        if ds[f"{data_var}"].isnull().any():
            log.warn(
                event=f"Dataset has NaNs in variable {data_var}",
                initTime=str(ds.coords["init_time"].values[0])[:16],
                variable=data_var,
            )

        data = ds[data_var].data
        elementCount += data.size
        zeroCount += (data == 0).sum()

    if zeroCount / elementCount > 0.2:
        raise ValueError("In your dataset more than 20% of your data are 0's")

    return True
 

@GAMinsect
Copy link

@peterdudfield is the code fine?

@peterdudfield
Copy link
Contributor Author

ill let @devsjc review if thats ok.

@GAMinsect
Copy link

@devsjc how's the review going?

@devsjc
Copy link
Collaborator

devsjc commented Dec 6, 2024

Hi @GAMinsect, thanks for your work looking into this, and apologies for the slow response.

I've just merged in a quite comprehensive restructuring of the project in order to improve the speed of the application, which has resulted in the _dataQualityFilter function being removed, in favour of a new - in progress - PostProcess API that I'm in the process of finalising. When complete, I suspect this will be where this check will end up.

The code looks good though, so your work wasn't wasted - it will still be used in the logic, just not in the place I originally suggested!

@GAMinsect
Copy link

@devsjc Thank you for the reply (and sorry for my late one) i guess this issue can be closed now, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants