Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research Python tools for validating data #47

Open
Tracked by #48
JackKelly opened this issue Mar 6, 2024 · 8 comments
Open
Tracked by #48

Research Python tools for validating data #47

JackKelly opened this issue Mar 6, 2024 · 8 comments
Labels
good first issue Good for newcomers

Comments

@JackKelly
Copy link
Member

JackKelly commented Mar 6, 2024

We'd like the code to automatically check that batches are of the correct shape, with the correct coordinates, and with the appropriate number of NaNs, zeros, etc. Historically, we have used pydantic. But we'd like to do some research into alternatively tools like attrs.

We decided this would be a good idea in OCF's internal "Data Engineering Big Ideas" meeting on 5th March 2024.

e.g mypy / pydantic / attrs

Related:

@JackKelly JackKelly added the good first issue Good for newcomers label Mar 6, 2024
@JackKelly JackKelly transferred this issue from openclimatefix/ocf_datapipes Mar 6, 2024
@bikramb98
Copy link

@JackKelly Happy to take this on and get back to you with my findings. I couldn't access the Google Docs linked above. If they are meant to be public, can you please update their access settings.

@JackKelly
Copy link
Member Author

That's very kind, thank you! The Google Docs aren't super-relevant, TBH.

Basically, what we'd like is a way to automatically check that our xarray objects have:

  • the correct shape
  • the correct coordinates
  • the appropriate number of NaNs (which may vary depending on context)
  • the appropriate number of zeros (which may vary depending on context. For example, solar irradiance should be zero at night!)

I'm not super-involved in this work, TBH. I just wrote up these GitHub issues.

@noobjam
Copy link

noobjam commented Mar 18, 2024

Hello!
I'd also like to help on this.

@JackKelly
Copy link
Member Author

Sounds good, thank you! Maybe what I'd suggest is starting a shared Google Doc with brief notes about the various data validation frameworks out there! Thanks so much! The idea is that multiple people can collaborate on the notes doc. Or, if you'd prefer, we could use the GitHub wiki in this project. Or a markdown file in this project.

@Mahak-Agrawal-304
Copy link

Hello @JackKelly , I agree with your suggestion of sharing doc with brief notes so that multiple contributors can collaborate on the notes.
I would really like to implement this idea

@Mahak-Agrawal-304
Copy link

As for the beginning, I have started with a very raw document. Since it's my first time contributing, I will appreciate any constructive feedback!

@JackKelly
Copy link
Member Author

That doc looks perfect, thank you! I think you're on the right track: list the various options, and give a brief summary of each option. Ideally, it'd be great to include a short code example showing how to define a data validation schema. Thanks so much! This is super-helpful!

@Mahak-Agrawal-304
Copy link

Thank you for your feedback! I'll keep in mind to add code snippets henceforth. I will implement it right now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants