-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research Python tools for validating data #47
Comments
@JackKelly Happy to take this on and get back to you with my findings. I couldn't access the Google Docs linked above. If they are meant to be public, can you please update their access settings. |
That's very kind, thank you! The Google Docs aren't super-relevant, TBH. Basically, what we'd like is a way to automatically check that our xarray objects have:
I'm not super-involved in this work, TBH. I just wrote up these GitHub issues. |
Hello! |
Sounds good, thank you! Maybe what I'd suggest is starting a shared Google Doc with brief notes about the various data validation frameworks out there! Thanks so much! The idea is that multiple people can collaborate on the notes doc. Or, if you'd prefer, we could use the GitHub wiki in this project. Or a markdown file in this project. |
Hello @JackKelly , I agree with your suggestion of sharing doc with brief notes so that multiple contributors can collaborate on the notes. |
As for the beginning, I have started with a very raw document. Since it's my first time contributing, I will appreciate any constructive feedback! |
That doc looks perfect, thank you! I think you're on the right track: list the various options, and give a brief summary of each option. Ideally, it'd be great to include a short code example showing how to define a data validation schema. Thanks so much! This is super-helpful! |
Thank you for your feedback! I'll keep in mind to add code snippets henceforth. I will implement it right now |
We'd like the code to automatically check that batches are of the correct shape, with the correct coordinates, and with the appropriate number of NaNs, zeros, etc. Historically, we have used pydantic. But we'd like to do some research into alternatively tools like attrs.
We decided this would be a good idea in OCF's internal "Data Engineering Big Ideas" meeting on 5th March 2024.
e.g mypy / pydantic / attrs
Related:
The text was updated successfully, but these errors were encountered: