Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta Issue: Open Data PVnet #5

Open
7 tasks
peterdudfield opened this issue Nov 25, 2024 · 18 comments
Open
7 tasks

Meta Issue: Open Data PVnet #5

peterdudfield opened this issue Nov 25, 2024 · 18 comments

Comments

@peterdudfield
Copy link
Contributor

peterdudfield commented Nov 25, 2024

The idea is to make sure PVnet is accessible and usable for Open source user and contributors.

Current problems are lots of the NWP data is private.

Other context, we are moving over from ocf-datapipes to ocf-datasampler, so I would vote we try to use ocf-datasample at all points.

Here's a rough list of task lists that need

If all these steps are complete, then it will be ready to use for different countries and different geographies

@jcamier
Copy link
Collaborator

jcamier commented Nov 27, 2024

🌞 Open Source Solar Forecasting Project – Volunteers Needed! 🌞

We're building an open-source solar forecasting pipeline using publicly available data to predict solar generation at the national level, starting with the UK. Tasks include identifying gridded NWP datasets (preferably in Zarr format), creating pipelines for batching data, setting up APIs for PVlive solar generation and capacity data, defining training/testing splits, and benchmarking against existing OCF results. Roles range from data engineers and machine learning enthusiasts to software developers with Python expertise. If you're passionate about renewable energy and open-source collaboration, join us in advancing solar forecasting solutions for global impact! 🌍✨ #opensource #renewableenergy #solarforecasting

@peterdudfield peterdudfield changed the title Open Source PVnet Open Data PVnet Nov 27, 2024
@jcamier
Copy link
Collaborator

jcamier commented Nov 28, 2024

@peterdudfield would this work for open source data to use?
https://www.metoffice.gov.uk/services/data/met-office-data-for-reuse/discovery

@peterdudfield
Copy link
Contributor Author

@jcamier that probably one option. Lots of other options like free ECWMF variables, GFS, ICON - https://huggingface.co/datasets/openclimatefix/dwd-icon-eu

@jcamier
Copy link
Collaborator

jcamier commented Nov 29, 2024

@peterdudfield @Sukh-P what do you guys think of this readme I am creating to help with on-boarding to this volunteer group project?
getting_started.md

@jcamier
Copy link
Collaborator

jcamier commented Nov 29, 2024

Here is a preview of it...

Solar Forecasting Volunteer Onboarding

Welcome to the Solar Forecasting project! This document will introduce you to the key concepts and knowledge needed to contribute effectively.

Table of Contents

  1. Introduction to Solar Forecasting
  2. What is NWP Data?
  3. Understanding Zarr Format
  4. Target Data: What is UK PVlive?
  5. Basics of Machine Learning for Solar Forecasting
  6. APIs and Data Retrieval
  7. Data Pipelines for Solar Forecasting
  8. Benchmarks and Comparisons
  9. Geographical Adaptability
  10. Key Tools and Technologies
  11. Common Terminology
  12. Expected Knowledge and Skills
  13. How This Project Fits into Renewable Energy

Introduction to Solar Forecasting

Solar forecasting is the process of predicting the amount of solar energy that will be generated over a specific period. Understanding this helps optimize renewable energy systems and integrate them with the grid.


What is NWP Data?

Numerical Weather Prediction (NWP) data uses mathematical models of the atmosphere and oceans to forecast weather. It predicts various atmospheric conditions such as temperature, pressure, wind speed, humidity, precipitation type and amount, cloud cover, and sometimes even surface conditions and air quality—all of which are crucial for solar forecasting.

https://en.wikipedia.org/wiki/Numerical_weather_prediction


Understanding Zarr Format

zarr is a relatively new, cloud-based data format designed to improve access to N-dimensional arrays. It provides an effective way to store large N-dimensional data in the cloud, with access facilitated through predefined chunks. Zarr can be viewed as the cloud-based counterpart to HDF5/NetCDF files, as it follows a similar data model. However, unlike NetCDF or HDF5, which store data in a single file, Zarr organizes data as a directory containing compressed binary files for chunks of data, alongside metadata stored in external JSON files.

The semantic mapping from the NetCDF Data Model to the Zarr Data Model is as follows:

NetCDF Data Model Zarr V2 Data Model
File Store
Group Group
Variable Array
Attribute User Attribute
Dimension Not supported as a native feature

A Zarr array can be stored in any storage system that supports a key/value interface. In this system:

A key is an ASCII string.
A value is an arbitrary sequence of bytes.
Supported operations include:
Read: Retrieve the sequence of bytes associated with a key.
Write: Set the sequence of bytes associated with a key.
Delete: Remove a key/value pair.
Currently, Zarr V2 is the stable version, while Zarr V3 is considered experimental.
https://wiki.earthdata.nasa.gov/display/ESO/Zarr+Format


Target Data: What is UK PVlive?

UK PVlive provides national solar generation data, accessible via API. This data serves as a "ground truth" for training and evaluating solar forecasting models.


Basics of Machine Learning for Solar Forecasting

Discover key ML concepts such as data splitting, feature engineering, and model evaluation, all tailored to the solar forecasting domain.


APIs and Data Retrieval

Learn how to use APIs to fetch solar generation data and capacity information, critical for building datasets.


Data Pipelines for Solar Forecasting

Explore how pipelines prepare and batch data for machine learning models, making training and testing efficient.


Benchmarks and Comparisons

Understand the importance of benchmarking and how our models compare to existing solutions.


Geographical Adaptability

This project isn't limited to the UK currently but will be expanded to other global regions in the future. Learn how it can be adapted to other regions and data sources.


Key Tools and Technologies

Familiarize yourself with tools like Python, pandas, and open-source libraries like ocf-datasample.


Common Terminology

Learn the meanings of key terms like Grid Supply Point (GSP), solar irradiance, and capacity factors.


Expected Knowledge and Skills

An overview of the skills contributors should have or be willing to learn, such as Python programming and data analysis.


How This Project Fits into Renewable Energy

Understand the broader impact of this work and its contribution to a sustainable future.


Thank you for joining us on this journey to advance solar forecasting and renewable energy solutions!
``

@peterdudfield
Copy link
Contributor Author

Something like this would be really great.

We might be able to put that in as the Github project home page. Let me try to make it now and give you access @jcamier

@peterdudfield
Copy link
Contributor Author

Yea its possible to add a large readme to a project - https://github.com/orgs/openclimatefix/projects/36/views/1

@jcamier
Copy link
Collaborator

jcamier commented Nov 29, 2024

@peterdudfield how do you want me to push a PR for the markdown for the onboarding? Can you add an items to the project, one of which is this markdown file? I could create a branch and push up a PR for it then? I would like to make it as easy as possible to give volunteers context about what we are doing to get them up-to-speed quickly and answer a lot of questions they may have to shorten the time in which they can be effective and able to contribute to the project.

I am used to using Jira boards with epics, themes, stories and then using github and/or gitlab to create branches tied to the stories etc. I have not used Github projects before. Or do you want me to create a PR directly to PVNet for this? I am assuming this is a bit of an additional project at this point that we want to run in parallel to PVNet and then merge into it at a later point once it proves to improve or expand the core PVNet model?

@peterdudfield
Copy link
Contributor Author

Hi @jcamier Ive invited you to OCF github, and then I should be able to give you write access to the project. This means you can then add the markdown file for the project.

I would prefer we try Github Project rather than Jira e.t.c, as then its very close to the github issues.

Yea its an interesting discussion of where we put code for this. We tried to keep PVNet mainly for ML work, so one idea could be to have a seperate repo for "Open Data PVnet".

I would expect stage 1, that PVnet does not change much, its more about collect the right data and training the model. After that, we can defiantely try new features in PVNet.

@jcamier
Copy link
Collaborator

jcamier commented Nov 30, 2024

@peterdudfield I agree. Maybe we create a separate repo which is clone/fork of PVNet which we call Open Data PVnet? We can work on this in parallel of OCF's own work with PVNet, and depending on the progress that is made, you can make decisions to merge this back to main at a later point or just portions of the work you guys find that is useful? Ideally allowing for experimentation of ideas, innovations you might have wanted to try but didn't have the resources to do so.

Also, should we start working with ocf-data-sampler and improving/modifying this (if needed) to better handle the open data we will be procuring/working with? As you know, data is the air that AI models breathe, and the better it is, the better the models can ultimately perform. Data acquisition and curation can be our first few quarters of focus in Q1 & Q2 2025 of the volunteers...? All the Github project issues (stories, epics) could be created for this type of work in mind. I could recruit more data engineer volunteers to start with then. Also, I was reading that since Github project doesn't have themes, epics, stories, bugs like Jira does, that sometimes organizations just create labels for this in each of the issues? Is this currently a practice you do to organize your issues? I find having these labels very useful to organize, plan and properly allocate teams/developers when working with larger projects. I would like to support you guys to make OCF's PVNet a foundational model for the global world to use that is best-in-class. I am a lean six-sigma black belt, have scrum master training as well as am a technical lead at my current job and would ideally like to have some structure to better organize the large group(s) of volunteers for you for this particular project. The better we organize and can on-board the volunteers, the faster we can develop quality code. Many open source projects I have worked with, and even recently, I have lost volunteers because they got lost in what to do as things were not clear to them. However, I want to work with whatever work-stream you may have and add some structure where ever you are comfortable for me to do so.

@peterdudfield
Copy link
Contributor Author

Thanks @jcamier

Yea, thats a good one to discuss, where we make a fork of PVNet for this project. Probably a good idea

For Github projects, yea we can have lables e.t.c. Im trying to give you write access to the project, so you can edit as appropriate

@Sukh-P
Copy link
Member

Sukh-P commented Dec 4, 2024

I have had a go at revising this slightly:

🌞 Open Source Solar Forecasting Project – Volunteers Needed! 🌞

We're building an open-source solar forecasting pipeline using publicly available data to predict solar generation at the national level, starting with the UK. Tasks include identifying gridded Numerical Weather Prediction datasets, downloading this NWP data and transforming it into the preferred Zarr format, acquiring solar generation target data through APIs such as PVlive's solar generation and capacity API, creating pipelines for batching data, ML model experimentation and benchmarking against existing OCF results. Roles range from data engineers and machine learning enthusiasts to software developers with Python expertise. If you're passionate about renewable energy and open-source collaboration, join us in advancing solar forecasting solutions for global impact! 🌍✨ #opensource #renewableenergy #solarforecasting

@Sukh-P
Copy link
Member

Sukh-P commented Dec 4, 2024

On the task list above I feel for this one:

Identify open source gridded NWP that is already in zarr format. Need to make sure we it has enough variables for solar forecasts and enough years (>2) of data. Note that OCF publish satellite data already, which could be used. We try to use at least 1 year for training and 1 year for testing. Better to have more liek 5 years of training, but lets see what we can do.

Finding gridded NWP data that is already in a zarr format may be tricky, I think more likely they will be in other formats such as GRIB and would need to be converted into zarrs, could leverage tools such as the nwp-consumer OCF has for work like this

@Sukh-P
Copy link
Member

Sukh-P commented Dec 4, 2024

Also, should we start working with ocf-data-sampler and improving/modifying this (if needed) to better handle the open data we will be procuring/working with? As you know, data is the air that AI models breathe, and the better it is, the better the models can ultimately perform. Data acquisition and curation can be our first few quarters of focus in Q1 & Q2 2025 of the volunteers...?

@jcamier Yes I think the key focus in the beginning is going to be the data engineering aspect, finding the appropriate open NWP data sources, either downloading these into the preferred zarr format or having the right tools to be able to stream this data at a good pace. After that then will come using ocf-data-sampler to create samples for ML models from this data, which I imagine will be less work than the first task of just getting the data in the right places in the right formats and having the right tools to work with them.

@peterdudfield
Copy link
Contributor Author

🌞 Open Source Solar Forecasting Project – Volunteers Needed! 🌞

We're building an open-source solar forecasting pipeline using publicly available data to predict solar generation at the national level, starting with the UK. Tasks include identifying gridded Numerical Weather Prediction datasets, downloading this NWP data and transforming it into the preferred Zarr format, acquiring solar generation target data through APIs such as PVlive's solar generation and capacity API, creating pipelines for batching data and ML model experimentation

We want to start in the UK, in order to benchmark with OCF results, and then lets expanding to lots of other countries.

Roles range from data engineers and machine learning enthusiasts to software developers with Python expertise. If you're passionate about renewable energy and open-source collaboration, join us in advancing solar forecasting solutions for global impact! 🌍✨ #opensource #renewableenergy #solarforecasting

@peterdudfield peterdudfield changed the title Open Data PVnet Meta Issue: Open Data PVnet Dec 4, 2024
@peterdudfield peterdudfield transferred this issue from openclimatefix/PVNet Dec 4, 2024
@jcamier
Copy link
Collaborator

jcamier commented Dec 18, 2024

@peterdudfield after thinking it over some more, I propose we don't include the architecture overview image in our readme for the time being. Maybe after we have a few more passes at it, we can include it then? However, I am including it here as an artifact we can reference later.
Image

And here is the miro board link as well: https://miro.com/app/board/uXjVL2Ugbq8=/

@peterdudfield
Copy link
Contributor Author

Is ti worth putting this on the readme @jcamier ?

@jcamier
Copy link
Collaborator

jcamier commented Jan 16, 2025

@peterdudfield this is your call. It is not the most professional looking artifact (good enough for internal purposes though 😄 ) So, up-to-you if you think this would be helpful. I was envisioning sharing this with the open source team during a weekly standup call we could have in the future...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants