Skip to content

Commit

Permalink
Merge pull request #23 from openclimatefix/issue/make-testset
Browse files Browse the repository at this point in the history
Basic Evaluation
  • Loading branch information
peterdudfield authored Dec 19, 2023
2 parents 3dee386 + b1013f4 commit 53e21d2
Show file tree
Hide file tree
Showing 23 changed files with 3,529 additions and 33 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/pytest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,5 @@ jobs:
# pytest-cov looks at this folder
pytest_cov_dir: "quartz_solar_forecast"
os_list: '["ubuntu-latest"]'
python-version: "['3.10','3.11']"
python-version: "['3.10','3.11']"
extra_commands: echo "HF_TOKEN=${{ vars.HF_TOKEN }}" > .env
37 changes: 35 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,17 @@
<!-- ALL-CONTRIBUTORS-BADGE:END -->

The aim of the project is to build an open source PV forecast that is free and easy to use.
The forecast provides the expected generation in `kw` for 0 to 48 hours for a single PV site.

Open Climate Fix also provide a commercial PV forecast, please get in touch at [email protected]

The current model uses GFS or ICON NWPs to predict the solar generation at a site


```python
from quartz_solar_forecast.forecast import run_forecast
from quartz_solar_forecast.pydantic_models import PVSite

# make input data
# make a pv site object
site = PVSite(latitude=51.75, longitude=-1.25, capacity_kwp=1.25)

# run model, uses ICON NWP data by default
Expand Down Expand Up @@ -50,6 +51,38 @@ The 9 NWP variables, from Open-Meteo documentation, are mentioned above with the
- The model is trained on [UK MetOffice](https://www.metoffice.gov.uk/services/data/met-office-weather-datahub) NWPs, but when running inference we use [GFS](https://www.ncei.noaa.gov/products/weather-climate-models/global-forecast) data from [Open-meteo](https://open-meteo.com/). The differences between GFS and UK MetOffice, could led to some odd behaviours.
- It looks like the GFS data on Open-Meteo is only available for free for the last 3 months.

## Evaluation

To evaluate the model we use the [UK PV](https://huggingface.co/datasets/openclimatefix/uk_pv) dataset and the [ICON NWP](https://huggingface.co/datasets/openclimatefix/dwd-icon-eu) dataset.
All the data is publicly available and the evaluation script can be run with the following command

```bash
python scripts/run_evaluation.py
```

The test dataset we used is defined in `quartz_solar_forecast/dataset/testset.csv`.
This contains 50 PV sites, which 50 unique timestamps. The data is from 2021.

The results of the evaluation are as follows
The MAE is 0.1906 kw across all horizons.

| Horizons | MAE [kw] | MAE [%] |
|----------|---------------| ------- |
| 0 | 0.202 +- 0.03 | 6.2 |
| 1 | 0.211 +- 0.03 | 6.4 |
| 2 | 0.216 +- 0.03 | 6.5 |
| 3 - 4 | 0.211 +- 0.02 |6.3 |
| 5 - 8 | 0.191 +- 0.01 | 6 |
| 9 - 16 | 0.161 +- 0.01 | 5 |
| 17 - 24 | 0.173 +- 0.01 | 5.3 |
| 24 - 48 | 0.201 +- 0.01 | 6.1 |


Notes:
- THe MAE in % is the MAE divided by the capacity of the PV site. We acknowledge there are a number of different ways to do this.
- it is slightly surprising that the 0-hour forecast horizon and the 24-48 hour horizon have a similar MAE.
This may be because the model is trained expecting live PV data, but currently in this project we provide no live PV data.

## Abbreviations

- NWP: Numerical Weather Predictions
Expand Down
9 changes: 7 additions & 2 deletions quartz_solar_forecast/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,12 @@ def get_nwp(site: PVSite, ts: datetime, nwp_source: str = "icon") -> xr.Dataset:
}
)
df = df.set_index("time")
data_xr = format_nwp_data(df, nwp_source, site)

return data_xr


def format_nwp_data(df: pd.DataFrame, nwp_source:str, site: PVSite):
data_xr = xr.DataArray(
data=df.values,
dims=["step", "variable"],
Expand All @@ -103,11 +109,10 @@ def get_nwp(site: PVSite, ts: datetime, nwp_source: str = "icon") -> xr.Dataset:
data_xr = data_xr.assign_coords(
{"x": [site.longitude], "y": [site.latitude], "time": [df.index[0]]}
)

return data_xr


def make_pv_data(site: PVSite, ts) -> xr.Dataset:
def make_pv_data(site: PVSite, ts: pd.Timestamp) -> xr.Dataset:
"""
Make fake PV data for the site
Expand Down
133 changes: 133 additions & 0 deletions quartz_solar_forecast/dataset/make_test_set.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
"""
Make a random test set
This takes a random subset of times and for various pv ids and makes a test set
There is an option to odmit timestamps that don't exsits in the ICON dataset:
https://huggingface.co/datasets/openclimatefix/dwd-icon-eu/tree/main/data
"""
import os
from typing import Optional

import numpy as np
import pandas as pd

from quartz_solar_forecast.eval.utils import make_hf_filename
from huggingface_hub import HfFileSystem

test_start_date = pd.Timestamp("2021-01-01")
test_end_date = pd.Timestamp("2022-01-01")

# this have been chosen from the entire training set. This ideas
pv_ids = [
9531,
7174,
6872,
7386,
13607,
6330,
26841,
6665,
4045,
26846,
6494,
7834,
3543,
7093,
3864,
8412,
3454,
9765,
10585,
26942,
7721,
26804,
7551,
26861,
7568,
7338,
7410,
6967,
16480,
7241,
7593,
7557,
7757,
3094,
6800,
26905,
5512,
26840,
7595,
5803,
26876,
7846,
26786,
7580,
6629,
16477,
3489,
26796,
12761,
26903,
]

np.random.seed(42)


def make_test_set(output_file_name: Optional[str] = None, number_of_samples_per_system: int = 50, check_hf_files: bool = False):
"""
Make a test set of random times and pv ids
:param output_file_name: the name of the file to write the test set to
:param number_of_samples_per_system: the number of samples to take per pv id
"""

if output_file_name is None:
# get the folder where this file is
output_file_name = os.path.dirname(os.path.abspath(__file__)) + "/testset.csv"

ts = pd.date_range(start=test_start_date, end=test_end_date, freq="15min")

# check that the files are in HF for ICON
if check_hf_files:
ts = filter_timestamps_if_hf_files_exists(ts)

test_set = []
for pv_id in pv_ids:
ts_choice = ts[np.random.choice(len(ts), size=number_of_samples_per_system, replace=False)]
test_set.append(pd.DataFrame({"pv_id": pv_id, "timestamp": ts_choice}))
test_set = pd.concat(test_set)
test_set.to_csv(output_file_name, index=False)

return test_set


def filter_timestamps_if_hf_files_exists(timestamps_full: pd.DatetimeIndex):
"""
Filter the timestamps if the huggingface files exist
We are checking if the teimstamps, rounded down to the nearest 6 hours,
exist in
https://huggingface.co/datasets/openclimatefix/dwd-icon-eu/tree/main/data
"""
timestamps = []
fs = HfFileSystem()
# print(fs.ls("datasets/openclimatefix/dwd-icon-eu/data/2022/4/11/", detail=False))
for timestamp in timestamps_full:
timestamp_floor = timestamp.floor("6H")
_, huggingface_file = make_hf_filename(timestamp_floor)
huggingface_file = huggingface_file[14:]

if fs.exists(huggingface_file):
timestamps.append(timestamp)
else:
print(f"Skipping {timestamp} because {huggingface_file} does not exist")

timestamps = pd.DatetimeIndex(timestamps)
return timestamps


# To run the script, un comment the following line and run this file
# make_test_set(check_hf_files=True)
Loading

0 comments on commit 53e21d2

Please sign in to comment.