Skip to content

Commit

Permalink
Feat/Example notebook for the regression models (#2039)
Browse files Browse the repository at this point in the history
* feat: example notebook for the regression models and new dataset (energy consumption and weather in Zurich, between 2015 and 2022)

* fix: tests, some datasets width were missing

* feat: udpated changelog

* fix: to keep the API uniform, Zurich energy consumption and weather was split into two datasets. Energy consumption was added to the darts repo

* fix: changed the way datasets are loaded, added an illustration for multi_models=True

* fix: tweaked notebook

* feat: grouped dataset and their width into a single variable to improve readibility

* Apply suggestions from code review

Co-authored-by: Dennis Bader <[email protected]>

* fix: simplified API to load the EnergyConsumptionZurich dataset, updated notebook accordingly

* fix: remove the obsolete dataset from the tests

* blabla

* update dataset

* update notebook p1

* update regression model notebook

* notebook last fixes

* fix: typo

* add regression model example test to merge workflow

---------

Co-authored-by: Dennis Bader <[email protected]>
  • Loading branch information
madtoinou and dennisbader authored Nov 11, 2023
1 parent da049e5 commit 09300d9
Show file tree
Hide file tree
Showing 11 changed files with 1,269 additions and 36 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/merge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
example-name: [00-quickstart.ipynb, 01-multi-time-series-and-covariates.ipynb, 02-data-processing.ipynb, 03-FFT-examples.ipynb, 04-RNN-examples.ipynb, 05-TCN-examples.ipynb, 06-Transformer-examples.ipynb, 07-NBEATS-examples.ipynb, 08-DeepAR-examples.ipynb, 09-DeepTCN-examples.ipynb, 10-Kalman-filter-examples.ipynb, 11-GP-filter-examples.ipynb, 12-Dynamic-Time-Warping-example.ipynb, 13-TFT-examples.ipynb, 15-static-covariates.ipynb, 16-hierarchical-reconciliation.ipynb, 18-TiDE-examples.ipynb, 19-EnsembleModel-examples.ipynb]
example-name: [00-quickstart.ipynb, 01-multi-time-series-and-covariates.ipynb, 02-data-processing.ipynb, 03-FFT-examples.ipynb, 04-RNN-examples.ipynb, 05-TCN-examples.ipynb, 06-Transformer-examples.ipynb, 07-NBEATS-examples.ipynb, 08-DeepAR-examples.ipynb, 09-DeepTCN-examples.ipynb, 10-Kalman-filter-examples.ipynb, 11-GP-filter-examples.ipynb, 12-Dynamic-Time-Warping-example.ipynb, 13-TFT-examples.ipynb, 15-static-covariates.ipynb, 16-hierarchical-reconciliation.ipynb, 18-TiDE-examples.ipynb, 19-EnsembleModel-examples.ipynb, 20-RegressionModel-examples.ipynb]
steps:
- name: "1. Clone repository"
uses: actions/checkout@v2
Expand Down
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,13 @@ but cannot always guarantee backwards compatibility. Changes that may **break co
- Added callback `darts.utils.callbacks.TFMProgressBar` to customize at which model stages to display the progress bar. [#2020](https://github.com/unit8co/darts/pull/2020) by [Dennis Bader](https://github.com/dennisbader).
- Improvements to documentation:
- Adapted the example notebooks to properly apply data transformers and avoid look-ahead bias. [#2020](https://github.com/unit8co/darts/pull/2020) by [Samriddhi Singh](https://github.com/SimTheGreat).
- New example notebook for the `RegressionModels` explaining features such as (component-specific) lags, `output_chunk_length` in relation with `multi_models`, multivariate support, and more. [#2039](https://github.com/unit8co/darts/pull/2039) by [Antoine Madrona](https://github.com/madtoinou).
- Improvements to Regression Models:
- `XGBModel` now leverages XGBoost's native Quantile Regression support that was released in version 2.0.0 for improved probabilistic forecasts. [#2051](https://github.com/unit8co/darts/pull/2051) by [Dennis Bader](https://github.com/dennisbader).
- Other improvements:
- Added support for time index time zone conversion with parameter `tz` before generating/computing holidays and datetime attributes. Support was added to all Time Axis Encoders (standalone encoders and forecasting models' `add_encoders`, time series generation utils functions `holidays_timeseries()` and `datetime_attribute_timeseries()`, and `TimeSeries` methods `add_datetime_attribute()` and `add_holidays()`. [#2054](https://github.com/unit8co/darts/pull/2054) by [Dennis Bader](https://github.com/dennisbader).
- Added optional keyword arguments dict `kwargs` to `ExponentialSmoothing` that will be passed to the constructor of the underlying `statsmodels.tsa.holtwinters.ExponentialSmoothing` model. [#2059](https://github.com/unit8co/darts/pull/2059) by [Antoine Madrona](https://github.com/madtoinou).
- Added new dataset `ElectricityConsumptionZurichDataset`: The dataset contains the electricity consumption of households in Zurich, Switzerland from 2015-2022 on different grid levels. We also added weather measurements for Zurich which can be used as covariates for modelling. [#2039](https://github.com/unit8co/darts/pull/2039) by [Antoine Madrona](https://github.com/madtoinou) and [Dennis Bader](https://github.com/dennisbader).

**Fixed**
- Fixed a bug when calling optimized `historical_forecasts()` for a `RegressionModel` trained with unequal component-specific lags. [#2040](https://github.com/unit8co/darts/pull/2040) by [Antoine Madrona](https://github.com/madtoinou).
Expand Down
111 changes: 110 additions & 1 deletion darts/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@
A few popular time series datasets
"""

import os
from pathlib import Path
from typing import List
from typing import List, Literal, Optional

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -813,3 +814,111 @@ def _to_multi_series(self, series: pd.DataFrame) -> List[TimeSeries]:
Load the WeatherDataset dataset as a list of univariate timeseries, one for weather indicator.
"""
return [TimeSeries.from_series(series[label]) for label in series]


class ElectricityConsumptionZurichDataset(DatasetLoaderCSV):
"""
Electricity Consumption of households & SMEs (low voltage) and businesses & services (medium voltage) in the
city of Zurich [1]_, with values recorded every 15 minutes.
The electricity consumption is combined with weather measurements recorded by three different
stations in the city of Zurich with a hourly frequency [2]_. The missing time stamps are filled with NaN.
The original weather data is recorded every hour. Before adding the features to the electricity consumption,
the data is resampled to 15 minutes frequency, and missing values are interpolated.
To simplify the dataset, the measurements from the Zch_Schimmelstrasse and Zch_Rosengartenstrasse weather
stations are discarded to keep only the data recorded in the Zch_Stampfenbachstrasse station.
Both dataset sources are updated continuously, but this dataset only retrains values between 2015 and 2022.
The time index was converted from CET time zone to UTC.
Components Descriptions:
* Value_NE5 : Households & SMEs electricity consumption (low voltage, grid level 7) in kWh
* Value_NE7 : Business and services electricity consumption (medium voltage, grid level 5) in kWh
* Hr [%Hr] : Relative humidity
* RainDur [min] : Duration of precipitation (divided by 4 for conversion from hourly to quarter-hourly records)
* T [°C] : Temperature
* WD [°] : Wind direction
* WVv [m/s] : Wind vector speed
* p [hPa] : Air pressure
* WVs [m/s] : Wind scalar speed
* StrGlo [W/m2] : Global solar irradiation
Note: before 2018, the scalar speeds were calculated from the 30 minutes vector data.
References
----------
.. [1] https://data.stadt-zuerich.ch/dataset/ewz_stromabgabe_netzebenen_stadt_zuerich
.. [2] https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte
"""

def __init__(self):
def pre_process_dataset(dataset_path):
"""Restrict the time axis and add the weather data"""
df = pd.read_csv(dataset_path, index_col=0)
# convert time index
df.index = (
pd.DatetimeIndex(df.index, tz="CET").tz_convert("UTC").tz_localize(None)
)
# extract pre-determined period
df = df.loc[
(pd.Timestamp("2015-01-01") <= df.index)
& (df.index <= pd.Timestamp("2022-12-31"))
]
# download and preprocess the weather information
df_weather = self._download_weather_data()
# add weather data as additional features
df = pd.concat([df, df_weather], axis=1)
# interpolate weather data
df = df.interpolate()
# raining duration is given in minutes -> we divide by 4 from hourly to quarter-hourly records
df["RainDur [min]"] = df["RainDur [min]"] / 4

# round Electricity cols to 4 decimals, other columns to 2 decimals
cols_precise = ["Value_NE5", "Value_NE7"]
df = df.round(
decimals={col: (4 if col in cols_precise else 2) for col in df.columns}
)

# export the dataset
df.index.name = "Timestamp"
df.to_csv(self._get_path_dataset())

# hash value for dataset with weather data
super().__init__(
metadata=DatasetLoaderMetadata(
"zurich_electricity_consumption.csv",
uri=(
"https://data.stadt-zuerich.ch/dataset/"
"ewz_stromabgabe_netzebenen_stadt_zuerich/"
"download/ewz_stromabgabe_netzebenen_stadt_zuerich.csv"
),
hash="c2fea1a0974611ff1c276abcc1d34619",
header_time="Timestamp",
freq="15min",
pre_process_csv_fn=pre_process_dataset,
)
)

@staticmethod
def _download_weather_data():
"""Concatenate the yearly csv files into a single dataframe and reshape it"""
# download the csv from the url
base_url = "https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte/download/"
filenames = [f"ugz_ogd_meteo_h1_{year}.csv" for year in range(2015, 2023)]
df = pd.concat([pd.read_csv(base_url + fname) for fname in filenames])
# retain only one weather station
df = df.loc[df["Standort"] == "Zch_Stampfenbachstrasse"]
# pivot the df to get all measurements as columns
df["param_name"] = df["Parameter"] + " [" + df["Einheit"] + "]"
df = df.pivot(index="Datum", columns="param_name", values="Wert")
# convert time index to from CET to UTC and extract the required time range
df.index = (
pd.DatetimeIndex(df.index, tz="CET").tz_convert("UTC").tz_localize(None)
)
df = df.loc[
(pd.Timestamp("2015-01-01") <= df.index)
& (df.index <= pd.Timestamp("2022-12-31"))
]
return df
18 changes: 15 additions & 3 deletions darts/datasets/dataset_loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,10 @@ class DatasetLoaderMetadata:
format_time: Optional[str] = None
# used to indicate the freq when we already know it
freq: Optional[str] = None
# a custom function to handling non-csv based datasets
# a custom function handling non-csv based datasets
pre_process_zipped_csv_fn: Optional[Callable] = None
# a custom function handling csv based datasets
pre_process_csv_fn: Optional[Callable] = None
# multivariate
multivariate: Optional[bool] = None

Expand All @@ -49,7 +51,9 @@ class DatasetLoader(ABC):

_DEFAULT_DIRECTORY = Path(os.path.join(Path.home(), Path(".darts/datasets/")))

def __init__(self, metadata: DatasetLoaderMetadata, root_path: Path = None):
def __init__(
self, metadata: DatasetLoaderMetadata, root_path: Optional[Path] = None
):
self._metadata: DatasetLoaderMetadata = metadata
if root_path is None:
self._root_path: Path = DatasetLoader._DEFAULT_DIRECTORY
Expand Down Expand Up @@ -131,7 +135,13 @@ def _download_dataset(self):
"Could not download the dataset. Reason:" + e.__repr__()
) from None

if self._metadata.pre_process_csv_fn is not None:
self._metadata.pre_process_csv_fn(self._get_path_dataset())

def _download_zip_dataset(self):
if self._metadata.pre_process_csv_fn:
logger.warning("Loading a ZIP file does not use the pre_process_csv_fn")

os.makedirs(self._root_path, exist_ok=True)
try:
request = requests.get(self._metadata.uri)
Expand Down Expand Up @@ -186,7 +196,9 @@ def _format_time_column(self, df):


class DatasetLoaderCSV(DatasetLoader):
def __init__(self, metadata: DatasetLoaderMetadata, root_path: Path = None):
def __init__(
self, metadata: DatasetLoaderMetadata, root_path: Optional[Path] = None
):
super().__init__(metadata, root_path)

def _load_from_disk(
Expand Down
62 changes: 31 additions & 31 deletions darts/tests/datasets/test_dataset_loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
AirPassengersDataset,
AusBeerDataset,
AustralianTourismDataset,
ElectricityConsumptionZurichDataset,
ElectricityDataset,
EnergyDataset,
ETTh1Dataset,
Expand Down Expand Up @@ -40,37 +41,36 @@
DatasetLoadingException,
)

datasets = [
AirPassengersDataset,
AusBeerDataset,
AustralianTourismDataset,
EnergyDataset,
HeartRateDataset,
IceCreamHeaterDataset,
MonthlyMilkDataset,
SunspotsDataset,
TaylorDataset,
TemperatureDataset,
USGasolineDataset,
WineDataset,
WoolyDataset,
GasRateCO2Dataset,
MonthlyMilkIncompleteDataset,
ETTh1Dataset,
ETTh2Dataset,
ETTm1Dataset,
ETTm2Dataset,
ElectricityDataset,
UberTLCDataset,
ILINetDataset,
ExchangeRateDataset,
TrafficDataset,
WeatherDataset,
]

_DEFAULT_PATH_TEST = _DEFAULT_PATH + "/tests"

width_datasets = [1, 1, 96, 28, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 7, 7, 7, 7, 370, 262]
datasets_with_width = [
(AirPassengersDataset, 1),
(AusBeerDataset, 1),
(AustralianTourismDataset, 96),
(EnergyDataset, 28),
(HeartRateDataset, 1),
(IceCreamHeaterDataset, 2),
(MonthlyMilkDataset, 1),
(SunspotsDataset, 1),
(TaylorDataset, 1),
(TemperatureDataset, 1),
(USGasolineDataset, 1),
(WineDataset, 1),
(WoolyDataset, 1),
(GasRateCO2Dataset, 2),
(MonthlyMilkIncompleteDataset, 1),
(ETTh1Dataset, 7),
(ETTh2Dataset, 7),
(ETTm1Dataset, 7),
(ETTm2Dataset, 7),
(ElectricityDataset, 370),
(UberTLCDataset, 262),
(ILINetDataset, 11),
(ExchangeRateDataset, 8),
(TrafficDataset, 862),
(WeatherDataset, 21),
(ElectricityConsumptionZurichDataset, 10),
]

wrong_hash_dataset = DatasetLoaderCSV(
metadata=DatasetLoaderMetadata(
Expand Down Expand Up @@ -135,9 +135,9 @@ def tmp_dir_dataset():

class TestDatasetLoader:
@pytest.mark.slow
@pytest.mark.parametrize("dataset_config", zip(width_datasets, datasets))
@pytest.mark.parametrize("dataset_config", datasets_with_width)
def test_ok_dataset(self, dataset_config, tmp_dir_dataset):
width, dataset_cls = dataset_config
dataset_cls, width = dataset_config
dataset = dataset_cls()
assert dataset._DEFAULT_DIRECTORY == tmp_dir_dataset
ts: TimeSeries = dataset.load()
Expand Down
10 changes: 10 additions & 0 deletions docs/source/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,16 @@ with Darts using the Optuna library for hyperparameter optimization.

examples/17-hyperparameter-optimization.ipynb

Regression Models
=================

Regression models example notebook:

.. toctree::
:maxdepth: 1

examples/20-RegressionModel-examples.ipynb


Fast Fourier Transform
======================
Expand Down
1,100 changes: 1,100 additions & 0 deletions examples/20-RegressionModel-examples.ipynb

Large diffs are not rendered by default.

Binary file added examples/static/images/multi_model_ocl2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/static/images/regression_model_train.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/static/images/single_model_ocl2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/static/images/single_model_ocl3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 09300d9

Please sign in to comment.