From 313af0eaae1b3f8d093f39268d2757e388097982 Mon Sep 17 00:00:00 2001 From: GTimothee <39728445+GTimothee@users.noreply.github.com> Date: Tue, 3 Oct 2023 10:46:22 +0200 Subject: [PATCH 1/5] Update 03_demand_forecasting.qmd with the base of a tutorial --- docs/tutorials/03_demand_forecasting.qmd | 87 +++++++++++++++++++++++- 1 file changed, 86 insertions(+), 1 deletion(-) diff --git a/docs/tutorials/03_demand_forecasting.qmd b/docs/tutorials/03_demand_forecasting.qmd index e1d8dc53..ff455e4f 100644 --- a/docs/tutorials/03_demand_forecasting.qmd +++ b/docs/tutorials/03_demand_forecasting.qmd @@ -7,4 +7,89 @@ number-sections: true number-depth: 2 --- -Coming soon... \ No newline at end of file +Timetk enables to generate features from the timestamps in your data very fast. This tutorial showcases how easy it is to perform time series forecasting with `pytimetk`. This applied tutorial showcases the use of: + +- `tk.augment_timeseries_signature()`: Add 29 time series features to a DataFrame. + +Load the following packages before proceeding with this tutorial. + +```{python} +import pytimetk as tk +import pandas as pd +import matplotlib.pyplot as plt +from sklearn.ensemble import RandomForestRegressor +from sklearn.model_selection import train_test_split +``` + +The first thing we want to do is to load a sample dataset. We will use the m4_daily dataset for this tutorial. This dataset can be used for forecasting a `value` for each `id`, given a daily timestamp. +```{python} +dset = tk.load_dataset('m4_daily', parse_dates = ['date']) +dset.head() +``` + +In the following cell, we create a regular timestamp so that each id has a value per day. If it does not have data for a particular day, the value is set to NaN. +```{python} +value_df = dset \ + .groupby(['id']) \ + .summarize_by_time( + date_column = 'date', + value_column = 'value', + freq = "MS", + agg_func = 'sum', + wide_format = False + ) + +# For this example we shall use only years 2014 and 2015, so that we have just enough data for creating a train and a test set. +value_df = value_df[value_df.date.dt.year.isin([2014, 2015])] +value_df.head() +``` + +Let us now reshape the DataFrame so that it has one column per id. This DataFrame represents our target variable y. We call it Y as it is in fact a stack of target variables, one for each id. +```{python} +Y = value_df.set_index(['id', 'date']).sort_index().unstack(['id']) +Y[('value','D10')].plot() # plots the target variable for id D10 +``` + +Now that we have our target, we want to produce the features that will help us predict the target. We only have the time feature for now, and we would like to extract some useful information from it, to be able to train a regressor on this data. Such useful data would include the year or the quarter at prediction time for example. This is where `tk.augment_timeseries_signature` comes handy. It will generate 29 useful time series features from our timestamps. +```{python} +X = value_df.drop_duplicates(subset=['date']).drop(columns=['id', 'value']).augment_timeseries_signature(date_column = 'date').drop(columns=['date']) + +# some features that have been created are categorical. We cannot use them directly to fit a regressor. That is why we produce dummies out of them: +X = pd.get_dummies(X, columns=['date_year', 'date_year_iso', 'date_quarteryear', 'date_month_lbl', 'date_wday_lbl', 'date_am_pm'], drop_first=True) + +# inspect our resulting features +X.head() +``` + +The preprocessing work have been done, we shall now train a basic regressor model. We choose a random forests regressor here, as it is pretty good at handling both numerical and categorical data. +```{python} +from sklearn.ensemble import RandomForestRegressor +from sklearn.model_selection import train_test_split + +Y = Y.fillna(method='ffill').fillna(method='bfill') +X['date_yweek'] = X['date_yweek'].astype(int) +X_train, X_test, Y_train, Y_test = train_test_split(X, Y, shuffle=False, train_size=.5) + +model = RandomForestRegressor() +model = model.fit(X_train, Y_train) +``` + +We can inspect how the training went, by looking at the predictions on the train set. We can see that the random forests worked pretty well and was able to get close to the ground truth. +```{python} +import matplotlib.pyplot as plt +Y_train[('value','D10')].plot(label='ground truth') +pd.DataFrame(model.predict(X_train), index=Y_train.index, columns=Y_train.columns)[('value','D10')].plot(label='predictions') +plt.legend() +plt.show() +``` + +We can now see how well it performs on the test set: +```{python} +import matplotlib.pyplot as plt +Y_test[('value','D10')].plot(label='ground truth') +pd.DataFrame(model.predict(X_test), index=Y_test.index, columns=Y_test.columns)[('value','D10')].plot(label='predictions') +plt.legend() +plt.show() +``` + +In this tutorial, we showed how the `tk.augment_timeseries_signature()` function can be used to effortlessly extract useful features from the date index. The features have proved to be relevant, as we can directly use them to train a regression model and get correct results for the task of time series forecasting. Note that we only used some months of data for the training set and did not change any hyperparameter of the random forests. That shows how meaningful the time features produced are. From 73e7b7e19ae2c27ebc3762fb9fef163c869f2bea Mon Sep 17 00:00:00 2001 From: GTimothee <39728445+GTimothee@users.noreply.github.com> Date: Thu, 5 Oct 2023 00:11:56 +0200 Subject: [PATCH 2/5] Update 03_demand_forecasting.qmd - Using wallmart dataset - Added usage of plot_timeseries --- docs/tutorials/03_demand_forecasting.qmd | 194 +++++++++++++++++------ 1 file changed, 147 insertions(+), 47 deletions(-) diff --git a/docs/tutorials/03_demand_forecasting.qmd b/docs/tutorials/03_demand_forecasting.qmd index ff455e4f..9b63b4ea 100644 --- a/docs/tutorials/03_demand_forecasting.qmd +++ b/docs/tutorials/03_demand_forecasting.qmd @@ -10,86 +10,186 @@ number-depth: 2 Timetk enables to generate features from the timestamps in your data very fast. This tutorial showcases how easy it is to perform time series forecasting with `pytimetk`. This applied tutorial showcases the use of: - `tk.augment_timeseries_signature()`: Add 29 time series features to a DataFrame. +- `tk.plot_timeseries()`: Plots a time series. Load the following packages before proceeding with this tutorial. +# Data preprocessing + ```{python} -import pytimetk as tk import pandas as pd +import pytimetk as tk import matplotlib.pyplot as plt + from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split +from sklearn.metrics import r2_score +from sklearn.preprocessing import RobustScaler +from sklearn.feature_selection import SelectKBest, mutual_info_regression ``` -The first thing we want to do is to load a sample dataset. We will use the m4_daily dataset for this tutorial. This dataset can be used for forecasting a `value` for each `id`, given a daily timestamp. +The first thing we want to do is to load a sample dataset. ```{python} -dset = tk.load_dataset('m4_daily', parse_dates = ['date']) +# We start by loading the dataset +# You can get more insights about the dataset by following this link: https://business-science.github.io/timetk/reference/walmart_sales_weekly.html +dset = tk.load_dataset('walmart_sales_weekly', parse_dates = ['Date']) + +# We also remove markdowns as it is not very useful for the sake of the tutorial +dset = dset.drop(columns=[ + 'id', # This column can be removed as it is equivalent to 'Dept' + 'Store', # This column has only one possible value + 'Type', # This column has only one possible value + 'Size', # This column has only one possible value + 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']) + dset.head() ``` -In the following cell, we create a regular timestamp so that each id has a value per day. If it does not have data for a particular day, the value is set to NaN. +We can plot the values of one Dept to get an idea of how the data looks like, using the `plot_timeseries` method. ```{python} -value_df = dset \ - .groupby(['id']) \ - .summarize_by_time( - date_column = 'date', - value_column = 'value', - freq = "MS", - agg_func = 'sum', - wide_format = False - ) - -# For this example we shall use only years 2014 and 2015, so that we have just enough data for creating a train and a test set. -value_df = value_df[value_df.date.dt.year.isin([2014, 2015])] -value_df.head() +sales_df = dset +fig = sales_df[sales_df['Dept']==1].plot_timeseries( + date_column='Date', + value_column='Weekly_Sales', + facet_ncol = 1, + x_axis_date_labels = "%Y", + engine = 'plotly') +fig ``` -Let us now reshape the DataFrame so that it has one column per id. This DataFrame represents our target variable y. We call it Y as it is in fact a stack of target variables, one for each id. +Let us now reshape the DataFrame so that it has one column per Dept. This DataFrame represents our target variable y. We call it Y as it is in fact a stack of target variables, one for each Dept. ```{python} -Y = value_df.set_index(['id', 'date']).sort_index().unstack(['id']) -Y[('value','D10')].plot() # plots the target variable for id D10 +Y = sales_df[['Dept', 'Date', 'Weekly_Sales']].set_index(['Dept', 'Date']).sort_index().unstack(['Dept']) +Y.head() ``` -Now that we have our target, we want to produce the features that will help us predict the target. We only have the time feature for now, and we would like to extract some useful information from it, to be able to train a regressor on this data. Such useful data would include the year or the quarter at prediction time for example. This is where `tk.augment_timeseries_signature` comes handy. It will generate 29 useful time series features from our timestamps. -```{python} -X = value_df.drop_duplicates(subset=['date']).drop(columns=['id', 'value']).augment_timeseries_signature(date_column = 'date').drop(columns=['date']) - -# some features that have been created are categorical. We cannot use them directly to fit a regressor. That is why we produce dummies out of them: -X = pd.get_dummies(X, columns=['date_year', 'date_year_iso', 'date_quarteryear', 'date_month_lbl', 'date_wday_lbl', 'date_am_pm'], drop_first=True) +Now that we have our target, we want to produce the features that will help us predict the target. -# inspect our resulting features +```{python} +X = sales_df.drop_duplicates(subset=['Date']).drop(columns=['Dept', 'Date', 'Weekly_Sales']) X.head() ``` -The preprocessing work have been done, we shall now train a basic regressor model. We choose a random forests regressor here, as it is pretty good at handling both numerical and categorical data. ```{python} -from sklearn.ensemble import RandomForestRegressor -from sklearn.model_selection import train_test_split +def dummify(X_time, date_col = 'Date'): + X_time = pd.get_dummies(X_time, columns=[ + f'{date_col}_year', + f'{date_col}_year_iso', + f'{date_col}_quarteryear', + f'{date_col}_month_lbl', + f'{date_col}_wday_lbl', + f'{date_col}_am_pm'], drop_first=True) + return X_time + +date_col = 'Date' +X_time = sales_df[['Date']].drop_duplicates(subset=[date_col]).augment_timeseries_signature(date_column = date_col).drop(columns=[date_col]) +X_time = dummify(X_time, date_col='Date') +X_time +``` -Y = Y.fillna(method='ffill').fillna(method='bfill') -X['date_yweek'] = X['date_yweek'].astype(int) -X_train, X_test, Y_train, Y_test = train_test_split(X, Y, shuffle=False, train_size=.5) +# Modeling -model = RandomForestRegressor() -model = model.fit(X_train, Y_train) +```{python} +def train(X, Y, k=None): + """ Trains a RandomForests model on the input data. """ + + Y = Y.fillna(method='ffill').fillna(method='bfill') + X_train, X_test, Y_train, Y_test = train_test_split(X, Y, shuffle=False, train_size=.5) + + # scale numerical features + features_to_scale = [ c for c in ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment'] if c in X.columns] + if len(features_to_scale): + scaler = RobustScaler() + X_train[features_to_scale] = scaler.fit_transform(X_train[features_to_scale]) + X_test[features_to_scale] = scaler.transform(X_test[features_to_scale]) + + # select best features to remove noise + if k is not None: + selector = SelectKBest(mutual_info_regression, k=k) + X_train = selector.fit_transform(X_train, Y_train.iloc[:,1]) + X_test = selector.transform(X_test) + + # train the model + model = RandomForestRegressor(random_state=123456, n_estimators=300) + model = model.fit(X_train, Y_train) + preds_train = model.predict(X_train) + + # test the model + preds_test = model.predict(X_test) + print(f'R2 score: {r2_score(Y_test, preds_test)}') + + return Y_train, Y_test, preds_train, preds_test ``` -We can inspect how the training went, by looking at the predictions on the train set. We can see that the random forests worked pretty well and was able to get close to the ground truth. ```{python} -import matplotlib.pyplot as plt -Y_train[('value','D10')].plot(label='ground truth') -pd.DataFrame(model.predict(X_train), index=Y_train.index, columns=Y_train.columns)[('value','D10')].plot(label='predictions') -plt.legend() -plt.show() +def plot_result(dept_idx, Y, Y_train, Y_test, preds_train, preds_test): + """ Plots the predictions for a given Department. """ + import numpy as np + + data = pd.DataFrame({ + 'Weekly_Sales': pd.concat([ + Y.iloc[:, dept_idx], + pd.Series(preds_train[:,dept_idx], index=Y_train.index), + pd.Series(preds_test[:,dept_idx], index=Y_test.index)]) + }) + data['Labels'] = "" + data['Labels'].iloc[:len(Y)] = 'Ground truth' + data['Labels'].iloc[len(Y):len(Y)+len(Y_train)] = 'Predictions on train set' + data['Labels'].iloc[len(Y)+len(Y_train):] = 'Predictions on test set' + + fig = data.reset_index().plot_timeseries( + date_column='Date', + value_column='Weekly_Sales', + color_column='Labels', + facet_ncol = 1, + smooth=False, + x_axis_date_labels = "%Y", + engine = 'plotly') + fig.show() ``` -We can now see how well it performs on the test set: ```{python} -import matplotlib.pyplot as plt -Y_test[('value','D10')].plot(label='ground truth') -pd.DataFrame(model.predict(X_test), index=Y_test.index, columns=Y_test.columns)[('value','D10')].plot(label='predictions') -plt.legend() -plt.show() +Y_train, Y_test, preds_train, preds_test = train(X, Y) +plot_result(1, Y, Y_train, Y_test, preds_train, preds_test) ``` +R2 score: -0.16 + +```{python} +Y_train, Y_test, preds_train, preds_test = train(X_time, Y) +plot_result(1, Y, Y_train, Y_test, preds_train, preds_test) +``` +R2 score: 0.23 + +``` +from sklearn.feature_selection import mutual_info_regression + +def compute_MI(X, Y): + Y = Y.fillna(method='ffill').fillna(method='bfill') + X_train, X_test, Y_train, Y_test = train_test_split(X, Y, shuffle=False, train_size=.5) + features_to_scale = [ c for c in ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment'] if c in X.columns] + scaler = RobustScaler() + X_train[features_to_scale] = scaler.fit_transform(X_train[features_to_scale]) + + print(pd.DataFrame({'feature': X_train.columns, 'MI': mutual_info_regression(X_train, Y_train.iloc[:,0])}).sort_values(by='MI', ascending=False)) + +compute_MI(X, Y) +``` + + feature MI +1 Temperature 0.277781 +4 Unemployment 0.213034 +2 Fuel_Price 0.080510 +0 IsHoliday 0.004462 +3 CPI 0.003706 + +``` +X_concat = pd.concat([X_time, X[['Temperature', 'Unemployment']]], axis=1) +Y_train, Y_test, preds_train, preds_test = train(X_concat, Y, k=30) +plot_result(1, Y, Y_train, Y_test, preds_train, preds_test) +``` + +R2 score: 0.239 + +# Conclusion In this tutorial, we showed how the `tk.augment_timeseries_signature()` function can be used to effortlessly extract useful features from the date index. The features have proved to be relevant, as we can directly use them to train a regression model and get correct results for the task of time series forecasting. Note that we only used some months of data for the training set and did not change any hyperparameter of the random forests. That shows how meaningful the time features produced are. From b40740aeb8fe74a9fb195c33d08007a31cf44405 Mon Sep 17 00:00:00 2001 From: GTimothee <39728445+GTimothee@users.noreply.github.com> Date: Thu, 5 Oct 2023 11:00:27 +0200 Subject: [PATCH 3/5] Update 03_demand_forecasting.qmd added the forecasting section, using future_frame --- docs/tutorials/03_demand_forecasting.qmd | 50 ++++++++++++++++++++++++ 1 file changed, 50 insertions(+) diff --git a/docs/tutorials/03_demand_forecasting.qmd b/docs/tutorials/03_demand_forecasting.qmd index 9b63b4ea..9408ae20 100644 --- a/docs/tutorials/03_demand_forecasting.qmd +++ b/docs/tutorials/03_demand_forecasting.qmd @@ -190,6 +190,56 @@ plot_result(1, Y, Y_train, Y_test, preds_train, preds_test) R2 score: 0.239 +# Forecasting + +```{python} +def train_final(X, Y): + """ Trains a RandomForests model on the input data. """ + + Y = Y.fillna(method='ffill').fillna(method='bfill') + X_train, Y_train = X, Y + model = RandomForestRegressor(random_state=123456, n_estimators=300) + model = model.fit(X_train, Y_train) + + return model + + +date_col= 'Date' + +X_time_future = sales_df.drop_duplicates(subset=['Date'])[['Date']] +X_time_future = X_time_future.future_frame( + date_column = date_col, + length_out = 60 +).augment_timeseries_signature(date_column = date_col).drop(columns=[date_col]) +X_time_future = dummify(X_time_future, date_col='Date') + +X_time_future.head() +``` + +```{python} +Y_future = sales_df[['Dept', 'Date', 'Weekly_Sales']] +Y_future = Y_future.groupby('Dept').future_frame( + date_column = date_col, + length_out = 60 +) +Y_future.head() +``` + +```{python} +model = train_final(X_time_future.iloc[:len(Y)], Y) +predictions = model.predict(X_time_future.iloc[len(Y):]) +``` + +```{python} +Y_future.loc[Y_future.Date > Y.index[-1], 'Weekly_Sales'] = predictions.T.ravel() +Y_future['Label'] = 'History' +Y_future.loc[Y_future.Date > Y.index[-1], 'Label'] = 'Predictions' +``` + +```{python} +Y_future.query("Dept == 1").plot_timeseries(date_column='Date', value_column='Weekly_Sales', color_column='Label', smooth=False) +``` + # Conclusion In this tutorial, we showed how the `tk.augment_timeseries_signature()` function can be used to effortlessly extract useful features from the date index. The features have proved to be relevant, as we can directly use them to train a regression model and get correct results for the task of time series forecasting. Note that we only used some months of data for the training set and did not change any hyperparameter of the random forests. That shows how meaningful the time features produced are. From a01ad8b24fbfe00dde2cd53658e89710a26ac5b0 Mon Sep 17 00:00:00 2001 From: GTimothee <39728445+GTimothee@users.noreply.github.com> Date: Thu, 5 Oct 2023 11:46:32 +0200 Subject: [PATCH 4/5] Update 03_demand_forecasting.qmd added doc' --- docs/tutorials/03_demand_forecasting.qmd | 108 ++++++++++++++++++----- 1 file changed, 86 insertions(+), 22 deletions(-) diff --git a/docs/tutorials/03_demand_forecasting.qmd b/docs/tutorials/03_demand_forecasting.qmd index 9408ae20..eb5d42cd 100644 --- a/docs/tutorials/03_demand_forecasting.qmd +++ b/docs/tutorials/03_demand_forecasting.qmd @@ -7,15 +7,14 @@ number-sections: true number-depth: 2 --- -Timetk enables to generate features from the timestamps in your data very fast. This tutorial showcases how easy it is to perform time series forecasting with `pytimetk`. This applied tutorial showcases the use of: +Timetk enables to generate features from the time column of your data very easily. This tutorial showcases how easy it is to perform time series forecasting with `pytimetk`. The specific methods we will be using are: - `tk.augment_timeseries_signature()`: Add 29 time series features to a DataFrame. -- `tk.plot_timeseries()`: Plots a time series. +- `tk.plot_timeseries()`: Creates time series plots using different plotting engines such as Plotnine, Matplotlib, and Plotly. +- `tk.future_frame()`: Extend a DataFrame or GroupBy object with future dates. Load the following packages before proceeding with this tutorial. -# Data preprocessing - ```{python} import pandas as pd import pytimetk as tk @@ -28,13 +27,19 @@ from sklearn.preprocessing import RobustScaler from sklearn.feature_selection import SelectKBest, mutual_info_regression ``` -The first thing we want to do is to load a sample dataset. +The tutorial is divided into three parts: We will first have a look at the Wallmart dataset and perform some preprocessing. Secondly, we will create models based on different features, and see how the time features can be useful. Finally, we will solve the task of time series forecasting, using the features from augment_timeseries_signature only, in order to predict future sales. + +# Preprocessing the dataset + +The first thing we want to do is to load the dataset. It is a subset of the Wallmart sales prediction Kaggle competition. You can get more insights about the dataset by following this link: [walmart_sales_weekly](https://business-science.github.io/timetk/reference/walmart_sales_weekly.html). The most important things to know about the dataset is that you are provided with some features like the fuel price or whether the week contains Hollidays and you are expected to predict the weekly sales column for 7 different departments of a given store. Of course, you also have the date for each week, and that is what we can leverage to create additional features. + +Let us start by loading the dataset and cleaning it. Note that we also remove markdown columns as they are not very useful for the tutorial. + ```{python} # We start by loading the dataset -# You can get more insights about the dataset by following this link: https://business-science.github.io/timetk/reference/walmart_sales_weekly.html +# /walmart_sales_weekly.html dset = tk.load_dataset('walmart_sales_weekly', parse_dates = ['Date']) -# We also remove markdowns as it is not very useful for the sake of the tutorial dset = dset.drop(columns=[ 'id', # This column can be removed as it is equivalent to 'Dept' 'Store', # This column has only one possible value @@ -45,7 +50,8 @@ dset = dset.drop(columns=[ dset.head() ``` -We can plot the values of one Dept to get an idea of how the data looks like, using the `plot_timeseries` method. +We can plot the values of one department to get an idea of how the data looks like, using the `plot_timeseries` method: + ```{python} sales_df = dset fig = sales_df[sales_df['Dept']==1].plot_timeseries( @@ -57,21 +63,27 @@ fig = sales_df[sales_df['Dept']==1].plot_timeseries( fig ``` -Let us now reshape the DataFrame so that it has one column per Dept. This DataFrame represents our target variable y. We call it Y as it is in fact a stack of target variables, one for each Dept. +Let us now reshape the DataFrame so that it has one column per Dept. This DataFrame represents our target variable y. We call it Y as it is in fact a matrix, a stack of target variables, one for each Dept. ```{python} Y = sales_df[['Dept', 'Date', 'Weekly_Sales']].set_index(['Dept', 'Date']).sort_index().unstack(['Dept']) Y.head() ``` -Now that we have our target, we want to produce the features that will help us predict the target. +Now that we have our target, we want to produce the features that will help us predict the target. We will create two sets of features, to show the differences between time features and the original features provided with the dataset. + +`X` contains the features originally in the dataset: ```{python} X = sales_df.drop_duplicates(subset=['Date']).drop(columns=['Dept', 'Date', 'Weekly_Sales']) X.head() ``` +`X_time` contains the time features. To build it we first apply the `augment_timeseries_signature` method on the `Date` column. Then, we create dummy variables from the categorical features that have been created, so that they can be fed in a machine learning algorithm. + ```{python} -def dummify(X_time, date_col = 'Date'): +def dummify(X_time: pd.DataFrame, date_col = 'Date'): + """ Creates dummy variables from the categorical date features that have been created. """ + X_time = pd.get_dummies(X_time, columns=[ f'{date_col}_year', f'{date_col}_year_iso', @@ -83,12 +95,39 @@ def dummify(X_time, date_col = 'Date'): date_col = 'Date' X_time = sales_df[['Date']].drop_duplicates(subset=[date_col]).augment_timeseries_signature(date_column = date_col).drop(columns=[date_col]) -X_time = dummify(X_time, date_col='Date') +X_time = dummify(X_time, date_col=date_col) X_time ``` +Let us explain a little bit more what happened here. We select only the Date column with + +```{python} +sales_df[['Date']] +``` + +We then drop the duplicates, as the Date column contains all the dates 7 times (one for each Dept): + +```{python} +drop_duplicates(subset=[date_col]) +``` + +We can now augment the data using `tk.augment_timeseries_signature`, and drop the original Date column: + +```{python} +.augment_timeseries_signature(date_column = date_col).drop(columns=[date_col]) +``` + # Modeling +So far, we defined our target variables `Y`, and two different sets of features: `X` and `X_time`. We can now train a sales forecasting model. For this tutorial we will be using the RandomForestsRegressor, as it is a simple yet powerful model, that can handle multiple types of data. We build a train function that takes the features and the targets as input, and is composed of several steps: + +1. We divide the data into a train set and a test set. `train_size` is the percentage of the data that you want to keep for the train set, the rest will be used for the test set. +2. We scale numerical features if any, so that the model learns better. The `RobustScaler` allows for better performances, as it scales the data using statistics that are robust to outliers. +3. We added the `k` option so that we can select the k best features of our dataset using mutual information, hence reducing the noise of irrelevant features. +4. We train and test the RandomForests model, measuring its performance with the R2 score function. + +The resulting training function is as follows: + ```{python} def train(X, Y, k=None): """ Trains a RandomForests model on the input data. """ @@ -118,9 +157,11 @@ def train(X, Y, k=None): preds_test = model.predict(X_test) print(f'R2 score: {r2_score(Y_test, preds_test)}') - return Y_train, Y_test, preds_train, preds_test + return Y_train, Y_test, preds_train, preds_test # returns data useful for the plot_result function below ``` +In addition, we define a plot function based on `tk.plot_timeseries` that will enable us to compare the ground truth data to the model's predictions, for a given department. + ```{python} def plot_result(dept_idx, Y, Y_train, Y_test, preds_train, preds_test): """ Plots the predictions for a given Department. """ @@ -148,17 +189,14 @@ def plot_result(dept_idx, Y, Y_train, Y_test, preds_train, preds_test): fig.show() ``` +If we train using the `X` matrix of features (the features of the dataset) we get a poor result with a R2 score of -0.16. + ```{python} Y_train, Y_test, preds_train, preds_test = train(X, Y) -plot_result(1, Y, Y_train, Y_test, preds_train, preds_test) +plot_result(1, Y, Y_train, Y_test, preds_train, preds_test) # We inspect the predictions for the first department ``` -R2 score: -0.16 -```{python} -Y_train, Y_test, preds_train, preds_test = train(X_time, Y) -plot_result(1, Y, Y_train, Y_test, preds_train, preds_test) -``` -R2 score: 0.23 +Computing the mutual information score on these features, we realize that only two features are really useful. But using them alone does not improve the results. ``` from sklearn.feature_selection import mutual_info_regression @@ -175,12 +213,25 @@ def compute_MI(X, Y): compute_MI(X, Y) ``` +Output: + +``` feature MI 1 Temperature 0.277781 4 Unemployment 0.213034 2 Fuel_Price 0.080510 0 IsHoliday 0.004462 3 CPI 0.003706 +``` + +Now if we create a model using our date-based features, the results are a lot better, with a R2 score of .23 ! + +```{python} +Y_train, Y_test, preds_train, preds_test = train(X_time, Y) +plot_result(1, Y, Y_train, Y_test, preds_train, preds_test) +``` + +Concatenating all the features together, we can get a little bit of improvement, but not a significant one (R2 score: 0.239). ``` X_concat = pd.concat([X_time, X[['Temperature', 'Unemployment']]], axis=1) @@ -188,10 +239,12 @@ Y_train, Y_test, preds_train, preds_test = train(X_concat, Y, k=30) plot_result(1, Y, Y_train, Y_test, preds_train, preds_test) ``` -R2 score: 0.239 +This section showed the relevance of the time features in time series forecasting. Now let us build our final forecasting engine, using only the time-based features. # Forecasting +We want to use all of our data to create the final model. To that aim, we will use a very simplified version of our previous training function: + ```{python} def train_final(X, Y): """ Trains a RandomForests model on the input data. """ @@ -202,8 +255,11 @@ def train_final(X, Y): model = model.fit(X_train, Y_train) return model +``` +In order to perform the forecasting, we need time-based features for dates that are not in our dataset. Let us use the `tk.future_frame` method to add future dates to our dataset. We then apply the `augment_timeseries_signature` method on the resulting DataFrame, hence creating the time-based features for past and future dates: +```{python} date_col= 'Date' X_time_future = sales_df.drop_duplicates(subset=['Date'])[['Date']] @@ -216,6 +272,8 @@ X_time_future = dummify(X_time_future, date_col='Date') X_time_future.head() ``` +The same way, we augment our target DataFrame by 60 weeks. A we don't know the sales in the future, the additional rows will be filled with nans. We will replace the nans by our predictions in the following of the tutorial. + ```{python} Y_future = sales_df[['Dept', 'Date', 'Weekly_Sales']] Y_future = Y_future.groupby('Dept').future_frame( @@ -225,21 +283,27 @@ Y_future = Y_future.groupby('Dept').future_frame( Y_future.head() ``` +We train the model and store its predictions on the future dates: + ```{python} model = train_final(X_time_future.iloc[:len(Y)], Y) predictions = model.predict(X_time_future.iloc[len(Y):]) ``` +We store the predictions in the Y DataFrame, and tag the prediction entries with the label 'Predictions'. Original entries are tagged with the label `History`. + ```{python} Y_future.loc[Y_future.Date > Y.index[-1], 'Weekly_Sales'] = predictions.T.ravel() Y_future['Label'] = 'History' Y_future.loc[Y_future.Date > Y.index[-1], 'Label'] = 'Predictions' ``` +We can now plot the result very easily using the `plot_timeseries` method. Note how you can easily select a subset of the data (the data of a given department) using the `query` method: + ```{python} Y_future.query("Dept == 1").plot_timeseries(date_column='Date', value_column='Weekly_Sales', color_column='Label', smooth=False) ``` # Conclusion -In this tutorial, we showed how the `tk.augment_timeseries_signature()` function can be used to effortlessly extract useful features from the date index. The features have proved to be relevant, as we can directly use them to train a regression model and get correct results for the task of time series forecasting. Note that we only used some months of data for the training set and did not change any hyperparameter of the random forests. That shows how meaningful the time features produced are. +In this tutorial, we showed how the `tk.augment_timeseries_signature()` function can be used to effortlessly extract useful features from the date index and perform forecasting. From 5d265d7cb67d60b2bb632eb46831ce9ec2b5c4a5 Mon Sep 17 00:00:00 2001 From: GTimothee <39728445+GTimothee@users.noreply.github.com> Date: Thu, 5 Oct 2023 11:47:04 +0200 Subject: [PATCH 5/5] Update 03_demand_forecasting.qmd --- docs/tutorials/03_demand_forecasting.qmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/tutorials/03_demand_forecasting.qmd b/docs/tutorials/03_demand_forecasting.qmd index eb5d42cd..0bf43ef8 100644 --- a/docs/tutorials/03_demand_forecasting.qmd +++ b/docs/tutorials/03_demand_forecasting.qmd @@ -198,7 +198,7 @@ plot_result(1, Y, Y_train, Y_test, preds_train, preds_test) # We inspect the pr Computing the mutual information score on these features, we realize that only two features are really useful. But using them alone does not improve the results. -``` +```{python} from sklearn.feature_selection import mutual_info_regression def compute_MI(X, Y): @@ -215,7 +215,7 @@ compute_MI(X, Y) Output: -``` +```{python} feature MI 1 Temperature 0.277781 4 Unemployment 0.213034 @@ -233,7 +233,7 @@ plot_result(1, Y, Y_train, Y_test, preds_train, preds_test) Concatenating all the features together, we can get a little bit of improvement, but not a significant one (R2 score: 0.239). -``` +```{python} X_concat = pd.concat([X_time, X[['Temperature', 'Unemployment']]], axis=1) Y_train, Y_test, preds_train, preds_test = train(X_concat, Y, k=30) plot_result(1, Y, Y_train, Y_test, preds_train, preds_test)