diff --git a/docs/_freeze/reference/plot_timeseries/execute-results/html.json b/docs/_freeze/reference/plot_timeseries/execute-results/html.json index 136d4153..f3d95be6 100644 --- a/docs/_freeze/reference/plot_timeseries/execute-results/html.json +++ b/docs/_freeze/reference/plot_timeseries/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "eed1a351dc0d2276c7b29abcc3300f18", + "hash": "2047520e00892e131a8075aed550bf4d", "result": { - "markdown": "---\ntitle: plot_timeseries\n---\n\n\n\n`plot_timeseries(data, date_column, value_column, color_column=None, color_palette=None, facet_ncol=1, facet_nrow=None, facet_scales='free_y', facet_dir='h', line_color='#2c3e50', line_size=None, line_type='solid', line_alpha=1.0, y_intercept=None, y_intercept_color='#2c3e50', x_intercept=None, x_intercept_color='#2c3e50', smooth=True, smooth_color='#3366FF', smooth_frac=0.2, smooth_size=1.0, smooth_alpha=1.0, legend_show=True, title='Time Series Plot', x_lab='', y_lab='', color_lab='Legend', x_axis_date_labels='%b %Y', base_size=11, width=None, height=None, engine='plotly', plotly_dropdown=False, plotly_dropdown_x=0, plotly_dropdown_y=1)`\n\nCreates time series plots using different plotting engines such as Plotnine, \nMatplotlib, and Plotly.\n\n## Parameters\n\n| Name | Type | Description | Default |\n|----------------------|----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|\n| `data` | pd.DataFrame or pd.core.groupby.generic.DataFrameGroupBy | The input data for the plot. It can be either a Pandas DataFrame or a Pandas DataFrameGroupBy object. | _required_ |\n| `date_column` | str | The name of the column in the DataFrame that contains the dates for the time series data. | _required_ |\n| `value_column` | str or list | The `value_column` parameter is used to specify the name of the column in the DataFrame that contains the values for the time series data. This column will be plotted on the y-axis of the time series plot. LONG-FORMAT PLOTTING: If the `value_column` parameter is a string, it will be treated as a single column name. To plot multiple time series, group the DataFrame first using pd.DataFrame.groupby(). WIDE-FORMAT PLOTTING: If the `value_column` parameter is a list, it will plotted as multiple time series (wide-format). | _required_ |\n| `color_column` | str | The `color_column` parameter is an optional parameter that specifies the column in the DataFrame that will be used to assign colors to the different time series. If this parameter is not provided, all time series will have the same color. LONG-FORMAT PLOTTING: The `color_column` parameter is a single column name. WIDE-FORMAT PLOTTING: The `color_column` parameter must be the same list as the `value_column` parameter to color the different time series when performing wide-format plotting. | `None` |\n| `color_palette` | list | The `color_palette` parameter is used to specify the colors to be used for the different time series. It accepts a list of color codes or names. If the `color_column` parameter is not provided, the `tk.palette_timetk()` color palette will be used. | `None` |\n| `facet_ncol` | int | The `facet_ncol` parameter determines the number of columns in the facet grid. It specifies how many subplots will be arranged horizontally in the plot. | `1` |\n| `facet_nrow` | int | The `facet_nrow` parameter determines the number of rows in the facet grid. It specifies how many subplots will be arranged vertically in the grid. | `None` |\n| `facet_scales` | str | The `facet_scales` parameter determines the scaling of the y-axis in the facetted plots. It can take the following values: - \"free_y\": The y-axis scale will be free for each facet, but the x-axis scale will be fixed for all facets. This is the default value. - \"free_x\": The y-axis scale will be free for each facet, but the x-axis scale will be fixed for all facets. - \"free\": The y-axis scale will be free for each facet (subplot). This is the default value. | `'free_y'` |\n| `facet_dir` | str | The `facet_dir` parameter determines the direction in which the facets (subplots) are arranged. It can take two possible values: - \"h\": The facets will be arranged horizontally (in rows). This is the default value. - \"v\": The facets will be arranged vertically (in columns). | `'h'` |\n| `line_color` | str | The `line_color` parameter is used to specify the color of the lines in the time series plot. It accepts a string value representing a color code or name. The default value is \"#2c3e50\", which corresponds to a dark blue color. | `'#2c3e50'` |\n| `line_size` | float | The `line_size` parameter is used to specify the size of the lines in the time series plot. It determines the thickness of the lines. | `None` |\n| `line_type` | str | The `line_type` parameter is used to specify the type of line to be used in the time series plot. | `'solid'` |\n| `line_alpha` | float | The `line_alpha` parameter controls the transparency of the lines in the time series plot. It accepts a value between 0 and 1, where 0 means completely transparent (invisible) and 1 means completely opaque (solid). | `1.0` |\n| `y_intercept` | float | The `y_intercept` parameter is used to add a horizontal line to the plot at a specific y-value. It can be set to a numeric value to specify the y-value of the intercept. If set to `None` (default), no y-intercept line will be added to the plot | `None` |\n| `y_intercept_color` | str | The `y_intercept_color` parameter is used to specify the color of the y-intercept line in the plot. It accepts a string value representing a color code or name. The default value is \"#2c3e50\", which corresponds to a dark blue color. You can change this value. | `'#2c3e50'` |\n| `x_intercept` | str | The `x_intercept` parameter is used to add a vertical line at a specific x-axis value on the plot. It is used to highlight a specific point or event in the time series data. - By default, it is set to `None`, which means no vertical line will be added. - You can use a date string to specify the x-axis value of the intercept. For example, \"2020-01-01\" would add a vertical line at the beginning of the year 2020. | `None` |\n| `x_intercept_color` | str | The `x_intercept_color` parameter is used to specify the color of the vertical line that represents the x-intercept in the plot. By default, it is set to \"#2c3e50\", which is a dark blue color. You can change this value to any valid color code. | `'#2c3e50'` |\n| `smooth` | bool | The `smooth` parameter is a boolean indicating whether or not to apply smoothing to the time eries data. If set to True, the time series will be smoothed using the lowess algorithm. The default value is True. | `True` |\n| `smooth_color` | str | The `smooth_color` parameter is used to specify the color of the smoothed line in the time series plot. It accepts a string value representing a color code or name. The default value is `#3366FF`, which corresponds to a shade of blue. You can change this value to any valid color code. | `'#3366FF'` |\n| `smooth_frac` | float | The `smooth_frac` parameter is used to control the fraction of data points used for smoothing the time series. It determines the degree of smoothing applied to the data. A smaller value of `smooth_frac` will result in more smoothing, while a larger value will result in less smoothing. The default value is 0.2. | `0.2` |\n| `smooth_size` | float | The `smooth_size` parameter is used to specify the size of the line used to plot the smoothed values in the time series plot. It is a numeric value that controls the thickness of the line. A larger value will result in a thicker line, while a smaller value will result in a thinner line | `1.0` |\n| `smooth_alpha` | float | The `smooth_alpha` parameter controls the transparency of the smoothed line in the plot. It accepts a value between 0 and 1, where 0 means completely transparent and 1 means completely opaque. | `1.0` |\n| `legend_show` | bool | The `legend_show` parameter is a boolean indicating whether or not to show the legend in the plot. If set to True, the legend will be displayed. The default value is True. | `True` |\n| `title` | str | The title of the plot. | `'Time Series Plot'` |\n| `x_lab` | str | The `x_lab` parameter is used to specify the label for the x-axis in the plot. It is a string that represents the label text. | `''` |\n| `y_lab` | str | The `y_lab` parameter is used to specify the label for the y-axis in the plot. It is a string that represents the label for the y-axis. | `''` |\n| `color_lab` | str | The `color_lab` parameter is used to specify the label for the legend or color scale in the plot. It is used to provide a description of the colors used in the plot, typically when a color column is specified. | `'Legend'` |\n| `x_axis_date_labels` | str | The `x_axis_date_labels` parameter is used to specify the format of the date labels on the x-axis of the plot. It accepts a string representing the format of the date labels. For example, \"%b %Y\" would display the month abbreviation and year (e.g., Jan 2020). | `'%b %Y'` |\n| `base_size` | float | The `base_size` parameter is used to set the base font size for the plot. It determines the size of the text elements such as axis labels, titles, and legends. | `11` |\n| `width` | int | The `width` parameter is used to specify the width of the plot. It determines the horizontal size of the plot in pixels. | `None` |\n| `height` | int | The `height` parameter is used to specify the height of the plot in pixels. It determines the vertical size of the plot when it is rendered. | `None` |\n| `engine` | str | The `engine` parameter specifies the plotting library to use for creating the time series plot. It can take one of the following values: - \"plotly\" (interactive): Use the plotly library to create the plot. This is the default value. - \"plotnine\" (static): Use the plotnine library to create the plot. This is the default value. - \"matplotlib\" (static): Use the matplotlib library to create the plot. | `'plotly'` |\n| `plotly_dropdown` | bool | For analyzing many plots. When set to True and groups are provided, the function switches from faceting to create a dropdown menu to switch between different groups. Default: `False`. | `False` |\n| `plotly_dropdown_x` | float | The x-axis location of the dropdown. Default: 0. | `0` |\n| `plotly_dropdown_y` | float | The y-axis location of the dropdown. Default: 1. | `1` |\n\n## Returns\n\n| Type | Description |\n|------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| The function `plot_timeseries` returns a plot object, depending on the | specified `engine` parameter. - If `engine` is set to 'plotnine' or 'matplotlib', the function returns a plot object that can be further customized or displayed. - If `engine` is set to 'plotly', the function returns a plotly figure object. |\n\n## Examples\n\n\n::: {.cell execution_count=1}\n``` {.python .cell-code}\nimport pytimetk as tk\n\ndf = tk.load_dataset('m4_monthly', parse_dates = ['date'])\n\n# Plotly Object: Single Time Series\nfig = (\n df\n .query('id == \"M750\"')\n .plot_timeseries(\n 'date', 'value', \n facet_ncol = 1,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=2}\n``` {.python .cell-code}\n# Plotly Object: Grouped Time Series (Facets)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n facet_ncol = 2,\n facet_scales = \"free_y\",\n smooth_frac = 0.2,\n smooth_size = 2.0,\n y_intercept = None,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n width = 600,\n height = 500,\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=3}\n``` {.python .cell-code}\n# Plotly Object: Grouped Time Series (Plotly Dropdown)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n facet_scales = \"free_y\",\n smooth_frac = 0.2,\n smooth_size = 2.0,\n y_intercept = None,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n width = 600,\n height = 500,\n plotly_dropdown = True, # Plotly Dropdown\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=4}\n``` {.python .cell-code}\n# Plotly Object: Color Column\nfig = (\n df\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n smooth = False,\n y_intercept = 0,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=5}\n``` {.python .cell-code}\n# Plotnine Object: Single Time Series\nfig = (\n df\n .query('id == \"M1\"')\n .plot_timeseries(\n 'date', 'value', \n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n![](plot_timeseries_files/figure-html/cell-6-output-1.png){}\n:::\n\n::: {.cell-output .cell-output-display execution_count=5}\n```\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=6}\n``` {.python .cell-code}\n# Plotnine Object: Grouped Time Series\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value',\n facet_ncol = 2,\n facet_scales = \"free\",\n line_size = 0.35,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n![](plot_timeseries_files/figure-html/cell-7-output-1.png){}\n:::\n\n::: {.cell-output .cell-output-display execution_count=6}\n```\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=7}\n``` {.python .cell-code}\n# Plotnine Object: Color Column\nfig = (\n df\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n smooth = False,\n y_intercept = 0,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine',\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n![](plot_timeseries_files/figure-html/cell-8-output-1.png){}\n:::\n\n::: {.cell-output .cell-output-display execution_count=7}\n```\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=8}\n``` {.python .cell-code}\n# Matplotlib object (same as plotnine, but converted to matplotlib object)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n facet_ncol = 2,\n x_axis_date_labels = \"%Y\",\n engine = 'matplotlib',\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display execution_count=8}\n![](plot_timeseries_files/figure-html/cell-9-output-1.png){}\n:::\n:::\n\n\n# Wide-Format Plotting\n\n# Imports\nimport pandas as pd\nimport numpy as np\nimport pytimetk as tk\n\n# Set a random seed for reproducibility\nnp.random.seed(42) \n\n# Create a date range\ndates = pd.date_range(start=\"2020-01-01\", periods=100, freq=\"D\")\n\n# Generate random sales data and compute expenses and profit\nsales = np.random.uniform(1000, 5000, len(dates))\nexpenses = sales * np.random.uniform(0.5, 0.8, len(dates))\nprofit = sales - expenses\n\n# Create the DataFrame\ndf = pd.DataFrame({\n 'date': dates,\n 'sales': sales,\n 'expenses': expenses,\n 'profit': profit\n})\n\n(\n df\n .plot_timeseries(\n date_column = 'date', \n value_column = ['sales', 'expenses', 'profit'],\n color_column = ['sales', 'expenses', 'profit'], \n smooth = True,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n plotly_dropdown = True, # Plotly Dropdown\n )\n)\n\n", + "markdown": "---\ntitle: plot_timeseries\n---\n\n\n\n`plot_timeseries(data, date_column, value_column, color_column=None, color_palette=None, facet_ncol=1, facet_nrow=None, facet_scales='free_y', facet_dir='h', line_color='#2c3e50', line_size=None, line_type='solid', line_alpha=1.0, y_intercept=None, y_intercept_color='#2c3e50', x_intercept=None, x_intercept_color='#2c3e50', smooth=True, smooth_color='#3366FF', smooth_frac=0.2, smooth_size=1.0, smooth_alpha=1.0, legend_show=True, title='Time Series Plot', x_lab='', y_lab='', color_lab='Legend', x_axis_date_labels='%b %Y', base_size=11, width=None, height=None, engine='plotly', plotly_dropdown=False, plotly_dropdown_x=0, plotly_dropdown_y=1)`\n\nCreates time series plots using different plotting engines such as Plotnine, \nMatplotlib, and Plotly.\n\n## Parameters\n\n| Name | Type | Description | Default |\n|----------------------|----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|\n| `data` | pd.DataFrame or pd.core.groupby.generic.DataFrameGroupBy | The input data for the plot. It can be either a Pandas DataFrame or a Pandas DataFrameGroupBy object. | _required_ |\n| `date_column` | str | The name of the column in the DataFrame that contains the dates for the time series data. | _required_ |\n| `value_column` | str or list | The `value_column` parameter is used to specify the name of the column in the DataFrame that contains the values for the time series data. This column will be plotted on the y-axis of the time series plot. LONG-FORMAT PLOTTING: If the `value_column` parameter is a string, it will be treated as a single column name. To plot multiple time series, group the DataFrame first using pd.DataFrame.groupby(). WIDE-FORMAT PLOTTING: If the `value_column` parameter is a list, it will plotted as multiple time series (wide-format). | _required_ |\n| `color_column` | str | The `color_column` parameter is an optional parameter that specifies the column in the DataFrame that will be used to assign colors to the different time series. If this parameter is not provided, all time series will have the same color. LONG-FORMAT PLOTTING: The `color_column` parameter is a single column name. WIDE-FORMAT PLOTTING: The `color_column` parameter must be the same list as the `value_column` parameter to color the different time series when performing wide-format plotting. | `None` |\n| `color_palette` | list | The `color_palette` parameter is used to specify the colors to be used for the different time series. It accepts a list of color codes or names. If the `color_column` parameter is not provided, the `tk.palette_timetk()` color palette will be used. | `None` |\n| `facet_ncol` | int | The `facet_ncol` parameter determines the number of columns in the facet grid. It specifies how many subplots will be arranged horizontally in the plot. | `1` |\n| `facet_nrow` | int | The `facet_nrow` parameter determines the number of rows in the facet grid. It specifies how many subplots will be arranged vertically in the grid. | `None` |\n| `facet_scales` | str | The `facet_scales` parameter determines the scaling of the y-axis in the facetted plots. It can take the following values: - \"free_y\": The y-axis scale will be free for each facet, but the x-axis scale will be fixed for all facets. This is the default value. - \"free_x\": The y-axis scale will be free for each facet, but the x-axis scale will be fixed for all facets. - \"free\": The y-axis scale will be free for each facet (subplot). This is the default value. | `'free_y'` |\n| `facet_dir` | str | The `facet_dir` parameter determines the direction in which the facets (subplots) are arranged. It can take two possible values: - \"h\": The facets will be arranged horizontally (in rows). This is the default value. - \"v\": The facets will be arranged vertically (in columns). | `'h'` |\n| `line_color` | str | The `line_color` parameter is used to specify the color of the lines in the time series plot. It accepts a string value representing a color code or name. The default value is \"#2c3e50\", which corresponds to a dark blue color. | `'#2c3e50'` |\n| `line_size` | float | The `line_size` parameter is used to specify the size of the lines in the time series plot. It determines the thickness of the lines. | `None` |\n| `line_type` | str | The `line_type` parameter is used to specify the type of line to be used in the time series plot. | `'solid'` |\n| `line_alpha` | float | The `line_alpha` parameter controls the transparency of the lines in the time series plot. It accepts a value between 0 and 1, where 0 means completely transparent (invisible) and 1 means completely opaque (solid). | `1.0` |\n| `y_intercept` | float | The `y_intercept` parameter is used to add a horizontal line to the plot at a specific y-value. It can be set to a numeric value to specify the y-value of the intercept. If set to `None` (default), no y-intercept line will be added to the plot | `None` |\n| `y_intercept_color` | str | The `y_intercept_color` parameter is used to specify the color of the y-intercept line in the plot. It accepts a string value representing a color code or name. The default value is \"#2c3e50\", which corresponds to a dark blue color. You can change this value. | `'#2c3e50'` |\n| `x_intercept` | str | The `x_intercept` parameter is used to add a vertical line at a specific x-axis value on the plot. It is used to highlight a specific point or event in the time series data. - By default, it is set to `None`, which means no vertical line will be added. - You can use a date string to specify the x-axis value of the intercept. For example, \"2020-01-01\" would add a vertical line at the beginning of the year 2020. | `None` |\n| `x_intercept_color` | str | The `x_intercept_color` parameter is used to specify the color of the vertical line that represents the x-intercept in the plot. By default, it is set to \"#2c3e50\", which is a dark blue color. You can change this value to any valid color code. | `'#2c3e50'` |\n| `smooth` | bool | The `smooth` parameter is a boolean indicating whether or not to apply smoothing to the time eries data. If set to True, the time series will be smoothed using the lowess algorithm. The default value is True. | `True` |\n| `smooth_color` | str | The `smooth_color` parameter is used to specify the color of the smoothed line in the time series plot. It accepts a string value representing a color code or name. The default value is `#3366FF`, which corresponds to a shade of blue. You can change this value to any valid color code. | `'#3366FF'` |\n| `smooth_frac` | float | The `smooth_frac` parameter is used to control the fraction of data points used for smoothing the time series. It determines the degree of smoothing applied to the data. A smaller value of `smooth_frac` will result in more smoothing, while a larger value will result in less smoothing. The default value is 0.2. | `0.2` |\n| `smooth_size` | float | The `smooth_size` parameter is used to specify the size of the line used to plot the smoothed values in the time series plot. It is a numeric value that controls the thickness of the line. A larger value will result in a thicker line, while a smaller value will result in a thinner line | `1.0` |\n| `smooth_alpha` | float | The `smooth_alpha` parameter controls the transparency of the smoothed line in the plot. It accepts a value between 0 and 1, where 0 means completely transparent and 1 means completely opaque. | `1.0` |\n| `legend_show` | bool | The `legend_show` parameter is a boolean indicating whether or not to show the legend in the plot. If set to True, the legend will be displayed. The default value is True. | `True` |\n| `title` | str | The title of the plot. | `'Time Series Plot'` |\n| `x_lab` | str | The `x_lab` parameter is used to specify the label for the x-axis in the plot. It is a string that represents the label text. | `''` |\n| `y_lab` | str | The `y_lab` parameter is used to specify the label for the y-axis in the plot. It is a string that represents the label for the y-axis. | `''` |\n| `color_lab` | str | The `color_lab` parameter is used to specify the label for the legend or color scale in the plot. It is used to provide a description of the colors used in the plot, typically when a color column is specified. | `'Legend'` |\n| `x_axis_date_labels` | str | The `x_axis_date_labels` parameter is used to specify the format of the date labels on the x-axis of the plot. It accepts a string representing the format of the date labels. For example, \"%b %Y\" would display the month abbreviation and year (e.g., Jan 2020). | `'%b %Y'` |\n| `base_size` | float | The `base_size` parameter is used to set the base font size for the plot. It determines the size of the text elements such as axis labels, titles, and legends. | `11` |\n| `width` | int | The `width` parameter is used to specify the width of the plot. It determines the horizontal size of the plot in pixels. | `None` |\n| `height` | int | The `height` parameter is used to specify the height of the plot in pixels. It determines the vertical size of the plot when it is rendered. | `None` |\n| `engine` | str | The `engine` parameter specifies the plotting library to use for creating the time series plot. It can take one of the following values: - \"plotly\" (interactive): Use the plotly library to create the plot. This is the default value. - \"plotnine\" (static): Use the plotnine library to create the plot. This is the default value. - \"matplotlib\" (static): Use the matplotlib library to create the plot. | `'plotly'` |\n| `plotly_dropdown` | bool | For analyzing many plots. When set to True and groups are provided, the function switches from faceting to create a dropdown menu to switch between different groups. Default: `False`. | `False` |\n| `plotly_dropdown_x` | float | The x-axis location of the dropdown. Default: 0. | `0` |\n| `plotly_dropdown_y` | float | The y-axis location of the dropdown. Default: 1. | `1` |\n\n## Returns\n\n| Type | Description |\n|------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| The function `plot_timeseries` returns a plot object, depending on the | specified `engine` parameter. - If `engine` is set to 'plotnine' or 'matplotlib', the function returns a plot object that can be further customized or displayed. - If `engine` is set to 'plotly', the function returns a plotly figure object. |\n\n## Examples\n\n\n::: {.cell execution_count=1}\n``` {.python .cell-code}\nimport pytimetk as tk\n\ndf = tk.load_dataset('m4_monthly', parse_dates = ['date'])\n\n# Plotly Object: Single Time Series\nfig = (\n df\n .query('id == \"M750\"')\n .plot_timeseries(\n 'date', 'value', \n facet_ncol = 1,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=2}\n``` {.python .cell-code}\n# Plotly Object: Grouped Time Series (Facets)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n facet_ncol = 2,\n facet_scales = \"free_y\",\n smooth_frac = 0.2,\n smooth_size = 2.0,\n y_intercept = None,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n width = 600,\n height = 500,\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=3}\n``` {.python .cell-code}\n# Plotly Object: Grouped Time Series (Plotly Dropdown)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n facet_scales = \"free_y\",\n smooth_frac = 0.2,\n smooth_size = 2.0,\n y_intercept = None,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n width = 600,\n height = 500,\n plotly_dropdown = True, # Plotly Dropdown\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=4}\n``` {.python .cell-code}\n# Plotly Object: Color Column\nfig = (\n df\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n smooth = False,\n y_intercept = 0,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=5}\n``` {.python .cell-code}\n# Plotnine Object: Single Time Series\nfig = (\n df\n .query('id == \"M1\"')\n .plot_timeseries(\n 'date', 'value', \n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n![](plot_timeseries_files/figure-html/cell-6-output-1.png){}\n:::\n\n::: {.cell-output .cell-output-display execution_count=5}\n```\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=6}\n``` {.python .cell-code}\n# Plotnine Object: Grouped Time Series\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value',\n facet_ncol = 2,\n facet_scales = \"free\",\n line_size = 0.35,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n![](plot_timeseries_files/figure-html/cell-7-output-1.png){}\n:::\n\n::: {.cell-output .cell-output-display execution_count=6}\n```\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=7}\n``` {.python .cell-code}\n# Plotnine Object: Color Column\nfig = (\n df\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n smooth = False,\n y_intercept = 0,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine',\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display}\n![](plot_timeseries_files/figure-html/cell-8-output-1.png){}\n:::\n\n::: {.cell-output .cell-output-display execution_count=7}\n```\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=8}\n``` {.python .cell-code}\n# Matplotlib object (same as plotnine, but converted to matplotlib object)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n facet_ncol = 2,\n x_axis_date_labels = \"%Y\",\n engine = 'matplotlib',\n )\n)\nfig\n```\n\n::: {.cell-output .cell-output-display execution_count=8}\n![](plot_timeseries_files/figure-html/cell-9-output-1.png){}\n:::\n:::\n\n\n::: {.cell execution_count=9}\n``` {.python .cell-code}\n# Wide-Format Plotting\n\n# Imports\nimport pandas as pd\nimport numpy as np\nimport pytimetk as tk\n\n# Set a random seed for reproducibility\nnp.random.seed(42) \n\n# Create a date range\ndates = pd.date_range(start=\"2020-01-01\", periods=100, freq=\"D\")\n\n# Generate random sales data and compute expenses and profit\nsales = np.random.uniform(1000, 5000, len(dates))\nexpenses = sales * np.random.uniform(0.5, 0.8, len(dates))\nprofit = sales - expenses\n\n# Create the DataFrame\ndf = pd.DataFrame({\n 'date': dates,\n 'sales': sales,\n 'expenses': expenses,\n 'profit': profit\n})\n\n(\n df\n .plot_timeseries(\n date_column = 'date', \n value_column = ['sales', 'expenses', 'profit'],\n color_column = ['sales', 'expenses', 'profit'], \n smooth = True,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n plotly_dropdown = True, # Plotly Dropdown\n )\n)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n", "supporting": [ "plot_timeseries_files/figure-html" ], diff --git a/docs/_site/404.html b/docs/_site/404.html index 25c6c31f..7ee4da80 100644 --- a/docs/_site/404.html +++ b/docs/_site/404.html @@ -204,12 +204,6 @@ Anomaly Detection - - @@ -257,6 +251,12 @@ Correlation Funnel + + diff --git a/docs/_site/changelog-news.html b/docs/_site/changelog-news.html index 55bcd52f..6c37f158 100644 --- a/docs/_site/changelog-news.html +++ b/docs/_site/changelog-news.html @@ -204,12 +204,6 @@ Anomaly Detection - - @@ -257,6 +251,12 @@ Correlation Funnel + + diff --git a/docs/_site/contributing.html b/docs/_site/contributing.html index 6aeb4a25..8d16c92a 100644 --- a/docs/_site/contributing.html +++ b/docs/_site/contributing.html @@ -238,12 +238,6 @@ Anomaly Detection - - @@ -291,6 +285,12 @@ Correlation Funnel + + diff --git a/docs/_site/getting-started/01_installation.html b/docs/_site/getting-started/01_installation.html index 735e0a2e..8ffb3ece 100644 --- a/docs/_site/getting-started/01_installation.html +++ b/docs/_site/getting-started/01_installation.html @@ -239,12 +239,6 @@ Anomaly Detection - - @@ -292,6 +286,12 @@ Correlation Funnel + + diff --git a/docs/_site/getting-started/02_quick_start.html b/docs/_site/getting-started/02_quick_start.html index 08e7059c..5d22f0f8 100644 --- a/docs/_site/getting-started/02_quick_start.html +++ b/docs/_site/getting-started/02_quick_start.html @@ -259,12 +259,6 @@ Anomaly Detection - - @@ -312,6 +306,12 @@ Correlation Funnel + + diff --git a/docs/_site/guides/01_visualization.html b/docs/_site/guides/01_visualization.html index b4f96b8d..add4a4ae 100644 --- a/docs/_site/guides/01_visualization.html +++ b/docs/_site/guides/01_visualization.html @@ -265,12 +265,6 @@ Anomaly Detection - - @@ -318,6 +312,12 @@ Correlation Funnel + + diff --git a/docs/_site/guides/02_timetk_concepts.html b/docs/_site/guides/02_timetk_concepts.html index 35871600..a9dd458b 100644 --- a/docs/_site/guides/02_timetk_concepts.html +++ b/docs/_site/guides/02_timetk_concepts.html @@ -244,12 +244,6 @@ Anomaly Detection - - @@ -297,6 +291,12 @@ Correlation Funnel + + diff --git a/docs/_site/guides/03_pandas_frequency.html b/docs/_site/guides/03_pandas_frequency.html index 2fd286cf..f54ada07 100644 --- a/docs/_site/guides/03_pandas_frequency.html +++ b/docs/_site/guides/03_pandas_frequency.html @@ -243,12 +243,6 @@ Anomaly Detection - - @@ -296,6 +290,12 @@ Correlation Funnel + + diff --git a/docs/_site/guides/04_wrangling.html b/docs/_site/guides/04_wrangling.html index 7c4f1874..03648f66 100644 --- a/docs/_site/guides/04_wrangling.html +++ b/docs/_site/guides/04_wrangling.html @@ -244,12 +244,6 @@ Anomaly Detection - - @@ -297,6 +291,12 @@ Correlation Funnel + + diff --git a/docs/_site/guides/05_augmenting.html b/docs/_site/guides/05_augmenting.html index 366772f5..6284da04 100644 --- a/docs/_site/guides/05_augmenting.html +++ b/docs/_site/guides/05_augmenting.html @@ -260,12 +260,6 @@ Anomaly Detection - - @@ -313,6 +307,12 @@ Correlation Funnel + + diff --git a/docs/_site/guides/06_anomalize.html b/docs/_site/guides/06_anomalize.html index af8c7136..0652e8ff 100644 --- a/docs/_site/guides/06_anomalize.html +++ b/docs/_site/guides/06_anomalize.html @@ -63,7 +63,7 @@ - + @@ -259,12 +259,6 @@ Anomaly Detection - - @@ -312,6 +306,12 @@ Correlation Funnel + + @@ -1249,8 +1249,8 @@

2 More Coming Soo diff --git a/docs/_site/index.html b/docs/_site/index.html index f0b5c7e0..45321d61 100644 --- a/docs/_site/index.html +++ b/docs/_site/index.html @@ -263,12 +263,6 @@ Anomaly Detection - - @@ -316,6 +310,12 @@ Correlation Funnel + + diff --git a/docs/_site/performance/01_speed_comparisons.html b/docs/_site/performance/01_speed_comparisons.html index 3fa05677..3eba7eac 100644 --- a/docs/_site/performance/01_speed_comparisons.html +++ b/docs/_site/performance/01_speed_comparisons.html @@ -63,7 +63,7 @@ - + @@ -245,12 +245,6 @@ Anomaly Detection - - @@ -298,6 +292,12 @@ Correlation Funnel + + @@ -1358,8 +1358,8 @@

7 More Coming Soo @@ -692,9 +686,9 @@

Examples

fig
-
+
+ diff --git a/docs/_site/reference/progress_apply.html b/docs/_site/reference/progress_apply.html index e0a2294d..59861257 100644 --- a/docs/_site/reference/progress_apply.html +++ b/docs/_site/reference/progress_apply.html @@ -236,12 +236,6 @@ Anomaly Detection - - @@ -289,6 +283,12 @@ Correlation Funnel + + diff --git a/docs/_site/reference/summarize_by_time.html b/docs/_site/reference/summarize_by_time.html index e5ff5341..99b47699 100644 --- a/docs/_site/reference/summarize_by_time.html +++ b/docs/_site/reference/summarize_by_time.html @@ -235,12 +235,6 @@ Anomaly Detection - - @@ -288,6 +282,12 @@ Correlation Funnel + + diff --git a/docs/_site/reference/theme_timetk.html b/docs/_site/reference/theme_timetk.html index 55e997d7..04261140 100644 --- a/docs/_site/reference/theme_timetk.html +++ b/docs/_site/reference/theme_timetk.html @@ -232,12 +232,6 @@ Anomaly Detection - - @@ -285,6 +279,12 @@ Correlation Funnel + + diff --git a/docs/_site/reference/time_scale_template.html b/docs/_site/reference/time_scale_template.html index 1cebf816..66f24444 100644 --- a/docs/_site/reference/time_scale_template.html +++ b/docs/_site/reference/time_scale_template.html @@ -235,12 +235,6 @@ Anomaly Detection - - @@ -288,6 +282,12 @@ Correlation Funnel + + diff --git a/docs/_site/reference/timeseries_unit_frequency_table.html b/docs/_site/reference/timeseries_unit_frequency_table.html index 92ba2578..96626808 100644 --- a/docs/_site/reference/timeseries_unit_frequency_table.html +++ b/docs/_site/reference/timeseries_unit_frequency_table.html @@ -235,12 +235,6 @@ Anomaly Detection - - @@ -288,6 +282,12 @@ Correlation Funnel + + diff --git a/docs/_site/reference/transform_columns.html b/docs/_site/reference/transform_columns.html index 1c378911..b581699a 100644 --- a/docs/_site/reference/transform_columns.html +++ b/docs/_site/reference/transform_columns.html @@ -198,12 +198,6 @@ Anomaly Detection - - @@ -251,6 +245,12 @@ Correlation Funnel + + diff --git a/docs/_site/reference/ts_features.html b/docs/_site/reference/ts_features.html index 30b9d5fc..1b0ab141 100644 --- a/docs/_site/reference/ts_features.html +++ b/docs/_site/reference/ts_features.html @@ -236,12 +236,6 @@ Anomaly Detection - - @@ -289,6 +283,12 @@ Correlation Funnel + + diff --git a/docs/_site/reference/ts_summary.html b/docs/_site/reference/ts_summary.html index ddeafa77..6ad82c19 100644 --- a/docs/_site/reference/ts_summary.html +++ b/docs/_site/reference/ts_summary.html @@ -236,12 +236,6 @@ Anomaly Detection - - @@ -289,6 +283,12 @@ Correlation Funnel + + diff --git a/docs/_site/reference/week_of_month.html b/docs/_site/reference/week_of_month.html index 6e576d2b..d5e0ef6c 100644 --- a/docs/_site/reference/week_of_month.html +++ b/docs/_site/reference/week_of_month.html @@ -232,12 +232,6 @@ Anomaly Detection - - @@ -285,6 +279,12 @@ Correlation Funnel + + diff --git a/docs/_site/search.json b/docs/_site/search.json index 7949ee54..ada72ad6 100644 --- a/docs/_site/search.json +++ b/docs/_site/search.json @@ -1225,242 +1225,235 @@ "text": "5.1 Chande Momentum Oscillator (CMO) augment_cmo()\n\nPolars is 3.3X faster than Pandas\nSpeed improvement of Polars (vs Pandas) increases with number of CMO periods\n\n\n\n\n\n\n\nPolarsPandas\n\n\n\n\nCode\n%%timeit -n 25\n\ndf = (\n stocks_daily_df\n .groupby('symbol')\n .augment_cmo(\n date_column = 'date', \n value_column = 'adjusted', \n periods = (5,30),\n engine = 'polars', \n )\n)\n\n# 94.4 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 25 loops each)\n\n\n\n\n\n\nCode\n%%timeit -n 25\n\ndf = (\n stocks_daily_df\n .groupby('symbol')\n .augment_cmo(\n date_column = 'date', \n value_column = 'adjusted', \n periods = (5,30),\n engine = 'pandas', \n )\n)\n\n# 73.3 ms ± 3.29 ms per loop (mean ± std. dev. of 7 runs, 25 loops each)" }, { - "objectID": "tutorials/01_sales_crm.html", - "href": "tutorials/01_sales_crm.html", - "title": "Sales Analysis", + "objectID": "tutorials/02_finance.html", + "href": "tutorials/02_finance.html", + "title": "Finance Analysis", "section": "", - "text": "In this tutorial, we will use pytimetk and its powerful functions to perform a time series analysis on a dataset representing bike sales. Our goal is to understand the patterns in the data and forecast future sales. You will:" - }, - { - "objectID": "tutorials/01_sales_crm.html#load-packages.", - "href": "tutorials/01_sales_crm.html#load-packages.", - "title": "Sales Analysis", - "section": "1.1 Load Packages.", - "text": "1.1 Load Packages.\nIf you do not have pytimetk installed, you can install by using\npip install pytimetk\nor for the latest features and functionality, you can install the development version.\npip install git+https://github.com/business-science/pytimetk.git\n\n\nCode\nimport pytimetk as tk\nimport pandas as pd\nimport numpy as np\n\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.model_selection import train_test_split" + "text": "Timetk is designed to work with any time series domain. Arguably the most important is Finance. This tutorial showcases how you can perform Financial Investment and Stock Analysis at scale with pytimetk. This applied tutorial covers financial analysis with:\nLoad the following packages before proceeding with this tutorial.\nCode\nimport pytimetk as tk\nimport pandas as pd\nimport numpy as np" }, { - "objectID": "tutorials/01_sales_crm.html#load-inspect-dataset", - "href": "tutorials/01_sales_crm.html#load-inspect-dataset", - "title": "Sales Analysis", - "section": "1.2 Load & inspect dataset", - "text": "1.2 Load & inspect dataset\nTo kick off our analysis, we’ll begin by importing essential libraries and accessing the ‘bike_sales’ dataset available within pytimetk’s suite of built-in datasets.\nThe Bike Sales dataset exemplifies what one might find in a CRM (Customer Relationship Management) system. CRM systems are pivotal for businesses, offering vital insights by tracking sales throughout the entire sales funnel. Such datasets are rich with transaction-level data, encompassing elements like order numbers, individual order lines, customer details, product information, and specific transaction data.\nTransactional data, such as this, inherently holds the essential components for time series analysis:\n\nTime Stamps\nAssociated Values\nDistinct Groups or Categories\n\nGiven these attributes, the Bike Sales dataset emerges as an ideal candidate for analysis using pytimetk." + "objectID": "tutorials/02_finance.html#application-moving-averages-10-day-and-50-day", + "href": "tutorials/02_finance.html#application-moving-averages-10-day-and-50-day", + "title": "Finance Analysis", + "section": "3.1 Application: Moving Averages, 10-Day and 50-Day", + "text": "3.1 Application: Moving Averages, 10-Day and 50-Day\nThis code template can be used to make and visualize the 10-day and 50-Day moving average of a group of stock symbols. Click to expand the code.\n\nPlotlyPlotnine\n\n\n\n\nCode\n# Add 2 moving averages (10-day and 50-Day)\nsma_df = stocks_df[['symbol', 'date', 'adjusted']] \\\n .groupby('symbol') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'adjusted',\n window = [10, 50],\n window_func = ['mean'],\n center = False,\n threads = 1, # Change to -1 to use all available cores\n )\n\n# Visualize \n(sma_df \n\n # zoom in on dates\n .query('date >= \"2023-01-01\"') \n\n # Convert to long format\n .melt(\n id_vars = ['symbol', 'date'],\n value_vars = [\"adjusted\", \"adjusted_rolling_mean_win_10\", \"adjusted_rolling_mean_win_50\"]\n ) \n\n # Group on symbol and visualize\n .groupby(\"symbol\") \n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n color_column = 'variable',\n smooth = False, \n facet_ncol = 2,\n width = 900,\n height = 700,\n engine = \"plotly\"\n )\n)\n\n\n\n\n\n\n \n\n\n\n\n\n\nCode\n# Add 2 moving averages (10-day and 50-Day)\nsma_df = stocks_df[['symbol', 'date', 'adjusted']] \\\n .groupby('symbol') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'adjusted',\n window = [10, 50],\n window_func = ['mean'],\n center = False,\n threads = 1, # Change to -1 to use all available cores\n )\n\n# Visualize \n(sma_df \n\n # zoom in on dates\n .query('date >= \"2023-01-01\"') \n\n # Convert to long format\n .melt(\n id_vars = ['symbol', 'date'],\n value_vars = [\"adjusted\", \"adjusted_rolling_mean_win_10\", \"adjusted_rolling_mean_win_50\"]\n ) \n\n # Group on symbol and visualize\n .groupby(\"symbol\") \n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n color_column = 'variable',\n smooth = False, \n facet_ncol = 2,\n width = 900,\n height = 700,\n engine = \"plotnine\"\n )\n)\n\n\n\n\n\n\n\n\n<Figure Size: (900 x 700)>" }, { - "objectID": "tutorials/01_sales_crm.html#initial-inspection-with-tk.glimpse", - "href": "tutorials/01_sales_crm.html#initial-inspection-with-tk.glimpse", - "title": "Sales Analysis", - "section": "2.1 Initial Inspection with tk.glimpse", - "text": "2.1 Initial Inspection with tk.glimpse\nTo get a preliminary understanding of our data, let’s utilize the tk.glimpse() function from pytimetk. This will provide us with a snapshot of the available fields, their respective data types, and a sneak peek into the data entries.\n\n\nCode\ndf = tk.datasets.load_dataset('bike_sales_sample')\ndf['order_date'] = pd.to_datetime(df['order_date'])\n\ndf.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 2466 rows of 13 columns\norder_id: int64 [1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 5, 5, ...\norder_line: int64 [1, 2, 1, 2, 1, 2, 3, 4, 5, 1, 1, 2, ...\norder_date: datetime64[ns] [Timestamp('2011-01-07 00:00:00'), Ti ...\nquantity: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...\nprice: int64 [6070, 5970, 2770, 5970, 10660, 3200, ...\ntotal_price: int64 [6070, 5970, 2770, 5970, 10660, 3200, ...\nmodel: object ['Jekyll Carbon 2', 'Trigger Carbon 2 ...\ncategory_1: object ['Mountain', 'Mountain', 'Mountain', ...\ncategory_2: object ['Over Mountain', 'Over Mountain', 'T ...\nframe_material: object ['Carbon', 'Carbon', 'Aluminum', 'Car ...\nbikeshop_name: object ['Ithaca Mountain Climbers', 'Ithaca ...\ncity: object ['Ithaca', 'Ithaca', 'Kansas City', ' ...\nstate: object ['NY', 'NY', 'KS', 'KS', 'KY', 'KY', ..." + "objectID": "tutorials/02_finance.html#application-bollinger-bands", + "href": "tutorials/02_finance.html#application-bollinger-bands", + "title": "Finance Analysis", + "section": "3.2 Application: Bollinger Bands", + "text": "3.2 Application: Bollinger Bands\nBollinger Bands are a volatility indicator commonly used in financial trading. They consist of three lines:\n\nThe middle band, which is a simple moving average (usually over 20 periods).\nThe upper band, calculated as the middle band plus k times the standard deviation of the price (typically, k=2).\nThe lower band, calculated as the middle band minus k times the standard deviation of the price.\n\nHere’s how you can calculate and plot Bollinger Bands with pytimetk using this code template (click to expand):\n\nPlotlyPlotnine\n\n\n\n\nCode\n# Bollinger Bands\nbollinger_df = stocks_df[['symbol', 'date', 'adjusted']] \\\n .groupby('symbol') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'adjusted',\n window = 20,\n window_func = ['mean', 'std'],\n center = False\n ) \\\n .assign(\n upper_band = lambda x: x['adjusted_rolling_mean_win_20'] + 2*x['adjusted_rolling_std_win_20'],\n lower_band = lambda x: x['adjusted_rolling_mean_win_20'] - 2*x['adjusted_rolling_std_win_20']\n )\n\n\n# Visualize\n(bollinger_df\n\n # zoom in on dates\n .query('date >= \"2023-01-01\"') \n\n # Convert to long format\n .melt(\n id_vars = ['symbol', 'date'],\n value_vars = [\"adjusted\", \"adjusted_rolling_mean_win_20\", \"upper_band\", \"lower_band\"]\n ) \n\n # Group on symbol and visualize\n .groupby(\"symbol\") \n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n color_column = 'variable',\n # Adjust colors for Bollinger Bands\n color_palette =[\"#2C3E50\", \"#E31A1C\", '#18BC9C', '#18BC9C'],\n smooth = False, \n facet_ncol = 2,\n width = 900,\n height = 700,\n engine = \"plotly\" \n )\n)\n\n\n\n\n\n\n \n\n\n\n\n\n\nCode\n# Bollinger Bands\nbollinger_df = stocks_df[['symbol', 'date', 'adjusted']] \\\n .groupby('symbol') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'adjusted',\n window = 20,\n window_func = ['mean', 'std'],\n center = False\n ) \\\n .assign(\n upper_band = lambda x: x['adjusted_rolling_mean_win_20'] + 2*x['adjusted_rolling_std_win_20'],\n lower_band = lambda x: x['adjusted_rolling_mean_win_20'] - 2*x['adjusted_rolling_std_win_20']\n )\n\n\n# Visualize\n(bollinger_df\n\n # zoom in on dates\n .query('date >= \"2023-01-01\"') \n\n # Convert to long format\n .melt(\n id_vars = ['symbol', 'date'],\n value_vars = [\"adjusted\", \"adjusted_rolling_mean_win_20\", \"upper_band\", \"lower_band\"]\n ) \n\n # Group on symbol and visualize\n .groupby(\"symbol\") \n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n color_column = 'variable',\n # Adjust colors for Bollinger Bands\n color_palette =[\"#2C3E50\", \"#E31A1C\", '#18BC9C', '#18BC9C'],\n smooth = False, \n facet_ncol = 2,\n width = 900,\n height = 700,\n engine = \"plotnine\"\n )\n)\n\n\n\n\n\n\n\n\n<Figure Size: (900 x 700)>" }, { - "objectID": "tutorials/01_sales_crm.html#data-exploration-with-tk.summarize_by_time", - "href": "tutorials/01_sales_crm.html#data-exploration-with-tk.summarize_by_time", - "title": "Sales Analysis", - "section": "2.2 Data Exploration with tk.summarize_by_time", - "text": "2.2 Data Exploration with tk.summarize_by_time\nCRM data is often bustling with activity, reflecting the myriad of transactions happening daily. Due to this high volume, the data can sometimes seem overwhelming or noisy. To derive meaningful insights, it’s essential to aggregate this data over specific time intervals. This is where tk.summarize_by_time() comes into play.\nThe tk.summarize_by_time() function offers a streamlined approach to time-based data aggregation. By defining a desired frequency and an aggregation method, this function seamlessly organizes your data. The beauty of it is its versatility; from a broad array of built-in aggregation methods and frequencies to the flexibility of integrating a custom function, it caters to a range of requirements.\n\n\n\n\n\n\nGetting to know tk.summarize_by_time()\n\n\n\n\n\nCurious about the various options it provides?\n\nClick here to see our Data Wrangling Guide\nUse help(tk.summarize_by_time) to review additional helpful documentation. And explore the plethora of possibilities!\n\n\n\n\n\nGetting Weekly Totals\nWe can quickly get totals by week with summarize_byt_time.\n\n\nCode\nweekly_totals = df.summarize_by_time(\n date_column = 'order_date',\n value_column = 'total_price',\n agg_func = ['sum'],\n freq = 'W'\n)\n\nweekly_totals.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum\n\n\n\n\n0\n2011-01-09\n12040\n\n\n1\n2011-01-16\n151460\n\n\n2\n2011-01-23\n143850\n\n\n3\n2011-01-30\n175665\n\n\n4\n2011-02-06\n105210\n\n\n5\n2011-02-13\n250390\n\n\n6\n2011-02-20\n410595\n\n\n7\n2011-02-27\n254045\n\n\n8\n2011-03-06\n308420\n\n\n9\n2011-03-13\n45450\n\n\n\n\n\n\n\n\n\nGet Weekly Totals by Group (Category 2)\nTo better understand your data, you might want to add groups to this summary. We can include a groupby before the summarize_by_time and then aggregate our data.\n\n\nCode\n sales_by_week = df \\\n .groupby('category_2') \\\n .summarize_by_time(\n date_column = 'order_date',\n value_column = 'total_price',\n agg_func = ['sum'],\n freq = 'W'\n )\n\nsales_by_week.head(10)\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\n\n\n\n\n0\nCross Country Race\n2011-01-16\n61750\n\n\n1\nCross Country Race\n2011-01-23\n25050\n\n\n2\nCross Country Race\n2011-01-30\n56860\n\n\n3\nCross Country Race\n2011-02-06\n8740\n\n\n4\nCross Country Race\n2011-02-13\n78070\n\n\n5\nCross Country Race\n2011-02-20\n115010\n\n\n6\nCross Country Race\n2011-02-27\n64290\n\n\n7\nCross Country Race\n2011-03-06\n95070\n\n\n8\nCross Country Race\n2011-03-13\n3200\n\n\n9\nCross Country Race\n2011-03-20\n21170\n\n\n\n\n\n\n\n\n\nLong vs Wide Format\nThis long format can make it a little hard to compare the different group values visually, so instead of long-format you might want to pivot wide to view the data.\n\n\nCode\nsales_by_week_wide = df \\\n .groupby('category_2') \\\n .summarize_by_time(\n date_column = 'order_date',\n value_column = 'total_price',\n agg_func = ['sum'],\n freq = 'W',\n wide_format = True\n )\n\nsales_by_week_wide.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum_Cross Country Race\ntotal_price_sum_Cyclocross\ntotal_price_sum_Elite Road\ntotal_price_sum_Endurance Road\ntotal_price_sum_Fat Bike\ntotal_price_sum_Over Mountain\ntotal_price_sum_Sport\ntotal_price_sum_Trail\ntotal_price_sum_Triathalon\n\n\n\n\n0\n2011-01-09\n0.0\n0.0\n0.0\n0.0\n0.0\n12040.0\n0.0\n0.0\n0.0\n\n\n1\n2011-01-16\n61750.0\n1960.0\n49540.0\n11110.0\n0.0\n9170.0\n4030.0\n7450.0\n6450.0\n\n\n2\n2011-01-23\n25050.0\n3500.0\n51330.0\n47930.0\n0.0\n3840.0\n0.0\n0.0\n12200.0\n\n\n3\n2011-01-30\n56860.0\n2450.0\n43895.0\n24160.0\n0.0\n10880.0\n3720.0\n26700.0\n7000.0\n\n\n4\n2011-02-06\n8740.0\n7000.0\n35640.0\n22680.0\n3730.0\n14270.0\n980.0\n10220.0\n1950.0\n\n\n5\n2011-02-13\n78070.0\n0.0\n83780.0\n24820.0\n2130.0\n17160.0\n6810.0\n17120.0\n20500.0\n\n\n6\n2011-02-20\n115010.0\n7910.0\n79770.0\n27650.0\n26100.0\n37830.0\n10925.0\n96250.0\n9150.0\n\n\n7\n2011-02-27\n64290.0\n6650.0\n86900.0\n31900.0\n5860.0\n22070.0\n6165.0\n16410.0\n13800.0\n\n\n8\n2011-03-06\n95070.0\n2450.0\n31990.0\n47660.0\n5860.0\n82060.0\n9340.0\n26790.0\n7200.0\n\n\n9\n2011-03-13\n3200.0\n4200.0\n23110.0\n7260.0\n0.0\n5970.0\n1710.0\n0.0\n0.0\n\n\n\n\n\n\n\nYou can now observe the total sales for each product side by side. This streamlined view facilitates easy comparison between product sales." + "objectID": "tutorials/02_finance.html#returns-analysis-by-time", + "href": "tutorials/02_finance.html#returns-analysis-by-time", + "title": "Finance Analysis", + "section": "4.1 Returns Analysis By Time", + "text": "4.1 Returns Analysis By Time\n\n\n\n\n\n\nReturns are NOT static (so analyze them by time)\n\n\n\n\n\n\nWe can use rolling window calculations with tk.augment_rolling() to compute many rolling features at scale such as rolling mean, std, range (spread).\nWe can expand our tk.augment_rolling_apply() rolling calculations to Rolling Correlation and Rolling Regression (to make comparisons over time)\n\n\n\n\n\nApplication: Descriptive Statistic Analysis\nMany traders compute descriptive statistics like mean, median, mode, skewness, kurtosis, and standard deviation to understand the central tendency, spread, and shape of the return distribution.\n\n\nStep 1: Returns\nUse this code to get the pct_change() in wide format. Click expand to get the code.\n\n\nCode\nreturns_wide_df = stocks_df[['symbol', 'date', 'adjusted']] \\\n .pivot(index = 'date', columns = 'symbol', values = 'adjusted') \\\n .pct_change() \\\n .reset_index() \\\n [1:]\n\nreturns_wide_df\n\n\n\n\n\n\n\n\nsymbol\ndate\nAAPL\nAMZN\nGOOG\nMETA\nNFLX\nNVDA\n\n\n\n\n1\n2013-01-03\n-0.012622\n0.004547\n0.000581\n-0.008214\n0.049777\n0.000786\n\n\n2\n2013-01-04\n-0.027854\n0.002592\n0.019760\n0.035650\n-0.006315\n0.032993\n\n\n3\n2013-01-07\n-0.005883\n0.035925\n-0.004363\n0.022949\n0.033549\n-0.028897\n\n\n4\n2013-01-08\n0.002691\n-0.007748\n-0.001974\n-0.012237\n-0.020565\n-0.021926\n\n\n5\n2013-01-09\n-0.015629\n-0.000113\n0.006573\n0.052650\n-0.012865\n-0.022418\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n2694\n2023-09-15\n-0.004154\n-0.029920\n-0.004964\n-0.036603\n-0.008864\n-0.036879\n\n\n2695\n2023-09-18\n0.016913\n-0.002920\n0.004772\n0.007459\n-0.006399\n0.001503\n\n\n2696\n2023-09-19\n0.006181\n-0.016788\n-0.000936\n0.008329\n0.004564\n-0.010144\n\n\n2697\n2023-09-20\n-0.019992\n-0.017002\n-0.030541\n-0.017701\n-0.024987\n-0.029435\n\n\n2698\n2023-09-21\n-0.008889\n-0.044053\n-0.023999\n-0.013148\n-0.005566\n-0.028931\n\n\n\n\n2698 rows × 7 columns\n\n\n\n\n\nStep 2: Descriptive Stats\nUse this code to get standard statistics with the describe() method. Click expand to get the code.\n\n\nCode\nreturns_wide_df.describe()\n\n\n\n\n\n\n\n\nsymbol\nAAPL\nAMZN\nGOOG\nMETA\nNFLX\nNVDA\n\n\n\n\ncount\n2698.000000\n2698.000000\n2698.000000\n2698.000000\n2698.000000\n2698.000000\n\n\nmean\n0.001030\n0.001068\n0.000885\n0.001170\n0.001689\n0.002229\n\n\nstd\n0.018036\n0.020621\n0.017267\n0.024291\n0.029683\n0.028320\n\n\nmin\n-0.128647\n-0.140494\n-0.111008\n-0.263901\n-0.351166\n-0.187559\n\n\n25%\n-0.007410\n-0.008635\n-0.006900\n-0.009610\n-0.012071\n-0.010938\n\n\n50%\n0.000892\n0.001050\n0.000700\n0.001051\n0.000544\n0.001918\n\n\n75%\n0.010324\n0.011363\n0.009053\n0.012580\n0.014678\n0.015202\n\n\nmax\n0.119808\n0.141311\n0.160524\n0.296115\n0.422235\n0.298067\n\n\n\n\n\n\n\n\n\nStep 3: Correlation\nAnd run a correlation with corr(). Click expand to get the code.\n\n\nCode\ncorr_table_df = returns_wide_df.drop('date', axis=1).corr()\ncorr_table_df\n\n\n\n\n\n\n\n\nsymbol\nAAPL\nAMZN\nGOOG\nMETA\nNFLX\nNVDA\n\n\nsymbol\n\n\n\n\n\n\n\n\n\n\nAAPL\n1.000000\n0.497906\n0.566452\n0.479787\n0.321694\n0.526508\n\n\nAMZN\n0.497906\n1.000000\n0.628103\n0.544481\n0.475078\n0.490234\n\n\nGOOG\n0.566452\n0.628103\n1.000000\n0.595728\n0.428470\n0.531382\n\n\nMETA\n0.479787\n0.544481\n0.595728\n1.000000\n0.407417\n0.450586\n\n\nNFLX\n0.321694\n0.475078\n0.428470\n0.407417\n1.000000\n0.380153\n\n\nNVDA\n0.526508\n0.490234\n0.531382\n0.450586\n0.380153\n1.000000\n\n\n\n\n\n\n\n\nThe problem is that the stock market is constantly changing. And these descriptive statistics aren’t representative of the most recent fluctuations. This is where pytimetk comes into play with rolling descriptive statistics.\n\n\n\nApplication: 90-Day Rolling Descriptive Statistics Analysis with tk.augment_rolling()\nLet’s compute and visualize the 90-day rolling statistics.\n\n\n\n\n\n\nGetting More Info: tk.augment_rolling()\n\n\n\n\n\n\nClick here to see our Augmenting Guide\nUse help(tk.augment_rolling) to review additional helpful documentation.\n\n\n\n\n\nStep 1: Long Format Pt.1\nUse this code to get the date melt() into long format. Click expand to get the code.\n\n\nCode\nreturns_long_df = returns_wide_df \\\n .melt(id_vars='date', value_name='returns') \n\nreturns_long_df\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\n\n\n\n\n0\n2013-01-03\nAAPL\n-0.012622\n\n\n1\n2013-01-04\nAAPL\n-0.027854\n\n\n2\n2013-01-07\nAAPL\n-0.005883\n\n\n3\n2013-01-08\nAAPL\n0.002691\n\n\n4\n2013-01-09\nAAPL\n-0.015629\n\n\n...\n...\n...\n...\n\n\n16183\n2023-09-15\nNVDA\n-0.036879\n\n\n16184\n2023-09-18\nNVDA\n0.001503\n\n\n16185\n2023-09-19\nNVDA\n-0.010144\n\n\n16186\n2023-09-20\nNVDA\n-0.029435\n\n\n16187\n2023-09-21\nNVDA\n-0.028931\n\n\n\n\n16188 rows × 3 columns\n\n\n\n\n\nStep 2: Augment Rolling Statistic\nLet’s add multiple columns of rolling statistics. Click to expand the code.\n\n\nCode\nrolling_stats_df = returns_long_df \\\n .groupby('symbol') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'returns',\n window = [90],\n window_func = [\n 'mean', \n 'std', \n 'min',\n ('q25', lambda x: np.quantile(x, 0.25)),\n 'median',\n ('q75', lambda x: np.quantile(x, 0.75)),\n 'max'\n ],\n threads = 1 # Change to -1 to use all threads\n ) \\\n .dropna()\n\nrolling_stats_df\n\n\n\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\nreturns_rolling_mean_win_90\nreturns_rolling_std_win_90\nreturns_rolling_min_win_90\nreturns_rolling_q25_win_90\nreturns_rolling_median_win_90\nreturns_rolling_q75_win_90\nreturns_rolling_max_win_90\n\n\n\n\n89\n2013-05-13\nAAPL\n0.003908\n-0.001702\n0.022233\n-0.123558\n-0.010533\n-0.001776\n0.012187\n0.041509\n\n\n90\n2013-05-14\nAAPL\n-0.023926\n-0.001827\n0.022327\n-0.123558\n-0.010533\n-0.001776\n0.012187\n0.041509\n\n\n91\n2013-05-15\nAAPL\n-0.033817\n-0.001894\n0.022414\n-0.123558\n-0.010533\n-0.001776\n0.012187\n0.041509\n\n\n92\n2013-05-16\nAAPL\n0.013361\n-0.001680\n0.022467\n-0.123558\n-0.010533\n-0.001360\n0.013120\n0.041509\n\n\n93\n2013-05-17\nAAPL\n-0.003037\n-0.001743\n0.022462\n-0.123558\n-0.010533\n-0.001776\n0.013120\n0.041509\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n16183\n2023-09-15\nNVDA\n-0.036879\n0.005159\n0.036070\n-0.056767\n-0.012587\n-0.000457\n0.018480\n0.243696\n\n\n16184\n2023-09-18\nNVDA\n0.001503\n0.005396\n0.035974\n-0.056767\n-0.011117\n0.000177\n0.018480\n0.243696\n\n\n16185\n2023-09-19\nNVDA\n-0.010144\n0.005162\n0.036006\n-0.056767\n-0.011117\n-0.000457\n0.018480\n0.243696\n\n\n16186\n2023-09-20\nNVDA\n-0.029435\n0.004953\n0.036153\n-0.056767\n-0.012587\n-0.000457\n0.018480\n0.243696\n\n\n16187\n2023-09-21\nNVDA\n-0.028931\n0.004724\n0.036303\n-0.056767\n-0.013166\n-0.000457\n0.018480\n0.243696\n\n\n\n\n15654 rows × 10 columns\n\n\n\n\n\nStep 3: Long Format Pt.2\nFinally, we can .melt() each of the rolling statistics for a Long Format Analysis. Click to expand the code.\n\n\nCode\nrolling_stats_long_df = rolling_stats_df \\\n .melt(\n id_vars = [\"symbol\", \"date\"],\n var_name = \"statistic_type\"\n )\n\nrolling_stats_long_df\n\n\n\n\n\n\n\n\n\nsymbol\ndate\nstatistic_type\nvalue\n\n\n\n\n0\nAAPL\n2013-05-13\nreturns\n0.003908\n\n\n1\nAAPL\n2013-05-14\nreturns\n-0.023926\n\n\n2\nAAPL\n2013-05-15\nreturns\n-0.033817\n\n\n3\nAAPL\n2013-05-16\nreturns\n0.013361\n\n\n4\nAAPL\n2013-05-17\nreturns\n-0.003037\n\n\n...\n...\n...\n...\n...\n\n\n125227\nNVDA\n2023-09-15\nreturns_rolling_max_win_90\n0.243696\n\n\n125228\nNVDA\n2023-09-18\nreturns_rolling_max_win_90\n0.243696\n\n\n125229\nNVDA\n2023-09-19\nreturns_rolling_max_win_90\n0.243696\n\n\n125230\nNVDA\n2023-09-20\nreturns_rolling_max_win_90\n0.243696\n\n\n125231\nNVDA\n2023-09-21\nreturns_rolling_max_win_90\n0.243696\n\n\n\n\n125232 rows × 4 columns\n\n\n\nWith the data formatted properly we can evaluate the 90-Day Rolling Statistics using .plot_timeseries().\n\nPlotlyPlotnine\n\n\n\n\nCode\nrolling_stats_long_df \\\n .groupby(['symbol', 'statistic_type']) \\\n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n facet_ncol = 6,\n width = 1500,\n height = 1000,\n title = \"90-Day Rolling Statistics\"\n )\n\n\n\n \n\n\n\n\n\n\nCode\nrolling_stats_long_df \\\n .groupby(['symbol', 'statistic_type']) \\\n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n facet_ncol = 6,\n facet_dir = 'v',\n width = 1500,\n height = 1000,\n title = \"90-Day Rolling Statistics\",\n engine = \"plotnine\"\n )\n\n\n\n\n\n<Figure Size: (1500 x 1000)>" }, { - "objectID": "tutorials/01_sales_crm.html#visualize-your-time-series-data-with-tk.plot_timeseries", - "href": "tutorials/01_sales_crm.html#visualize-your-time-series-data-with-tk.plot_timeseries", - "title": "Sales Analysis", - "section": "2.3 Visualize your time series data with tk.plot_timeseries", - "text": "2.3 Visualize your time series data with tk.plot_timeseries\nYou can now visualize the summarized data to gain a clearer insight into the prevailing trends.\n\nPlotlyPlotnine\n\n\n\n\nCode\nsales_by_week \\\n .groupby('category_2') \\\n .plot_timeseries(\n date_column = 'order_date', \n value_column = 'total_price_sum',\n title = 'Bike Sales by Category',\n facet_ncol = 2,\n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 1000,\n height = 800,\n y_lab = 'Total Sales', \n engine = 'plotly'\n )\n\n\n\n \n\n\n\n\n\n\nCode\nsales_by_week \\\n .groupby('category_2') \\\n .plot_timeseries(\n date_column = 'order_date', \n value_column = 'total_price_sum',\n title = 'Bike Sales by Category',\n facet_ncol = 2,\n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 1000,\n height = 800,\n y_lab = 'Total Sales', \n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (1000 x 800)>\n\n\n\n\n\nThe graph showcases a pronounced uptick in sales for most of the different bike products during the summer. It’s a natural trend, aligning with our understanding that people gravitate towards biking during the balmy summer days. Conversely, as the chill of winter sets in at the year’s start and end, we observe a corresponding dip in sales.\nIt’s worth highlighting the elegance of the plot_timeseries function. Beyond just plotting raw data, it introduces a smoother, accentuating underlying trends and making them more discernible. This enhancement ensures we can effortlessly capture and comprehend the cyclical nature of bike sales throughout the year." + "objectID": "tutorials/02_finance.html#about-rolling-correlation", + "href": "tutorials/02_finance.html#about-rolling-correlation", + "title": "Finance Analysis", + "section": "5.1 About: Rolling Correlation", + "text": "5.1 About: Rolling Correlation\nRolling correlation calculates the correlation between two time series over a rolling window of a specified size, moving one period at a time. In stock analysis, this is often used to assess:\n\nDiversification: Helps in identifying how different stocks move in relation to each other, aiding in the creation of a diversified portfolio.\nMarket Dependency: Measures how a particular stock or sector is correlated with a broader market index.\nRisk Management: Helps in identifying changes in correlation structures over time which is crucial for risk assessment and management.\n\nFor example, if the rolling correlation between two stocks starts increasing, it might suggest that they are being influenced by similar factors or market conditions." }, { - "objectID": "tutorials/01_sales_crm.html#making-irregular-data-regular-with-tk.pad_by_time", - "href": "tutorials/01_sales_crm.html#making-irregular-data-regular-with-tk.pad_by_time", - "title": "Sales Analysis", - "section": "3.1 Making irregular data regular with tk.pad_by_time", - "text": "3.1 Making irregular data regular with tk.pad_by_time\nKicking off our journey, we’ll utilize pytimetk’s tk.pad_by_time() function. For this, grouping by the ‘category_1’ variable is recommended. Moreover, it’s prudent to establish a definitive end date. This ensures that all groups are equipped with training data up to the most recent date, accommodating scenarios where certain categories might have seen no sales in the final training week. By doing so, we create a representative observation for every group, capturing the nuances of each category’s sales pattern.\n\n\nCode\nsales_padded = sales_by_week \\\n .groupby('category_2') \\\n .pad_by_time(\n date_column = 'order_date',\n freq = 'W',\n end_date = sales_by_week.order_date.max()\n )\nsales_padded\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\n\n\n\n\n0\nCross Country Race\n2011-01-09\nNaN\n\n\n1\nCross Country Race\n2011-01-16\n61750.0\n\n\n2\nCross Country Race\n2011-01-23\n25050.0\n\n\n3\nCross Country Race\n2011-01-30\n56860.0\n\n\n4\nCross Country Race\n2011-02-06\n8740.0\n\n\n...\n...\n...\n...\n\n\n463\nTriathalon\n2011-12-04\n3200.0\n\n\n464\nTriathalon\n2011-12-11\n28350.0\n\n\n465\nTriathalon\n2011-12-18\n2700.0\n\n\n466\nTriathalon\n2011-12-25\n3900.0\n\n\n467\nTriathalon\n2012-01-01\nNaN\n\n\n\n\n468 rows × 3 columns" + "objectID": "tutorials/02_finance.html#application-rolling-correlation", + "href": "tutorials/02_finance.html#application-rolling-correlation", + "title": "Finance Analysis", + "section": "5.2 Application: Rolling Correlation", + "text": "5.2 Application: Rolling Correlation\nLet’s revisit the returns wide and long format. We can combine these two using the merge() method.\n\nStep 1: Create the return_combinations_long_df\nPerform data wrangling to get the pairwise combinations in long format:\n\nWe first .merge() to join the long returns with the wide returns by date.\nWe then .melt() to get the wide data into long format.\n\n\n\nCode\nreturn_combinations_long_df = returns_long_df \\\n .merge(returns_wide_df, how='left', on = 'date') \\\n .melt(\n id_vars = ['date', 'symbol', 'returns'],\n var_name = \"comp\",\n value_name = \"returns_comp\"\n )\nreturn_combinations_long_df\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\ncomp\nreturns_comp\n\n\n\n\n0\n2013-01-03\nAAPL\n-0.012622\nAAPL\n-0.012622\n\n\n1\n2013-01-04\nAAPL\n-0.027854\nAAPL\n-0.027854\n\n\n2\n2013-01-07\nAAPL\n-0.005883\nAAPL\n-0.005883\n\n\n3\n2013-01-08\nAAPL\n0.002691\nAAPL\n0.002691\n\n\n4\n2013-01-09\nAAPL\n-0.015629\nAAPL\n-0.015629\n\n\n...\n...\n...\n...\n...\n...\n\n\n97123\n2023-09-15\nNVDA\n-0.036879\nNVDA\n-0.036879\n\n\n97124\n2023-09-18\nNVDA\n0.001503\nNVDA\n0.001503\n\n\n97125\n2023-09-19\nNVDA\n-0.010144\nNVDA\n-0.010144\n\n\n97126\n2023-09-20\nNVDA\n-0.029435\nNVDA\n-0.029435\n\n\n97127\n2023-09-21\nNVDA\n-0.028931\nNVDA\n-0.028931\n\n\n\n\n97128 rows × 5 columns\n\n\n\n\n\nStep 2: Add Rolling Correlations with tk.augment_rolling_apply()\nNext, let’s add rolling correlations.\n\nWe first .groupby() on the combination of our target assets “symbol” and our comparison asset “comp”.\nThen we use a different function, tk.augment_rolling_apply().\n\n\n\n\n\n\n\ntk.augment_rolling() vs tk.augment_rolling_apply()\n\n\n\n\n\n\nFor the vast majority of operations, tk.augment_rolling() will suffice. It’s used on a single column where there is a simple rolling transformation applied to only the value_column.\nFor more complex cases where other columns beyond a value_column are needed (e.g. rolling correlations, rolling regressions), the tk.augment_rolling_apply() comes to the rescue.\ntk.augment_rolling_apply() exposes the group’s columns as a DataFrame to window function, thus allowing for multi-column analysis.\n\n\n\n\n\n\n\n\n\n\ntk.augment_rolling_apply() has no value_column\n\n\n\n\n\nThis is because the rolling apply passes a DataFrame containing all columns to the custom function. The custom function is then responsible for handling the columns internally. This is how you can select multiple columns to work with.\n\n\n\n\n\nCode\nreturn_corr_df = return_combinations_long_df \\\n .groupby([\"symbol\", \"comp\"]) \\\n .augment_rolling_apply(\n date_column = \"date\",\n window = 90,\n window_func=[('corr', lambda x: x['returns'].corr(x['returns_comp']))],\n threads = 1, # Change to -1 to use all available cores\n )\n\nreturn_corr_df\n\n\n\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\ncomp\nreturns_comp\nrolling_corr_win_90\n\n\n\n\n0\n2013-01-03\nAAPL\n-0.012622\nAAPL\n-0.012622\nNaN\n\n\n1\n2013-01-04\nAAPL\n-0.027854\nAAPL\n-0.027854\nNaN\n\n\n2\n2013-01-07\nAAPL\n-0.005883\nAAPL\n-0.005883\nNaN\n\n\n3\n2013-01-08\nAAPL\n0.002691\nAAPL\n0.002691\nNaN\n\n\n4\n2013-01-09\nAAPL\n-0.015629\nAAPL\n-0.015629\nNaN\n\n\n...\n...\n...\n...\n...\n...\n...\n\n\n97123\n2023-09-15\nNVDA\n-0.036879\nNVDA\n-0.036879\n1.0\n\n\n97124\n2023-09-18\nNVDA\n0.001503\nNVDA\n0.001503\n1.0\n\n\n97125\n2023-09-19\nNVDA\n-0.010144\nNVDA\n-0.010144\n1.0\n\n\n97126\n2023-09-20\nNVDA\n-0.029435\nNVDA\n-0.029435\n1.0\n\n\n97127\n2023-09-21\nNVDA\n-0.028931\nNVDA\n-0.028931\n1.0\n\n\n\n\n97128 rows × 6 columns\n\n\n\n\n\nStep 3: Visualize the Rolling Correlation\nWe can use tk.plot_timeseries() to visualize the 90-day rolling correlation. It’s interesting to see that stock combinations such as AAPL | AMZN returns have a high positive correlation of 0.80, but this relationship was much lower 0.25 before 2015.\n\nThe blue smoother can help us detect trends\nThe y_intercept is useful in this case to draw lines at -1, 0, and 1\n\n\nPlotlyPlotnine\n\n\n\n\nCode\nreturn_corr_df \\\n .dropna() \\\n .groupby(['symbol', 'comp']) \\\n .plot_timeseries(\n date_column = \"date\",\n value_column = \"rolling_corr_win_90\",\n facet_ncol = 6,\n y_intercept = [-1,0,1],\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 1500,\n height = 1000,\n title = \"90-Day Rolling Correlation\",\n engine = \"plotly\"\n )\n\n\n\n \n\n\n\n\n\n\nCode\nreturn_corr_df \\\n .dropna() \\\n .groupby(['symbol', 'comp']) \\\n .plot_timeseries(\n date_column = \"date\",\n value_column = \"rolling_corr_win_90\",\n facet_ncol = 6,\n y_intercept = [-1,0,1],\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 1500,\n height = 1000,\n title = \"90-Day Rolling Correlation\",\n engine = \"plotnine\"\n )\n\n\n\n\n\n<Figure Size: (1500 x 1000)>\n\n\n\n\n\nFor comparison, we can examine the corr_table_df from the Descriptive Statistics Analysis:\n\nNotice that the values tend not to match the most recent trends\nFor example APPL | AMZN is correlated at 0.49 over the entire time period. But more recently this correlation has dropped to 0.17 in the 90-Day Rolling Correlation chart.\n\n\n\nCode\ncorr_table_df\n\n\n\n\n\n\n\n\nsymbol\nAAPL\nAMZN\nGOOG\nMETA\nNFLX\nNVDA\n\n\nsymbol\n\n\n\n\n\n\n\n\n\n\nAAPL\n1.000000\n0.497906\n0.566452\n0.479787\n0.321694\n0.526508\n\n\nAMZN\n0.497906\n1.000000\n0.628103\n0.544481\n0.475078\n0.490234\n\n\nGOOG\n0.566452\n0.628103\n1.000000\n0.595728\n0.428470\n0.531382\n\n\nMETA\n0.479787\n0.544481\n0.595728\n1.000000\n0.407417\n0.450586\n\n\nNFLX\n0.321694\n0.475078\n0.428470\n0.407417\n1.000000\n0.380153\n\n\nNVDA\n0.526508\n0.490234\n0.531382\n0.450586\n0.380153\n1.000000" }, { - "objectID": "tutorials/01_sales_crm.html#making-future-dates-easier-with-tk.future_frame", - "href": "tutorials/01_sales_crm.html#making-future-dates-easier-with-tk.future_frame", - "title": "Sales Analysis", - "section": "3.2 Making Future Dates Easier with tk.future_frame", - "text": "3.2 Making Future Dates Easier with tk.future_frame\nMoving on, let’s set up the future frame, which will serve as our test dataset. To achieve this, employ the tk.future_frame() method. This function allows for the specification of a grouping column and a forecast horizon.\nUpon invoking tk.future_frame(), you’ll observe that placeholders (null values) are added for each group, extending 12 weeks into the future.\n\n\nCode\ndf_with_futureframe = sales_padded \\\n .groupby('category_2') \\\n .future_frame(\n date_column = 'order_date',\n length_out = 12\n )\ndf_with_futureframe\n\n\n\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\n\n\n\n\n0\nCross Country Race\n2011-01-09\nNaN\n\n\n1\nCross Country Race\n2011-01-16\n61750.0\n\n\n2\nCross Country Race\n2011-01-23\n25050.0\n\n\n3\nCross Country Race\n2011-01-30\n56860.0\n\n\n4\nCross Country Race\n2011-02-06\n8740.0\n\n\n...\n...\n...\n...\n\n\n571\nTriathalon\n2012-02-26\nNaN\n\n\n572\nTriathalon\n2012-03-04\nNaN\n\n\n573\nTriathalon\n2012-03-11\nNaN\n\n\n574\nTriathalon\n2012-03-18\nNaN\n\n\n575\nTriathalon\n2012-03-25\nNaN\n\n\n\n\n576 rows × 3 columns" + "objectID": "tutorials/02_finance.html#about-rolling-regression", + "href": "tutorials/02_finance.html#about-rolling-regression", + "title": "Finance Analysis", + "section": "5.3 About: Rolling Regression", + "text": "5.3 About: Rolling Regression\nRolling regression involves running regression analyses over rolling windows of data points to assess the relationship between a dependent and one or more independent variables. In the context of stock analysis, it can be used to:\n\nBeta Estimation: It can be used to estimate the beta of a stock (a measure of market risk) against a market index over different time periods. A higher beta indicates higher market-related risk.\nMarket Timing: It can be useful in identifying changing relationships between stocks and market indicators, helping traders to adjust their positions accordingly.\nHedge Ratio Determination: It helps in determining the appropriate hedge ratios for pairs trading or other hedging strategies." }, { - "objectID": "tutorials/01_sales_crm.html#lag-values-with-tk.augment_lags", - "href": "tutorials/01_sales_crm.html#lag-values-with-tk.augment_lags", - "title": "Sales Analysis", - "section": "3.3 Lag Values with tk.augment_lags", - "text": "3.3 Lag Values with tk.augment_lags\nCrafting features from time series data can be intricate, but thanks to the suite of feature engineering tools in pytimetk, the process is streamlined and intuitive.\nIn this guide, we’ll focus on the basics: introducing a few lag variables and incorporating some date-related features.\nFirstly, let’s dive into creating lag features.\nGiven our forecasting objective of a 12-week horizon, to ensure we have lag data available for every future point, we should utilize a lag of 12 or more. The beauty of the toolkit is that it supports the addition of multiple lags simultaneously.\nLag features play a pivotal role in machine learning for time series. Often, recent data offers valuable insights into future trends. To capture this recency effect, it’s crucial to integrate lag values. For this purpose, tk.augment_lags() comes in handy.\n\n\nCode\ndf_with_lags = df_with_futureframe \\\n .groupby('category_2') \\\n .augment_lags(\n date_column = 'order_date',\n value_column = 'total_price_sum',\n lags = [12,24]\n\n )\ndf_with_lags.head(25)\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\n\n\n\n\n0\nCross Country Race\n2011-01-09\nNaN\nNaN\nNaN\n\n\n1\nCross Country Race\n2011-01-16\n61750.0\nNaN\nNaN\n\n\n2\nCross Country Race\n2011-01-23\n25050.0\nNaN\nNaN\n\n\n3\nCross Country Race\n2011-01-30\n56860.0\nNaN\nNaN\n\n\n4\nCross Country Race\n2011-02-06\n8740.0\nNaN\nNaN\n\n\n5\nCross Country Race\n2011-02-13\n78070.0\nNaN\nNaN\n\n\n6\nCross Country Race\n2011-02-20\n115010.0\nNaN\nNaN\n\n\n7\nCross Country Race\n2011-02-27\n64290.0\nNaN\nNaN\n\n\n8\nCross Country Race\n2011-03-06\n95070.0\nNaN\nNaN\n\n\n9\nCross Country Race\n2011-03-13\n3200.0\nNaN\nNaN\n\n\n10\nCross Country Race\n2011-03-20\n21170.0\nNaN\nNaN\n\n\n11\nCross Country Race\n2011-03-27\n28990.0\nNaN\nNaN\n\n\n12\nCross Country Race\n2011-04-03\n51860.0\nNaN\nNaN\n\n\n13\nCross Country Race\n2011-04-10\n85910.0\n61750.0\nNaN\n\n\n14\nCross Country Race\n2011-04-17\n138230.0\n25050.0\nNaN\n\n\n15\nCross Country Race\n2011-04-24\n138350.0\n56860.0\nNaN\n\n\n16\nCross Country Race\n2011-05-01\n136090.0\n8740.0\nNaN\n\n\n17\nCross Country Race\n2011-05-08\n32110.0\n78070.0\nNaN\n\n\n18\nCross Country Race\n2011-05-15\n139010.0\n115010.0\nNaN\n\n\n19\nCross Country Race\n2011-05-22\n2060.0\n64290.0\nNaN\n\n\n20\nCross Country Race\n2011-05-29\n26130.0\n95070.0\nNaN\n\n\n21\nCross Country Race\n2011-06-05\n30360.0\n3200.0\nNaN\n\n\n22\nCross Country Race\n2011-06-12\n88280.0\n21170.0\nNaN\n\n\n23\nCross Country Race\n2011-06-19\n109470.0\n28990.0\nNaN\n\n\n24\nCross Country Race\n2011-06-26\n107280.0\n51860.0\nNaN\n\n\n\n\n\n\n\nObserve that lag values of 12 and 24 introduce missing entries at the dataset’s outset. This occurs because there isn’t available data from 12 or 24 weeks prior. To address these gaps, you can adopt one of two strategies:\n\nDiscard the Affected Rows: This is a recommended approach if your dataset is sufficiently large. Removing a few initial rows might not significantly impact the training process.\nBackfill Missing Values: In situations with limited data, you might consider backfilling these nulls using the first available values from lag 12 and 24. However, the appropriateness of this technique hinges on your specific context and objectives.\n\nFor the scope of this tutorial, we’ll opt to remove these rows. However, it’s worth pointing out that our dataset is quite small with limited historical data, so this might impact our model.\n\n\nCode\nlag_columns = [col for col in df_with_lags.columns if 'lag' in col]\ndf_no_nas = df_with_lags \\\n .dropna(subset=lag_columns, inplace=False)\n\ndf_no_nas.head()\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\n\n\n\n\n25\nCross Country Race\n2011-07-03\n56430.0\n85910.0\n61750.0\n\n\n26\nCross Country Race\n2011-07-10\n62320.0\n138230.0\n25050.0\n\n\n27\nCross Country Race\n2011-07-17\n141620.0\n138350.0\n56860.0\n\n\n28\nCross Country Race\n2011-07-24\n75720.0\n136090.0\n8740.0\n\n\n29\nCross Country Race\n2011-07-31\n21240.0\n32110.0\n78070.0" + "objectID": "tutorials/02_finance.html#application-90-day-rolling-regression", + "href": "tutorials/02_finance.html#application-90-day-rolling-regression", + "title": "Finance Analysis", + "section": "5.4 Application: 90-Day Rolling Regression", + "text": "5.4 Application: 90-Day Rolling Regression\n\n\n\n\n\n\nThis Application Requires Scikit Learn\n\n\n\n\n\nWe need to make a regression function that returns the Slope and Intercept. Scikit Learn has an easy-to-use modeling interface. You may need to pip install scikit-learn to use this applied tutorial.\n\n\n\n\nStep 1: Get Market Returns\nFor our purposes, we assume the market is the average returns of the 6 technology stocks.\n\nWe calculate an equal-weight portfolio as the “market returns”.\nThen we merge the market returns into the returns long data.\n\n\n\nCode\n# Assume Market Returns = Equal Weight Portfolio\nmarket_returns_df = returns_wide_df \\\n .set_index(\"date\") \\\n .assign(returns_market = lambda df: df.sum(axis = 1) * (1 / df.shape[1])) \\\n .reset_index() \\\n [['date', 'returns_market']]\n\n# Merge with returns long\nreturns_long_market_df = returns_long_df \\\n .merge(market_returns_df, how='left', on='date')\n\nreturns_long_market_df\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\nreturns_market\n\n\n\n\n0\n2013-01-03\nAAPL\n-0.012622\n0.005809\n\n\n1\n2013-01-04\nAAPL\n-0.027854\n0.009471\n\n\n2\n2013-01-07\nAAPL\n-0.005883\n0.008880\n\n\n3\n2013-01-08\nAAPL\n0.002691\n-0.010293\n\n\n4\n2013-01-09\nAAPL\n-0.015629\n0.001366\n\n\n...\n...\n...\n...\n...\n\n\n16183\n2023-09-15\nNVDA\n-0.036879\n-0.020231\n\n\n16184\n2023-09-18\nNVDA\n0.001503\n0.003555\n\n\n16185\n2023-09-19\nNVDA\n-0.010144\n-0.001466\n\n\n16186\n2023-09-20\nNVDA\n-0.029435\n-0.023276\n\n\n16187\n2023-09-21\nNVDA\n-0.028931\n-0.020764\n\n\n\n\n16188 rows × 4 columns\n\n\n\n\n\nStep 2: Run a Rolling Regression\nNext, run the following code to perform a rolling regression:\n\nUse a custom regression function that will return the slope and intercept as a pandas series.\nRun the rolling regression with tk.augment_rolling_apply().\n\n\n\nCode\ndef regression(df):\n \n # External functions must \n from sklearn.linear_model import LinearRegression\n\n model = LinearRegression()\n X = df[['returns_market']] # Extract X values (independent variables)\n y = df['returns'] # Extract y values (dependent variable)\n model.fit(X, y)\n ret = pd.Series([model.intercept_, model.coef_[0]], index=['Intercept', 'Slope'])\n \n return ret # Return intercept and slope as a Series\n\nreturn_regression_df = returns_long_market_df \\\n .groupby('symbol') \\\n .augment_rolling_apply(\n date_column = \"date\",\n window = 90,\n window_func = [('regression', regression)],\n threads = 1, # Change to -1 to use all available cores \n ) \\\n .dropna()\n\nreturn_regression_df\n\n\n\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\nreturns_market\nrolling_regression_win_90\n\n\n\n\n89\n2013-05-13\nAAPL\n0.003908\n0.007082\nIntercept -0.001844 Slope 0.061629 dt...\n\n\n90\n2013-05-14\nAAPL\n-0.023926\n0.007583\nIntercept -0.001959 Slope 0.056540 dt...\n\n\n91\n2013-05-15\nAAPL\n-0.033817\n0.005381\nIntercept -0.002036 Slope 0.062330 dt...\n\n\n92\n2013-05-16\nAAPL\n0.013361\n-0.009586\nIntercept -0.001789 Slope 0.052348 dt...\n\n\n93\n2013-05-17\nAAPL\n-0.003037\n0.009005\nIntercept -0.001871 Slope 0.055661 dt...\n\n\n...\n...\n...\n...\n...\n...\n\n\n16183\n2023-09-15\nNVDA\n-0.036879\n-0.020231\nIntercept 0.000100 Slope 1.805479 dt...\n\n\n16184\n2023-09-18\nNVDA\n0.001503\n0.003555\nIntercept 0.000207 Slope 1.800813 dt...\n\n\n16185\n2023-09-19\nNVDA\n-0.010144\n-0.001466\nIntercept 0.000301 Slope 1.817878 dt...\n\n\n16186\n2023-09-20\nNVDA\n-0.029435\n-0.023276\nIntercept 0.000845 Slope 1.825818 dt...\n\n\n16187\n2023-09-21\nNVDA\n-0.028931\n-0.020764\nIntercept 0.000901 Slope 1.818710 dt...\n\n\n\n\n15654 rows × 5 columns\n\n\n\n\n\nStep 3: Extract the Slope Coefficient (Beta)\nThis is more of a hack than anything to extract the beta (slope) of the rolling regression.\n\n\nCode\nintercept_slope_df = pd.concat(return_regression_df['rolling_regression_win_90'].to_list(), axis=1).T \n\nintercept_slope_df.index = return_regression_df.index\n\nreturn_beta_df = pd.concat([return_regression_df, intercept_slope_df], axis=1)\n\nreturn_beta_df\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\nreturns_market\nrolling_regression_win_90\nIntercept\nSlope\n\n\n\n\n89\n2013-05-13\nAAPL\n0.003908\n0.007082\nIntercept -0.001844 Slope 0.061629 dt...\n-0.001844\n0.061629\n\n\n90\n2013-05-14\nAAPL\n-0.023926\n0.007583\nIntercept -0.001959 Slope 0.056540 dt...\n-0.001959\n0.056540\n\n\n91\n2013-05-15\nAAPL\n-0.033817\n0.005381\nIntercept -0.002036 Slope 0.062330 dt...\n-0.002036\n0.062330\n\n\n92\n2013-05-16\nAAPL\n0.013361\n-0.009586\nIntercept -0.001789 Slope 0.052348 dt...\n-0.001789\n0.052348\n\n\n93\n2013-05-17\nAAPL\n-0.003037\n0.009005\nIntercept -0.001871 Slope 0.055661 dt...\n-0.001871\n0.055661\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n16183\n2023-09-15\nNVDA\n-0.036879\n-0.020231\nIntercept 0.000100 Slope 1.805479 dt...\n0.000100\n1.805479\n\n\n16184\n2023-09-18\nNVDA\n0.001503\n0.003555\nIntercept 0.000207 Slope 1.800813 dt...\n0.000207\n1.800813\n\n\n16185\n2023-09-19\nNVDA\n-0.010144\n-0.001466\nIntercept 0.000301 Slope 1.817878 dt...\n0.000301\n1.817878\n\n\n16186\n2023-09-20\nNVDA\n-0.029435\n-0.023276\nIntercept 0.000845 Slope 1.825818 dt...\n0.000845\n1.825818\n\n\n16187\n2023-09-21\nNVDA\n-0.028931\n-0.020764\nIntercept 0.000901 Slope 1.818710 dt...\n0.000901\n1.818710\n\n\n\n\n15654 rows × 7 columns\n\n\n\n\n\nStep 4: Visualize the Rolling Beta\n\nPlotlyPlotnine\n\n\n\n\nCode\nreturn_beta_df \\\n .groupby('symbol') \\\n .plot_timeseries(\n date_column = \"date\",\n value_column = \"Slope\",\n facet_ncol = 2,\n facet_scales = \"free_x\",\n y_intercept = [0, 3],\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 800,\n height = 600,\n title = \"90-Day Rolling Regression\",\n engine = \"plotly\",\n )\n\n\n\n \n\n\n\n\n\n\nCode\nreturn_beta_df \\\n .groupby('symbol') \\\n .plot_timeseries(\n date_column = \"date\",\n value_column = \"Slope\",\n facet_ncol = 2,\n facet_scales = \"free_x\",\n y_intercept = [0, 3],\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 800,\n height = 600,\n title = \"90-Day Rolling Regression\",\n engine = \"plotnine\",\n )\n\n\n\n\n\n<Figure Size: (800 x 600)>" }, { - "objectID": "tutorials/01_sales_crm.html#date-features-with-tk.augment_timeseries_signature", - "href": "tutorials/01_sales_crm.html#date-features-with-tk.augment_timeseries_signature", - "title": "Sales Analysis", - "section": "3.4 Date Features with tk.augment_timeseries_signature", - "text": "3.4 Date Features with tk.augment_timeseries_signature\nNow, let’s enrich our dataset with date-related features.\nWith the function tk.augment_timeseries_signature(), you can effortlessly append 29 date attributes to a timestamp. Given that our dataset captures weekly intervals, certain attributes like ‘hour’ may not be pertinent. Thus, it’s prudent to refine our columns, retaining only those that truly matter to our analysis.\n\n\nCode\ndf_with_datefeatures = df_no_nas \\\n .augment_timeseries_signature(date_column='order_date')\n\ndf_with_datefeatures.head(10)\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\norder_date_index_num\norder_date_year\norder_date_year_iso\norder_date_yearstart\norder_date_yearend\n...\norder_date_mday\norder_date_qday\norder_date_yday\norder_date_weekend\norder_date_hour\norder_date_minute\norder_date_second\norder_date_msecond\norder_date_nsecond\norder_date_am_pm\n\n\n\n\n25\nCross Country Race\n2011-07-03\n56430.0\n85910.0\n61750.0\n1309651200\n2011\n2011\n0\n0\n...\n3\n3\n184\n1\n0\n0\n0\n0\n0\nam\n\n\n26\nCross Country Race\n2011-07-10\n62320.0\n138230.0\n25050.0\n1310256000\n2011\n2011\n0\n0\n...\n10\n10\n191\n1\n0\n0\n0\n0\n0\nam\n\n\n27\nCross Country Race\n2011-07-17\n141620.0\n138350.0\n56860.0\n1310860800\n2011\n2011\n0\n0\n...\n17\n17\n198\n1\n0\n0\n0\n0\n0\nam\n\n\n28\nCross Country Race\n2011-07-24\n75720.0\n136090.0\n8740.0\n1311465600\n2011\n2011\n0\n0\n...\n24\n24\n205\n1\n0\n0\n0\n0\n0\nam\n\n\n29\nCross Country Race\n2011-07-31\n21240.0\n32110.0\n78070.0\n1312070400\n2011\n2011\n0\n0\n...\n31\n31\n212\n1\n0\n0\n0\n0\n0\nam\n\n\n30\nCross Country Race\n2011-08-07\n11620.0\n139010.0\n115010.0\n1312675200\n2011\n2011\n0\n0\n...\n7\n38\n219\n1\n0\n0\n0\n0\n0\nam\n\n\n31\nCross Country Race\n2011-08-14\n9730.0\n2060.0\n64290.0\n1313280000\n2011\n2011\n0\n0\n...\n14\n45\n226\n1\n0\n0\n0\n0\n0\nam\n\n\n32\nCross Country Race\n2011-08-21\n22780.0\n26130.0\n95070.0\n1313884800\n2011\n2011\n0\n0\n...\n21\n52\n233\n1\n0\n0\n0\n0\n0\nam\n\n\n33\nCross Country Race\n2011-08-28\n53680.0\n30360.0\n3200.0\n1314489600\n2011\n2011\n0\n0\n...\n28\n59\n240\n1\n0\n0\n0\n0\n0\nam\n\n\n34\nCross Country Race\n2011-09-04\n38360.0\n88280.0\n21170.0\n1315094400\n2011\n2011\n0\n0\n...\n4\n66\n247\n1\n0\n0\n0\n0\n0\nam\n\n\n\n\n10 rows × 34 columns\n\n\n\nWe can quickly get a sense of what features were just created using tk.glimpse.\n\n\nCode\ndf_with_datefeatures.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 341 rows of 34 columns\ncategory_2: object ['Cross Country Race', 'Cros ...\norder_date: datetime64[ns] [Timestamp('2011-07-03 00:00 ...\ntotal_price_sum: float64 [56430.0, 62320.0, 141620.0, ...\ntotal_price_sum_lag_12: float64 [85910.0, 138230.0, 138350.0 ...\ntotal_price_sum_lag_24: float64 [61750.0, 25050.0, 56860.0, ...\norder_date_index_num: int64 [1309651200, 1310256000, 131 ...\norder_date_year: int64 [2011, 2011, 2011, 2011, 201 ...\norder_date_year_iso: UInt32 [2011, 2011, 2011, 2011, 201 ...\norder_date_yearstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_yearend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_leapyear: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_half: int64 [2, 2, 2, 2, 2, 2, 2, 2, 2, ...\norder_date_quarter: int64 [3, 3, 3, 3, 3, 3, 3, 3, 3, ...\norder_date_quarteryear: object ['2011Q3', '2011Q3', '2011Q3 ...\norder_date_quarterstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_quarterend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_month: int64 [7, 7, 7, 7, 7, 8, 8, 8, 8, ...\norder_date_month_lbl: object ['July', 'July', 'July', 'Ju ...\norder_date_monthstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_monthend: uint8 [0, 0, 0, 0, 1, 0, 0, 0, 0, ...\norder_date_yweek: UInt32 [26, 27, 28, 29, 30, 31, 32, ...\norder_date_mweek: int64 [1, 2, 3, 4, 5, 1, 2, 3, 4, ...\norder_date_wday: int64 [7, 7, 7, 7, 7, 7, 7, 7, 7, ...\norder_date_wday_lbl: object ['Sunday', 'Sunday', 'Sunday ...\norder_date_mday: int64 [3, 10, 17, 24, 31, 7, 14, 2 ...\norder_date_qday: int64 [3, 10, 17, 24, 31, 38, 45, ...\norder_date_yday: int64 [184, 191, 198, 205, 212, 21 ...\norder_date_weekend: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, ...\norder_date_hour: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_minute: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_second: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_msecond: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_nsecond: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_am_pm: object ['am', 'am', 'am', 'am', 'am ...\n\n\nLet’s subset to just a few of the relevant date features. Let’s use tk.glimpse again.\n\n\nCode\ndf_with_datefeatures_narrom = df_with_datefeatures[[\n 'order_date', \n 'category_2', \n 'total_price_sum',\n 'total_price_sum_lag_12',\n 'total_price_sum_lag_24',\n 'order_date_year', \n 'order_date_half', \n 'order_date_quarter', \n 'order_date_month',\n 'order_date_yweek'\n]]\n\ndf_with_datefeatures_narrom.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 341 rows of 10 columns\norder_date: datetime64[ns] [Timestamp('2011-07-03 00:00: ...\ncategory_2: object ['Cross Country Race', 'Cross ...\ntotal_price_sum: float64 [56430.0, 62320.0, 141620.0, ...\ntotal_price_sum_lag_12: float64 [85910.0, 138230.0, 138350.0, ...\ntotal_price_sum_lag_24: float64 [61750.0, 25050.0, 56860.0, 8 ...\norder_date_year: int64 [2011, 2011, 2011, 2011, 2011 ...\norder_date_half: int64 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2 ...\norder_date_quarter: int64 [3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ...\norder_date_month: int64 [7, 7, 7, 7, 7, 8, 8, 8, 8, 9 ...\norder_date_yweek: UInt32 [26, 27, 28, 29, 30, 31, 32, ...\n\n\n\nOne-Hot Encoding\nThe final phase in our feature engineering journey is one-hot encoding our categorical variables. While certain machine learning models like CatBoost can natively handle categorical data, many cannot. Enter one-hot encoding, a technique that transforms each category within a column into its separate column, marking its presence with a ‘1’ or absence with a ‘0’.\nFor this transformation, the handy pd.get_dummies() function from pandas comes to the rescue.\n\n\nCode\ndf_encoded = pd.get_dummies(df_with_datefeatures_narrom, columns=['category_2'])\n\ndf_encoded.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 341 rows of 18 columns\norder_date: datetime64[ns] [Timestamp('2011-07-03 ...\ntotal_price_sum: float64 [56430.0, 62320.0, 141 ...\ntotal_price_sum_lag_12: float64 [85910.0, 138230.0, 13 ...\ntotal_price_sum_lag_24: float64 [61750.0, 25050.0, 568 ...\norder_date_year: int64 [2011, 2011, 2011, 201 ...\norder_date_half: int64 [2, 2, 2, 2, 2, 2, 2, ...\norder_date_quarter: int64 [3, 3, 3, 3, 3, 3, 3, ...\norder_date_month: int64 [7, 7, 7, 7, 7, 8, 8, ...\norder_date_yweek: UInt32 [26, 27, 28, 29, 30, 3 ...\ncategory_2_Cross Country Race: uint8 [1, 1, 1, 1, 1, 1, 1, ...\ncategory_2_Cyclocross: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Elite Road: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Endurance Road: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Fat Bike: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Over Mountain: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Sport: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Trail: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Triathalon: uint8 [0, 0, 0, 0, 0, 0, 0, ...\n\n\n\n\nTraining and Future Feature Sets\nPytimetk offers an extensive array of feature engineering tools and augmentation functions, giving you a broad spectrum of possibilities. However, for the purposes of this tutorial, let’s shift our focus to modeling.\nLet’s proceed by segmenting our dataframe into training and future sets.\n\n\nCode\nfuture = df_encoded[df_encoded.total_price_sum.isnull()]\ntrain = df_encoded[df_encoded.total_price_sum.notnull()]\n\n\nLet’s focus on the columns essential for training. You’ll observe that we’ve excluded the ‘order_date’ column. This is because numerous machine learning models struggle with date data types. This is precisely why we utilized the tk.augment_timeseries_signature earlier—to transform date features into a format that’s compatible with ML models.\nWe can quickly see what features we have available with tk.glimpse().\n\n\nCode\ntrain.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 233 rows of 18 columns\norder_date: datetime64[ns] [Timestamp('2011-07-03 ...\ntotal_price_sum: float64 [56430.0, 62320.0, 141 ...\ntotal_price_sum_lag_12: float64 [85910.0, 138230.0, 13 ...\ntotal_price_sum_lag_24: float64 [61750.0, 25050.0, 568 ...\norder_date_year: int64 [2011, 2011, 2011, 201 ...\norder_date_half: int64 [2, 2, 2, 2, 2, 2, 2, ...\norder_date_quarter: int64 [3, 3, 3, 3, 3, 3, 3, ...\norder_date_month: int64 [7, 7, 7, 7, 7, 8, 8, ...\norder_date_yweek: UInt32 [26, 27, 28, 29, 30, 3 ...\ncategory_2_Cross Country Race: uint8 [1, 1, 1, 1, 1, 1, 1, ...\ncategory_2_Cyclocross: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Elite Road: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Endurance Road: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Fat Bike: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Over Mountain: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Sport: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Trail: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Triathalon: uint8 [0, 0, 0, 0, 0, 0, 0, ..." + "objectID": "tutorials/03_demand_forecasting.html", + "href": "tutorials/03_demand_forecasting.html", + "title": "Demand Forecasting", + "section": "", + "text": "Timetk enables you to generate features from the time column of your data very easily. This tutorial showcases how easy it is to perform time series forecasting with pytimetk. The specific methods we will be using are:" }, { - "objectID": "tutorials/01_sales_crm.html#scikit-learn-model", - "href": "tutorials/01_sales_crm.html#scikit-learn-model", - "title": "Sales Analysis", - "section": "3.5 Scikit Learn Model", - "text": "3.5 Scikit Learn Model\nNow for some machine learning.\n\nFitting a Random Forest Regressor\nLet’s create a RandomForestRegressor to predict future sales patterns.\n\ntrain_columns = [ 'total_price_sum_lag_12',\n 'total_price_sum_lag_24', 'order_date_year', 'order_date_half',\n 'order_date_quarter', 'order_date_month', 'order_date_yweek','category_2_Cross Country Race', 'category_2_Cyclocross',\n 'category_2_Elite Road', 'category_2_Endurance Road',\n 'category_2_Fat Bike', 'category_2_Over Mountain', 'category_2_Sport',\n 'category_2_Trail', 'category_2_Triathalon']\nX = train[train_columns]\ny = train[['total_price_sum']]\n\nmodel = RandomForestRegressor(random_state=123)\nmodel = model.fit(X, y)\n\n\n\nPrediction\nWe now have a fitted model, and can use this to predict sales from our future frame.\n\n\nCode\npredicted_values = model.predict(future[train_columns])\nfuture['y_pred'] = predicted_values\n\nfuture.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\norder_date_year\norder_date_half\norder_date_quarter\norder_date_month\norder_date_yweek\ncategory_2_Cross Country Race\ncategory_2_Cyclocross\ncategory_2_Elite Road\ncategory_2_Endurance Road\ncategory_2_Fat Bike\ncategory_2_Over Mountain\ncategory_2_Sport\ncategory_2_Trail\ncategory_2_Triathalon\ny_pred\n\n\n\n\n468\n2012-01-08\nNaN\n51820.0\n75720.0\n2012\n1\n1\n1\n1\n1\n0\n0\n0\n0\n0\n0\n0\n0\n59462.00\n\n\n469\n2012-01-15\nNaN\n62940.0\n21240.0\n2012\n1\n1\n1\n2\n1\n0\n0\n0\n0\n0\n0\n0\n0\n59149.45\n\n\n470\n2012-01-22\nNaN\n9060.0\n11620.0\n2012\n1\n1\n1\n3\n1\n0\n0\n0\n0\n0\n0\n0\n0\n20458.40\n\n\n471\n2012-01-29\nNaN\n15980.0\n9730.0\n2012\n1\n1\n1\n4\n1\n0\n0\n0\n0\n0\n0\n0\n0\n31914.00\n\n\n472\n2012-02-05\nNaN\n59180.0\n22780.0\n2012\n1\n1\n2\n5\n1\n0\n0\n0\n0\n0\n0\n0\n0\n59128.95\n\n\n473\n2012-02-12\nNaN\n132550.0\n53680.0\n2012\n1\n1\n2\n6\n1\n0\n0\n0\n0\n0\n0\n0\n0\n76397.50\n\n\n474\n2012-02-19\nNaN\n68430.0\n38360.0\n2012\n1\n1\n2\n7\n1\n0\n0\n0\n0\n0\n0\n0\n0\n63497.80\n\n\n475\n2012-02-26\nNaN\n29470.0\n90290.0\n2012\n1\n1\n2\n8\n1\n0\n0\n0\n0\n0\n0\n0\n0\n57332.00\n\n\n476\n2012-03-04\nNaN\n71080.0\n7380.0\n2012\n1\n1\n3\n9\n1\n0\n0\n0\n0\n0\n0\n0\n0\n60981.30\n\n\n477\n2012-03-11\nNaN\n9800.0\n0.0\n2012\n1\n1\n3\n10\n1\n0\n0\n0\n0\n0\n0\n0\n0\n18738.15\n\n\n\n\n\n\n\n\n\nCleaning Up\nNow let us do a little cleanup. For ease in plotting later, let’s add a column to track the actuals vs. the predicted values.\n\n\nCode\ntrain['type'] = 'actuals'\nfuture['type'] = 'prediction'\n\nfull_df = pd.concat([train, future])\n\nfull_df.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\norder_date_year\norder_date_half\norder_date_quarter\norder_date_month\norder_date_yweek\ncategory_2_Cross Country Race\ncategory_2_Cyclocross\ncategory_2_Elite Road\ncategory_2_Endurance Road\ncategory_2_Fat Bike\ncategory_2_Over Mountain\ncategory_2_Sport\ncategory_2_Trail\ncategory_2_Triathalon\ntype\ny_pred\n\n\n\n\n25\n2011-07-03\n56430.0\n85910.0\n61750.0\n2011\n2\n3\n7\n26\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n26\n2011-07-10\n62320.0\n138230.0\n25050.0\n2011\n2\n3\n7\n27\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n27\n2011-07-17\n141620.0\n138350.0\n56860.0\n2011\n2\n3\n7\n28\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n28\n2011-07-24\n75720.0\n136090.0\n8740.0\n2011\n2\n3\n7\n29\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n29\n2011-07-31\n21240.0\n32110.0\n78070.0\n2011\n2\n3\n7\n30\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n30\n2011-08-07\n11620.0\n139010.0\n115010.0\n2011\n2\n3\n8\n31\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n31\n2011-08-14\n9730.0\n2060.0\n64290.0\n2011\n2\n3\n8\n32\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n32\n2011-08-21\n22780.0\n26130.0\n95070.0\n2011\n2\n3\n8\n33\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n33\n2011-08-28\n53680.0\n30360.0\n3200.0\n2011\n2\n3\n8\n34\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n34\n2011-09-04\n38360.0\n88280.0\n21170.0\n2011\n2\n3\n9\n35\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n\n\n\n\n\nYou can get the grouping category back from the one-hot encoding for easier plotting. For simplicity, we will search for any column with ‘category’ in its name.\n\n\nCode\n# Extract dummy columns\ndummy_cols = [col for col in full_df.columns if 'category' in col.lower() ]\nfull_df_reverted = full_df.copy()\n\n# Convert dummy columns back to categorical column\nfull_df_reverted['category'] = full_df_reverted[dummy_cols].idxmax(axis=1).str.replace(\"A_\", \"\")\n\n# Drop dummy columns\nfull_df_reverted = full_df_reverted.drop(columns=dummy_cols)\n\nfull_df_reverted.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\norder_date_year\norder_date_half\norder_date_quarter\norder_date_month\norder_date_yweek\ntype\ny_pred\ncategory\n\n\n\n\n25\n2011-07-03\n56430.0\n85910.0\n61750.0\n2011\n2\n3\n7\n26\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n26\n2011-07-10\n62320.0\n138230.0\n25050.0\n2011\n2\n3\n7\n27\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n27\n2011-07-17\n141620.0\n138350.0\n56860.0\n2011\n2\n3\n7\n28\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n28\n2011-07-24\n75720.0\n136090.0\n8740.0\n2011\n2\n3\n7\n29\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n29\n2011-07-31\n21240.0\n32110.0\n78070.0\n2011\n2\n3\n7\n30\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n30\n2011-08-07\n11620.0\n139010.0\n115010.0\n2011\n2\n3\n8\n31\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n31\n2011-08-14\n9730.0\n2060.0\n64290.0\n2011\n2\n3\n8\n32\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n32\n2011-08-21\n22780.0\n26130.0\n95070.0\n2011\n2\n3\n8\n33\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n33\n2011-08-28\n53680.0\n30360.0\n3200.0\n2011\n2\n3\n8\n34\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n34\n2011-09-04\n38360.0\n88280.0\n21170.0\n2011\n2\n3\n9\n35\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n\n\n\n\n\n\n\nPre-Visualization Wrangling\nBefore we proceed to visualization, let’s streamline our dataset by aligning our predicted values with the actuals. This approach will simplify the plotting process. Given that our DataFrame columns are already labeled as ‘actuals’ and ‘predictions’, a brief conditional check will allow us to consolidate the necessary values.\n\n\nCode\nfull_df_reverted['total_price_sum'] = np.where(full_df_reverted.type =='actuals', full_df_reverted.total_price_sum, full_df_reverted.y_pred)\n\nfull_df_reverted.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\norder_date_year\norder_date_half\norder_date_quarter\norder_date_month\norder_date_yweek\ntype\ny_pred\ncategory\n\n\n\n\n25\n2011-07-03\n56430.0\n85910.0\n61750.0\n2011\n2\n3\n7\n26\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n26\n2011-07-10\n62320.0\n138230.0\n25050.0\n2011\n2\n3\n7\n27\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n27\n2011-07-17\n141620.0\n138350.0\n56860.0\n2011\n2\n3\n7\n28\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n28\n2011-07-24\n75720.0\n136090.0\n8740.0\n2011\n2\n3\n7\n29\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n29\n2011-07-31\n21240.0\n32110.0\n78070.0\n2011\n2\n3\n7\n30\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n30\n2011-08-07\n11620.0\n139010.0\n115010.0\n2011\n2\n3\n8\n31\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n31\n2011-08-14\n9730.0\n2060.0\n64290.0\n2011\n2\n3\n8\n32\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n32\n2011-08-21\n22780.0\n26130.0\n95070.0\n2011\n2\n3\n8\n33\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n33\n2011-08-28\n53680.0\n30360.0\n3200.0\n2011\n2\n3\n8\n34\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n34\n2011-09-04\n38360.0\n88280.0\n21170.0\n2011\n2\n3\n9\n35\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n\n\n\n\n\n\n\nVisualize the Forecast\nLet’s again use tk.plot_timeseries() to visually inspect the forecasts.\n\nPlotlyPlotnine\n\n\n\n\nCode\nfull_df_reverted \\\n .groupby('category') \\\n .plot_timeseries(\n date_column = 'order_date',\n value_column = 'total_price_sum',\n color_column = 'type',\n smooth = False,\n smooth_alpha = 0,\n facet_ncol = 2,\n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 800,\n height = 600,\n engine = 'plotly'\n )\n\n\n\n \n\n\n\n\n\n\nCode\nfull_df_reverted \\\n .groupby('category') \\\n .plot_timeseries(\n date_column = 'order_date',\n value_column = 'total_price_sum',\n color_column = 'type',\n smooth = False,\n smooth_alpha = 0,\n facet_ncol = 2, \n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 1000,\n height = 800,\n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (1000 x 800)>\n\n\n\n\n\nUpon examining the graph, our models look alright given the length of time for training. Important points:\n\nFor effective time series forecasting, having multiple years of data is pivotal. This provides the model ample opportunities to recognize and adapt to seasonal variations.\nGiven our dataset spanned less than a year, the model lacked the depth of historical context to discern such patterns.\nAlthough our feature engineering was kept basic to introduce various pytimetk capabilities, there’s room for enhancement.\nFor a more refined analysis, consider experimenting with different machine learning models and diving deeper into feature engineering.\nPytimetk’s tk.augment_fourier() might assist in discerning seasonal trends, but with the dataset’s limited historical scope, capturing intricate patterns could remain a challenge." + "objectID": "tutorials/03_demand_forecasting.html#load-packages", + "href": "tutorials/03_demand_forecasting.html#load-packages", + "title": "Demand Forecasting", + "section": "1.1 Load Packages", + "text": "1.1 Load Packages\nLoad the following packages before proceeding with this tutorial.\n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport pytimetk as tk\n\nfrom sklearn.ensemble import RandomForestRegressor\n\n\nThe tutorial is divided into three parts: We will first have a look at the Walmart dataset and perform some preprocessing. Secondly, we will create models based on different features, and see how the time features can be useful. Finally, we will solve the task of time series forecasting, using the features from augment_timeseries_signature, augment_lags, and augment_rolling, to predict future sales." }, { - "objectID": "tutorials/05_clustering.html", - "href": "tutorials/05_clustering.html", - "title": "Clustering", - "section": "", - "text": "Coming soon…\n\n1 More Coming Soon…\nWe are in the early stages of development. But it’s obvious the potential for pytimetk now in Python. 🐍\n\nPlease ⭐ us on GitHub (it takes 2-seconds and means a lot).\nTo make requests, please see our Project Roadmap GH Issue #2. You can make requests there.\nWant to contribute? See our contributing guide here." + "objectID": "tutorials/03_demand_forecasting.html#load-inspect-dataset", + "href": "tutorials/03_demand_forecasting.html#load-inspect-dataset", + "title": "Demand Forecasting", + "section": "1.2 Load & Inspect dataset", + "text": "1.2 Load & Inspect dataset\nThe first thing we want to do is to load the dataset. It is a subset of the Walmart sales prediction Kaggle competition. You can get more insights about the dataset by following this link: walmart_sales_weekly. The most important thing to know about the dataset is that you are provided with some features like the fuel price or whether the week contains holidays and you are expected to predict the weekly sales column for 7 different departments of a given store. Of course, you also have the date for each week, and that is what we can leverage to create additional features.\nLet us start by loading the dataset and cleaning it. Note that we also removed some columns due to * duplication of data * 0 variance * No future data available in current dataset.\n\n\nCode\n# We start by loading the dataset\n# /walmart_sales_weekly.html\ndset = tk.load_dataset('walmart_sales_weekly', parse_dates = ['Date'])\n\ndset = dset.drop(columns=[\n 'id', # This column can be removed as it is equivalent to 'Dept'\n 'Store', # This column has only one possible value\n 'Type', # This column has only one possible value\n 'Size', # This column has only one possible value\n 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5',\n 'IsHoliday', 'Temperature', 'Fuel_Price', 'CPI',\n 'Unemployment'])\n\ndset.head()\n\n\n\n\n\n\n\n\n\nDept\nDate\nWeekly_Sales\n\n\n\n\n0\n1\n2010-02-05\n24924.50\n\n\n1\n1\n2010-02-12\n46039.49\n\n\n2\n1\n2010-02-19\n41595.55\n\n\n3\n1\n2010-02-26\n19403.54\n\n\n4\n1\n2010-03-05\n21827.90\n\n\n\n\n\n\n\nWe can plot the values of each department to get an idea of how the data looks like. Using the plot_timeseries method with a groupby allows us to create multiple plots by group.\n\n\n\n\n\n\nGetting More Info: tk.plot_timeseries()\n\n\n\n\n\n\nClick here to see our Data Visualization Guide\nUse help(tk.plot_timeseries) to review additional helpful documentation.\n\n\n\n\n\nPlotlyPlotnine\n\n\n\n\nCode\nsales_df = dset\nfig = sales_df.groupby('Dept').plot_timeseries(\n date_column='Date',\n value_column='Weekly_Sales',\n facet_ncol = 2,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly')\nfig\n\n\n\n \n\n\n\n\n\n\nCode\nfig = sales_df.groupby('Dept').plot_timeseries(\n date_column='Date',\n value_column='Weekly_Sales',\n facet_ncol = 2,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine')\nfig\n\n\n\n\n\n<Figure Size: (700 x 500)>" }, { - "objectID": "tutorials/04_anomaly_detection.html", - "href": "tutorials/04_anomaly_detection.html", - "title": "Anomaly Detection in Website Traffic", - "section": "", - "text": "Anomalize: Breakdown, identify, and clean anomalies in 1 easy step\nAnomalies, often called outliers, are data points that deviate significantly from the general trend or pattern in the data. In the context of time series, they can appear as sudden spikes, drops, or any abrupt change in a sequence of values.\nAnomaly detection for time series is a technique used to identify unusual patterns that do not conform to expected behavior. It is especially relevant for sequential data (like stock prices, sensor data, sales data, etc.) where the temporal aspect is crucial. Anomalies can identify important events or be the cause of noise that can hinder forecasting performance." + "objectID": "tutorials/03_demand_forecasting.html#making-future-dates-easier-with-tk.future_frame", + "href": "tutorials/03_demand_forecasting.html#making-future-dates-easier-with-tk.future_frame", + "title": "Demand Forecasting", + "section": "2.1 Making Future Dates Easier with tk.future_frame", + "text": "2.1 Making Future Dates Easier with tk.future_frame\nWhen building machine learning models, we need to setup our dataframe to hold information about the future. This is the dataframe that will get passed to our model.predict() call. This is made easy with tk.future_frame().\n\n\n\n\n\n\nGetting to know tk.future_frame()\n\n\n\n\n\nCurious about the various options it provides?\n\nClick here to see our Data Wrangling Guide\nUse help(tk.future_frame) to review additional helpful documentation. And explore the plethora of possibilities!\n\n\n\n\nNotice this function adds 5 weeks to our dateset for each department and fills in weekly sales with nulls. Previously our max date was 2012-10-26.\n\n\nCode\nprint(sales_df.groupby('Dept').Date.max())\n\n\nDept\n1 2012-10-26\n3 2012-10-26\n8 2012-10-26\n13 2012-10-26\n38 2012-10-26\n93 2012-10-26\n95 2012-10-26\nName: Date, dtype: datetime64[ns]\n\n\nAfter applying our future frame, we can now see values 5 weeks in the future, and our dataframe has been extended to 2012-11-30 for all groups.\n\n\nCode\nsales_df_with_futureframe = sales_df \\\n .groupby('Dept') \\\n .future_frame(\n date_column = 'Date',\n length_out = 5\n )\n\n\n\n\n\n\n\nCode\nsales_df_with_futureframe.groupby('Dept').Date.max()\n\n\nDept\n1 2012-11-30\n3 2012-11-30\n8 2012-11-30\n13 2012-11-30\n38 2012-11-30\n93 2012-11-30\n95 2012-11-30\nName: Date, dtype: datetime64[ns]" }, { - "objectID": "tutorials/04_anomaly_detection.html#anomalize-breakdown-identify-and-clean-in-1-easy-step", - "href": "tutorials/04_anomaly_detection.html#anomalize-breakdown-identify-and-clean-in-1-easy-step", - "title": "Anomaly Detection in Website Traffic", - "section": "2.1 Anomalize: breakdown, identify, and clean in 1 easy step", - "text": "2.1 Anomalize: breakdown, identify, and clean in 1 easy step\nThe anomalize() function is a feature rich tool for performing anomaly detection. Anomalize is group-aware, so we can use this as part of a normal pandas groupby chain. In one easy step:\n\nWe breakdown (decompose) the time series\nAnalyze it’s remainder (residuals) for spikes (anomalies)\nClean the anomalies if desired\n\n\n\nCode\nanomalize_df = df \\\n .groupby('Page', sort = False) \\\n .anomalize(\n date_column = \"date\", \n value_column = \"value\", \n )\n\nanomalize_df.glimpse()\n\n\n\n\n\n<class 'pandas.core.frame.DataFrame'>: 5500 rows of 13 columns\nPage: object ['Death_of_Freddie_Gray_en.wikiped ...\ndate: datetime64[ns] [Timestamp('2015-07-01 00:00:00'), ...\nobserved: int64 [791, 704, 903, 732, 558, 504, 543 ...\nseasonal: float64 [206.78723511550484, 4.04332698700 ...\nseasadj: float64 [584.2127648844952, 699.9566730129 ...\ntrend: float64 [729.0301895900458, 726.0497757616 ...\nremainder: float64 [-144.8174247055506, -26.093102748 ...\nanomaly: object ['No', 'No', 'No', 'No', 'No', 'No ...\nanomaly_score: float64 [266.9421236324138, 148.2178016755 ...\nanomaly_direction: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nrecomposed_l1: float64 [266.05095141435606, 60.3266294574 ...\nrecomposed_l2: float64 [1849.8332958504716, 1644.10897389 ...\nobserved_clean: float64 [791.0, 704.0, 903.0, 732.0, 558.0 ...\n\n\n\n\n\n\n\n\nThe anomalize() function returns:\n\n\n\n\n\n\nThe original grouping and datetime columns.\nThe seasonal decomposition: observed, seasonal, seasadj, trend, and remainder. The objective is to remove trend and seasonality such that the remainder is stationary and representative of normal variation and anomalous variations.\nAnomaly identification and scoring: anomaly, anomaly_score, anomaly_direction. These identify the anomaly decision (Yes/No), score the anomaly as a distance from the centerline, and label the direction (-1 (down), zero (not anomalous), +1 (up)).\nRecomposition: recomposed_l1 and recomposed_l2. Think of these as the lower and upper bands. Any observed data that is below l1 or above l2 is anomalous.\nCleaned data: observed_clean. Cleaned data is automatically provided, which has the outliers replaced with data that is within the recomposed l1/l2 boundaries. With that said, you should always first seek to understand why data is being considered anomalous before simply removing outliers and using the cleaned data.\n\n\n\n\nThe most important aspect is that this data is ready to be visualized, inspected, and modifications can then be made to address any tweaks you would like to make." + "objectID": "tutorials/03_demand_forecasting.html#date-features-with-tk.augment_timeseries_signature", + "href": "tutorials/03_demand_forecasting.html#date-features-with-tk.augment_timeseries_signature", + "title": "Demand Forecasting", + "section": "2.2 Date Features with tk.augment_timeseries_signature", + "text": "2.2 Date Features with tk.augment_timeseries_signature\nMachine Learning models generally cannot process raw date objects directly. Moreover, they lack an inherent understanding of the passage of time. This means that, without specific features, a model can’t differentiate between a January observation and a June one. To bridge this gap, the tk.augment_timeseries_signature function is invaluable. It generates 29 distinct date-oriented features suitable for model inputs.\n\n\n\n\n\n\nGetting More Info: tk.augment_timeseries_signature(),tk.augment_lags(), tk.augment_rolling()\n\n\n\n\n\n\nClick here to see our Adding Features (Augmenting)\nUse help(tk.augment_timeseries_signature) help(tk.augment_lags) help(tk.augment_rolling) to review additional helpful documentation.\n\n\n\n\n\nIt’s crucial, however, to align these features with the granularity of your dataset. Given the weekly granularity of the Walmart dataset, any date attributes finer than ‘week’ should be excluded for relevance and efficiency.\n\n\nCode\nsales_df_dates = sales_df_with_futureframe.augment_timeseries_signature(date_column = 'Date')\nsales_df_dates.head(10)\n\n\n\n\n\n\n\n\n\nDept\nDate\nWeekly_Sales\nDate_index_num\nDate_year\nDate_year_iso\nDate_yearstart\nDate_yearend\nDate_leapyear\nDate_half\n...\nDate_mday\nDate_qday\nDate_yday\nDate_weekend\nDate_hour\nDate_minute\nDate_second\nDate_msecond\nDate_nsecond\nDate_am_pm\n\n\n\n\n0\n1\n2010-02-05\n24924.50\n1265328000\n2010\n2010\n0\n0\n0\n1\n...\n5\n36\n36\n0\n0\n0\n0\n0\n0\nam\n\n\n1\n1\n2010-02-12\n46039.49\n1265932800\n2010\n2010\n0\n0\n0\n1\n...\n12\n43\n43\n0\n0\n0\n0\n0\n0\nam\n\n\n2\n1\n2010-02-19\n41595.55\n1266537600\n2010\n2010\n0\n0\n0\n1\n...\n19\n50\n50\n0\n0\n0\n0\n0\n0\nam\n\n\n3\n1\n2010-02-26\n19403.54\n1267142400\n2010\n2010\n0\n0\n0\n1\n...\n26\n57\n57\n0\n0\n0\n0\n0\n0\nam\n\n\n4\n1\n2010-03-05\n21827.90\n1267747200\n2010\n2010\n0\n0\n0\n1\n...\n5\n64\n64\n0\n0\n0\n0\n0\n0\nam\n\n\n5\n1\n2010-03-12\n21043.39\n1268352000\n2010\n2010\n0\n0\n0\n1\n...\n12\n71\n71\n0\n0\n0\n0\n0\n0\nam\n\n\n6\n1\n2010-03-19\n22136.64\n1268956800\n2010\n2010\n0\n0\n0\n1\n...\n19\n78\n78\n0\n0\n0\n0\n0\n0\nam\n\n\n7\n1\n2010-03-26\n26229.21\n1269561600\n2010\n2010\n0\n0\n0\n1\n...\n26\n85\n85\n0\n0\n0\n0\n0\n0\nam\n\n\n8\n1\n2010-04-02\n57258.43\n1270166400\n2010\n2010\n0\n0\n0\n1\n...\n2\n2\n92\n0\n0\n0\n0\n0\n0\nam\n\n\n9\n1\n2010-04-09\n42960.91\n1270771200\n2010\n2010\n0\n0\n0\n1\n...\n9\n9\n99\n0\n0\n0\n0\n0\n0\nam\n\n\n\n\n10 rows × 32 columns\n\n\n\nUpon reviewing the generated features, it’s evident that certain attributes don’t align with the granularity of our dataset. For optimal results, features exhibiting no variance—like “Date_hour” due to the weekly nature of our data—should be omitted. We also spot redundant features, such as “Date_Month” and “Date_month_lbl”; both convey month information, albeit in different formats. To enhance clarity and computational efficiency, we’ll refine our dataset to include only the most relevant columns.\nAdditionally, we’ve eliminated certain categorical columns, which, although compatible with models like LightGBM and Catboost, demand extra processing for many tree-based ML models. While 1-hot encoding is a popular method for managing categorical data, it’s not typically recommended for date attributes. Instead, leveraging numeric date features directly, combined with the integration of Fourier features, can effectively capture cyclical patterns.\n\n\nCode\nsales_df_dates.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 1036 rows of 32 columns\nDept: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\nDate: datetime64[ns] [Timestamp('2010-02-05 00:00:00'), ...\nWeekly_Sales: float64 [24924.5, 46039.49, 41595.55, 1940 ...\nDate_index_num: int64 [1265328000, 1265932800, 126653760 ...\nDate_year: int64 [2010, 2010, 2010, 2010, 2010, 201 ...\nDate_year_iso: UInt32 [2010, 2010, 2010, 2010, 2010, 201 ...\nDate_yearstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_yearend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_leapyear: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_half: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\nDate_quarter: int64 [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ...\nDate_quarteryear: object ['2010Q1', '2010Q1', '2010Q1', '20 ...\nDate_quarterstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_quarterend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_month: int64 [2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, ...\nDate_month_lbl: object ['February', 'February', 'February ...\nDate_monthstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_monthend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_yweek: UInt32 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ...\nDate_mweek: int64 [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, ...\nDate_wday: int64 [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...\nDate_wday_lbl: object ['Friday', 'Friday', 'Friday', 'Fr ...\nDate_mday: int64 [5, 12, 19, 26, 5, 12, 19, 26, 2, ...\nDate_qday: int64 [36, 43, 50, 57, 64, 71, 78, 85, 2 ...\nDate_yday: int64 [36, 43, 50, 57, 64, 71, 78, 85, 9 ...\nDate_weekend: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_hour: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_minute: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_second: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_msecond: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_nsecond: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_am_pm: object ['am', 'am', 'am', 'am', 'am', 'am ...\n\n\n\n\nCode\nsales_df_dates = sales_df_dates[[\n 'Date'\n ,'Dept'\n , 'Weekly_Sales'\n , 'Date_year'\n , 'Date_month'\n , 'Date_yweek'\n , 'Date_mweek' \n ]]\nsales_df_dates.tail(10)\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\n\n\n\n\n1026\n2012-11-02\n93\nNaN\n2012\n11\n44\n1\n\n\n1027\n2012-11-09\n93\nNaN\n2012\n11\n45\n2\n\n\n1028\n2012-11-16\n93\nNaN\n2012\n11\n46\n3\n\n\n1029\n2012-11-23\n93\nNaN\n2012\n11\n47\n4\n\n\n1030\n2012-11-30\n93\nNaN\n2012\n11\n48\n5\n\n\n1031\n2012-11-02\n95\nNaN\n2012\n11\n44\n1\n\n\n1032\n2012-11-09\n95\nNaN\n2012\n11\n45\n2\n\n\n1033\n2012-11-16\n95\nNaN\n2012\n11\n46\n3\n\n\n1034\n2012-11-23\n95\nNaN\n2012\n11\n47\n4\n\n\n1035\n2012-11-30\n95\nNaN\n2012\n11\n48\n5" }, { - "objectID": "tutorials/04_anomaly_detection.html#visualization-1-seasonal-decomposition-plot", - "href": "tutorials/04_anomaly_detection.html#visualization-1-seasonal-decomposition-plot", - "title": "Anomaly Detection in Website Traffic", - "section": "2.2 Visualization 1: Seasonal Decomposition Plot", - "text": "2.2 Visualization 1: Seasonal Decomposition Plot\nThe first step in my normal process is to analyze the seasonal decomposition. I want to see what the remainders look like, and make sure that the trend and seasonality are being removed such that the remainder is centered around zero.\n\n\n\n\n\n\nWhat to do when the remainders have trend or seasonality?\n\n\n\n\n\nWe’ll cover how to tweak the nobs of anomalize() in the next section aptly named “How to tweak the nobs on anomalize”.\n\n\n\n\nPlotlyPlotnine\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies_decomp(\n date_column = \"date\", \n width = 1800,\n height = 1000,\n engine = 'plotly'\n )\n\n\n\n \n\n\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies_decomp(\n date_column = \"date\", \n width = 1800,\n height = 1000,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (1800 x 1000)>" + "objectID": "tutorials/03_demand_forecasting.html#lag-features-with-tk.augment_lags", + "href": "tutorials/03_demand_forecasting.html#lag-features-with-tk.augment_lags", + "title": "Demand Forecasting", + "section": "2.3 Lag Features with tk.augment_lags", + "text": "2.3 Lag Features with tk.augment_lags\nAs previously noted, it’s important to recognize that machine learning models lack inherent awareness of time, a vital consideration in time series modeling. Furthermore, these models operate under the assumption that each row is independent, meaning that the information from last month’s weekly sales is not inherently integrated into the prediction of next month’s sales target. To address this limitation, we incorporate additional features, such as lags, into the models to capture temporal dependencies. You can easily achieve this by employing the tk.augment_lags function.\n\n\nCode\ndf_with_lags = sales_df_dates \\\n .groupby('Dept') \\\n .augment_lags(\n date_column = 'Date',\n value_column = 'Weekly_Sales',\n lags = [5,6,7,8,9]\n )\ndf_with_lags.head(5)\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\nWeekly_Sales_lag_5\nWeekly_Sales_lag_6\nWeekly_Sales_lag_7\nWeekly_Sales_lag_8\nWeekly_Sales_lag_9\n\n\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n2\n2010-02-19\n1\n41595.55\n2010\n2\n7\n3\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n3\n2010-02-26\n1\n19403.54\n2010\n2\n8\n4\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n4\n2010-03-05\n1\n21827.90\n2010\n3\n9\n1\nNaN\nNaN\nNaN\nNaN\nNaN" }, { - "objectID": "tutorials/04_anomaly_detection.html#visualization-2-anomaly-detection-plot", - "href": "tutorials/04_anomaly_detection.html#visualization-2-anomaly-detection-plot", - "title": "Anomaly Detection in Website Traffic", - "section": "2.3 Visualization 2: Anomaly Detection Plot", - "text": "2.3 Visualization 2: Anomaly Detection Plot\nOnce I’m satisfied with the remainders, my next step is to visualize the anomalies. Here I’m looking to see if I need to grow or shrink the remainder l1 and l2 bands, which classify anomalies.\n\nPlotlyPlotnine\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies(\n date_column = \"date\", \n facet_ncol = 2, \n width = 1000,\n height = 1000,\n )\n\n\n\n \n\n\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies(\n date_column = \"date\", \n facet_ncol = 2, \n width = 1000,\n height = 1000,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (1000 x 1000)>" + "objectID": "tutorials/03_demand_forecasting.html#rolling-lag-features-with-tk.augment_rolling", + "href": "tutorials/03_demand_forecasting.html#rolling-lag-features-with-tk.augment_rolling", + "title": "Demand Forecasting", + "section": "2.4 Rolling Lag Features with tk.augment_rolling", + "text": "2.4 Rolling Lag Features with tk.augment_rolling\nAnother pivotal aspect of time series analysis involves the utilization of rolling lags. These operations facilitate computations within a moving time window, enabling the use of functions such as “mean” and “std” on these rolling windows. This can be achieved by invoking the tk.augment_rolling() function on grouped time series data. To execute this, we will initially gather all columns containing ‘lag’ in their names. We then apply this function to the lag values, as opposed to the weekly sales, since we lack future weekly sales data. By applying these functions to the lag values, we ensure the prevention of data leakage and maintain the adaptability of our method to unforeseen future data.\n\n\nCode\nlag_columns = [col for col in df_with_lags.columns if 'lag' in col]\n\ndf_with_rolling = df_with_lags \\\n .groupby('Dept') \\\n .augment_rolling(\n date_column = 'Date',\n value_column = lag_columns,\n window = 4,\n window_func = 'mean',\n threads = 1 # Change to -1 to use all available cores\n ) \ndf_with_rolling[df_with_rolling.Dept ==1].head(10)\n\n\n\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\nWeekly_Sales_lag_5\nWeekly_Sales_lag_6\nWeekly_Sales_lag_7\nWeekly_Sales_lag_8\nWeekly_Sales_lag_9\nWeekly_Sales_lag_5_rolling_mean_win_4\nWeekly_Sales_lag_6_rolling_mean_win_4\nWeekly_Sales_lag_7_rolling_mean_win_4\nWeekly_Sales_lag_8_rolling_mean_win_4\nWeekly_Sales_lag_9_rolling_mean_win_4\n\n\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n\n\n\n\n\nNotice when we add lag values to our dataframe, this creates several NA values. This is because when using lags, there will be some data that is not available early in our dataset.Thus as a result, NA values are introduced.\nTo simplify and clean up the process, we will remove these rows entirely since we already extracted some meaningful information from them (ie. lags, rolling lags).\n\n\nCode\nall_lag_columns = [col for col in df_with_rolling.columns if 'lag' in col]\n\ndf_no_nas = df_with_rolling \\\n .dropna(subset=all_lag_columns, inplace=False)\n\ndf_no_nas.head()\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\nWeekly_Sales_lag_5\nWeekly_Sales_lag_6\nWeekly_Sales_lag_7\nWeekly_Sales_lag_8\nWeekly_Sales_lag_9\nWeekly_Sales_lag_5_rolling_mean_win_4\nWeekly_Sales_lag_6_rolling_mean_win_4\nWeekly_Sales_lag_7_rolling_mean_win_4\nWeekly_Sales_lag_8_rolling_mean_win_4\nWeekly_Sales_lag_9_rolling_mean_win_4\n\n\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.9\n19403.54\n22809.285\n21102.8675\n25967.595\n32216.62\n32990.77\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.9\n19403.54\n22809.285\n21102.8675\n25967.595\n32216.62\n32990.77\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.9\n19403.54\n22809.285\n21102.8675\n25967.595\n32216.62\n32990.77\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.9\n19403.54\n22809.285\n21102.8675\n25967.595\n32216.62\n32990.77\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.9\n19403.54\n22809.285\n21102.8675\n25967.595\n32216.62\n32990.77\n\n\n\n\n\n\n\nWe can call tk.glimpse() again to quickly see what features we still have available.\n\n\nCode\ndf_no_nas.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 4760 rows of 17 columns\nDate: datetime64[ns] [Timestamp('20 ...\nDept: int64 [1, 1, 1, 1, 1 ...\nWeekly_Sales: float64 [16555.11, 165 ...\nDate_year: int64 [2010, 2010, 2 ...\nDate_month: int64 [4, 4, 4, 4, 4 ...\nDate_yweek: UInt32 [17, 17, 17, 1 ...\nDate_mweek: int64 [5, 5, 5, 5, 5 ...\nWeekly_Sales_lag_5: float64 [26229.21, 262 ...\nWeekly_Sales_lag_6: float64 [22136.64, 221 ...\nWeekly_Sales_lag_7: float64 [21043.39, 210 ...\nWeekly_Sales_lag_8: float64 [21827.9, 2182 ...\nWeekly_Sales_lag_9: float64 [19403.54, 194 ...\nWeekly_Sales_lag_5_rolling_mean_win_4: float64 [22809.285, 22 ...\nWeekly_Sales_lag_6_rolling_mean_win_4: float64 [21102.8675, 2 ...\nWeekly_Sales_lag_7_rolling_mean_win_4: float64 [25967.595, 25 ...\nWeekly_Sales_lag_8_rolling_mean_win_4: float64 [32216.6200000 ...\nWeekly_Sales_lag_9_rolling_mean_win_4: float64 [32990.7700000 ..." }, { - "objectID": "tutorials/04_anomaly_detection.html#visualization-3-anomalies-cleaned-plot", - "href": "tutorials/04_anomaly_detection.html#visualization-3-anomalies-cleaned-plot", - "title": "Anomaly Detection in Website Traffic", - "section": "2.4 Visualization 3: Anomalies Cleaned Plot", - "text": "2.4 Visualization 3: Anomalies Cleaned Plot\nThere are pros and cons to cleaning anomalies. I’ll leave that discussion for another time. But, should you be interested in seeing what your data looks like cleaned (with outliers removed), this plot will help you compare before and after.\n\nPlotlyPlotnine\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies_cleaned(\n date_column = \"date\", \n facet_ncol = 2, \n width = 1000,\n height = 1000,\n engine = \"plotly\"\n )\n\n\n\n \n\n\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies_cleaned(\n date_column = \"date\", \n facet_ncol = 2, \n width = 1000,\n height = 1000,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (1000 x 1000)>" + "objectID": "tutorials/03_demand_forecasting.html#training-and-future-sets", + "href": "tutorials/03_demand_forecasting.html#training-and-future-sets", + "title": "Demand Forecasting", + "section": "2.5 Training and Future Sets", + "text": "2.5 Training and Future Sets\nNow that we have our training set built, we can start to train our regressor. To do so, let’s first do some model cleanup.\nSplit our data in to train and future sets.\n\n\nCode\nfuture = df_no_nas[df_no_nas.Weekly_Sales.isnull()]\ntrain = df_no_nas[df_no_nas.Weekly_Sales.notnull()]" }, { - "objectID": "guides/04_wrangling.html", - "href": "guides/04_wrangling.html", - "title": "Data Wrangling", - "section": "", - "text": "This section will cover data wrangling for timeseries using pytimetk. We’ll show examples for the following functions:" + "objectID": "tutorials/03_demand_forecasting.html#model-with-regressor", + "href": "tutorials/03_demand_forecasting.html#model-with-regressor", + "title": "Demand Forecasting", + "section": "2.6 Model with regressor", + "text": "2.6 Model with regressor\nWe still have a datetime object in our training data. We will need to remove that before passing to our regressor. Let’s subset our column to just the features we want to use for modeling.\n\n\nCode\ntrain_columns = [ \n 'Dept'\n , 'Date_year'\n , 'Date_month'\n , 'Date_yweek'\n , 'Date_mweek'\n , 'Weekly_Sales_lag_5'\n , 'Weekly_Sales_lag_6'\n , 'Weekly_Sales_lag_7'\n , 'Weekly_Sales_lag_8'\n , 'Weekly_Sales_lag_5_rolling_mean_win_4'\n , 'Weekly_Sales_lag_6_rolling_mean_win_4'\n , 'Weekly_Sales_lag_7_rolling_mean_win_4'\n , 'Weekly_Sales_lag_8_rolling_mean_win_4'\n ]\n\nX = train[train_columns]\ny = train[['Weekly_Sales']]\n\nmodel = RandomForestRegressor(random_state=123)\nmodel = model.fit(X, y)\n\n\nNow that we have a trained model, we can pass in our future frame to predict weekly sales.\n\n\nCode\npredicted_values = model.predict(future[train_columns])\nfuture['y_pred'] = predicted_values\n\nfuture.head(10)\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\nWeekly_Sales_lag_5\nWeekly_Sales_lag_6\nWeekly_Sales_lag_7\nWeekly_Sales_lag_8\nWeekly_Sales_lag_9\nWeekly_Sales_lag_5_rolling_mean_win_4\nWeekly_Sales_lag_6_rolling_mean_win_4\nWeekly_Sales_lag_7_rolling_mean_win_4\nWeekly_Sales_lag_8_rolling_mean_win_4\nWeekly_Sales_lag_9_rolling_mean_win_4\ny_pred\n\n\n\n\n1001\n2012-11-02\n1\nNaN\n2012\n11\n44\n1\n18947.81\n19251.50\n19616.22\n18322.37\n16680.24\n19034.475\n18467.5825\n17726.3075\n17154.9275\n16604.3150\n26627.7378\n\n\n1001\n2012-11-02\n1\nNaN\n2012\n11\n44\n1\n18947.81\n19251.50\n19616.22\n18322.37\n16680.24\n19034.475\n18467.5825\n17726.3075\n17154.9275\n16604.3150\n26627.7378\n\n\n1001\n2012-11-02\n1\nNaN\n2012\n11\n44\n1\n18947.81\n19251.50\n19616.22\n18322.37\n16680.24\n19034.475\n18467.5825\n17726.3075\n17154.9275\n16604.3150\n26627.7378\n\n\n1001\n2012-11-02\n1\nNaN\n2012\n11\n44\n1\n18947.81\n19251.50\n19616.22\n18322.37\n16680.24\n19034.475\n18467.5825\n17726.3075\n17154.9275\n16604.3150\n26627.7378\n\n\n1001\n2012-11-02\n1\nNaN\n2012\n11\n44\n1\n18947.81\n19251.50\n19616.22\n18322.37\n16680.24\n19034.475\n18467.5825\n17726.3075\n17154.9275\n16604.3150\n26627.7378\n\n\n1002\n2012-11-09\n1\nNaN\n2012\n11\n45\n2\n21904.47\n18947.81\n19251.50\n19616.22\n18322.37\n19930.000\n19034.4750\n18467.5825\n17726.3075\n17154.9275\n20959.0553\n\n\n1002\n2012-11-09\n1\nNaN\n2012\n11\n45\n2\n21904.47\n18947.81\n19251.50\n19616.22\n18322.37\n19930.000\n19034.4750\n18467.5825\n17726.3075\n17154.9275\n20959.0553\n\n\n1002\n2012-11-09\n1\nNaN\n2012\n11\n45\n2\n21904.47\n18947.81\n19251.50\n19616.22\n18322.37\n19930.000\n19034.4750\n18467.5825\n17726.3075\n17154.9275\n20959.0553\n\n\n1002\n2012-11-09\n1\nNaN\n2012\n11\n45\n2\n21904.47\n18947.81\n19251.50\n19616.22\n18322.37\n19930.000\n19034.4750\n18467.5825\n17726.3075\n17154.9275\n20959.0553\n\n\n1002\n2012-11-09\n1\nNaN\n2012\n11\n45\n2\n21904.47\n18947.81\n19251.50\n19616.22\n18322.37\n19930.000\n19034.4750\n18467.5825\n17726.3075\n17154.9275\n20959.0553\n\n\n\n\n\n\n\nLet’s create a label to split up our actuals from our prediction dataset before recombining.\n\n\nCode\ntrain['type'] = 'actuals'\nfuture['type'] = 'prediction'\n\nfull_df = pd.concat([train, future])\n\nfull_df.head(10)\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\nWeekly_Sales_lag_5\nWeekly_Sales_lag_6\nWeekly_Sales_lag_7\nWeekly_Sales_lag_8\nWeekly_Sales_lag_9\nWeekly_Sales_lag_5_rolling_mean_win_4\nWeekly_Sales_lag_6_rolling_mean_win_4\nWeekly_Sales_lag_7_rolling_mean_win_4\nWeekly_Sales_lag_8_rolling_mean_win_4\nWeekly_Sales_lag_9_rolling_mean_win_4\ntype\ny_pred\n\n\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.90\n19403.54\n22809.2850\n21102.8675\n25967.5950\n32216.620\n32990.77\nactuals\nNaN\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.90\n19403.54\n22809.2850\n21102.8675\n25967.5950\n32216.620\n32990.77\nactuals\nNaN\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.90\n19403.54\n22809.2850\n21102.8675\n25967.5950\n32216.620\n32990.77\nactuals\nNaN\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.90\n19403.54\n22809.2850\n21102.8675\n25967.5950\n32216.620\n32990.77\nactuals\nNaN\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.90\n19403.54\n22809.2850\n21102.8675\n25967.5950\n32216.620\n32990.77\nactuals\nNaN\n\n\n13\n2010-05-07\n1\n17413.94\n2010\n5\n18\n1\n57258.43\n26229.21\n22136.64\n21043.39\n21827.90\n31666.9175\n22809.2850\n21102.8675\n25967.595\n32216.62\nactuals\nNaN\n\n\n13\n2010-05-07\n1\n17413.94\n2010\n5\n18\n1\n57258.43\n26229.21\n22136.64\n21043.39\n21827.90\n31666.9175\n22809.2850\n21102.8675\n25967.595\n32216.62\nactuals\nNaN\n\n\n13\n2010-05-07\n1\n17413.94\n2010\n5\n18\n1\n57258.43\n26229.21\n22136.64\n21043.39\n21827.90\n31666.9175\n22809.2850\n21102.8675\n25967.595\n32216.62\nactuals\nNaN\n\n\n13\n2010-05-07\n1\n17413.94\n2010\n5\n18\n1\n57258.43\n26229.21\n22136.64\n21043.39\n21827.90\n31666.9175\n22809.2850\n21102.8675\n25967.595\n32216.62\nactuals\nNaN\n\n\n13\n2010-05-07\n1\n17413.94\n2010\n5\n18\n1\n57258.43\n26229.21\n22136.64\n21043.39\n21827.90\n31666.9175\n22809.2850\n21102.8675\n25967.595\n32216.62\nactuals\nNaN" }, { - "objectID": "guides/04_wrangling.html#basic-example", - "href": "guides/04_wrangling.html#basic-example", - "title": "Data Wrangling", - "section": "1.1 Basic Example", - "text": "1.1 Basic Example\nThe m4_daily dataset has a daily frequency. Say we are interested in forecasting at the weekly level. We can use summarize_by_time() to aggregate to a weekly level\n\n\nCode\n# summarize by time: daily to weekly\nsummarized_df = m4_daily_df \\\n .summarize_by_time(\n date_column = 'date',\n value_column = 'value',\n freq = 'W',\n agg_func = 'sum'\n )\n\nprint(summarized_df.head())\nprint('\\nLength of the full dataset:', len(summarized_df))\n\n\n date value\n0 1978-06-25 27328.12\n1 1978-07-02 63621.88\n2 1978-07-09 63334.38\n3 1978-07-16 63737.51\n4 1978-07-23 64718.76\n\nLength of the full dataset: 1977\n\n\nThe data has now been aggregated at the weekly level. Notice we now have 1977 rows, compared to full dataset which had 9743 rows." + "objectID": "tutorials/03_demand_forecasting.html#pre-visualization-clean-up", + "href": "tutorials/03_demand_forecasting.html#pre-visualization-clean-up", + "title": "Demand Forecasting", + "section": "2.7 Pre-Visualization Clean-up", + "text": "2.7 Pre-Visualization Clean-up\n\n\nCode\nfull_df['Weekly_Sales'] = np.where(full_df.type =='actuals', full_df.Weekly_Sales, full_df.y_pred)" }, { - "objectID": "guides/04_wrangling.html#additional-aggregate-functions", - "href": "guides/04_wrangling.html#additional-aggregate-functions", - "title": "Data Wrangling", - "section": "1.2 Additional Aggregate Functions", - "text": "1.2 Additional Aggregate Functions\nsummarize_by_time() can take additional aggregate functions in the agg_func argument.\n\n\nCode\n# summarize by time with additional aggregate functions\nsummarized_multiple_agg_df = m4_daily_df \\\n .summarize_by_time(\n date_column = 'date',\n value_column = 'value',\n freq = 'W',\n agg_func = ['sum', 'min', 'max']\n )\n\nsummarized_multiple_agg_df.head()\n\n\n\n\n\n\n\n\n\ndate\nvalue_sum\nvalue_min\nvalue_max\n\n\n\n\n0\n1978-06-25\n27328.12\n9103.12\n9115.62\n\n\n1\n1978-07-02\n63621.88\n9046.88\n9115.62\n\n\n2\n1978-07-09\n63334.38\n9028.12\n9096.88\n\n\n3\n1978-07-16\n63737.51\n9075.00\n9146.88\n\n\n4\n1978-07-23\n64718.76\n9171.88\n9315.62" + "objectID": "tutorials/03_demand_forecasting.html#plot-predictions", + "href": "tutorials/03_demand_forecasting.html#plot-predictions", + "title": "Demand Forecasting", + "section": "2.8 Plot Predictions", + "text": "2.8 Plot Predictions\n\nPlotlyPlotnine\n\n\n\n\nCode\nfull_df \\\n .groupby('Dept') \\\n .plot_timeseries(\n date_column = 'Date',\n value_column = 'Weekly_Sales',\n color_column = 'type',\n smooth = False,\n smooth_alpha = 0,\n facet_ncol = 2,\n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 800,\n height = 600,\n engine = 'plotly'\n )\n\n\n\n \n\n\n\n\n\n\nCode\nfull_df \\\n .groupby('Dept') \\\n .plot_timeseries(\n date_column = 'Date',\n value_column = 'Weekly_Sales',\n color_column = 'type',\n smooth = False,\n smooth_alpha = 0,\n facet_ncol = 2,\n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 800,\n height = 600,\n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (800 x 600)>\n\n\n\n\n\nOur weekly sales forecasts exhibit a noticeable alignment with historical trends, indicating that our models are effectively capturing essential data signals. It’s worth noting that with some additional feature engineering, we have the potential to further enhance the model’s performance.\nHere are some additional techniques that can be explored to elevate its performance:\n\nExperiment with the incorporation of various lags using the versatile tk.augment_lags() function.\nEnhance the model’s capabilities by introducing additional rolling calculations through tk.augment_rolling().\nConsider incorporating cyclic features by utilizing tk.augment_fourier().\nTry different models and build a robust cross-validation strategy for model selection.\n\nThese strategies hold promise for refining the model’s accuracy and predictive power" }, { - "objectID": "guides/04_wrangling.html#summarize-by-time-with-grouped-time-series", - "href": "guides/04_wrangling.html#summarize-by-time-with-grouped-time-series", - "title": "Data Wrangling", - "section": "1.3 Summarize by Time with Grouped Time Series", - "text": "1.3 Summarize by Time with Grouped Time Series\nsummarize_by_time() also works with groups.\n\n\nCode\n# summarize by time with groups and additional aggregate functions\ngrouped_summarized_df = (\n m4_daily_df\n .groupby('id')\n .summarize_by_time(\n date_column = 'date',\n value_column = 'value',\n freq = 'W',\n agg_func = [\n 'sum',\n 'min',\n ('q25', lambda x: np.quantile(x, 0.25)),\n 'median',\n ('q75', lambda x: np.quantile(x, 0.75)),\n 'max'\n ],\n )\n)\n\ngrouped_summarized_df.head()\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue_sum\nvalue_min\nvalue_q25\nvalue_median\nvalue_q75\nvalue_max\n\n\n\n\n0\nD10\n2014-07-06\n8247.2\n2048.7\n2048.85\n2061.15\n2074.10\n2076.2\n\n\n1\nD10\n2014-07-13\n14040.8\n1978.8\n2003.95\n2007.40\n2013.80\n2019.1\n\n\n2\nD10\n2014-07-20\n13867.6\n1943.0\n1955.30\n1988.30\n2005.60\n2014.5\n\n\n3\nD10\n2014-07-27\n13266.3\n1876.0\n1887.15\n1891.00\n1895.85\n1933.3\n\n\n4\nD10\n2014-08-03\n13471.2\n1886.2\n1914.60\n1920.00\n1939.55\n1956.7" + "objectID": "tutorials/06_correlationfunnel.html", + "href": "tutorials/06_correlationfunnel.html", + "title": "Correlation Funnel", + "section": "", + "text": "We will demonstrate how Correlation Funnel to analyze Expedia Hotel Bookings and which features correlate to a customer making a booking through their website:\n\n\n\nCorrelation Funnel" }, { - "objectID": "guides/04_wrangling.html#basic-example-1", - "href": "guides/04_wrangling.html#basic-example-1", - "title": "Data Wrangling", - "section": "2.1 Basic Example", - "text": "2.1 Basic Example\nWe’ll continue with our use of the m4_daily_df dataset. Recall we’ve alread aggregated at the weekly level (summarized_df). Lets checkout the last week in the summarized_df:\n\n\nCode\n# last week in dataset\nsummarized_df \\\n .sort_values(by = 'date', ascending = True) \\\n .iloc[: -1] \\\n .tail(1)\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n1975\n2016-05-01\n17959.8\n\n\n\n\n\n\n\n\n\n\n\n\n\niloc()\n\n\n\n\n\niloc[: -1] is used to filter out the last row and keep only dates that are the start of the week.\n\n\n\nWe can see that the last week is the week of 2016-05-01. Now say we wanted to forecast the next 8 weeks. We can extend the dataset beyound the week of 2016-05-01:\n\n\nCode\n# extend dataset by 12 weeks\nsummarized_extended_df = summarized_df \\\n .future_frame(\n date_column = 'date',\n length_out = 8\n )\n\nsummarized_extended_df\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n0\n1978-06-25\n27328.12\n\n\n1\n1978-07-02\n63621.88\n\n\n2\n1978-07-09\n63334.38\n\n\n3\n1978-07-16\n63737.51\n\n\n4\n1978-07-23\n64718.76\n\n\n...\n...\n...\n\n\n1980\n2016-06-05\nNaN\n\n\n1981\n2016-06-12\nNaN\n\n\n1982\n2016-06-19\nNaN\n\n\n1983\n2016-06-26\nNaN\n\n\n1984\n2016-07-03\nNaN\n\n\n\n\n1985 rows × 2 columns\n\n\n\nTo get only the future data, we can filter the dataset for where value is missing (np.nan).\n\n\nCode\n# get only future data\nsummarized_extended_df \\\n .query('value.isna()')\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n1977\n2016-05-15\nNaN\n\n\n1978\n2016-05-22\nNaN\n\n\n1979\n2016-05-29\nNaN\n\n\n1980\n2016-06-05\nNaN\n\n\n1981\n2016-06-12\nNaN\n\n\n1982\n2016-06-19\nNaN\n\n\n1983\n2016-06-26\nNaN\n\n\n1984\n2016-07-03\nNaN" + "objectID": "tutorials/06_correlationfunnel.html#setup", + "href": "tutorials/06_correlationfunnel.html#setup", + "title": "Correlation Funnel", + "section": "3.1 Setup", + "text": "3.1 Setup\nTo set up, import the following packages and the expedia_df dataset, Expedia Hotel Time Series Dataset.\n\n# Libraries\nimport pandas as pd \nimport pytimetk as tk\n\n# Data\nexpedia_df = tk.load_dataset(\"expedia\", parse_dates = ['date_time'])\nexpedia_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 100000 rows of 24 columns\ndate_time: datetime64[ns] [Timestamp('2013-07-25 17: ...\nsite_name: int64 [2, 2, 2, 2, 2, 37, 2, 2, ...\nposa_continent: int64 [3, 3, 3, 3, 3, 1, 3, 3, 3 ...\nuser_location_country: int64 [66, 66, 66, 66, 66, 69, 6 ...\nuser_location_region: int64 [174, 174, 174, 220, 351, ...\nuser_location_city: int64 [35675, 31320, 16292, 1760 ...\norig_destination_distance: float64 [0.1203, 108.2251, 763.142 ...\nuser_id: int64 [44735, 794319, 761732, 69 ...\nis_mobile: int64 [0, 0, 1, 0, 0, 0, 0, 0, 0 ...\nis_package: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0 ...\nchannel: int64 [9, 3, 1, 9, 1, 9, 9, 9, 9 ...\nsrch_ci: object ['2013-07-26', '2014-11-27 ...\nsrch_co: object ['2013-07-27', '2014-11-29 ...\nsrch_adults_cnt: int64 [1, 2, 2, 2, 2, 2, 2, 2, 2 ...\nsrch_children_cnt: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0 ...\nsrch_rm_cnt: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1 ...\nsrch_destination_id: int64 [5465, 11620, 23808, 40658 ...\nsrch_destination_type_id: int64 [3, 1, 6, 5, 1, 6, 1, 5, 6 ...\nis_booking: int64 [1, 0, 0, 0, 0, 0, 0, 0, 0 ...\ncnt: int64 [1, 2, 3, 1, 2, 7, 1, 1, 1 ...\nhotel_continent: int64 [2, 2, 2, 2, 2, 6, 4, 2, 4 ...\nhotel_country: int64 [50, 50, 50, 50, 50, 204, ...\nhotel_market: int64 [1230, 369, 1144, 930, 637 ...\nhotel_cluster: int64 [47, 83, 93, 48, 33, 15, 9 ..." }, { - "objectID": "guides/04_wrangling.html#future-frame-with-grouped-time-series", - "href": "guides/04_wrangling.html#future-frame-with-grouped-time-series", - "title": "Data Wrangling", - "section": "2.2 Future Frame with Grouped Time Series", - "text": "2.2 Future Frame with Grouped Time Series\nfuture_frame() also works for grouped time series. We can see an example using our grouped summarized dataset (grouped_summarized_df) from earlier:\n\n\nCode\n# future frame with grouped time series\ngrouped_summarized_df[['id', 'date', 'value_sum']] \\\n .groupby('id') \\\n .future_frame(\n date_column = 'date',\n length_out = 8\n ) \\\n .query('value_sum.isna()') # filtering to return only the future data\n\n\n\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue_sum\n\n\n\n\n1395\nD10\n2016-05-15\nNaN\n\n\n1396\nD10\n2016-05-22\nNaN\n\n\n1397\nD10\n2016-05-29\nNaN\n\n\n1398\nD10\n2016-06-05\nNaN\n\n\n1399\nD10\n2016-06-12\nNaN\n\n\n1400\nD10\n2016-06-19\nNaN\n\n\n1401\nD10\n2016-06-26\nNaN\n\n\n1402\nD10\n2016-07-03\nNaN\n\n\n1403\nD160\n2011-07-10\nNaN\n\n\n1404\nD160\n2011-07-17\nNaN\n\n\n1405\nD160\n2011-07-24\nNaN\n\n\n1406\nD160\n2011-07-31\nNaN\n\n\n1407\nD160\n2011-08-07\nNaN\n\n\n1408\nD160\n2011-08-14\nNaN\n\n\n1409\nD160\n2011-08-21\nNaN\n\n\n1410\nD160\n2011-08-28\nNaN\n\n\n1411\nD410\n1980-05-11\nNaN\n\n\n1412\nD410\n1980-05-18\nNaN\n\n\n1413\nD410\n1980-05-25\nNaN\n\n\n1414\nD410\n1980-06-01\nNaN\n\n\n1415\nD410\n1980-06-08\nNaN\n\n\n1416\nD410\n1980-06-15\nNaN\n\n\n1417\nD410\n1980-06-22\nNaN\n\n\n1418\nD410\n1980-06-29\nNaN\n\n\n1419\nD500\n2012-09-30\nNaN\n\n\n1420\nD500\n2012-10-07\nNaN\n\n\n1421\nD500\n2012-10-14\nNaN\n\n\n1422\nD500\n2012-10-21\nNaN\n\n\n1423\nD500\n2012-10-28\nNaN\n\n\n1424\nD500\n2012-11-04\nNaN\n\n\n1425\nD500\n2012-11-11\nNaN\n\n\n1426\nD500\n2012-11-18\nNaN" + "objectID": "tutorials/06_correlationfunnel.html#data-preparation", + "href": "tutorials/06_correlationfunnel.html#data-preparation", + "title": "Correlation Funnel", + "section": "3.2 Data Preparation", + "text": "3.2 Data Preparation\nTo prepare the dataset, we will first perform data preparation:\n\nAdd time series features based on the date_time timestamp column.\nWe will drop any zero variance features\nDrop additional columns that are not an acceptable data type (i.e. not numeric, categorical, or string) or contain missing values\nConvert numeric columns that start with “hotel_” that are actually categorical “ID” columns to string\n\n\nexpedia_ts_features_df = expedia_df \\\n .augment_timeseries_signature('date_time') \\\n .drop_zero_variance() \\\n .drop(columns=['date_time', 'orig_destination_distance', 'srch_ci', 'srch_co']) \\\n .transform_columns(\n columns = [r\"hotel_.*\"],\n transform_func = lambda x: x.astype(str)\n )\n \nexpedia_ts_features_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 100000 rows of 46 columns\nsite_name: int64 [2, 2, 2, 2, 2, 37, 2, 2, 2 ...\nposa_continent: int64 [3, 3, 3, 3, 3, 1, 3, 3, 3, ...\nuser_location_country: int64 [66, 66, 66, 66, 66, 69, 66 ...\nuser_location_region: int64 [174, 174, 174, 220, 351, 7 ...\nuser_location_city: int64 [35675, 31320, 16292, 17605 ...\nuser_id: int64 [44735, 794319, 761732, 696 ...\nis_mobile: int64 [0, 0, 1, 0, 0, 0, 0, 0, 0, ...\nis_package: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nchannel: int64 [9, 3, 1, 9, 1, 9, 9, 9, 9, ...\nsrch_adults_cnt: int64 [1, 2, 2, 2, 2, 2, 2, 2, 2, ...\nsrch_children_cnt: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nsrch_rm_cnt: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, ...\nsrch_destination_id: int64 [5465, 11620, 23808, 40658, ...\nsrch_destination_type_id: int64 [3, 1, 6, 5, 1, 6, 1, 5, 6, ...\nis_booking: int64 [1, 0, 0, 0, 0, 0, 0, 0, 0, ...\ncnt: int64 [1, 2, 3, 1, 2, 7, 1, 1, 1, ...\nhotel_continent: object ['2', '2', '2', '2', '2', ' ...\nhotel_country: object ['50', '50', '50', '50', '5 ...\nhotel_market: object ['1230', '369', '1144', '93 ...\nhotel_cluster: object ['47', '83', '93', '48', '3 ...\ndate_time_index_num: int64 [1374773055, 1414939784, 14 ...\ndate_time_year: int64 [2013, 2014, 2014, 2014, 20 ...\ndate_time_year_iso: UInt32 [2013, 2014, 2014, 2014, 20 ...\ndate_time_yearstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_yearend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_half: int64 [2, 2, 1, 1, 2, 2, 1, 2, 1, ...\ndate_time_quarter: int64 [3, 4, 2, 1, 3, 4, 1, 3, 2, ...\ndate_time_quarteryear: object ['2013Q3', '2014Q4', '2014Q ...\ndate_time_quarterstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_quarterend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_month: int64 [7, 11, 5, 2, 8, 12, 3, 9, ...\ndate_time_month_lbl: object ['July', 'November', 'May', ...\ndate_time_monthstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_monthend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_yweek: UInt32 [30, 44, 21, 9, 33, 50, 12, ...\ndate_time_mweek: int64 [4, 1, 4, 4, 2, 2, 3, 3, 2, ...\ndate_time_wday: int64 [4, 7, 4, 3, 3, 2, 2, 1, 4, ...\ndate_time_wday_lbl: object ['Thursday', 'Sunday', 'Thu ...\ndate_time_mday: int64 [25, 2, 22, 26, 13, 9, 18, ...\ndate_time_qday: int64 [25, 33, 52, 57, 44, 70, 77 ...\ndate_time_yday: int64 [206, 306, 142, 57, 225, 34 ...\ndate_time_weekend: int64 [0, 1, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_hour: int64 [17, 14, 12, 14, 11, 7, 21, ...\ndate_time_minute: int64 [24, 49, 50, 1, 15, 21, 40, ...\ndate_time_second: int64 [15, 44, 53, 2, 40, 31, 29, ...\ndate_time_am_pm: object ['pm', 'pm', 'am', 'pm', 'a ..." }, { - "objectID": "guides/04_wrangling.html#basic-example-2", - "href": "guides/04_wrangling.html#basic-example-2", - "title": "Data Wrangling", - "section": "3.1 Basic Example", - "text": "3.1 Basic Example\nLet’s start with a basic example to see how pad_by_time() works. We’ll create some sample data with missing timestamps:\n\n\nCode\n# libraries\nimport pytimetk as tk\nimport pandas as pd\nimport numpy as np\n\n# sample quarterly data with missing timestamp for Q3\ndates = pd.to_datetime([\"2021-01-01\", \"2021-04-01\", \"2021-10-01\"])\nvalue = range(len(dates))\n\ndf = pd.DataFrame({\n 'date': dates,\n 'value': range(len(dates))\n})\n\ndf\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n0\n2021-01-01\n0\n\n\n1\n2021-04-01\n1\n\n\n2\n2021-10-01\n2\n\n\n\n\n\n\n\nNow we can use pad_by_time() to fill in the missing timestamp:\n\n\nCode\n# pad by time\ndf \\\n .pad_by_time(\n date_column = 'date',\n freq = 'QS' # specifying quarter start frequency\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n0\n2021-01-01\n0.0\n\n\n1\n2021-04-01\n1.0\n\n\n2\n2021-07-01\nNaN\n\n\n3\n2021-10-01\n2.0\n\n\n\n\n\n\n\nWe can also specify shorter time frequency:\n\n\nCode\n# pad by time with shorter frequency\ndf \\\n .pad_by_time(\n date_column = 'date',\n freq = 'MS' # specifying month start frequency\n ) \\\n .assign(value = lambda x: x['value'].fillna(0)) # replace NaN with 0\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n0\n2021-01-01\n0.0\n\n\n1\n2021-02-01\n0.0\n\n\n2\n2021-03-01\n0.0\n\n\n3\n2021-04-01\n1.0\n\n\n4\n2021-05-01\n0.0\n\n\n5\n2021-06-01\n0.0\n\n\n6\n2021-07-01\n0.0\n\n\n7\n2021-08-01\n0.0\n\n\n8\n2021-09-01\n0.0\n\n\n9\n2021-10-01\n2.0" + "objectID": "tutorials/06_correlationfunnel.html#step-correlation-funnel-workflow", + "href": "tutorials/06_correlationfunnel.html#step-correlation-funnel-workflow", + "title": "Correlation Funnel", + "section": "3.3 3-Step Correlation Funnel Workflow", + "text": "3.3 3-Step Correlation Funnel Workflow\nNext, we will perform the Correlation Funnel workflow to explore the Expedia Hotel Time Series dataset. There are 3 steps:\n\nBinarize: Convert the data to binary 0/1\nCorrelate: Detect relationships between the binary features and one of the columns (called the target)\nVisualize the Correlation Funnel: Plotting allows us to assess the top features and their relationship to the target.\n\n\nStep 1: Binarize\nUse binarize() to convert the raw data to binary 0/1. Binarization happens as follows:\n\nNumeric Data: Numeric data is Quantile Binned using the pd.qcut() function. The default is 4 bins, which bins numeric data into a maximum of 4 discrete bins. Fewer bins can be returned if there is insufficient data for 4 bins. The number of bins is controlled with the n_bins parameter.\nCategorical / String Data: Categorical data is first processed to determine the most frequent categories. Categories that are sparse are lumped into an “OTHER” category. The lumping can be controlled with the thresh_infreq.\n\n\nexpedia_ts_binarized_df = expedia_ts_features_df.binarize(thresh_infreq = 0.05)\n\nexpedia_ts_binarized_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 100000 rows of 155 columns\nsite_name__2.0_15.0: uint8 [1, 1 ...\nsite_name__15.0_53.0: uint8 [0, 0 ...\nuser_location_country__0.0_66.0: uint8 [1, 1 ...\nuser_location_country__66.0_71.0: uint8 [0, 0 ...\nuser_location_country__71.0_239.0: uint8 [0, 0 ...\nuser_location_region__0.0_174.0: uint8 [1, 1 ...\nuser_location_region__174.0_314.0: uint8 [0, 0 ...\nuser_location_region__314.0_385.0: uint8 [0, 0 ...\nuser_location_region__385.0_1021.0: uint8 [0, 0 ...\nuser_location_city__0.0_13087.0: uint8 [0, 0 ...\nuser_location_city__13087.0_27655.0: uint8 [0, 0 ...\nuser_location_city__27655.0_42563.0: uint8 [1, 1 ...\nuser_location_city__42563.0_56507.0: uint8 [0, 0 ...\nuser_id__13.0_299759.8: uint8 [1, 0 ...\nuser_id__299759.8_605161.5: uint8 [0, 0 ...\nuser_id__605161.5_911811.5: uint8 [0, 1 ...\nuser_id__911811.5_1198780.0: uint8 [0, 0 ...\nchannel__0.0_2.0: uint8 [0, 0 ...\nchannel__2.0_9.0: uint8 [1, 1 ...\nchannel__9.0_10.0: uint8 [0, 0 ...\nsrch_adults_cnt__0.0_2.0: uint8 [1, 1 ...\nsrch_adults_cnt__2.0_9.0: uint8 [0, 0 ...\nsrch_children_cnt__0.0_9.0: uint8 [1, 1 ...\nsrch_rm_cnt__0.0_1.0: uint8 [1, 1 ...\nsrch_rm_cnt__1.0_8.0: uint8 [0, 0 ...\nsrch_destination_id__1.0_8267.0: uint8 [1, 0 ...\nsrch_destination_id__8267.0_9147.0: uint8 [0, 0 ...\nsrch_destination_id__9147.0_18998.0: uint8 [0, 1 ...\nsrch_destination_id__18998.0_65104.0: uint8 [0, 0 ...\nsrch_destination_type_id__1.0_5.0: uint8 [1, 1 ...\nsrch_destination_type_id__5.0_9.0: uint8 [0, 0 ...\ncnt__1.0_2.0: uint8 [1, 1 ...\ncnt__2.0_72.0: uint8 [0, 0 ...\ndate_time_index_num__1357516842.0_1382867237.5: uint8 [1, 0 ...\ndate_time_index_num__1382867237.5_1401387689.0: uint8 [0, 0 ...\ndate_time_index_num__1401387689.0_1410981206.0: uint8 [0, 0 ...\ndate_time_index_num__1410981206.0_1420070302.0: uint8 [0, 1 ...\ndate_time_month__1.0_5.0: uint8 [0, 0 ...\ndate_time_month__5.0_7.0: uint8 [1, 0 ...\ndate_time_month__7.0_10.0: uint8 [0, 0 ...\ndate_time_month__10.0_12.0: uint8 [0, 1 ...\ndate_time_yweek__1.0_17.0: uint8 [0, 0 ...\ndate_time_yweek__17.0_30.0: uint8 [1, 0 ...\ndate_time_yweek__30.0_41.0: uint8 [0, 0 ...\ndate_time_yweek__41.0_52.0: uint8 [0, 1 ...\ndate_time_mday__1.0_8.0: uint8 [0, 1 ...\ndate_time_mday__8.0_16.0: uint8 [0, 0 ...\ndate_time_mday__16.0_23.0: uint8 [0, 0 ...\ndate_time_mday__23.0_31.0: uint8 [1, 0 ...\ndate_time_qday__1.0_24.0: uint8 [0, 0 ...\ndate_time_qday__24.0_48.0: uint8 [1, 1 ...\ndate_time_qday__48.0_70.0: uint8 [0, 0 ...\ndate_time_qday__70.0_92.0: uint8 [0, 0 ...\ndate_time_yday__1.0_121.0: uint8 [0, 0 ...\ndate_time_yday__121.0_209.0: uint8 [1, 0 ...\ndate_time_yday__209.0_286.0: uint8 [0, 0 ...\ndate_time_yday__286.0_365.0: uint8 [0, 1 ...\ndate_time_hour__0.0_10.0: uint8 [0, 0 ...\ndate_time_hour__10.0_14.0: uint8 [0, 1 ...\ndate_time_hour__14.0_18.0: uint8 [1, 0 ...\ndate_time_hour__18.0_23.0: uint8 [0, 0 ...\ndate_time_minute__0.0_15.0: uint8 [0, 0 ...\ndate_time_minute__15.0_30.0: uint8 [1, 0 ...\ndate_time_minute__30.0_45.0: uint8 [0, 0 ...\ndate_time_minute__45.0_59.0: uint8 [0, 1 ...\ndate_time_second__0.0_15.0: uint8 [1, 0 ...\ndate_time_second__15.0_30.0: uint8 [0, 0 ...\ndate_time_second__30.0_45.0: uint8 [0, 1 ...\ndate_time_second__45.0_59.0: uint8 [0, 0 ...\nposa_continent__1: uint8 [0, 0 ...\nposa_continent__2: uint8 [0, 0 ...\nposa_continent__3: uint8 [1, 1 ...\nposa_continent__-OTHER: uint8 [0, 0 ...\nis_mobile__0: uint8 [1, 1 ...\nis_mobile__1: uint8 [0, 0 ...\nis_package__0: uint8 [1, 1 ...\nis_package__1: uint8 [0, 0 ...\nis_booking__0: uint8 [0, 1 ...\nis_booking__1: uint8 [1, 0 ...\nhotel_continent__-OTHER: uint8 [0, 0 ...\nhotel_continent__2: uint8 [1, 1 ...\nhotel_continent__3: uint8 [0, 0 ...\nhotel_continent__4: uint8 [0, 0 ...\nhotel_continent__6: uint8 [0, 0 ...\nhotel_country__-OTHER: uint8 [0, 0 ...\nhotel_country__50: uint8 [1, 1 ...\nhotel_country__8: uint8 [0, 0 ...\nhotel_market__-OTHER: uint8 [1, 1 ...\nhotel_cluster__-OTHER: uint8 [1, 1 ...\ndate_time_year__2013: uint8 [1, 0 ...\ndate_time_year__2014: uint8 [0, 1 ...\ndate_time_year_iso__2013: uint8 [1, 0 ...\ndate_time_year_iso__2014: uint8 [0, 1 ...\ndate_time_year_iso__-OTHER: uint8 [0, 0 ...\ndate_time_yearstart__0: uint8 [1, 1 ...\ndate_time_yearstart__-OTHER: uint8 [0, 0 ...\ndate_time_yearend__0: uint8 [1, 1 ...\ndate_time_yearend__-OTHER: uint8 [0, 0 ...\ndate_time_half__1: uint8 [0, 0 ...\ndate_time_half__2: uint8 [1, 1 ...\ndate_time_quarter__1: uint8 [0, 0 ...\ndate_time_quarter__2: uint8 [0, 0 ...\ndate_time_quarter__3: uint8 [1, 0 ...\ndate_time_quarter__4: uint8 [0, 1 ...\ndate_time_quarteryear__2013Q1: uint8 [0, 0 ...\ndate_time_quarteryear__2013Q2: uint8 [0, 0 ...\ndate_time_quarteryear__2013Q3: uint8 [1, 0 ...\ndate_time_quarteryear__2013Q4: uint8 [0, 0 ...\ndate_time_quarteryear__2014Q1: uint8 [0, 0 ...\ndate_time_quarteryear__2014Q2: uint8 [0, 0 ...\ndate_time_quarteryear__2014Q3: uint8 [0, 0 ...\ndate_time_quarteryear__2014Q4: uint8 [0, 1 ...\ndate_time_quarterstart__0: uint8 [1, 1 ...\ndate_time_quarterstart__-OTHER: uint8 [0, 0 ...\ndate_time_quarterend__0: uint8 [1, 1 ...\ndate_time_quarterend__-OTHER: uint8 [0, 0 ...\ndate_time_month_lbl__April: uint8 [0, 0 ...\ndate_time_month_lbl__August: uint8 [0, 0 ...\ndate_time_month_lbl__December: uint8 [0, 0 ...\ndate_time_month_lbl__February: uint8 [0, 0 ...\ndate_time_month_lbl__January: uint8 [0, 0 ...\ndate_time_month_lbl__July: uint8 [1, 0 ...\ndate_time_month_lbl__June: uint8 [0, 0 ...\ndate_time_month_lbl__March: uint8 [0, 0 ...\ndate_time_month_lbl__May: uint8 [0, 0 ...\ndate_time_month_lbl__November: uint8 [0, 1 ...\ndate_time_month_lbl__October: uint8 [0, 0 ...\ndate_time_month_lbl__September: uint8 [0, 0 ...\ndate_time_monthstart__0: uint8 [1, 1 ...\ndate_time_monthstart__-OTHER: uint8 [0, 0 ...\ndate_time_monthend__0: uint8 [1, 1 ...\ndate_time_monthend__-OTHER: uint8 [0, 0 ...\ndate_time_mweek__1: uint8 [0, 1 ...\ndate_time_mweek__2: uint8 [0, 0 ...\ndate_time_mweek__3: uint8 [0, 0 ...\ndate_time_mweek__4: uint8 [1, 0 ...\ndate_time_mweek__5: uint8 [0, 0 ...\ndate_time_wday__1: uint8 [0, 0 ...\ndate_time_wday__2: uint8 [0, 0 ...\ndate_time_wday__3: uint8 [0, 0 ...\ndate_time_wday__4: uint8 [1, 0 ...\ndate_time_wday__5: uint8 [0, 0 ...\ndate_time_wday__6: uint8 [0, 0 ...\ndate_time_wday__7: uint8 [0, 1 ...\ndate_time_wday_lbl__Friday: uint8 [0, 0 ...\ndate_time_wday_lbl__Monday: uint8 [0, 0 ...\ndate_time_wday_lbl__Saturday: uint8 [0, 0 ...\ndate_time_wday_lbl__Sunday: uint8 [0, 1 ...\ndate_time_wday_lbl__Thursday: uint8 [1, 0 ...\ndate_time_wday_lbl__Tuesday: uint8 [0, 0 ...\ndate_time_wday_lbl__Wednesday: uint8 [0, 0 ...\ndate_time_weekend__0: uint8 [1, 0 ...\ndate_time_weekend__1: uint8 [0, 1 ...\ndate_time_am_pm__am: uint8 [0, 0 ...\ndate_time_am_pm__pm: uint8 [1, 1 ...\n\n\n\n\nStep 2: Correlate the data\nNext, we use correlate() to calculate strength of the relationship. The main parameter is target, which should be selected based on the business goal.\nIn this case, we can create a business goal to understand what relates to a website visit count greater than 2. We will select the column: is_booking__1 as the target. This is because we want to know what relates to a hotel room booking via the website search data.\nThis returns a 3 column data frame containing:\n\nfeature: The name of the features\nbin: The bin that corresponds to a bin inside the features\ncorrelation: The strength of the relationship (0 to 1) and the direction of the relationship (+/-)\n\n\nexpedia_ts_correlate_df = expedia_ts_binarized_df.correlate('is_booking__1')\n\nexpedia_ts_correlate_df\n\n\n\n\n\n\n\n\nfeature\nbin\ncorrelation\n\n\n\n\n77\nis_booking\n0\n-1.000000\n\n\n78\nis_booking\n1\n1.000000\n\n\n32\ncnt\n2.0_72.0\n-0.099372\n\n\n31\ncnt\n1.0_2.0\n0.099372\n\n\n75\nis_package\n0\n0.075930\n\n\n...\n...\n...\n...\n\n\n131\ndate_time_monthend\n-OTHER\n0.000182\n\n\n108\ndate_time_quarteryear\n2014Q1\n-0.000041\n\n\n22\nsrch_children_cnt\n0.0_9.0\nNaN\n\n\n87\nhotel_market\n-OTHER\nNaN\n\n\n88\nhotel_cluster\n-OTHER\nNaN\n\n\n\n\n155 rows × 3 columns\n\n\n\n\n\nStep 3: Plot the Correlation funnel\nIt’s in this step where we can visualize review the correlations and determine which features relate to the target, the strength of the relationship (magnitude between 0 and 1), and the direction of the relationship (+/-).\n\nexpedia_ts_correlate_df.plot_correlation_funnel(\n engine = 'plotly',\n height = 800\n)" }, { - "objectID": "guides/04_wrangling.html#pad-by-time-with-grouped-time-series", - "href": "guides/04_wrangling.html#pad-by-time-with-grouped-time-series", - "title": "Data Wrangling", - "section": "3.2 Pad by Time with Grouped Time Series", - "text": "3.2 Pad by Time with Grouped Time Series\npad_by_time() can also be used with grouped time series. Let’s use the stocks_daily dataset to showcase an example:\n\n\nCode\n# load dataset\nstocks_df = tk.load_dataset('stocks_daily', parse_dates = ['date'])\n\n# pad by time\nstocks_df \\\n .groupby('symbol') \\\n .pad_by_time(\n date_column = 'date',\n freq = 'D'\n ) \\\n .assign(id = lambda x: x['symbol'].ffill())\n\n\n\n\n\n\n\n\n\nsymbol\ndate\nopen\nhigh\nlow\nclose\nvolume\nadjusted\nid\n\n\n\n\n0\nAAPL\n2013-01-02\n19.779285\n19.821428\n19.343929\n19.608213\n560518000.0\n16.791180\nAAPL\n\n\n1\nAAPL\n2013-01-03\n19.567142\n19.631071\n19.321428\n19.360714\n352965200.0\n16.579241\nAAPL\n\n\n2\nAAPL\n2013-01-04\n19.177500\n19.236786\n18.779642\n18.821428\n594333600.0\n16.117437\nAAPL\n\n\n3\nAAPL\n2013-01-05\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nAAPL\n\n\n4\nAAPL\n2013-01-06\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nAAPL\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n23485\nNVDA\n2023-09-17\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNVDA\n\n\n23486\nNVDA\n2023-09-18\n427.480011\n442.420013\n420.000000\n439.660004\n50027100.0\n439.660004\nNVDA\n\n\n23487\nNVDA\n2023-09-19\n438.329987\n439.660004\n430.019989\n435.200012\n37306400.0\n435.200012\nNVDA\n\n\n23488\nNVDA\n2023-09-20\n436.000000\n439.029999\n422.230011\n422.390015\n36710800.0\n422.390015\nNVDA\n\n\n23489\nNVDA\n2023-09-21\n415.829987\n421.000000\n409.799988\n410.170013\n44893000.0\n410.170013\nNVDA\n\n\n\n\n23490 rows × 9 columns\n\n\n\nTo replace NaN with 0 in a dataframe with multiple columns:\n\n\nCode\nfrom functools import partial\n\n# columns to replace NaN with 0\ncols_to_fill = ['open', 'high', 'low', 'close', 'volume', 'adjusted']\n\n# define a function to fillna\ndef fill_na_col(df, col):\n return df[col].fillna(0)\n\n# pad by time and replace NaN with 0\nstocks_df \\\n .groupby('symbol') \\\n .pad_by_time(\n date_column = 'date',\n freq = 'D'\n ) \\\n .assign(id = lambda x: x['symbol'].ffill()) \\\n .assign(**{col: partial(fill_na_col, col=col) for col in cols_to_fill})\n\n\n\n\n\n\n\n\n\nsymbol\ndate\nopen\nhigh\nlow\nclose\nvolume\nadjusted\nid\n\n\n\n\n0\nAAPL\n2013-01-02\n19.779285\n19.821428\n19.343929\n19.608213\n560518000.0\n16.791180\nAAPL\n\n\n1\nAAPL\n2013-01-03\n19.567142\n19.631071\n19.321428\n19.360714\n352965200.0\n16.579241\nAAPL\n\n\n2\nAAPL\n2013-01-04\n19.177500\n19.236786\n18.779642\n18.821428\n594333600.0\n16.117437\nAAPL\n\n\n3\nAAPL\n2013-01-05\n0.000000\n0.000000\n0.000000\n0.000000\n0.0\n0.000000\nAAPL\n\n\n4\nAAPL\n2013-01-06\n0.000000\n0.000000\n0.000000\n0.000000\n0.0\n0.000000\nAAPL\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n23485\nNVDA\n2023-09-17\n0.000000\n0.000000\n0.000000\n0.000000\n0.0\n0.000000\nNVDA\n\n\n23486\nNVDA\n2023-09-18\n427.480011\n442.420013\n420.000000\n439.660004\n50027100.0\n439.660004\nNVDA\n\n\n23487\nNVDA\n2023-09-19\n438.329987\n439.660004\n430.019989\n435.200012\n37306400.0\n435.200012\nNVDA\n\n\n23488\nNVDA\n2023-09-20\n436.000000\n439.029999\n422.230011\n422.390015\n36710800.0\n422.390015\nNVDA\n\n\n23489\nNVDA\n2023-09-21\n415.829987\n421.000000\n409.799988\n410.170013\n44893000.0\n410.170013\nNVDA\n\n\n\n\n23490 rows × 9 columns" + "objectID": "guides/05_augmenting.html", + "href": "guides/05_augmenting.html", + "title": "Adding Features (Augmenting)", + "section": "", + "text": "This section will cover the augment set of functions, use to add many additional time series features to a dataset. We’ll cover how to use the following set of functions" }, { - "objectID": "guides/07_timeseries_crossvalidation.html", - "href": "guides/07_timeseries_crossvalidation.html", - "title": "Time Series Cross Validation", - "section": "", - "text": "In this tutorial, you’ll learn how to use the TimeSeriesCV and TimeSeriesCVSplitter classes from pytimetk for time series cross-validation, using the walmart_sales_df dataset as an example, which contains 7 time series groups.\n\nIn Part 1, we’ll start with exploring the data and move on to creating and visualizing time-based cross-validation splits. This will prepare you for the next section with Scikit Learn.\nIn Part 2, we’ll implement time series cross-validation with Scikit-Learn, engineer features, train a random forest model, and visualize the results in Python. By following this process, you can ensure a robust evaluation of your time series models and gain insights into their predictive performance." + "objectID": "guides/05_augmenting.html#basic-examples", + "href": "guides/05_augmenting.html#basic-examples", + "title": "Adding Features (Augmenting)", + "section": "1.1 Basic Examples", + "text": "1.1 Basic Examples\nAdd 1 or more lags / leads to a dataset:\n\n\nCode\n# import libraries\nimport pytimetk as tk\nimport pandas as pd\nimport numpy as np\nimport random\n\n# create sample data\ndates = pd.date_range(start = '2023-09-18', end = '2023-09-24')\nvalues = [random.randint(10, 50) for _ in range(7)]\n\ndf = pd.DataFrame({\n 'date': dates,\n 'value': values\n})\n\ndf\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n0\n2023-09-18\n25\n\n\n1\n2023-09-19\n50\n\n\n2\n2023-09-20\n49\n\n\n3\n2023-09-21\n45\n\n\n4\n2023-09-22\n48\n\n\n5\n2023-09-23\n18\n\n\n6\n2023-09-24\n18\n\n\n\n\n\n\n\nCreate lag / lead of 3 days:\n\nLagLead\n\n\n\n\nCode\n# augment lag\ndf \\\n .augment_lags(\n date_column = 'date',\n value_column = 'value',\n lags = 3\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_lag_3\n\n\n\n\n0\n2023-09-18\n25\nNaN\n\n\n1\n2023-09-19\n50\nNaN\n\n\n2\n2023-09-20\n49\nNaN\n\n\n3\n2023-09-21\n45\n25.0\n\n\n4\n2023-09-22\n48\n50.0\n\n\n5\n2023-09-23\n18\n49.0\n\n\n6\n2023-09-24\n18\n45.0\n\n\n\n\n\n\n\n\n\n\n\nCode\n# augment leads\ndf \\\n .augment_leads(\n date_column = 'date',\n value_column = 'value',\n leads = 3\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_lead_3\n\n\n\n\n0\n2023-09-18\n25\n45.0\n\n\n1\n2023-09-19\n50\n48.0\n\n\n2\n2023-09-20\n49\n18.0\n\n\n3\n2023-09-21\n45\n18.0\n\n\n4\n2023-09-22\n48\nNaN\n\n\n5\n2023-09-23\n18\nNaN\n\n\n6\n2023-09-24\n18\nNaN\n\n\n\n\n\n\n\n\n\n\nWe can create multiple lag / lead values for a single time series:\n\nLagLead\n\n\n\n\nCode\n# multiple lagged values for a single time series\ndf \\\n .augment_lags(\n date_column = 'date',\n value_column = 'value',\n lags = (1, 3)\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_lag_1\nvalue_lag_2\nvalue_lag_3\n\n\n\n\n0\n2023-09-18\n25\nNaN\nNaN\nNaN\n\n\n1\n2023-09-19\n50\n25.0\nNaN\nNaN\n\n\n2\n2023-09-20\n49\n50.0\n25.0\nNaN\n\n\n3\n2023-09-21\n45\n49.0\n50.0\n25.0\n\n\n4\n2023-09-22\n48\n45.0\n49.0\n50.0\n\n\n5\n2023-09-23\n18\n48.0\n45.0\n49.0\n\n\n6\n2023-09-24\n18\n18.0\n48.0\n45.0\n\n\n\n\n\n\n\n\n\n\n\nCode\n# multiple leads values for a single time series\ndf \\\n .augment_leads(\n date_column = 'date',\n value_column = 'value',\n leads = (1, 3)\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_lead_1\nvalue_lead_2\nvalue_lead_3\n\n\n\n\n0\n2023-09-18\n25\n50.0\n49.0\n45.0\n\n\n1\n2023-09-19\n50\n49.0\n45.0\n48.0\n\n\n2\n2023-09-20\n49\n45.0\n48.0\n18.0\n\n\n3\n2023-09-21\n45\n48.0\n18.0\n18.0\n\n\n4\n2023-09-22\n48\n18.0\n18.0\nNaN\n\n\n5\n2023-09-23\n18\n18.0\nNaN\nNaN\n\n\n6\n2023-09-24\n18\nNaN\nNaN\nNaN" }, { - "objectID": "guides/07_timeseries_crossvalidation.html#step-1-load-and-explore-the-data", - "href": "guides/07_timeseries_crossvalidation.html#step-1-load-and-explore-the-data", - "title": "Time Series Cross Validation", - "section": "2.1 Step 1: Load and Explore the Data", - "text": "2.1 Step 1: Load and Explore the Data\nFirst, let’s load the Walmart sales dataset and explore its structure:\n\n# libraries\nimport pytimetk as tk\nimport pandas as pd\nimport numpy as np\n\n# Import Data\nwalmart_sales_df = tk.load_dataset('walmart_sales_weekly')\n\nwalmart_sales_df['Date'] = pd.to_datetime(walmart_sales_df['Date'])\n\nwalmart_sales_df = walmart_sales_df[['id', 'Date', 'Weekly_Sales']]\n\nwalmart_sales_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 1001 rows of 3 columns\nid: object ['1_1', '1_1', '1_1', '1_1', '1_1', '1_ ...\nDate: datetime64[ns] [Timestamp('2010-02-05 00:00:00'), Time ...\nWeekly_Sales: float64 [24924.5, 46039.49, 41595.55, 19403.54, ..." + "objectID": "guides/05_augmenting.html#augment-lags-leads-for-grouped-time-series", + "href": "guides/05_augmenting.html#augment-lags-leads-for-grouped-time-series", + "title": "Adding Features (Augmenting)", + "section": "1.2 Augment Lags / Leads For Grouped Time Series", + "text": "1.2 Augment Lags / Leads For Grouped Time Series\naugment_lags() and augment_leads() also works for grouped time series data. Lets use the m4_daily_df dataset to showcase examples:\n\n\nCode\n# load m4_daily_df\nm4_daily_df = tk.load_dataset('m4_daily', parse_dates = ['date'])\n\n\n\nLagLead\n\n\n\n\nCode\n# agument lags for grouped time series\nm4_daily_df \\\n .groupby(\"id\") \\\n .augment_lags(\n date_column = 'date',\n value_column = 'value',\n lags = (1, 7)\n )\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\nvalue_lag_1\nvalue_lag_2\nvalue_lag_3\nvalue_lag_4\nvalue_lag_5\nvalue_lag_6\nvalue_lag_7\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\nD10\n2014-07-04\n2073.4\n2076.2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n2\nD10\n2014-07-05\n2048.7\n2073.4\n2076.2\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n3\nD10\n2014-07-06\n2048.9\n2048.7\n2073.4\n2076.2\nNaN\nNaN\nNaN\nNaN\n\n\n4\nD10\n2014-07-07\n2006.4\n2048.9\n2048.7\n2073.4\n2076.2\nNaN\nNaN\nNaN\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n9738\nD500\n2012-09-19\n9418.8\n9431.9\n9437.7\n9474.6\n9359.2\n9286.9\n9265.4\n9091.4\n\n\n9739\nD500\n2012-09-20\n9365.7\n9418.8\n9431.9\n9437.7\n9474.6\n9359.2\n9286.9\n9265.4\n\n\n9740\nD500\n2012-09-21\n9445.9\n9365.7\n9418.8\n9431.9\n9437.7\n9474.6\n9359.2\n9286.9\n\n\n9741\nD500\n2012-09-22\n9497.9\n9445.9\n9365.7\n9418.8\n9431.9\n9437.7\n9474.6\n9359.2\n\n\n9742\nD500\n2012-09-23\n9545.3\n9497.9\n9445.9\n9365.7\n9418.8\n9431.9\n9437.7\n9474.6\n\n\n\n\n9743 rows × 10 columns\n\n\n\n\n\n\n\nCode\n# augment leads for grouped time series\nm4_daily_df \\\n .groupby(\"id\") \\\n .augment_leads(\n date_column = 'date',\n value_column = 'value',\n leads = (1, 7)\n )\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\nvalue_lead_1\nvalue_lead_2\nvalue_lead_3\nvalue_lead_4\nvalue_lead_5\nvalue_lead_6\nvalue_lead_7\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\n2073.4\n2048.7\n2048.9\n2006.4\n2017.6\n2019.1\n2007.4\n\n\n1\nD10\n2014-07-04\n2073.4\n2048.7\n2048.9\n2006.4\n2017.6\n2019.1\n2007.4\n2010.0\n\n\n2\nD10\n2014-07-05\n2048.7\n2048.9\n2006.4\n2017.6\n2019.1\n2007.4\n2010.0\n2001.5\n\n\n3\nD10\n2014-07-06\n2048.9\n2006.4\n2017.6\n2019.1\n2007.4\n2010.0\n2001.5\n1978.8\n\n\n4\nD10\n2014-07-07\n2006.4\n2017.6\n2019.1\n2007.4\n2010.0\n2001.5\n1978.8\n1988.3\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n9738\nD500\n2012-09-19\n9418.8\n9365.7\n9445.9\n9497.9\n9545.3\nNaN\nNaN\nNaN\n\n\n9739\nD500\n2012-09-20\n9365.7\n9445.9\n9497.9\n9545.3\nNaN\nNaN\nNaN\nNaN\n\n\n9740\nD500\n2012-09-21\n9445.9\n9497.9\n9545.3\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n9741\nD500\n2012-09-22\n9497.9\n9545.3\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n9742\nD500\n2012-09-23\n9545.3\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n\n\n9743 rows × 10 columns" }, { - "objectID": "guides/07_timeseries_crossvalidation.html#step-2-visualize-the-time-series-data", - "href": "guides/07_timeseries_crossvalidation.html#step-2-visualize-the-time-series-data", - "title": "Time Series Cross Validation", - "section": "2.2 Step 2: Visualize the Time Series Data", - "text": "2.2 Step 2: Visualize the Time Series Data\nWe can visualize the weekly sales data for different store IDs using the plot_timeseries method from pytimetk:\n\nwalmart_sales_df \\\n .groupby('id') \\\n .plot_timeseries(\n \"Date\", \"Weekly_Sales\",\n plotly_dropdown = True,\n )\n\n\n \n\n\nThis will generate an interactive time series plot, allowing you to explore sales data for different stores using a dropdown." + "objectID": "guides/05_augmenting.html#basic-examples-1", + "href": "guides/05_augmenting.html#basic-examples-1", + "title": "Adding Features (Augmenting)", + "section": "2.1 Basic Examples", + "text": "2.1 Basic Examples\nWe’ll continue with the use of our sample df created earlier:\n\n\nCode\n# window = 3 days, window function = mean\ndf \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'value',\n window = 3,\n window_func = 'mean'\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_rolling_mean_win_3\n\n\n\n\n0\n2023-09-18\n25\nNaN\n\n\n1\n2023-09-19\n50\nNaN\n\n\n2\n2023-09-20\n49\n41.333333\n\n\n3\n2023-09-21\n45\n48.000000\n\n\n4\n2023-09-22\n48\n47.333333\n\n\n5\n2023-09-23\n18\n37.000000\n\n\n6\n2023-09-24\n18\n28.000000\n\n\n\n\n\n\n\nIt is important to understand how the center parameter in augment_rolling() works.\n\n\n\n\n\n\ncenter\n\n\n\n\n\nWhen set to True (default) the value of the rolling window will be centered, meaning that the value at the center of the window will be used as the result. When set to False (default) the rolling window will not be centered, meaning that the value at the end of the window will be used as the result.\n\n\n\nLets see an example:\n\nAugment Rolling: Center = TrueAugment Rolling: Center = False\n\n\n\n\nCode\n# agument rolling: center = true\ndf \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'value',\n window = 3,\n window_func = 'mean',\n center = True\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_rolling_mean_win_3\n\n\n\n\n0\n2023-09-18\n25\nNaN\n\n\n1\n2023-09-19\n50\n41.333333\n\n\n2\n2023-09-20\n49\n48.000000\n\n\n3\n2023-09-21\n45\n47.333333\n\n\n4\n2023-09-22\n48\n37.000000\n\n\n5\n2023-09-23\n18\n28.000000\n\n\n6\n2023-09-24\n18\nNaN\n\n\n\n\n\n\n\nNote that we are using a 3 day rolling window and applying a mean to value. In simplier terms, value_rolling_mean_win_3 is a 3 day rolling average of value with center set to True. Thus the function starts computing the mean from 2023-09-19\n\n\n\n\nCode\n# agument rolling: center = false\ndf \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'value',\n window = 3,\n window_func = 'mean',\n center = False\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_rolling_mean_win_3\n\n\n\n\n0\n2023-09-18\n25\nNaN\n\n\n1\n2023-09-19\n50\nNaN\n\n\n2\n2023-09-20\n49\n41.333333\n\n\n3\n2023-09-21\n45\n48.000000\n\n\n4\n2023-09-22\n48\n47.333333\n\n\n5\n2023-09-23\n18\n37.000000\n\n\n6\n2023-09-24\n18\n28.000000\n\n\n\n\n\n\n\nNote that we are using a 3 day rolling window and applying a mean to value. In simplier terms, value_rolling_mean_win_3 is a 3 day rolling average of value with center set to False. Thus the function starts computing the mean from 2023-09-20. The same value for 2023-19-18 and 2023-09-19 are returned as value_rolling_mean_win_3 since it did not detected the third to apply the 3 day rolling average." }, { - "objectID": "guides/07_timeseries_crossvalidation.html#step-3-set-up-timeseriescv-for-cross-validation", - "href": "guides/07_timeseries_crossvalidation.html#step-3-set-up-timeseriescv-for-cross-validation", - "title": "Time Series Cross Validation", - "section": "2.3 Step 3: Set Up TimeSeriesCV for Cross-Validation", - "text": "2.3 Step 3: Set Up TimeSeriesCV for Cross-Validation\nNow, let’s set up a time-based cross-validation scheme using TimeSeriesCV:\n\nfrom pytimetk.crossvalidation import TimeSeriesCV\n\n# Define parameters for TimeSeriesCV\ntscv = TimeSeriesCV(\n frequency=\"weeks\",\n train_size=52, # Use 52 weeks for training\n forecast_horizon=12, # Forecast 12 weeks ahead\n gap=0, # No gap between training and forecast sets\n stride=4, # Move forward by 4 weeks after each split\n window=\"rolling\", # Use a rolling window\n mode=\"backward\" # Generate splits from end to start\n)\n\n# Glimpse the cross-validation splits\ntscv.glimpse(\n walmart_sales_df['Weekly_Sales'], \n time_series=walmart_sales_df['Date']\n)\n\nSplit Number: 1\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-08-05 00:00:00 to 2012-07-27 00:00:00\nForecast Period: 2012-08-03 00:00:00 to 2012-10-19 00:00:00\n\nSplit Number: 2\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-07-08 00:00:00 to 2012-06-29 00:00:00\nForecast Period: 2012-07-06 00:00:00 to 2012-09-21 00:00:00\n\nSplit Number: 3\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-06-10 00:00:00 to 2012-06-01 00:00:00\nForecast Period: 2012-06-08 00:00:00 to 2012-08-24 00:00:00\n\nSplit Number: 4\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-05-13 00:00:00 to 2012-05-04 00:00:00\nForecast Period: 2012-05-11 00:00:00 to 2012-07-27 00:00:00\n\nSplit Number: 5\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-04-15 00:00:00 to 2012-04-06 00:00:00\nForecast Period: 2012-04-13 00:00:00 to 2012-06-29 00:00:00\n\nSplit Number: 6\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-03-18 00:00:00 to 2012-03-09 00:00:00\nForecast Period: 2012-03-16 00:00:00 to 2012-06-01 00:00:00\n\nSplit Number: 7\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-02-18 00:00:00 to 2012-02-10 00:00:00\nForecast Period: 2012-02-17 00:00:00 to 2012-05-04 00:00:00\n\nSplit Number: 8\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-01-21 00:00:00 to 2012-01-13 00:00:00\nForecast Period: 2012-01-20 00:00:00 to 2012-04-06 00:00:00\n\nSplit Number: 9\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-12-24 00:00:00 to 2011-12-16 00:00:00\nForecast Period: 2011-12-23 00:00:00 to 2012-03-09 00:00:00\n\nSplit Number: 10\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-11-26 00:00:00 to 2011-11-18 00:00:00\nForecast Period: 2011-11-25 00:00:00 to 2012-02-10 00:00:00\n\nSplit Number: 11\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-10-29 00:00:00 to 2011-10-21 00:00:00\nForecast Period: 2011-10-28 00:00:00 to 2012-01-13 00:00:00\n\nSplit Number: 12\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-10-01 00:00:00 to 2011-09-23 00:00:00\nForecast Period: 2011-09-30 00:00:00 to 2011-12-16 00:00:00\n\nSplit Number: 13\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-09-03 00:00:00 to 2011-08-26 00:00:00\nForecast Period: 2011-09-02 00:00:00 to 2011-11-18 00:00:00\n\nSplit Number: 14\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-08-06 00:00:00 to 2011-07-29 00:00:00\nForecast Period: 2011-08-05 00:00:00 to 2011-10-21 00:00:00\n\nSplit Number: 15\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-07-09 00:00:00 to 2011-07-01 00:00:00\nForecast Period: 2011-07-08 00:00:00 to 2011-09-23 00:00:00\n\nSplit Number: 16\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-06-11 00:00:00 to 2011-06-03 00:00:00\nForecast Period: 2011-06-10 00:00:00 to 2011-08-26 00:00:00\n\nSplit Number: 17\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-05-14 00:00:00 to 2011-05-06 00:00:00\nForecast Period: 2011-05-13 00:00:00 to 2011-07-29 00:00:00\n\nSplit Number: 18\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-04-16 00:00:00 to 2011-04-08 00:00:00\nForecast Period: 2011-04-15 00:00:00 to 2011-07-01 00:00:00\n\nSplit Number: 19\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-03-19 00:00:00 to 2011-03-11 00:00:00\nForecast Period: 2011-03-18 00:00:00 to 2011-06-03 00:00:00\n\nSplit Number: 20\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-02-19 00:00:00 to 2011-02-11 00:00:00\nForecast Period: 2011-02-18 00:00:00 to 2011-05-06 00:00:00\n\n\n\nThe glimpse method provides a summary of each cross-validation fold, including the start and end dates of the training and forecast periods." + "objectID": "guides/05_augmenting.html#augment-rolling-with-multiple-windows-and-window-functions", + "href": "guides/05_augmenting.html#augment-rolling-with-multiple-windows-and-window-functions", + "title": "Adding Features (Augmenting)", + "section": "2.2 Augment Rolling with Multiple Windows and Window Functions", + "text": "2.2 Augment Rolling with Multiple Windows and Window Functions\nMultiple window functions can be passed to the window and window_func parameters:\n\n\nCode\n# augment rolling: window of 2 & 7 days, window_func of mean and standard deviation\nm4_daily_df \\\n .query('id == \"D10\"') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'value',\n window = [2,7],\n window_func = ['mean', ('std', lambda x: x.std())]\n )\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\nvalue_rolling_mean_win_2\nvalue_rolling_std_win_2\nvalue_rolling_mean_win_7\nvalue_rolling_std_win_7\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\nNaN\nNaN\nNaN\nNaN\n\n\n1\nD10\n2014-07-04\n2073.4\n2074.80\n1.40\n2074.800000\n1.400000\n\n\n2\nD10\n2014-07-05\n2048.7\n2061.05\n12.35\n2066.100000\n12.356645\n\n\n3\nD10\n2014-07-06\n2048.9\n2048.80\n0.10\n2061.800000\n13.037830\n\n\n4\nD10\n2014-07-07\n2006.4\n2027.65\n21.25\n2050.720000\n25.041038\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n669\nD10\n2016-05-02\n2630.7\n2615.85\n14.85\n2579.471429\n28.868159\n\n\n670\nD10\n2016-05-03\n2649.3\n2640.00\n9.30\n2594.800000\n33.081631\n\n\n671\nD10\n2016-05-04\n2631.8\n2640.55\n8.75\n2601.371429\n35.145563\n\n\n672\nD10\n2016-05-05\n2622.5\n2627.15\n4.65\n2607.457143\n34.584508\n\n\n673\nD10\n2016-05-06\n2620.1\n2621.30\n1.20\n2618.328571\n22.923270\n\n\n\n\n674 rows × 7 columns" }, { - "objectID": "guides/07_timeseries_crossvalidation.html#step-4-plot-the-cross-validation-splits", - "href": "guides/07_timeseries_crossvalidation.html#step-4-plot-the-cross-validation-splits", - "title": "Time Series Cross Validation", - "section": "2.4 Step 4: Plot the Cross-Validation Splits", - "text": "2.4 Step 4: Plot the Cross-Validation Splits\nYou can visualize how the data is split for training and testing:\n\n# Plot the cross-validation splits\ntscv.plot(\n walmart_sales_df['Weekly_Sales'], \n time_series=walmart_sales_df['Date']\n)\n\n\n \n\n\nThis plot will show each fold, illustrating which weeks are used for training and which weeks are used for forecasting." + "objectID": "guides/05_augmenting.html#augment-rolling-with-grouped-time-series", + "href": "guides/05_augmenting.html#augment-rolling-with-grouped-time-series", + "title": "Adding Features (Augmenting)", + "section": "2.3 Augment Rolling with Grouped Time Series", + "text": "2.3 Augment Rolling with Grouped Time Series\nagument_rolling can be used on grouped time series data:\n\n\nCode\n## augment rolling on grouped time series: window of 2 & 7 days, window_func of mean and standard deviation\nm4_daily_df \\\n .groupby('id') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'value',\n window = [2,7],\n window_func = ['mean', ('std', lambda x: x.std())]\n )\n\n\n\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\nvalue_rolling_mean_win_2\nvalue_rolling_std_win_2\nvalue_rolling_mean_win_7\nvalue_rolling_std_win_7\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\nNaN\nNaN\nNaN\nNaN\n\n\n1\nD10\n2014-07-04\n2073.4\n2074.80\n1.40\n2074.800000\n1.400000\n\n\n2\nD10\n2014-07-05\n2048.7\n2061.05\n12.35\n2066.100000\n12.356645\n\n\n3\nD10\n2014-07-06\n2048.9\n2048.80\n0.10\n2061.800000\n13.037830\n\n\n4\nD10\n2014-07-07\n2006.4\n2027.65\n21.25\n2050.720000\n25.041038\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n9738\nD500\n2012-09-19\n9418.8\n9425.35\n6.55\n9382.071429\n74.335988\n\n\n9739\nD500\n2012-09-20\n9365.7\n9392.25\n26.55\n9396.400000\n58.431303\n\n\n9740\nD500\n2012-09-21\n9445.9\n9405.80\n40.10\n9419.114286\n39.184451\n\n\n9741\nD500\n2012-09-22\n9497.9\n9471.90\n26.00\n9438.928571\n38.945336\n\n\n9742\nD500\n2012-09-23\n9545.3\n9521.60\n23.70\n9449.028571\n53.379416\n\n\n\n\n9743 rows × 7 columns" }, { - "objectID": "guides/07_timeseries_crossvalidation.html#step-1-setting-up-the-timeseriescvsplitter", - "href": "guides/07_timeseries_crossvalidation.html#step-1-setting-up-the-timeseriescvsplitter", - "title": "Time Series Cross Validation", - "section": "3.1 Step 1: Setting Up the TimeSeriesCVSplitter", - "text": "3.1 Step 1: Setting Up the TimeSeriesCVSplitter\nThe TimeSeriesCVSplitter helps us divide our dataset into training and forecast sets in a rolling window fashion. Here’s how we configure it:\n\nfrom pytimetk.crossvalidation import TimeSeriesCVSplitter\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.model_selection import cross_val_score\n\n# Set up TimeSeriesCVSplitter\ncv_splitter = TimeSeriesCVSplitter(\n time_series=walmart_sales_df['Date'],\n frequency=\"weeks\",\n train_size=52*2,\n forecast_horizon=12,\n gap=0,\n stride=4,\n window=\"rolling\",\n mode=\"backward\",\n split_limit = 5\n)\n\n# Visualize the TSCV Strategy\ncv_splitter.splitter.plot(walmart_sales_df['Weekly_Sales'], walmart_sales_df['Date'])\n\n\n \n\n\nThe TimeSeriesCVSplitter creates multiple splits of the time series data, allowing us to validate the model across different periods. By visualizing the cross-validation strategy, we can see how the training and forecast sets are structured." + "objectID": "guides/05_augmenting.html#basic-example", + "href": "guides/05_augmenting.html#basic-example", + "title": "Adding Features (Augmenting)", + "section": "3.1 Basic Example", + "text": "3.1 Basic Example\nWe’ll showcase an example using the m4_daily_df dataset by generating 29 additional features from the date column:\n\n\nCode\n# augment time series signature\nm4_daily_df \\\n .query('id == \"D10\"') \\\n .augment_timeseries_signature(\n date_column = 'date'\n ) \\\n .head()\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\ndate_index_num\ndate_year\ndate_year_iso\ndate_yearstart\ndate_yearend\ndate_leapyear\ndate_half\n...\ndate_mday\ndate_qday\ndate_yday\ndate_weekend\ndate_hour\ndate_minute\ndate_second\ndate_msecond\ndate_nsecond\ndate_am_pm\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\n1404345600\n2014\n2014\n0\n0\n0\n2\n...\n3\n3\n184\n0\n0\n0\n0\n0\n0\nam\n\n\n1\nD10\n2014-07-04\n2073.4\n1404432000\n2014\n2014\n0\n0\n0\n2\n...\n4\n4\n185\n0\n0\n0\n0\n0\n0\nam\n\n\n2\nD10\n2014-07-05\n2048.7\n1404518400\n2014\n2014\n0\n0\n0\n2\n...\n5\n5\n186\n0\n0\n0\n0\n0\n0\nam\n\n\n3\nD10\n2014-07-06\n2048.9\n1404604800\n2014\n2014\n0\n0\n0\n2\n...\n6\n6\n187\n1\n0\n0\n0\n0\n0\nam\n\n\n4\nD10\n2014-07-07\n2006.4\n1404691200\n2014\n2014\n0\n0\n0\n2\n...\n7\n7\n188\n0\n0\n0\n0\n0\n0\nam\n\n\n\n\n5 rows × 32 columns" }, { - "objectID": "guides/07_timeseries_crossvalidation.html#step-2-feature-engineering-for-time-series-data", - "href": "guides/07_timeseries_crossvalidation.html#step-2-feature-engineering-for-time-series-data", - "title": "Time Series Cross Validation", - "section": "3.2 Step 2: Feature Engineering for Time Series Data", - "text": "3.2 Step 2: Feature Engineering for Time Series Data\nEffective feature engineering can significantly impact the performance of a time series model. Using pytimetk, we extract a variety of features from the Date column.\n\nGenerating Time Series Features\nWe use get_timeseries_signature to generate useful features, such as year, quarter, month, and day-of-week indicators.\n\n# Prepare data for modeling\n\n# Extract time series features from the 'Date' column\nX_time_features = tk.get_timeseries_signature(walmart_sales_df['Date'])\n\n# Select features to dummy encode\nfeatures_to_dummy = ['Date_quarteryear', 'Date_month_lbl', 'Date_wday_lbl', 'Date_am_pm']\n\n# Dummy encode the selected features\nX_time_dummies = pd.get_dummies(X_time_features[features_to_dummy], drop_first=True)\n\n# Dummy encode the 'id' column\nX_id_dummies = pd.get_dummies(walmart_sales_df['id'], prefix='store')\n\n# Combine the time series features, dummy-encoded features, and the 'id' dummies\nX = pd.concat([X_time_features, X_time_dummies, X_id_dummies], axis=1)\n\n# Drop the original categorical columns that were dummy encoded\nX = X.drop(columns=features_to_dummy).drop('Date', axis=1)\n\n# Set the target variable\ny = walmart_sales_df['Weekly_Sales'].values" + "objectID": "guides/05_augmenting.html#basic-example-1", + "href": "guides/05_augmenting.html#basic-example-1", + "title": "Adding Features (Augmenting)", + "section": "4.1 Basic Example", + "text": "4.1 Basic Example\nWe’ll showcase an example using some sample data:\n\n\nCode\n# create sample data\ndates = pd.date_range(start = '2022-12-25', end = '2023-01-05')\n\ndf = pd.DataFrame({'date': dates})\n\n# augment time series signature: USA\ndf \\\n .augment_holiday_signature(\n date_column = 'date',\n country_name = 'UnitedStates'\n )\n\n\n\n\n\n\n\n\n\ndate\nis_holiday\nbefore_holiday\nafter_holiday\nholiday_name\n\n\n\n\n0\n2022-12-25\n1\n1\n0\nChristmas Day\n\n\n1\n2022-12-26\n1\n0\n1\nChristmas Day (Observed)\n\n\n2\n2022-12-27\n0\n0\n1\nNaN\n\n\n3\n2022-12-28\n0\n0\n0\nNaN\n\n\n4\n2022-12-29\n0\n0\n0\nNaN\n\n\n5\n2022-12-30\n0\n0\n0\nNaN\n\n\n6\n2022-12-31\n0\n1\n0\nNaN\n\n\n7\n2023-01-01\n1\n1\n0\nNew Year's Day\n\n\n8\n2023-01-02\n1\n0\n1\nNew Year's Day (Observed)\n\n\n9\n2023-01-03\n0\n0\n1\nNaN\n\n\n10\n2023-01-04\n0\n0\n0\nNaN\n\n\n11\n2023-01-05\n0\n0\n0\nNaN" }, { - "objectID": "guides/07_timeseries_crossvalidation.html#step-3-model-training-and-evaluation-with-random-forest", - "href": "guides/07_timeseries_crossvalidation.html#step-3-model-training-and-evaluation-with-random-forest", - "title": "Time Series Cross Validation", - "section": "3.3 Step 3: Model Training and Evaluation with Random Forest", - "text": "3.3 Step 3: Model Training and Evaluation with Random Forest\nFor this example, we use RandomForestRegressor from scikit-learn to model the time series data. A random forest is a robust, ensemble-based model that can handle a wide range of regression tasks.\n\n# Initialize the RandomForestRegressor model\nmodel = RandomForestRegressor(\n n_estimators=100, # Number of trees in the forest\n max_depth=None, # Maximum depth of the trees (None means nodes are expanded until all leaves are pure)\n random_state=42 # Set a random state for reproducibility\n)\n\n# Evaluate the model using cross-validation scores\nscores = cross_val_score(model, X, y, cv=cv_splitter, scoring='neg_mean_squared_error')\n\n# Print cross-validation scores\nprint(\"Cross-Validation Scores (Negative MSE):\", scores)\n\nCross-Validation Scores (Negative MSE): [-23761708.80112538 -23107644.58461143 -21728878.18790144\n -25113860.93913386 -86192034.48953015]" + "objectID": "guides/05_augmenting.html#basic-example-2", + "href": "guides/05_augmenting.html#basic-example-2", + "title": "Adding Features (Augmenting)", + "section": "5.1 Basic Example", + "text": "5.1 Basic Example\n\n\nCode\n# augment fourier with 7 periods and max order of 1\n#m4_daily_df \\\n# .query('id == \"D10\"') \\\n# .augment_fourier(\n# date_column = 'date',\n# value_column = 'value',\n# num_periods = 7,\n# max_order = 1\n# ) \\\n# .head(20)\n\n\nNotice the additional value_fourier_1_1 to value_fourier_1_7 colums that have been added to the data." }, { - "objectID": "guides/07_timeseries_crossvalidation.html#step-4-visualizing-the-forecast", - "href": "guides/07_timeseries_crossvalidation.html#step-4-visualizing-the-forecast", - "title": "Time Series Cross Validation", - "section": "3.4 Step 4: Visualizing the Forecast", - "text": "3.4 Step 4: Visualizing the Forecast\nVisualization is crucial to understand how well the model predicts future values. We collect the actual and predicted values for each fold and combine them for easy plotting.\n\n# Lists to store the combined data\ncombined_data = []\n\n# Iterate through each fold and collect the data\nfor i, (train_index, test_index) in enumerate(cv_splitter.split(X, y), start=1):\n # Get the training and forecast data from the original DataFrame\n train_df = walmart_sales_df.iloc[train_index].copy()\n test_df = walmart_sales_df.iloc[test_index].copy()\n \n # Fit the model on the training data\n model.fit(X.iloc[train_index], y[train_index])\n \n # Predict on the test set\n y_pred = model.predict(X.iloc[test_index])\n \n # Add the actual and predicted values\n train_df['Actual'] = y[train_index]\n train_df['Predicted'] = None # No predictions for training data\n train_df['Fold'] = i # Indicate the current fold\n \n test_df['Actual'] = y[test_index]\n test_df['Predicted'] = y_pred # Predictions for the test data\n test_df['Fold'] = i # Indicate the current fold\n \n # Append both the training and forecast DataFrames to the combined data list\n combined_data.extend([train_df, test_df])\n\n# Combine all the data into a single DataFrame\nfull_forecast_df = pd.concat(combined_data, ignore_index=True)\n\nfull_forecast_df = full_forecast_df[['id', 'Date', 'Actual', 'Predicted', 'Fold']]\n\nfull_forecast_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 4060 rows of 5 columns\nid: object ['1_1', '1_1', '1_1', '1_1', '1_1', '1_1', ...\nDate: datetime64[ns] [Timestamp('2010-08-06 00:00:00'), Timesta ...\nActual: float64 [17508.41, 15536.4, 15740.13, 15793.87, 16 ...\nPredicted: float64 [nan, nan, nan, nan, nan, nan, nan, nan, n ...\nFold: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\n\n\n\nPreparing Data for Visualization\nTo make the data easier to plot, we use pd.melt() to transform the Actual and Predicted columns into a long format.\n\n# Melt the Actual and Predicted columns\nmelted_df = pd.melt(\n full_forecast_df,\n id_vars=['id', 'Date', 'Fold'], # Columns to keep\n value_vars=['Actual', 'Predicted'], # Columns to melt\n var_name='Type', # Name for the new column indicating 'Actual' or 'Predicted'\n value_name='Value' # Name for the new column with the values\n)\n\nmelted_df[\"unique_id\"] = \"ID_\" + melted_df['id'] + \"-Fold_\" + melted_df[\"Fold\"].astype(str)\n\nmelted_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 8120 rows of 6 columns\nid: object ['1_1', '1_1', '1_1', '1_1', '1_1', '1_1', ...\nDate: datetime64[ns] [Timestamp('2010-08-06 00:00:00'), Timesta ...\nFold: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\nType: object ['Actual', 'Actual', 'Actual', 'Actual', ' ...\nValue: float64 [17508.41, 15536.4, 15740.13, 15793.87, 16 ...\nunique_id: object ['ID_1_1-Fold_1', 'ID_1_1-Fold_1', 'ID_1_1 ...\n\n\n\n\nPlotting the Forecasts\nFinally, we use plot_timeseries() to visualize the forecasts, comparing the actual and predicted values for each fold.\n\nmelted_df \\\n .groupby('unique_id') \\\n .plot_timeseries(\n \"Date\", \"Value\",\n color_column = \"Type\",\n smooth=False, \n plotly_dropdown=True\n )" + "objectID": "guides/05_augmenting.html#augment-fourier-with-grouped-time-series", + "href": "guides/05_augmenting.html#augment-fourier-with-grouped-time-series", + "title": "Adding Features (Augmenting)", + "section": "5.2 Augment Fourier with Grouped Time Series", + "text": "5.2 Augment Fourier with Grouped Time Series\naugment_fourier also works with grouped time series:\n\n\nCode\n# augment fourier with grouped time series\nm4_daily_df \\\n .groupby('id') \\\n .augment_fourier(\n date_column = 'date',\n value_column = 'value',\n num_periods = 7,\n max_order = 1\n ) \\\n .head(20)\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\nvalue_fourier_1_1\nvalue_fourier_1_2\nvalue_fourier_1_3\nvalue_fourier_1_4\nvalue_fourier_1_5\nvalue_fourier_1_6\nvalue_fourier_1_7\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\n0.394510\n-0.725024\n0.937927\n-0.998682\n0.897435\n-0.650609\n0.298243\n\n\n1\nD10\n2014-07-04\n2073.4\n-0.980653\n0.383931\n0.830342\n-0.709015\n-0.552759\n0.925423\n0.190450\n\n\n2\nD10\n2014-07-05\n2048.7\n0.011484\n0.022967\n0.034446\n0.045921\n0.057390\n0.068852\n0.080304\n\n\n3\nD10\n2014-07-06\n2048.9\n0.975899\n-0.425928\n-0.790004\n0.770723\n0.453624\n-0.968706\n-0.030835\n\n\n4\nD10\n2014-07-07\n2006.4\n-0.415510\n0.755886\n-0.959581\n0.989762\n-0.840972\n0.540115\n-0.141593\n\n\n5\nD10\n2014-07-08\n2017.6\n-0.803876\n-0.956286\n-0.333715\n0.559301\n0.999055\n0.629169\n-0.250600\n\n\n6\nD10\n2014-07-09\n2019.1\n0.748318\n0.992779\n0.568784\n-0.238184\n-0.884778\n-0.935635\n-0.356511\n\n\n7\nD10\n2014-07-10\n2007.4\n0.494070\n-0.859111\n0.999790\n-0.879368\n0.529294\n-0.040992\n-0.458015\n\n\n8\nD10\n2014-07-11\n2010.0\n-0.952864\n0.578192\n0.602021\n-0.943494\n-0.029515\n0.961404\n-0.553858\n\n\n9\nD10\n2014-07-12\n2001.5\n-0.099581\n-0.198171\n-0.294792\n-0.388482\n-0.478310\n-0.563384\n-0.642856\n\n\n10\nD10\n2014-07-13\n1978.8\n0.994091\n-0.215816\n-0.947238\n0.421459\n0.855740\n-0.607239\n-0.723909\n\n\n11\nD10\n2014-07-14\n1988.3\n-0.311977\n0.592812\n-0.814472\n0.954831\n-0.999879\n0.945118\n-0.796015\n\n\n12\nD10\n2014-07-15\n2000.7\n-0.864932\n-0.868201\n-0.006551\n0.861625\n0.871433\n0.013101\n-0.858282\n\n\n13\nD10\n2014-07-16\n2010.5\n0.670062\n0.994781\n0.806801\n0.203005\n-0.505418\n-0.953354\n-0.909941\n\n\n14\nD10\n2014-07-17\n2014.5\n0.587524\n-0.950856\n0.951356\n-0.588831\n0.001617\n0.586214\n-0.950354\n\n\n15\nD10\n2014-07-18\n1962.6\n-0.913299\n0.743956\n0.307286\n-0.994265\n0.502625\n0.584837\n-0.979022\n\n\n16\nD10\n2014-07-19\n1948.0\n-0.209415\n-0.409542\n-0.591509\n-0.747244\n-0.869842\n-0.953865\n-0.995589\n\n\n17\nD10\n2014-07-20\n1943.0\n0.999997\n0.004934\n-0.999973\n-0.009867\n0.999924\n0.014800\n-0.999851\n\n\n18\nD10\n2014-07-21\n1933.3\n-0.204588\n0.400521\n-0.579511\n0.733985\n-0.857409\n0.944561\n-0.991756\n\n\n19\nD10\n2014-07-22\n1891.0\n-0.915297\n-0.737326\n0.321336\n0.996182\n0.481148\n-0.608588\n-0.971403" }, { "objectID": "guides/06_anomalize.html", @@ -1610,235 +1603,242 @@ "text": "2.2 Plotting Groups\nNext, let’s move on to a dataset with time series groups, m4_monthly, which is a sample of 4 time series from the M4 competition that are sampled at a monthly frequency.\n\n\nCode\n# Import a Time Series Data Set\nm4_monthly = tk.load_dataset(\"m4_monthly\", parse_dates = ['date'])\nm4_monthly\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\n\n\n\n\n0\nM1\n1976-06-01\n8000\n\n\n1\nM1\n1976-07-01\n8350\n\n\n2\nM1\n1976-08-01\n8570\n\n\n3\nM1\n1976-09-01\n7700\n\n\n4\nM1\n1976-10-01\n7080\n\n\n...\n...\n...\n...\n\n\n1569\nM1000\n2015-02-01\n880\n\n\n1570\nM1000\n2015-03-01\n800\n\n\n1571\nM1000\n2015-04-01\n1140\n\n\n1572\nM1000\n2015-05-01\n970\n\n\n1573\nM1000\n2015-06-01\n1430\n\n\n\n\n1574 rows × 3 columns\n\n\n\nVisualizing grouped data is as simple as grouping the data set with groupby() before run it into the plot_timeseries() function. There are 2 methods:\n\nFacets\nPlotly Dropdown\n\n\nFacets (Subgroups on one plot)\nThis is great to see all time series in one plot. Here are the key points:\n\nGroups can be added using the pandas groupby().\nThese groups are then converted into facets.\nUsing facet_ncol = 2 returns a 2-column faceted plot.\nSetting facet_scales = \"free\" allows the x and y-axes of each plot to scale independently of the other plots.\n\n\n\nCode\nm4_monthly.groupby('id').plot_timeseries(\n 'date', 'value', \n facet_ncol = 2, \n facet_scales = \"free\"\n)\n\n\n\n \n\n\n\n\nPlotly Dropdown\nSometimes you have many groups and would prefer to see one plot per group. This can be accomplished with plotly_dropdown. You can adjust the x and y position as follows:\n\n\nCode\nm4_monthly.groupby('id').plot_timeseries(\n 'date', 'value', \n plotly_dropdown=True,\n plotly_dropdown_x=0,\n plotly_dropdown_y=1\n)\n\n\n\n \n\n\nThe groups can also be vizualized in the same plot using color_column paramenter. Let’s come back to taylor_30_min dataframe.\n\n\nCode\n# load data\ntaylor_30_min = tk.load_dataset(\"taylor_30_min\", parse_dates = ['date'])\n\n# extract the month using pandas\ntaylor_30_min['month'] = pd.to_datetime(taylor_30_min['date']).dt.month\n\n# plot groups\ntaylor_30_min.plot_timeseries(\n 'date', 'value', \n color_column = 'month'\n)" }, { - "objectID": "guides/05_augmenting.html", - "href": "guides/05_augmenting.html", - "title": "Adding Features (Augmenting)", + "objectID": "guides/04_wrangling.html", + "href": "guides/04_wrangling.html", + "title": "Data Wrangling", "section": "", - "text": "This section will cover the augment set of functions, use to add many additional time series features to a dataset. We’ll cover how to use the following set of functions" + "text": "This section will cover data wrangling for timeseries using pytimetk. We’ll show examples for the following functions:" }, { - "objectID": "guides/05_augmenting.html#basic-examples", - "href": "guides/05_augmenting.html#basic-examples", - "title": "Adding Features (Augmenting)", - "section": "1.1 Basic Examples", - "text": "1.1 Basic Examples\nAdd 1 or more lags / leads to a dataset:\n\n\nCode\n# import libraries\nimport pytimetk as tk\nimport pandas as pd\nimport numpy as np\nimport random\n\n# create sample data\ndates = pd.date_range(start = '2023-09-18', end = '2023-09-24')\nvalues = [random.randint(10, 50) for _ in range(7)]\n\ndf = pd.DataFrame({\n 'date': dates,\n 'value': values\n})\n\ndf\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n0\n2023-09-18\n25\n\n\n1\n2023-09-19\n50\n\n\n2\n2023-09-20\n49\n\n\n3\n2023-09-21\n45\n\n\n4\n2023-09-22\n48\n\n\n5\n2023-09-23\n18\n\n\n6\n2023-09-24\n18\n\n\n\n\n\n\n\nCreate lag / lead of 3 days:\n\nLagLead\n\n\n\n\nCode\n# augment lag\ndf \\\n .augment_lags(\n date_column = 'date',\n value_column = 'value',\n lags = 3\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_lag_3\n\n\n\n\n0\n2023-09-18\n25\nNaN\n\n\n1\n2023-09-19\n50\nNaN\n\n\n2\n2023-09-20\n49\nNaN\n\n\n3\n2023-09-21\n45\n25.0\n\n\n4\n2023-09-22\n48\n50.0\n\n\n5\n2023-09-23\n18\n49.0\n\n\n6\n2023-09-24\n18\n45.0\n\n\n\n\n\n\n\n\n\n\n\nCode\n# augment leads\ndf \\\n .augment_leads(\n date_column = 'date',\n value_column = 'value',\n leads = 3\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_lead_3\n\n\n\n\n0\n2023-09-18\n25\n45.0\n\n\n1\n2023-09-19\n50\n48.0\n\n\n2\n2023-09-20\n49\n18.0\n\n\n3\n2023-09-21\n45\n18.0\n\n\n4\n2023-09-22\n48\nNaN\n\n\n5\n2023-09-23\n18\nNaN\n\n\n6\n2023-09-24\n18\nNaN\n\n\n\n\n\n\n\n\n\n\nWe can create multiple lag / lead values for a single time series:\n\nLagLead\n\n\n\n\nCode\n# multiple lagged values for a single time series\ndf \\\n .augment_lags(\n date_column = 'date',\n value_column = 'value',\n lags = (1, 3)\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_lag_1\nvalue_lag_2\nvalue_lag_3\n\n\n\n\n0\n2023-09-18\n25\nNaN\nNaN\nNaN\n\n\n1\n2023-09-19\n50\n25.0\nNaN\nNaN\n\n\n2\n2023-09-20\n49\n50.0\n25.0\nNaN\n\n\n3\n2023-09-21\n45\n49.0\n50.0\n25.0\n\n\n4\n2023-09-22\n48\n45.0\n49.0\n50.0\n\n\n5\n2023-09-23\n18\n48.0\n45.0\n49.0\n\n\n6\n2023-09-24\n18\n18.0\n48.0\n45.0\n\n\n\n\n\n\n\n\n\n\n\nCode\n# multiple leads values for a single time series\ndf \\\n .augment_leads(\n date_column = 'date',\n value_column = 'value',\n leads = (1, 3)\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_lead_1\nvalue_lead_2\nvalue_lead_3\n\n\n\n\n0\n2023-09-18\n25\n50.0\n49.0\n45.0\n\n\n1\n2023-09-19\n50\n49.0\n45.0\n48.0\n\n\n2\n2023-09-20\n49\n45.0\n48.0\n18.0\n\n\n3\n2023-09-21\n45\n48.0\n18.0\n18.0\n\n\n4\n2023-09-22\n48\n18.0\n18.0\nNaN\n\n\n5\n2023-09-23\n18\n18.0\nNaN\nNaN\n\n\n6\n2023-09-24\n18\nNaN\nNaN\nNaN" + "objectID": "guides/04_wrangling.html#basic-example", + "href": "guides/04_wrangling.html#basic-example", + "title": "Data Wrangling", + "section": "1.1 Basic Example", + "text": "1.1 Basic Example\nThe m4_daily dataset has a daily frequency. Say we are interested in forecasting at the weekly level. We can use summarize_by_time() to aggregate to a weekly level\n\n\nCode\n# summarize by time: daily to weekly\nsummarized_df = m4_daily_df \\\n .summarize_by_time(\n date_column = 'date',\n value_column = 'value',\n freq = 'W',\n agg_func = 'sum'\n )\n\nprint(summarized_df.head())\nprint('\\nLength of the full dataset:', len(summarized_df))\n\n\n date value\n0 1978-06-25 27328.12\n1 1978-07-02 63621.88\n2 1978-07-09 63334.38\n3 1978-07-16 63737.51\n4 1978-07-23 64718.76\n\nLength of the full dataset: 1977\n\n\nThe data has now been aggregated at the weekly level. Notice we now have 1977 rows, compared to full dataset which had 9743 rows." }, { - "objectID": "guides/05_augmenting.html#augment-lags-leads-for-grouped-time-series", - "href": "guides/05_augmenting.html#augment-lags-leads-for-grouped-time-series", - "title": "Adding Features (Augmenting)", - "section": "1.2 Augment Lags / Leads For Grouped Time Series", - "text": "1.2 Augment Lags / Leads For Grouped Time Series\naugment_lags() and augment_leads() also works for grouped time series data. Lets use the m4_daily_df dataset to showcase examples:\n\n\nCode\n# load m4_daily_df\nm4_daily_df = tk.load_dataset('m4_daily', parse_dates = ['date'])\n\n\n\nLagLead\n\n\n\n\nCode\n# agument lags for grouped time series\nm4_daily_df \\\n .groupby(\"id\") \\\n .augment_lags(\n date_column = 'date',\n value_column = 'value',\n lags = (1, 7)\n )\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\nvalue_lag_1\nvalue_lag_2\nvalue_lag_3\nvalue_lag_4\nvalue_lag_5\nvalue_lag_6\nvalue_lag_7\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\nD10\n2014-07-04\n2073.4\n2076.2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n2\nD10\n2014-07-05\n2048.7\n2073.4\n2076.2\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n3\nD10\n2014-07-06\n2048.9\n2048.7\n2073.4\n2076.2\nNaN\nNaN\nNaN\nNaN\n\n\n4\nD10\n2014-07-07\n2006.4\n2048.9\n2048.7\n2073.4\n2076.2\nNaN\nNaN\nNaN\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n9738\nD500\n2012-09-19\n9418.8\n9431.9\n9437.7\n9474.6\n9359.2\n9286.9\n9265.4\n9091.4\n\n\n9739\nD500\n2012-09-20\n9365.7\n9418.8\n9431.9\n9437.7\n9474.6\n9359.2\n9286.9\n9265.4\n\n\n9740\nD500\n2012-09-21\n9445.9\n9365.7\n9418.8\n9431.9\n9437.7\n9474.6\n9359.2\n9286.9\n\n\n9741\nD500\n2012-09-22\n9497.9\n9445.9\n9365.7\n9418.8\n9431.9\n9437.7\n9474.6\n9359.2\n\n\n9742\nD500\n2012-09-23\n9545.3\n9497.9\n9445.9\n9365.7\n9418.8\n9431.9\n9437.7\n9474.6\n\n\n\n\n9743 rows × 10 columns\n\n\n\n\n\n\n\nCode\n# augment leads for grouped time series\nm4_daily_df \\\n .groupby(\"id\") \\\n .augment_leads(\n date_column = 'date',\n value_column = 'value',\n leads = (1, 7)\n )\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\nvalue_lead_1\nvalue_lead_2\nvalue_lead_3\nvalue_lead_4\nvalue_lead_5\nvalue_lead_6\nvalue_lead_7\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\n2073.4\n2048.7\n2048.9\n2006.4\n2017.6\n2019.1\n2007.4\n\n\n1\nD10\n2014-07-04\n2073.4\n2048.7\n2048.9\n2006.4\n2017.6\n2019.1\n2007.4\n2010.0\n\n\n2\nD10\n2014-07-05\n2048.7\n2048.9\n2006.4\n2017.6\n2019.1\n2007.4\n2010.0\n2001.5\n\n\n3\nD10\n2014-07-06\n2048.9\n2006.4\n2017.6\n2019.1\n2007.4\n2010.0\n2001.5\n1978.8\n\n\n4\nD10\n2014-07-07\n2006.4\n2017.6\n2019.1\n2007.4\n2010.0\n2001.5\n1978.8\n1988.3\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n9738\nD500\n2012-09-19\n9418.8\n9365.7\n9445.9\n9497.9\n9545.3\nNaN\nNaN\nNaN\n\n\n9739\nD500\n2012-09-20\n9365.7\n9445.9\n9497.9\n9545.3\nNaN\nNaN\nNaN\nNaN\n\n\n9740\nD500\n2012-09-21\n9445.9\n9497.9\n9545.3\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n9741\nD500\n2012-09-22\n9497.9\n9545.3\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n9742\nD500\n2012-09-23\n9545.3\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n\n\n9743 rows × 10 columns" + "objectID": "guides/04_wrangling.html#additional-aggregate-functions", + "href": "guides/04_wrangling.html#additional-aggregate-functions", + "title": "Data Wrangling", + "section": "1.2 Additional Aggregate Functions", + "text": "1.2 Additional Aggregate Functions\nsummarize_by_time() can take additional aggregate functions in the agg_func argument.\n\n\nCode\n# summarize by time with additional aggregate functions\nsummarized_multiple_agg_df = m4_daily_df \\\n .summarize_by_time(\n date_column = 'date',\n value_column = 'value',\n freq = 'W',\n agg_func = ['sum', 'min', 'max']\n )\n\nsummarized_multiple_agg_df.head()\n\n\n\n\n\n\n\n\n\ndate\nvalue_sum\nvalue_min\nvalue_max\n\n\n\n\n0\n1978-06-25\n27328.12\n9103.12\n9115.62\n\n\n1\n1978-07-02\n63621.88\n9046.88\n9115.62\n\n\n2\n1978-07-09\n63334.38\n9028.12\n9096.88\n\n\n3\n1978-07-16\n63737.51\n9075.00\n9146.88\n\n\n4\n1978-07-23\n64718.76\n9171.88\n9315.62" }, { - "objectID": "guides/05_augmenting.html#basic-examples-1", - "href": "guides/05_augmenting.html#basic-examples-1", - "title": "Adding Features (Augmenting)", - "section": "2.1 Basic Examples", - "text": "2.1 Basic Examples\nWe’ll continue with the use of our sample df created earlier:\n\n\nCode\n# window = 3 days, window function = mean\ndf \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'value',\n window = 3,\n window_func = 'mean'\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_rolling_mean_win_3\n\n\n\n\n0\n2023-09-18\n25\nNaN\n\n\n1\n2023-09-19\n50\nNaN\n\n\n2\n2023-09-20\n49\n41.333333\n\n\n3\n2023-09-21\n45\n48.000000\n\n\n4\n2023-09-22\n48\n47.333333\n\n\n5\n2023-09-23\n18\n37.000000\n\n\n6\n2023-09-24\n18\n28.000000\n\n\n\n\n\n\n\nIt is important to understand how the center parameter in augment_rolling() works.\n\n\n\n\n\n\ncenter\n\n\n\n\n\nWhen set to True (default) the value of the rolling window will be centered, meaning that the value at the center of the window will be used as the result. When set to False (default) the rolling window will not be centered, meaning that the value at the end of the window will be used as the result.\n\n\n\nLets see an example:\n\nAugment Rolling: Center = TrueAugment Rolling: Center = False\n\n\n\n\nCode\n# agument rolling: center = true\ndf \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'value',\n window = 3,\n window_func = 'mean',\n center = True\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_rolling_mean_win_3\n\n\n\n\n0\n2023-09-18\n25\nNaN\n\n\n1\n2023-09-19\n50\n41.333333\n\n\n2\n2023-09-20\n49\n48.000000\n\n\n3\n2023-09-21\n45\n47.333333\n\n\n4\n2023-09-22\n48\n37.000000\n\n\n5\n2023-09-23\n18\n28.000000\n\n\n6\n2023-09-24\n18\nNaN\n\n\n\n\n\n\n\nNote that we are using a 3 day rolling window and applying a mean to value. In simplier terms, value_rolling_mean_win_3 is a 3 day rolling average of value with center set to True. Thus the function starts computing the mean from 2023-09-19\n\n\n\n\nCode\n# agument rolling: center = false\ndf \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'value',\n window = 3,\n window_func = 'mean',\n center = False\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\nvalue_rolling_mean_win_3\n\n\n\n\n0\n2023-09-18\n25\nNaN\n\n\n1\n2023-09-19\n50\nNaN\n\n\n2\n2023-09-20\n49\n41.333333\n\n\n3\n2023-09-21\n45\n48.000000\n\n\n4\n2023-09-22\n48\n47.333333\n\n\n5\n2023-09-23\n18\n37.000000\n\n\n6\n2023-09-24\n18\n28.000000\n\n\n\n\n\n\n\nNote that we are using a 3 day rolling window and applying a mean to value. In simplier terms, value_rolling_mean_win_3 is a 3 day rolling average of value with center set to False. Thus the function starts computing the mean from 2023-09-20. The same value for 2023-19-18 and 2023-09-19 are returned as value_rolling_mean_win_3 since it did not detected the third to apply the 3 day rolling average." + "objectID": "guides/04_wrangling.html#summarize-by-time-with-grouped-time-series", + "href": "guides/04_wrangling.html#summarize-by-time-with-grouped-time-series", + "title": "Data Wrangling", + "section": "1.3 Summarize by Time with Grouped Time Series", + "text": "1.3 Summarize by Time with Grouped Time Series\nsummarize_by_time() also works with groups.\n\n\nCode\n# summarize by time with groups and additional aggregate functions\ngrouped_summarized_df = (\n m4_daily_df\n .groupby('id')\n .summarize_by_time(\n date_column = 'date',\n value_column = 'value',\n freq = 'W',\n agg_func = [\n 'sum',\n 'min',\n ('q25', lambda x: np.quantile(x, 0.25)),\n 'median',\n ('q75', lambda x: np.quantile(x, 0.75)),\n 'max'\n ],\n )\n)\n\ngrouped_summarized_df.head()\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue_sum\nvalue_min\nvalue_q25\nvalue_median\nvalue_q75\nvalue_max\n\n\n\n\n0\nD10\n2014-07-06\n8247.2\n2048.7\n2048.85\n2061.15\n2074.10\n2076.2\n\n\n1\nD10\n2014-07-13\n14040.8\n1978.8\n2003.95\n2007.40\n2013.80\n2019.1\n\n\n2\nD10\n2014-07-20\n13867.6\n1943.0\n1955.30\n1988.30\n2005.60\n2014.5\n\n\n3\nD10\n2014-07-27\n13266.3\n1876.0\n1887.15\n1891.00\n1895.85\n1933.3\n\n\n4\nD10\n2014-08-03\n13471.2\n1886.2\n1914.60\n1920.00\n1939.55\n1956.7" }, { - "objectID": "guides/05_augmenting.html#augment-rolling-with-multiple-windows-and-window-functions", - "href": "guides/05_augmenting.html#augment-rolling-with-multiple-windows-and-window-functions", - "title": "Adding Features (Augmenting)", - "section": "2.2 Augment Rolling with Multiple Windows and Window Functions", - "text": "2.2 Augment Rolling with Multiple Windows and Window Functions\nMultiple window functions can be passed to the window and window_func parameters:\n\n\nCode\n# augment rolling: window of 2 & 7 days, window_func of mean and standard deviation\nm4_daily_df \\\n .query('id == \"D10\"') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'value',\n window = [2,7],\n window_func = ['mean', ('std', lambda x: x.std())]\n )\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\nvalue_rolling_mean_win_2\nvalue_rolling_std_win_2\nvalue_rolling_mean_win_7\nvalue_rolling_std_win_7\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\nNaN\nNaN\nNaN\nNaN\n\n\n1\nD10\n2014-07-04\n2073.4\n2074.80\n1.40\n2074.800000\n1.400000\n\n\n2\nD10\n2014-07-05\n2048.7\n2061.05\n12.35\n2066.100000\n12.356645\n\n\n3\nD10\n2014-07-06\n2048.9\n2048.80\n0.10\n2061.800000\n13.037830\n\n\n4\nD10\n2014-07-07\n2006.4\n2027.65\n21.25\n2050.720000\n25.041038\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n669\nD10\n2016-05-02\n2630.7\n2615.85\n14.85\n2579.471429\n28.868159\n\n\n670\nD10\n2016-05-03\n2649.3\n2640.00\n9.30\n2594.800000\n33.081631\n\n\n671\nD10\n2016-05-04\n2631.8\n2640.55\n8.75\n2601.371429\n35.145563\n\n\n672\nD10\n2016-05-05\n2622.5\n2627.15\n4.65\n2607.457143\n34.584508\n\n\n673\nD10\n2016-05-06\n2620.1\n2621.30\n1.20\n2618.328571\n22.923270\n\n\n\n\n674 rows × 7 columns" + "objectID": "guides/04_wrangling.html#basic-example-1", + "href": "guides/04_wrangling.html#basic-example-1", + "title": "Data Wrangling", + "section": "2.1 Basic Example", + "text": "2.1 Basic Example\nWe’ll continue with our use of the m4_daily_df dataset. Recall we’ve alread aggregated at the weekly level (summarized_df). Lets checkout the last week in the summarized_df:\n\n\nCode\n# last week in dataset\nsummarized_df \\\n .sort_values(by = 'date', ascending = True) \\\n .iloc[: -1] \\\n .tail(1)\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n1975\n2016-05-01\n17959.8\n\n\n\n\n\n\n\n\n\n\n\n\n\niloc()\n\n\n\n\n\niloc[: -1] is used to filter out the last row and keep only dates that are the start of the week.\n\n\n\nWe can see that the last week is the week of 2016-05-01. Now say we wanted to forecast the next 8 weeks. We can extend the dataset beyound the week of 2016-05-01:\n\n\nCode\n# extend dataset by 12 weeks\nsummarized_extended_df = summarized_df \\\n .future_frame(\n date_column = 'date',\n length_out = 8\n )\n\nsummarized_extended_df\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n0\n1978-06-25\n27328.12\n\n\n1\n1978-07-02\n63621.88\n\n\n2\n1978-07-09\n63334.38\n\n\n3\n1978-07-16\n63737.51\n\n\n4\n1978-07-23\n64718.76\n\n\n...\n...\n...\n\n\n1980\n2016-06-05\nNaN\n\n\n1981\n2016-06-12\nNaN\n\n\n1982\n2016-06-19\nNaN\n\n\n1983\n2016-06-26\nNaN\n\n\n1984\n2016-07-03\nNaN\n\n\n\n\n1985 rows × 2 columns\n\n\n\nTo get only the future data, we can filter the dataset for where value is missing (np.nan).\n\n\nCode\n# get only future data\nsummarized_extended_df \\\n .query('value.isna()')\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n1977\n2016-05-15\nNaN\n\n\n1978\n2016-05-22\nNaN\n\n\n1979\n2016-05-29\nNaN\n\n\n1980\n2016-06-05\nNaN\n\n\n1981\n2016-06-12\nNaN\n\n\n1982\n2016-06-19\nNaN\n\n\n1983\n2016-06-26\nNaN\n\n\n1984\n2016-07-03\nNaN" }, { - "objectID": "guides/05_augmenting.html#augment-rolling-with-grouped-time-series", - "href": "guides/05_augmenting.html#augment-rolling-with-grouped-time-series", - "title": "Adding Features (Augmenting)", - "section": "2.3 Augment Rolling with Grouped Time Series", - "text": "2.3 Augment Rolling with Grouped Time Series\nagument_rolling can be used on grouped time series data:\n\n\nCode\n## augment rolling on grouped time series: window of 2 & 7 days, window_func of mean and standard deviation\nm4_daily_df \\\n .groupby('id') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'value',\n window = [2,7],\n window_func = ['mean', ('std', lambda x: x.std())]\n )\n\n\n\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\nvalue_rolling_mean_win_2\nvalue_rolling_std_win_2\nvalue_rolling_mean_win_7\nvalue_rolling_std_win_7\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\nNaN\nNaN\nNaN\nNaN\n\n\n1\nD10\n2014-07-04\n2073.4\n2074.80\n1.40\n2074.800000\n1.400000\n\n\n2\nD10\n2014-07-05\n2048.7\n2061.05\n12.35\n2066.100000\n12.356645\n\n\n3\nD10\n2014-07-06\n2048.9\n2048.80\n0.10\n2061.800000\n13.037830\n\n\n4\nD10\n2014-07-07\n2006.4\n2027.65\n21.25\n2050.720000\n25.041038\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n9738\nD500\n2012-09-19\n9418.8\n9425.35\n6.55\n9382.071429\n74.335988\n\n\n9739\nD500\n2012-09-20\n9365.7\n9392.25\n26.55\n9396.400000\n58.431303\n\n\n9740\nD500\n2012-09-21\n9445.9\n9405.80\n40.10\n9419.114286\n39.184451\n\n\n9741\nD500\n2012-09-22\n9497.9\n9471.90\n26.00\n9438.928571\n38.945336\n\n\n9742\nD500\n2012-09-23\n9545.3\n9521.60\n23.70\n9449.028571\n53.379416\n\n\n\n\n9743 rows × 7 columns" + "objectID": "guides/04_wrangling.html#future-frame-with-grouped-time-series", + "href": "guides/04_wrangling.html#future-frame-with-grouped-time-series", + "title": "Data Wrangling", + "section": "2.2 Future Frame with Grouped Time Series", + "text": "2.2 Future Frame with Grouped Time Series\nfuture_frame() also works for grouped time series. We can see an example using our grouped summarized dataset (grouped_summarized_df) from earlier:\n\n\nCode\n# future frame with grouped time series\ngrouped_summarized_df[['id', 'date', 'value_sum']] \\\n .groupby('id') \\\n .future_frame(\n date_column = 'date',\n length_out = 8\n ) \\\n .query('value_sum.isna()') # filtering to return only the future data\n\n\n\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue_sum\n\n\n\n\n1395\nD10\n2016-05-15\nNaN\n\n\n1396\nD10\n2016-05-22\nNaN\n\n\n1397\nD10\n2016-05-29\nNaN\n\n\n1398\nD10\n2016-06-05\nNaN\n\n\n1399\nD10\n2016-06-12\nNaN\n\n\n1400\nD10\n2016-06-19\nNaN\n\n\n1401\nD10\n2016-06-26\nNaN\n\n\n1402\nD10\n2016-07-03\nNaN\n\n\n1403\nD160\n2011-07-10\nNaN\n\n\n1404\nD160\n2011-07-17\nNaN\n\n\n1405\nD160\n2011-07-24\nNaN\n\n\n1406\nD160\n2011-07-31\nNaN\n\n\n1407\nD160\n2011-08-07\nNaN\n\n\n1408\nD160\n2011-08-14\nNaN\n\n\n1409\nD160\n2011-08-21\nNaN\n\n\n1410\nD160\n2011-08-28\nNaN\n\n\n1411\nD410\n1980-05-11\nNaN\n\n\n1412\nD410\n1980-05-18\nNaN\n\n\n1413\nD410\n1980-05-25\nNaN\n\n\n1414\nD410\n1980-06-01\nNaN\n\n\n1415\nD410\n1980-06-08\nNaN\n\n\n1416\nD410\n1980-06-15\nNaN\n\n\n1417\nD410\n1980-06-22\nNaN\n\n\n1418\nD410\n1980-06-29\nNaN\n\n\n1419\nD500\n2012-09-30\nNaN\n\n\n1420\nD500\n2012-10-07\nNaN\n\n\n1421\nD500\n2012-10-14\nNaN\n\n\n1422\nD500\n2012-10-21\nNaN\n\n\n1423\nD500\n2012-10-28\nNaN\n\n\n1424\nD500\n2012-11-04\nNaN\n\n\n1425\nD500\n2012-11-11\nNaN\n\n\n1426\nD500\n2012-11-18\nNaN" }, { - "objectID": "guides/05_augmenting.html#basic-example", - "href": "guides/05_augmenting.html#basic-example", - "title": "Adding Features (Augmenting)", + "objectID": "guides/04_wrangling.html#basic-example-2", + "href": "guides/04_wrangling.html#basic-example-2", + "title": "Data Wrangling", "section": "3.1 Basic Example", - "text": "3.1 Basic Example\nWe’ll showcase an example using the m4_daily_df dataset by generating 29 additional features from the date column:\n\n\nCode\n# augment time series signature\nm4_daily_df \\\n .query('id == \"D10\"') \\\n .augment_timeseries_signature(\n date_column = 'date'\n ) \\\n .head()\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\ndate_index_num\ndate_year\ndate_year_iso\ndate_yearstart\ndate_yearend\ndate_leapyear\ndate_half\n...\ndate_mday\ndate_qday\ndate_yday\ndate_weekend\ndate_hour\ndate_minute\ndate_second\ndate_msecond\ndate_nsecond\ndate_am_pm\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\n1404345600\n2014\n2014\n0\n0\n0\n2\n...\n3\n3\n184\n0\n0\n0\n0\n0\n0\nam\n\n\n1\nD10\n2014-07-04\n2073.4\n1404432000\n2014\n2014\n0\n0\n0\n2\n...\n4\n4\n185\n0\n0\n0\n0\n0\n0\nam\n\n\n2\nD10\n2014-07-05\n2048.7\n1404518400\n2014\n2014\n0\n0\n0\n2\n...\n5\n5\n186\n0\n0\n0\n0\n0\n0\nam\n\n\n3\nD10\n2014-07-06\n2048.9\n1404604800\n2014\n2014\n0\n0\n0\n2\n...\n6\n6\n187\n1\n0\n0\n0\n0\n0\nam\n\n\n4\nD10\n2014-07-07\n2006.4\n1404691200\n2014\n2014\n0\n0\n0\n2\n...\n7\n7\n188\n0\n0\n0\n0\n0\n0\nam\n\n\n\n\n5 rows × 32 columns" + "text": "3.1 Basic Example\nLet’s start with a basic example to see how pad_by_time() works. We’ll create some sample data with missing timestamps:\n\n\nCode\n# libraries\nimport pytimetk as tk\nimport pandas as pd\nimport numpy as np\n\n# sample quarterly data with missing timestamp for Q3\ndates = pd.to_datetime([\"2021-01-01\", \"2021-04-01\", \"2021-10-01\"])\nvalue = range(len(dates))\n\ndf = pd.DataFrame({\n 'date': dates,\n 'value': range(len(dates))\n})\n\ndf\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n0\n2021-01-01\n0\n\n\n1\n2021-04-01\n1\n\n\n2\n2021-10-01\n2\n\n\n\n\n\n\n\nNow we can use pad_by_time() to fill in the missing timestamp:\n\n\nCode\n# pad by time\ndf \\\n .pad_by_time(\n date_column = 'date',\n freq = 'QS' # specifying quarter start frequency\n )\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n0\n2021-01-01\n0.0\n\n\n1\n2021-04-01\n1.0\n\n\n2\n2021-07-01\nNaN\n\n\n3\n2021-10-01\n2.0\n\n\n\n\n\n\n\nWe can also specify shorter time frequency:\n\n\nCode\n# pad by time with shorter frequency\ndf \\\n .pad_by_time(\n date_column = 'date',\n freq = 'MS' # specifying month start frequency\n ) \\\n .assign(value = lambda x: x['value'].fillna(0)) # replace NaN with 0\n\n\n\n\n\n\n\n\n\ndate\nvalue\n\n\n\n\n0\n2021-01-01\n0.0\n\n\n1\n2021-02-01\n0.0\n\n\n2\n2021-03-01\n0.0\n\n\n3\n2021-04-01\n1.0\n\n\n4\n2021-05-01\n0.0\n\n\n5\n2021-06-01\n0.0\n\n\n6\n2021-07-01\n0.0\n\n\n7\n2021-08-01\n0.0\n\n\n8\n2021-09-01\n0.0\n\n\n9\n2021-10-01\n2.0" }, { - "objectID": "guides/05_augmenting.html#basic-example-1", - "href": "guides/05_augmenting.html#basic-example-1", - "title": "Adding Features (Augmenting)", - "section": "4.1 Basic Example", - "text": "4.1 Basic Example\nWe’ll showcase an example using some sample data:\n\n\nCode\n# create sample data\ndates = pd.date_range(start = '2022-12-25', end = '2023-01-05')\n\ndf = pd.DataFrame({'date': dates})\n\n# augment time series signature: USA\ndf \\\n .augment_holiday_signature(\n date_column = 'date',\n country_name = 'UnitedStates'\n )\n\n\n\n\n\n\n\n\n\ndate\nis_holiday\nbefore_holiday\nafter_holiday\nholiday_name\n\n\n\n\n0\n2022-12-25\n1\n1\n0\nChristmas Day\n\n\n1\n2022-12-26\n1\n0\n1\nChristmas Day (Observed)\n\n\n2\n2022-12-27\n0\n0\n1\nNaN\n\n\n3\n2022-12-28\n0\n0\n0\nNaN\n\n\n4\n2022-12-29\n0\n0\n0\nNaN\n\n\n5\n2022-12-30\n0\n0\n0\nNaN\n\n\n6\n2022-12-31\n0\n1\n0\nNaN\n\n\n7\n2023-01-01\n1\n1\n0\nNew Year's Day\n\n\n8\n2023-01-02\n1\n0\n1\nNew Year's Day (Observed)\n\n\n9\n2023-01-03\n0\n0\n1\nNaN\n\n\n10\n2023-01-04\n0\n0\n0\nNaN\n\n\n11\n2023-01-05\n0\n0\n0\nNaN" + "objectID": "guides/04_wrangling.html#pad-by-time-with-grouped-time-series", + "href": "guides/04_wrangling.html#pad-by-time-with-grouped-time-series", + "title": "Data Wrangling", + "section": "3.2 Pad by Time with Grouped Time Series", + "text": "3.2 Pad by Time with Grouped Time Series\npad_by_time() can also be used with grouped time series. Let’s use the stocks_daily dataset to showcase an example:\n\n\nCode\n# load dataset\nstocks_df = tk.load_dataset('stocks_daily', parse_dates = ['date'])\n\n# pad by time\nstocks_df \\\n .groupby('symbol') \\\n .pad_by_time(\n date_column = 'date',\n freq = 'D'\n ) \\\n .assign(id = lambda x: x['symbol'].ffill())\n\n\n\n\n\n\n\n\n\nsymbol\ndate\nopen\nhigh\nlow\nclose\nvolume\nadjusted\nid\n\n\n\n\n0\nAAPL\n2013-01-02\n19.779285\n19.821428\n19.343929\n19.608213\n560518000.0\n16.791180\nAAPL\n\n\n1\nAAPL\n2013-01-03\n19.567142\n19.631071\n19.321428\n19.360714\n352965200.0\n16.579241\nAAPL\n\n\n2\nAAPL\n2013-01-04\n19.177500\n19.236786\n18.779642\n18.821428\n594333600.0\n16.117437\nAAPL\n\n\n3\nAAPL\n2013-01-05\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nAAPL\n\n\n4\nAAPL\n2013-01-06\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nAAPL\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n23485\nNVDA\n2023-09-17\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNVDA\n\n\n23486\nNVDA\n2023-09-18\n427.480011\n442.420013\n420.000000\n439.660004\n50027100.0\n439.660004\nNVDA\n\n\n23487\nNVDA\n2023-09-19\n438.329987\n439.660004\n430.019989\n435.200012\n37306400.0\n435.200012\nNVDA\n\n\n23488\nNVDA\n2023-09-20\n436.000000\n439.029999\n422.230011\n422.390015\n36710800.0\n422.390015\nNVDA\n\n\n23489\nNVDA\n2023-09-21\n415.829987\n421.000000\n409.799988\n410.170013\n44893000.0\n410.170013\nNVDA\n\n\n\n\n23490 rows × 9 columns\n\n\n\nTo replace NaN with 0 in a dataframe with multiple columns:\n\n\nCode\nfrom functools import partial\n\n# columns to replace NaN with 0\ncols_to_fill = ['open', 'high', 'low', 'close', 'volume', 'adjusted']\n\n# define a function to fillna\ndef fill_na_col(df, col):\n return df[col].fillna(0)\n\n# pad by time and replace NaN with 0\nstocks_df \\\n .groupby('symbol') \\\n .pad_by_time(\n date_column = 'date',\n freq = 'D'\n ) \\\n .assign(id = lambda x: x['symbol'].ffill()) \\\n .assign(**{col: partial(fill_na_col, col=col) for col in cols_to_fill})\n\n\n\n\n\n\n\n\n\nsymbol\ndate\nopen\nhigh\nlow\nclose\nvolume\nadjusted\nid\n\n\n\n\n0\nAAPL\n2013-01-02\n19.779285\n19.821428\n19.343929\n19.608213\n560518000.0\n16.791180\nAAPL\n\n\n1\nAAPL\n2013-01-03\n19.567142\n19.631071\n19.321428\n19.360714\n352965200.0\n16.579241\nAAPL\n\n\n2\nAAPL\n2013-01-04\n19.177500\n19.236786\n18.779642\n18.821428\n594333600.0\n16.117437\nAAPL\n\n\n3\nAAPL\n2013-01-05\n0.000000\n0.000000\n0.000000\n0.000000\n0.0\n0.000000\nAAPL\n\n\n4\nAAPL\n2013-01-06\n0.000000\n0.000000\n0.000000\n0.000000\n0.0\n0.000000\nAAPL\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n23485\nNVDA\n2023-09-17\n0.000000\n0.000000\n0.000000\n0.000000\n0.0\n0.000000\nNVDA\n\n\n23486\nNVDA\n2023-09-18\n427.480011\n442.420013\n420.000000\n439.660004\n50027100.0\n439.660004\nNVDA\n\n\n23487\nNVDA\n2023-09-19\n438.329987\n439.660004\n430.019989\n435.200012\n37306400.0\n435.200012\nNVDA\n\n\n23488\nNVDA\n2023-09-20\n436.000000\n439.029999\n422.230011\n422.390015\n36710800.0\n422.390015\nNVDA\n\n\n23489\nNVDA\n2023-09-21\n415.829987\n421.000000\n409.799988\n410.170013\n44893000.0\n410.170013\nNVDA\n\n\n\n\n23490 rows × 9 columns" }, { - "objectID": "guides/05_augmenting.html#basic-example-2", - "href": "guides/05_augmenting.html#basic-example-2", - "title": "Adding Features (Augmenting)", - "section": "5.1 Basic Example", - "text": "5.1 Basic Example\n\n\nCode\n# augment fourier with 7 periods and max order of 1\n#m4_daily_df \\\n# .query('id == \"D10\"') \\\n# .augment_fourier(\n# date_column = 'date',\n# value_column = 'value',\n# num_periods = 7,\n# max_order = 1\n# ) \\\n# .head(20)\n\n\nNotice the additional value_fourier_1_1 to value_fourier_1_7 colums that have been added to the data." + "objectID": "tutorials/04_anomaly_detection.html", + "href": "tutorials/04_anomaly_detection.html", + "title": "Anomaly Detection in Website Traffic", + "section": "", + "text": "Anomalize: Breakdown, identify, and clean anomalies in 1 easy step\nAnomalies, often called outliers, are data points that deviate significantly from the general trend or pattern in the data. In the context of time series, they can appear as sudden spikes, drops, or any abrupt change in a sequence of values.\nAnomaly detection for time series is a technique used to identify unusual patterns that do not conform to expected behavior. It is especially relevant for sequential data (like stock prices, sensor data, sales data, etc.) where the temporal aspect is crucial. Anomalies can identify important events or be the cause of noise that can hinder forecasting performance." }, { - "objectID": "guides/05_augmenting.html#augment-fourier-with-grouped-time-series", - "href": "guides/05_augmenting.html#augment-fourier-with-grouped-time-series", - "title": "Adding Features (Augmenting)", - "section": "5.2 Augment Fourier with Grouped Time Series", - "text": "5.2 Augment Fourier with Grouped Time Series\naugment_fourier also works with grouped time series:\n\n\nCode\n# augment fourier with grouped time series\nm4_daily_df \\\n .groupby('id') \\\n .augment_fourier(\n date_column = 'date',\n value_column = 'value',\n num_periods = 7,\n max_order = 1\n ) \\\n .head(20)\n\n\n\n\n\n\n\n\n\nid\ndate\nvalue\nvalue_fourier_1_1\nvalue_fourier_1_2\nvalue_fourier_1_3\nvalue_fourier_1_4\nvalue_fourier_1_5\nvalue_fourier_1_6\nvalue_fourier_1_7\n\n\n\n\n0\nD10\n2014-07-03\n2076.2\n0.394510\n-0.725024\n0.937927\n-0.998682\n0.897435\n-0.650609\n0.298243\n\n\n1\nD10\n2014-07-04\n2073.4\n-0.980653\n0.383931\n0.830342\n-0.709015\n-0.552759\n0.925423\n0.190450\n\n\n2\nD10\n2014-07-05\n2048.7\n0.011484\n0.022967\n0.034446\n0.045921\n0.057390\n0.068852\n0.080304\n\n\n3\nD10\n2014-07-06\n2048.9\n0.975899\n-0.425928\n-0.790004\n0.770723\n0.453624\n-0.968706\n-0.030835\n\n\n4\nD10\n2014-07-07\n2006.4\n-0.415510\n0.755886\n-0.959581\n0.989762\n-0.840972\n0.540115\n-0.141593\n\n\n5\nD10\n2014-07-08\n2017.6\n-0.803876\n-0.956286\n-0.333715\n0.559301\n0.999055\n0.629169\n-0.250600\n\n\n6\nD10\n2014-07-09\n2019.1\n0.748318\n0.992779\n0.568784\n-0.238184\n-0.884778\n-0.935635\n-0.356511\n\n\n7\nD10\n2014-07-10\n2007.4\n0.494070\n-0.859111\n0.999790\n-0.879368\n0.529294\n-0.040992\n-0.458015\n\n\n8\nD10\n2014-07-11\n2010.0\n-0.952864\n0.578192\n0.602021\n-0.943494\n-0.029515\n0.961404\n-0.553858\n\n\n9\nD10\n2014-07-12\n2001.5\n-0.099581\n-0.198171\n-0.294792\n-0.388482\n-0.478310\n-0.563384\n-0.642856\n\n\n10\nD10\n2014-07-13\n1978.8\n0.994091\n-0.215816\n-0.947238\n0.421459\n0.855740\n-0.607239\n-0.723909\n\n\n11\nD10\n2014-07-14\n1988.3\n-0.311977\n0.592812\n-0.814472\n0.954831\n-0.999879\n0.945118\n-0.796015\n\n\n12\nD10\n2014-07-15\n2000.7\n-0.864932\n-0.868201\n-0.006551\n0.861625\n0.871433\n0.013101\n-0.858282\n\n\n13\nD10\n2014-07-16\n2010.5\n0.670062\n0.994781\n0.806801\n0.203005\n-0.505418\n-0.953354\n-0.909941\n\n\n14\nD10\n2014-07-17\n2014.5\n0.587524\n-0.950856\n0.951356\n-0.588831\n0.001617\n0.586214\n-0.950354\n\n\n15\nD10\n2014-07-18\n1962.6\n-0.913299\n0.743956\n0.307286\n-0.994265\n0.502625\n0.584837\n-0.979022\n\n\n16\nD10\n2014-07-19\n1948.0\n-0.209415\n-0.409542\n-0.591509\n-0.747244\n-0.869842\n-0.953865\n-0.995589\n\n\n17\nD10\n2014-07-20\n1943.0\n0.999997\n0.004934\n-0.999973\n-0.009867\n0.999924\n0.014800\n-0.999851\n\n\n18\nD10\n2014-07-21\n1933.3\n-0.204588\n0.400521\n-0.579511\n0.733985\n-0.857409\n0.944561\n-0.991756\n\n\n19\nD10\n2014-07-22\n1891.0\n-0.915297\n-0.737326\n0.321336\n0.996182\n0.481148\n-0.608588\n-0.971403" + "objectID": "tutorials/04_anomaly_detection.html#anomalize-breakdown-identify-and-clean-in-1-easy-step", + "href": "tutorials/04_anomaly_detection.html#anomalize-breakdown-identify-and-clean-in-1-easy-step", + "title": "Anomaly Detection in Website Traffic", + "section": "2.1 Anomalize: breakdown, identify, and clean in 1 easy step", + "text": "2.1 Anomalize: breakdown, identify, and clean in 1 easy step\nThe anomalize() function is a feature rich tool for performing anomaly detection. Anomalize is group-aware, so we can use this as part of a normal pandas groupby chain. In one easy step:\n\nWe breakdown (decompose) the time series\nAnalyze it’s remainder (residuals) for spikes (anomalies)\nClean the anomalies if desired\n\n\n\nCode\nanomalize_df = df \\\n .groupby('Page', sort = False) \\\n .anomalize(\n date_column = \"date\", \n value_column = \"value\", \n )\n\nanomalize_df.glimpse()\n\n\n\n\n\n<class 'pandas.core.frame.DataFrame'>: 5500 rows of 13 columns\nPage: object ['Death_of_Freddie_Gray_en.wikiped ...\ndate: datetime64[ns] [Timestamp('2015-07-01 00:00:00'), ...\nobserved: int64 [791, 704, 903, 732, 558, 504, 543 ...\nseasonal: float64 [206.78723511550484, 4.04332698700 ...\nseasadj: float64 [584.2127648844952, 699.9566730129 ...\ntrend: float64 [729.0301895900458, 726.0497757616 ...\nremainder: float64 [-144.8174247055506, -26.093102748 ...\nanomaly: object ['No', 'No', 'No', 'No', 'No', 'No ...\nanomaly_score: float64 [266.9421236324138, 148.2178016755 ...\nanomaly_direction: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nrecomposed_l1: float64 [266.05095141435606, 60.3266294574 ...\nrecomposed_l2: float64 [1849.8332958504716, 1644.10897389 ...\nobserved_clean: float64 [791.0, 704.0, 903.0, 732.0, 558.0 ...\n\n\n\n\n\n\n\n\nThe anomalize() function returns:\n\n\n\n\n\n\nThe original grouping and datetime columns.\nThe seasonal decomposition: observed, seasonal, seasadj, trend, and remainder. The objective is to remove trend and seasonality such that the remainder is stationary and representative of normal variation and anomalous variations.\nAnomaly identification and scoring: anomaly, anomaly_score, anomaly_direction. These identify the anomaly decision (Yes/No), score the anomaly as a distance from the centerline, and label the direction (-1 (down), zero (not anomalous), +1 (up)).\nRecomposition: recomposed_l1 and recomposed_l2. Think of these as the lower and upper bands. Any observed data that is below l1 or above l2 is anomalous.\nCleaned data: observed_clean. Cleaned data is automatically provided, which has the outliers replaced with data that is within the recomposed l1/l2 boundaries. With that said, you should always first seek to understand why data is being considered anomalous before simply removing outliers and using the cleaned data.\n\n\n\n\nThe most important aspect is that this data is ready to be visualized, inspected, and modifications can then be made to address any tweaks you would like to make." }, { - "objectID": "tutorials/06_correlationfunnel.html", - "href": "tutorials/06_correlationfunnel.html", - "title": "Correlation Funnel", - "section": "", - "text": "We will demonstrate how Correlation Funnel to analyze Expedia Hotel Bookings and which features correlate to a customer making a booking through their website:\n\n\n\nCorrelation Funnel" + "objectID": "tutorials/04_anomaly_detection.html#visualization-1-seasonal-decomposition-plot", + "href": "tutorials/04_anomaly_detection.html#visualization-1-seasonal-decomposition-plot", + "title": "Anomaly Detection in Website Traffic", + "section": "2.2 Visualization 1: Seasonal Decomposition Plot", + "text": "2.2 Visualization 1: Seasonal Decomposition Plot\nThe first step in my normal process is to analyze the seasonal decomposition. I want to see what the remainders look like, and make sure that the trend and seasonality are being removed such that the remainder is centered around zero.\n\n\n\n\n\n\nWhat to do when the remainders have trend or seasonality?\n\n\n\n\n\nWe’ll cover how to tweak the nobs of anomalize() in the next section aptly named “How to tweak the nobs on anomalize”.\n\n\n\n\nPlotlyPlotnine\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies_decomp(\n date_column = \"date\", \n width = 1800,\n height = 1000,\n engine = 'plotly'\n )\n\n\n\n \n\n\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies_decomp(\n date_column = \"date\", \n width = 1800,\n height = 1000,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (1800 x 1000)>" }, { - "objectID": "tutorials/06_correlationfunnel.html#setup", - "href": "tutorials/06_correlationfunnel.html#setup", - "title": "Correlation Funnel", - "section": "3.1 Setup", - "text": "3.1 Setup\nTo set up, import the following packages and the expedia_df dataset, Expedia Hotel Time Series Dataset.\n\n# Libraries\nimport pandas as pd \nimport pytimetk as tk\n\n# Data\nexpedia_df = tk.load_dataset(\"expedia\", parse_dates = ['date_time'])\nexpedia_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 100000 rows of 24 columns\ndate_time: datetime64[ns] [Timestamp('2013-07-25 17: ...\nsite_name: int64 [2, 2, 2, 2, 2, 37, 2, 2, ...\nposa_continent: int64 [3, 3, 3, 3, 3, 1, 3, 3, 3 ...\nuser_location_country: int64 [66, 66, 66, 66, 66, 69, 6 ...\nuser_location_region: int64 [174, 174, 174, 220, 351, ...\nuser_location_city: int64 [35675, 31320, 16292, 1760 ...\norig_destination_distance: float64 [0.1203, 108.2251, 763.142 ...\nuser_id: int64 [44735, 794319, 761732, 69 ...\nis_mobile: int64 [0, 0, 1, 0, 0, 0, 0, 0, 0 ...\nis_package: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0 ...\nchannel: int64 [9, 3, 1, 9, 1, 9, 9, 9, 9 ...\nsrch_ci: object ['2013-07-26', '2014-11-27 ...\nsrch_co: object ['2013-07-27', '2014-11-29 ...\nsrch_adults_cnt: int64 [1, 2, 2, 2, 2, 2, 2, 2, 2 ...\nsrch_children_cnt: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0 ...\nsrch_rm_cnt: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1 ...\nsrch_destination_id: int64 [5465, 11620, 23808, 40658 ...\nsrch_destination_type_id: int64 [3, 1, 6, 5, 1, 6, 1, 5, 6 ...\nis_booking: int64 [1, 0, 0, 0, 0, 0, 0, 0, 0 ...\ncnt: int64 [1, 2, 3, 1, 2, 7, 1, 1, 1 ...\nhotel_continent: int64 [2, 2, 2, 2, 2, 6, 4, 2, 4 ...\nhotel_country: int64 [50, 50, 50, 50, 50, 204, ...\nhotel_market: int64 [1230, 369, 1144, 930, 637 ...\nhotel_cluster: int64 [47, 83, 93, 48, 33, 15, 9 ..." + "objectID": "tutorials/04_anomaly_detection.html#visualization-2-anomaly-detection-plot", + "href": "tutorials/04_anomaly_detection.html#visualization-2-anomaly-detection-plot", + "title": "Anomaly Detection in Website Traffic", + "section": "2.3 Visualization 2: Anomaly Detection Plot", + "text": "2.3 Visualization 2: Anomaly Detection Plot\nOnce I’m satisfied with the remainders, my next step is to visualize the anomalies. Here I’m looking to see if I need to grow or shrink the remainder l1 and l2 bands, which classify anomalies.\n\nPlotlyPlotnine\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies(\n date_column = \"date\", \n facet_ncol = 2, \n width = 1000,\n height = 1000,\n )\n\n\n\n \n\n\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies(\n date_column = \"date\", \n facet_ncol = 2, \n width = 1000,\n height = 1000,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (1000 x 1000)>" }, { - "objectID": "tutorials/06_correlationfunnel.html#data-preparation", - "href": "tutorials/06_correlationfunnel.html#data-preparation", - "title": "Correlation Funnel", - "section": "3.2 Data Preparation", - "text": "3.2 Data Preparation\nTo prepare the dataset, we will first perform data preparation:\n\nAdd time series features based on the date_time timestamp column.\nWe will drop any zero variance features\nDrop additional columns that are not an acceptable data type (i.e. not numeric, categorical, or string) or contain missing values\nConvert numeric columns that start with “hotel_” that are actually categorical “ID” columns to string\n\n\nexpedia_ts_features_df = expedia_df \\\n .augment_timeseries_signature('date_time') \\\n .drop_zero_variance() \\\n .drop(columns=['date_time', 'orig_destination_distance', 'srch_ci', 'srch_co']) \\\n .transform_columns(\n columns = [r\"hotel_.*\"],\n transform_func = lambda x: x.astype(str)\n )\n \nexpedia_ts_features_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 100000 rows of 46 columns\nsite_name: int64 [2, 2, 2, 2, 2, 37, 2, 2, 2 ...\nposa_continent: int64 [3, 3, 3, 3, 3, 1, 3, 3, 3, ...\nuser_location_country: int64 [66, 66, 66, 66, 66, 69, 66 ...\nuser_location_region: int64 [174, 174, 174, 220, 351, 7 ...\nuser_location_city: int64 [35675, 31320, 16292, 17605 ...\nuser_id: int64 [44735, 794319, 761732, 696 ...\nis_mobile: int64 [0, 0, 1, 0, 0, 0, 0, 0, 0, ...\nis_package: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nchannel: int64 [9, 3, 1, 9, 1, 9, 9, 9, 9, ...\nsrch_adults_cnt: int64 [1, 2, 2, 2, 2, 2, 2, 2, 2, ...\nsrch_children_cnt: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nsrch_rm_cnt: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, ...\nsrch_destination_id: int64 [5465, 11620, 23808, 40658, ...\nsrch_destination_type_id: int64 [3, 1, 6, 5, 1, 6, 1, 5, 6, ...\nis_booking: int64 [1, 0, 0, 0, 0, 0, 0, 0, 0, ...\ncnt: int64 [1, 2, 3, 1, 2, 7, 1, 1, 1, ...\nhotel_continent: object ['2', '2', '2', '2', '2', ' ...\nhotel_country: object ['50', '50', '50', '50', '5 ...\nhotel_market: object ['1230', '369', '1144', '93 ...\nhotel_cluster: object ['47', '83', '93', '48', '3 ...\ndate_time_index_num: int64 [1374773055, 1414939784, 14 ...\ndate_time_year: int64 [2013, 2014, 2014, 2014, 20 ...\ndate_time_year_iso: UInt32 [2013, 2014, 2014, 2014, 20 ...\ndate_time_yearstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_yearend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_half: int64 [2, 2, 1, 1, 2, 2, 1, 2, 1, ...\ndate_time_quarter: int64 [3, 4, 2, 1, 3, 4, 1, 3, 2, ...\ndate_time_quarteryear: object ['2013Q3', '2014Q4', '2014Q ...\ndate_time_quarterstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_quarterend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_month: int64 [7, 11, 5, 2, 8, 12, 3, 9, ...\ndate_time_month_lbl: object ['July', 'November', 'May', ...\ndate_time_monthstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_monthend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_yweek: UInt32 [30, 44, 21, 9, 33, 50, 12, ...\ndate_time_mweek: int64 [4, 1, 4, 4, 2, 2, 3, 3, 2, ...\ndate_time_wday: int64 [4, 7, 4, 3, 3, 2, 2, 1, 4, ...\ndate_time_wday_lbl: object ['Thursday', 'Sunday', 'Thu ...\ndate_time_mday: int64 [25, 2, 22, 26, 13, 9, 18, ...\ndate_time_qday: int64 [25, 33, 52, 57, 44, 70, 77 ...\ndate_time_yday: int64 [206, 306, 142, 57, 225, 34 ...\ndate_time_weekend: int64 [0, 1, 0, 0, 0, 0, 0, 0, 0, ...\ndate_time_hour: int64 [17, 14, 12, 14, 11, 7, 21, ...\ndate_time_minute: int64 [24, 49, 50, 1, 15, 21, 40, ...\ndate_time_second: int64 [15, 44, 53, 2, 40, 31, 29, ...\ndate_time_am_pm: object ['pm', 'pm', 'am', 'pm', 'a ..." + "objectID": "tutorials/04_anomaly_detection.html#visualization-3-anomalies-cleaned-plot", + "href": "tutorials/04_anomaly_detection.html#visualization-3-anomalies-cleaned-plot", + "title": "Anomaly Detection in Website Traffic", + "section": "2.4 Visualization 3: Anomalies Cleaned Plot", + "text": "2.4 Visualization 3: Anomalies Cleaned Plot\nThere are pros and cons to cleaning anomalies. I’ll leave that discussion for another time. But, should you be interested in seeing what your data looks like cleaned (with outliers removed), this plot will help you compare before and after.\n\nPlotlyPlotnine\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies_cleaned(\n date_column = \"date\", \n facet_ncol = 2, \n width = 1000,\n height = 1000,\n engine = \"plotly\"\n )\n\n\n\n \n\n\n\n\n\n\nCode\nanomalize_df \\\n .groupby(\"Page\") \\\n .plot_anomalies_cleaned(\n date_column = \"date\", \n facet_ncol = 2, \n width = 1000,\n height = 1000,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (1000 x 1000)>" }, { - "objectID": "tutorials/06_correlationfunnel.html#step-correlation-funnel-workflow", - "href": "tutorials/06_correlationfunnel.html#step-correlation-funnel-workflow", - "title": "Correlation Funnel", - "section": "3.3 3-Step Correlation Funnel Workflow", - "text": "3.3 3-Step Correlation Funnel Workflow\nNext, we will perform the Correlation Funnel workflow to explore the Expedia Hotel Time Series dataset. There are 3 steps:\n\nBinarize: Convert the data to binary 0/1\nCorrelate: Detect relationships between the binary features and one of the columns (called the target)\nVisualize the Correlation Funnel: Plotting allows us to assess the top features and their relationship to the target.\n\n\nStep 1: Binarize\nUse binarize() to convert the raw data to binary 0/1. Binarization happens as follows:\n\nNumeric Data: Numeric data is Quantile Binned using the pd.qcut() function. The default is 4 bins, which bins numeric data into a maximum of 4 discrete bins. Fewer bins can be returned if there is insufficient data for 4 bins. The number of bins is controlled with the n_bins parameter.\nCategorical / String Data: Categorical data is first processed to determine the most frequent categories. Categories that are sparse are lumped into an “OTHER” category. The lumping can be controlled with the thresh_infreq.\n\n\nexpedia_ts_binarized_df = expedia_ts_features_df.binarize(thresh_infreq = 0.05)\n\nexpedia_ts_binarized_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 100000 rows of 155 columns\nsite_name__2.0_15.0: uint8 [1, 1 ...\nsite_name__15.0_53.0: uint8 [0, 0 ...\nuser_location_country__0.0_66.0: uint8 [1, 1 ...\nuser_location_country__66.0_71.0: uint8 [0, 0 ...\nuser_location_country__71.0_239.0: uint8 [0, 0 ...\nuser_location_region__0.0_174.0: uint8 [1, 1 ...\nuser_location_region__174.0_314.0: uint8 [0, 0 ...\nuser_location_region__314.0_385.0: uint8 [0, 0 ...\nuser_location_region__385.0_1021.0: uint8 [0, 0 ...\nuser_location_city__0.0_13087.0: uint8 [0, 0 ...\nuser_location_city__13087.0_27655.0: uint8 [0, 0 ...\nuser_location_city__27655.0_42563.0: uint8 [1, 1 ...\nuser_location_city__42563.0_56507.0: uint8 [0, 0 ...\nuser_id__13.0_299759.8: uint8 [1, 0 ...\nuser_id__299759.8_605161.5: uint8 [0, 0 ...\nuser_id__605161.5_911811.5: uint8 [0, 1 ...\nuser_id__911811.5_1198780.0: uint8 [0, 0 ...\nchannel__0.0_2.0: uint8 [0, 0 ...\nchannel__2.0_9.0: uint8 [1, 1 ...\nchannel__9.0_10.0: uint8 [0, 0 ...\nsrch_adults_cnt__0.0_2.0: uint8 [1, 1 ...\nsrch_adults_cnt__2.0_9.0: uint8 [0, 0 ...\nsrch_children_cnt__0.0_9.0: uint8 [1, 1 ...\nsrch_rm_cnt__0.0_1.0: uint8 [1, 1 ...\nsrch_rm_cnt__1.0_8.0: uint8 [0, 0 ...\nsrch_destination_id__1.0_8267.0: uint8 [1, 0 ...\nsrch_destination_id__8267.0_9147.0: uint8 [0, 0 ...\nsrch_destination_id__9147.0_18998.0: uint8 [0, 1 ...\nsrch_destination_id__18998.0_65104.0: uint8 [0, 0 ...\nsrch_destination_type_id__1.0_5.0: uint8 [1, 1 ...\nsrch_destination_type_id__5.0_9.0: uint8 [0, 0 ...\ncnt__1.0_2.0: uint8 [1, 1 ...\ncnt__2.0_72.0: uint8 [0, 0 ...\ndate_time_index_num__1357516842.0_1382867237.5: uint8 [1, 0 ...\ndate_time_index_num__1382867237.5_1401387689.0: uint8 [0, 0 ...\ndate_time_index_num__1401387689.0_1410981206.0: uint8 [0, 0 ...\ndate_time_index_num__1410981206.0_1420070302.0: uint8 [0, 1 ...\ndate_time_month__1.0_5.0: uint8 [0, 0 ...\ndate_time_month__5.0_7.0: uint8 [1, 0 ...\ndate_time_month__7.0_10.0: uint8 [0, 0 ...\ndate_time_month__10.0_12.0: uint8 [0, 1 ...\ndate_time_yweek__1.0_17.0: uint8 [0, 0 ...\ndate_time_yweek__17.0_30.0: uint8 [1, 0 ...\ndate_time_yweek__30.0_41.0: uint8 [0, 0 ...\ndate_time_yweek__41.0_52.0: uint8 [0, 1 ...\ndate_time_mday__1.0_8.0: uint8 [0, 1 ...\ndate_time_mday__8.0_16.0: uint8 [0, 0 ...\ndate_time_mday__16.0_23.0: uint8 [0, 0 ...\ndate_time_mday__23.0_31.0: uint8 [1, 0 ...\ndate_time_qday__1.0_24.0: uint8 [0, 0 ...\ndate_time_qday__24.0_48.0: uint8 [1, 1 ...\ndate_time_qday__48.0_70.0: uint8 [0, 0 ...\ndate_time_qday__70.0_92.0: uint8 [0, 0 ...\ndate_time_yday__1.0_121.0: uint8 [0, 0 ...\ndate_time_yday__121.0_209.0: uint8 [1, 0 ...\ndate_time_yday__209.0_286.0: uint8 [0, 0 ...\ndate_time_yday__286.0_365.0: uint8 [0, 1 ...\ndate_time_hour__0.0_10.0: uint8 [0, 0 ...\ndate_time_hour__10.0_14.0: uint8 [0, 1 ...\ndate_time_hour__14.0_18.0: uint8 [1, 0 ...\ndate_time_hour__18.0_23.0: uint8 [0, 0 ...\ndate_time_minute__0.0_15.0: uint8 [0, 0 ...\ndate_time_minute__15.0_30.0: uint8 [1, 0 ...\ndate_time_minute__30.0_45.0: uint8 [0, 0 ...\ndate_time_minute__45.0_59.0: uint8 [0, 1 ...\ndate_time_second__0.0_15.0: uint8 [1, 0 ...\ndate_time_second__15.0_30.0: uint8 [0, 0 ...\ndate_time_second__30.0_45.0: uint8 [0, 1 ...\ndate_time_second__45.0_59.0: uint8 [0, 0 ...\nposa_continent__1: uint8 [0, 0 ...\nposa_continent__2: uint8 [0, 0 ...\nposa_continent__3: uint8 [1, 1 ...\nposa_continent__-OTHER: uint8 [0, 0 ...\nis_mobile__0: uint8 [1, 1 ...\nis_mobile__1: uint8 [0, 0 ...\nis_package__0: uint8 [1, 1 ...\nis_package__1: uint8 [0, 0 ...\nis_booking__0: uint8 [0, 1 ...\nis_booking__1: uint8 [1, 0 ...\nhotel_continent__-OTHER: uint8 [0, 0 ...\nhotel_continent__2: uint8 [1, 1 ...\nhotel_continent__3: uint8 [0, 0 ...\nhotel_continent__4: uint8 [0, 0 ...\nhotel_continent__6: uint8 [0, 0 ...\nhotel_country__-OTHER: uint8 [0, 0 ...\nhotel_country__50: uint8 [1, 1 ...\nhotel_country__8: uint8 [0, 0 ...\nhotel_market__-OTHER: uint8 [1, 1 ...\nhotel_cluster__-OTHER: uint8 [1, 1 ...\ndate_time_year__2013: uint8 [1, 0 ...\ndate_time_year__2014: uint8 [0, 1 ...\ndate_time_year_iso__2013: uint8 [1, 0 ...\ndate_time_year_iso__2014: uint8 [0, 1 ...\ndate_time_year_iso__-OTHER: uint8 [0, 0 ...\ndate_time_yearstart__0: uint8 [1, 1 ...\ndate_time_yearstart__-OTHER: uint8 [0, 0 ...\ndate_time_yearend__0: uint8 [1, 1 ...\ndate_time_yearend__-OTHER: uint8 [0, 0 ...\ndate_time_half__1: uint8 [0, 0 ...\ndate_time_half__2: uint8 [1, 1 ...\ndate_time_quarter__1: uint8 [0, 0 ...\ndate_time_quarter__2: uint8 [0, 0 ...\ndate_time_quarter__3: uint8 [1, 0 ...\ndate_time_quarter__4: uint8 [0, 1 ...\ndate_time_quarteryear__2013Q1: uint8 [0, 0 ...\ndate_time_quarteryear__2013Q2: uint8 [0, 0 ...\ndate_time_quarteryear__2013Q3: uint8 [1, 0 ...\ndate_time_quarteryear__2013Q4: uint8 [0, 0 ...\ndate_time_quarteryear__2014Q1: uint8 [0, 0 ...\ndate_time_quarteryear__2014Q2: uint8 [0, 0 ...\ndate_time_quarteryear__2014Q3: uint8 [0, 0 ...\ndate_time_quarteryear__2014Q4: uint8 [0, 1 ...\ndate_time_quarterstart__0: uint8 [1, 1 ...\ndate_time_quarterstart__-OTHER: uint8 [0, 0 ...\ndate_time_quarterend__0: uint8 [1, 1 ...\ndate_time_quarterend__-OTHER: uint8 [0, 0 ...\ndate_time_month_lbl__April: uint8 [0, 0 ...\ndate_time_month_lbl__August: uint8 [0, 0 ...\ndate_time_month_lbl__December: uint8 [0, 0 ...\ndate_time_month_lbl__February: uint8 [0, 0 ...\ndate_time_month_lbl__January: uint8 [0, 0 ...\ndate_time_month_lbl__July: uint8 [1, 0 ...\ndate_time_month_lbl__June: uint8 [0, 0 ...\ndate_time_month_lbl__March: uint8 [0, 0 ...\ndate_time_month_lbl__May: uint8 [0, 0 ...\ndate_time_month_lbl__November: uint8 [0, 1 ...\ndate_time_month_lbl__October: uint8 [0, 0 ...\ndate_time_month_lbl__September: uint8 [0, 0 ...\ndate_time_monthstart__0: uint8 [1, 1 ...\ndate_time_monthstart__-OTHER: uint8 [0, 0 ...\ndate_time_monthend__0: uint8 [1, 1 ...\ndate_time_monthend__-OTHER: uint8 [0, 0 ...\ndate_time_mweek__1: uint8 [0, 1 ...\ndate_time_mweek__2: uint8 [0, 0 ...\ndate_time_mweek__3: uint8 [0, 0 ...\ndate_time_mweek__4: uint8 [1, 0 ...\ndate_time_mweek__5: uint8 [0, 0 ...\ndate_time_wday__1: uint8 [0, 0 ...\ndate_time_wday__2: uint8 [0, 0 ...\ndate_time_wday__3: uint8 [0, 0 ...\ndate_time_wday__4: uint8 [1, 0 ...\ndate_time_wday__5: uint8 [0, 0 ...\ndate_time_wday__6: uint8 [0, 0 ...\ndate_time_wday__7: uint8 [0, 1 ...\ndate_time_wday_lbl__Friday: uint8 [0, 0 ...\ndate_time_wday_lbl__Monday: uint8 [0, 0 ...\ndate_time_wday_lbl__Saturday: uint8 [0, 0 ...\ndate_time_wday_lbl__Sunday: uint8 [0, 1 ...\ndate_time_wday_lbl__Thursday: uint8 [1, 0 ...\ndate_time_wday_lbl__Tuesday: uint8 [0, 0 ...\ndate_time_wday_lbl__Wednesday: uint8 [0, 0 ...\ndate_time_weekend__0: uint8 [1, 0 ...\ndate_time_weekend__1: uint8 [0, 1 ...\ndate_time_am_pm__am: uint8 [0, 0 ...\ndate_time_am_pm__pm: uint8 [1, 1 ...\n\n\n\n\nStep 2: Correlate the data\nNext, we use correlate() to calculate strength of the relationship. The main parameter is target, which should be selected based on the business goal.\nIn this case, we can create a business goal to understand what relates to a website visit count greater than 2. We will select the column: is_booking__1 as the target. This is because we want to know what relates to a hotel room booking via the website search data.\nThis returns a 3 column data frame containing:\n\nfeature: The name of the features\nbin: The bin that corresponds to a bin inside the features\ncorrelation: The strength of the relationship (0 to 1) and the direction of the relationship (+/-)\n\n\nexpedia_ts_correlate_df = expedia_ts_binarized_df.correlate('is_booking__1')\n\nexpedia_ts_correlate_df\n\n\n\n\n\n\n\n\nfeature\nbin\ncorrelation\n\n\n\n\n77\nis_booking\n0\n-1.000000\n\n\n78\nis_booking\n1\n1.000000\n\n\n32\ncnt\n2.0_72.0\n-0.099372\n\n\n31\ncnt\n1.0_2.0\n0.099372\n\n\n75\nis_package\n0\n0.075930\n\n\n...\n...\n...\n...\n\n\n131\ndate_time_monthend\n-OTHER\n0.000182\n\n\n108\ndate_time_quarteryear\n2014Q1\n-0.000041\n\n\n22\nsrch_children_cnt\n0.0_9.0\nNaN\n\n\n87\nhotel_market\n-OTHER\nNaN\n\n\n88\nhotel_cluster\n-OTHER\nNaN\n\n\n\n\n155 rows × 3 columns\n\n\n\n\n\nStep 3: Plot the Correlation funnel\nIt’s in this step where we can visualize review the correlations and determine which features relate to the target, the strength of the relationship (magnitude between 0 and 1), and the direction of the relationship (+/-).\n\nexpedia_ts_correlate_df.plot_correlation_funnel(\n engine = 'plotly',\n height = 800\n)" + "objectID": "tutorials/05_clustering.html", + "href": "tutorials/05_clustering.html", + "title": "Clustering", + "section": "", + "text": "Coming soon…\n\n1 More Coming Soon…\nWe are in the early stages of development. But it’s obvious the potential for pytimetk now in Python. 🐍\n\nPlease ⭐ us on GitHub (it takes 2-seconds and means a lot).\nTo make requests, please see our Project Roadmap GH Issue #2. You can make requests there.\nWant to contribute? See our contributing guide here." }, { - "objectID": "tutorials/03_demand_forecasting.html", - "href": "tutorials/03_demand_forecasting.html", - "title": "Demand Forecasting", + "objectID": "tutorials/01_sales_crm.html", + "href": "tutorials/01_sales_crm.html", + "title": "Sales Analysis", "section": "", - "text": "Timetk enables you to generate features from the time column of your data very easily. This tutorial showcases how easy it is to perform time series forecasting with pytimetk. The specific methods we will be using are:" + "text": "In this tutorial, we will use pytimetk and its powerful functions to perform a time series analysis on a dataset representing bike sales. Our goal is to understand the patterns in the data and forecast future sales. You will:" }, { - "objectID": "tutorials/03_demand_forecasting.html#load-packages", - "href": "tutorials/03_demand_forecasting.html#load-packages", - "title": "Demand Forecasting", - "section": "1.1 Load Packages", - "text": "1.1 Load Packages\nLoad the following packages before proceeding with this tutorial.\n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport pytimetk as tk\n\nfrom sklearn.ensemble import RandomForestRegressor\n\n\nThe tutorial is divided into three parts: We will first have a look at the Walmart dataset and perform some preprocessing. Secondly, we will create models based on different features, and see how the time features can be useful. Finally, we will solve the task of time series forecasting, using the features from augment_timeseries_signature, augment_lags, and augment_rolling, to predict future sales." + "objectID": "tutorials/01_sales_crm.html#load-packages.", + "href": "tutorials/01_sales_crm.html#load-packages.", + "title": "Sales Analysis", + "section": "1.1 Load Packages.", + "text": "1.1 Load Packages.\nIf you do not have pytimetk installed, you can install by using\npip install pytimetk\nor for the latest features and functionality, you can install the development version.\npip install git+https://github.com/business-science/pytimetk.git\n\n\nCode\nimport pytimetk as tk\nimport pandas as pd\nimport numpy as np\n\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.model_selection import train_test_split" }, { - "objectID": "tutorials/03_demand_forecasting.html#load-inspect-dataset", - "href": "tutorials/03_demand_forecasting.html#load-inspect-dataset", - "title": "Demand Forecasting", - "section": "1.2 Load & Inspect dataset", - "text": "1.2 Load & Inspect dataset\nThe first thing we want to do is to load the dataset. It is a subset of the Walmart sales prediction Kaggle competition. You can get more insights about the dataset by following this link: walmart_sales_weekly. The most important thing to know about the dataset is that you are provided with some features like the fuel price or whether the week contains holidays and you are expected to predict the weekly sales column for 7 different departments of a given store. Of course, you also have the date for each week, and that is what we can leverage to create additional features.\nLet us start by loading the dataset and cleaning it. Note that we also removed some columns due to * duplication of data * 0 variance * No future data available in current dataset.\n\n\nCode\n# We start by loading the dataset\n# /walmart_sales_weekly.html\ndset = tk.load_dataset('walmart_sales_weekly', parse_dates = ['Date'])\n\ndset = dset.drop(columns=[\n 'id', # This column can be removed as it is equivalent to 'Dept'\n 'Store', # This column has only one possible value\n 'Type', # This column has only one possible value\n 'Size', # This column has only one possible value\n 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5',\n 'IsHoliday', 'Temperature', 'Fuel_Price', 'CPI',\n 'Unemployment'])\n\ndset.head()\n\n\n\n\n\n\n\n\n\nDept\nDate\nWeekly_Sales\n\n\n\n\n0\n1\n2010-02-05\n24924.50\n\n\n1\n1\n2010-02-12\n46039.49\n\n\n2\n1\n2010-02-19\n41595.55\n\n\n3\n1\n2010-02-26\n19403.54\n\n\n4\n1\n2010-03-05\n21827.90\n\n\n\n\n\n\n\nWe can plot the values of each department to get an idea of how the data looks like. Using the plot_timeseries method with a groupby allows us to create multiple plots by group.\n\n\n\n\n\n\nGetting More Info: tk.plot_timeseries()\n\n\n\n\n\n\nClick here to see our Data Visualization Guide\nUse help(tk.plot_timeseries) to review additional helpful documentation.\n\n\n\n\n\nPlotlyPlotnine\n\n\n\n\nCode\nsales_df = dset\nfig = sales_df.groupby('Dept').plot_timeseries(\n date_column='Date',\n value_column='Weekly_Sales',\n facet_ncol = 2,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly')\nfig\n\n\n\n \n\n\n\n\n\n\nCode\nfig = sales_df.groupby('Dept').plot_timeseries(\n date_column='Date',\n value_column='Weekly_Sales',\n facet_ncol = 2,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine')\nfig\n\n\n\n\n\n<Figure Size: (700 x 500)>" + "objectID": "tutorials/01_sales_crm.html#load-inspect-dataset", + "href": "tutorials/01_sales_crm.html#load-inspect-dataset", + "title": "Sales Analysis", + "section": "1.2 Load & inspect dataset", + "text": "1.2 Load & inspect dataset\nTo kick off our analysis, we’ll begin by importing essential libraries and accessing the ‘bike_sales’ dataset available within pytimetk’s suite of built-in datasets.\nThe Bike Sales dataset exemplifies what one might find in a CRM (Customer Relationship Management) system. CRM systems are pivotal for businesses, offering vital insights by tracking sales throughout the entire sales funnel. Such datasets are rich with transaction-level data, encompassing elements like order numbers, individual order lines, customer details, product information, and specific transaction data.\nTransactional data, such as this, inherently holds the essential components for time series analysis:\n\nTime Stamps\nAssociated Values\nDistinct Groups or Categories\n\nGiven these attributes, the Bike Sales dataset emerges as an ideal candidate for analysis using pytimetk." }, { - "objectID": "tutorials/03_demand_forecasting.html#making-future-dates-easier-with-tk.future_frame", - "href": "tutorials/03_demand_forecasting.html#making-future-dates-easier-with-tk.future_frame", - "title": "Demand Forecasting", - "section": "2.1 Making Future Dates Easier with tk.future_frame", - "text": "2.1 Making Future Dates Easier with tk.future_frame\nWhen building machine learning models, we need to setup our dataframe to hold information about the future. This is the dataframe that will get passed to our model.predict() call. This is made easy with tk.future_frame().\n\n\n\n\n\n\nGetting to know tk.future_frame()\n\n\n\n\n\nCurious about the various options it provides?\n\nClick here to see our Data Wrangling Guide\nUse help(tk.future_frame) to review additional helpful documentation. And explore the plethora of possibilities!\n\n\n\n\nNotice this function adds 5 weeks to our dateset for each department and fills in weekly sales with nulls. Previously our max date was 2012-10-26.\n\n\nCode\nprint(sales_df.groupby('Dept').Date.max())\n\n\nDept\n1 2012-10-26\n3 2012-10-26\n8 2012-10-26\n13 2012-10-26\n38 2012-10-26\n93 2012-10-26\n95 2012-10-26\nName: Date, dtype: datetime64[ns]\n\n\nAfter applying our future frame, we can now see values 5 weeks in the future, and our dataframe has been extended to 2012-11-30 for all groups.\n\n\nCode\nsales_df_with_futureframe = sales_df \\\n .groupby('Dept') \\\n .future_frame(\n date_column = 'Date',\n length_out = 5\n )\n\n\n\n\n\n\n\nCode\nsales_df_with_futureframe.groupby('Dept').Date.max()\n\n\nDept\n1 2012-11-30\n3 2012-11-30\n8 2012-11-30\n13 2012-11-30\n38 2012-11-30\n93 2012-11-30\n95 2012-11-30\nName: Date, dtype: datetime64[ns]" + "objectID": "tutorials/01_sales_crm.html#initial-inspection-with-tk.glimpse", + "href": "tutorials/01_sales_crm.html#initial-inspection-with-tk.glimpse", + "title": "Sales Analysis", + "section": "2.1 Initial Inspection with tk.glimpse", + "text": "2.1 Initial Inspection with tk.glimpse\nTo get a preliminary understanding of our data, let’s utilize the tk.glimpse() function from pytimetk. This will provide us with a snapshot of the available fields, their respective data types, and a sneak peek into the data entries.\n\n\nCode\ndf = tk.datasets.load_dataset('bike_sales_sample')\ndf['order_date'] = pd.to_datetime(df['order_date'])\n\ndf.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 2466 rows of 13 columns\norder_id: int64 [1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 5, 5, ...\norder_line: int64 [1, 2, 1, 2, 1, 2, 3, 4, 5, 1, 1, 2, ...\norder_date: datetime64[ns] [Timestamp('2011-01-07 00:00:00'), Ti ...\nquantity: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...\nprice: int64 [6070, 5970, 2770, 5970, 10660, 3200, ...\ntotal_price: int64 [6070, 5970, 2770, 5970, 10660, 3200, ...\nmodel: object ['Jekyll Carbon 2', 'Trigger Carbon 2 ...\ncategory_1: object ['Mountain', 'Mountain', 'Mountain', ...\ncategory_2: object ['Over Mountain', 'Over Mountain', 'T ...\nframe_material: object ['Carbon', 'Carbon', 'Aluminum', 'Car ...\nbikeshop_name: object ['Ithaca Mountain Climbers', 'Ithaca ...\ncity: object ['Ithaca', 'Ithaca', 'Kansas City', ' ...\nstate: object ['NY', 'NY', 'KS', 'KS', 'KY', 'KY', ..." }, { - "objectID": "tutorials/03_demand_forecasting.html#date-features-with-tk.augment_timeseries_signature", - "href": "tutorials/03_demand_forecasting.html#date-features-with-tk.augment_timeseries_signature", - "title": "Demand Forecasting", - "section": "2.2 Date Features with tk.augment_timeseries_signature", - "text": "2.2 Date Features with tk.augment_timeseries_signature\nMachine Learning models generally cannot process raw date objects directly. Moreover, they lack an inherent understanding of the passage of time. This means that, without specific features, a model can’t differentiate between a January observation and a June one. To bridge this gap, the tk.augment_timeseries_signature function is invaluable. It generates 29 distinct date-oriented features suitable for model inputs.\n\n\n\n\n\n\nGetting More Info: tk.augment_timeseries_signature(),tk.augment_lags(), tk.augment_rolling()\n\n\n\n\n\n\nClick here to see our Adding Features (Augmenting)\nUse help(tk.augment_timeseries_signature) help(tk.augment_lags) help(tk.augment_rolling) to review additional helpful documentation.\n\n\n\n\n\nIt’s crucial, however, to align these features with the granularity of your dataset. Given the weekly granularity of the Walmart dataset, any date attributes finer than ‘week’ should be excluded for relevance and efficiency.\n\n\nCode\nsales_df_dates = sales_df_with_futureframe.augment_timeseries_signature(date_column = 'Date')\nsales_df_dates.head(10)\n\n\n\n\n\n\n\n\n\nDept\nDate\nWeekly_Sales\nDate_index_num\nDate_year\nDate_year_iso\nDate_yearstart\nDate_yearend\nDate_leapyear\nDate_half\n...\nDate_mday\nDate_qday\nDate_yday\nDate_weekend\nDate_hour\nDate_minute\nDate_second\nDate_msecond\nDate_nsecond\nDate_am_pm\n\n\n\n\n0\n1\n2010-02-05\n24924.50\n1265328000\n2010\n2010\n0\n0\n0\n1\n...\n5\n36\n36\n0\n0\n0\n0\n0\n0\nam\n\n\n1\n1\n2010-02-12\n46039.49\n1265932800\n2010\n2010\n0\n0\n0\n1\n...\n12\n43\n43\n0\n0\n0\n0\n0\n0\nam\n\n\n2\n1\n2010-02-19\n41595.55\n1266537600\n2010\n2010\n0\n0\n0\n1\n...\n19\n50\n50\n0\n0\n0\n0\n0\n0\nam\n\n\n3\n1\n2010-02-26\n19403.54\n1267142400\n2010\n2010\n0\n0\n0\n1\n...\n26\n57\n57\n0\n0\n0\n0\n0\n0\nam\n\n\n4\n1\n2010-03-05\n21827.90\n1267747200\n2010\n2010\n0\n0\n0\n1\n...\n5\n64\n64\n0\n0\n0\n0\n0\n0\nam\n\n\n5\n1\n2010-03-12\n21043.39\n1268352000\n2010\n2010\n0\n0\n0\n1\n...\n12\n71\n71\n0\n0\n0\n0\n0\n0\nam\n\n\n6\n1\n2010-03-19\n22136.64\n1268956800\n2010\n2010\n0\n0\n0\n1\n...\n19\n78\n78\n0\n0\n0\n0\n0\n0\nam\n\n\n7\n1\n2010-03-26\n26229.21\n1269561600\n2010\n2010\n0\n0\n0\n1\n...\n26\n85\n85\n0\n0\n0\n0\n0\n0\nam\n\n\n8\n1\n2010-04-02\n57258.43\n1270166400\n2010\n2010\n0\n0\n0\n1\n...\n2\n2\n92\n0\n0\n0\n0\n0\n0\nam\n\n\n9\n1\n2010-04-09\n42960.91\n1270771200\n2010\n2010\n0\n0\n0\n1\n...\n9\n9\n99\n0\n0\n0\n0\n0\n0\nam\n\n\n\n\n10 rows × 32 columns\n\n\n\nUpon reviewing the generated features, it’s evident that certain attributes don’t align with the granularity of our dataset. For optimal results, features exhibiting no variance—like “Date_hour” due to the weekly nature of our data—should be omitted. We also spot redundant features, such as “Date_Month” and “Date_month_lbl”; both convey month information, albeit in different formats. To enhance clarity and computational efficiency, we’ll refine our dataset to include only the most relevant columns.\nAdditionally, we’ve eliminated certain categorical columns, which, although compatible with models like LightGBM and Catboost, demand extra processing for many tree-based ML models. While 1-hot encoding is a popular method for managing categorical data, it’s not typically recommended for date attributes. Instead, leveraging numeric date features directly, combined with the integration of Fourier features, can effectively capture cyclical patterns.\n\n\nCode\nsales_df_dates.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 1036 rows of 32 columns\nDept: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\nDate: datetime64[ns] [Timestamp('2010-02-05 00:00:00'), ...\nWeekly_Sales: float64 [24924.5, 46039.49, 41595.55, 1940 ...\nDate_index_num: int64 [1265328000, 1265932800, 126653760 ...\nDate_year: int64 [2010, 2010, 2010, 2010, 2010, 201 ...\nDate_year_iso: UInt32 [2010, 2010, 2010, 2010, 2010, 201 ...\nDate_yearstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_yearend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_leapyear: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_half: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\nDate_quarter: int64 [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ...\nDate_quarteryear: object ['2010Q1', '2010Q1', '2010Q1', '20 ...\nDate_quarterstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_quarterend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_month: int64 [2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, ...\nDate_month_lbl: object ['February', 'February', 'February ...\nDate_monthstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_monthend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_yweek: UInt32 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ...\nDate_mweek: int64 [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, ...\nDate_wday: int64 [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...\nDate_wday_lbl: object ['Friday', 'Friday', 'Friday', 'Fr ...\nDate_mday: int64 [5, 12, 19, 26, 5, 12, 19, 26, 2, ...\nDate_qday: int64 [36, 43, 50, 57, 64, 71, 78, 85, 2 ...\nDate_yday: int64 [36, 43, 50, 57, 64, 71, 78, 85, 9 ...\nDate_weekend: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_hour: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_minute: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_second: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_msecond: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_nsecond: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\nDate_am_pm: object ['am', 'am', 'am', 'am', 'am', 'am ...\n\n\n\n\nCode\nsales_df_dates = sales_df_dates[[\n 'Date'\n ,'Dept'\n , 'Weekly_Sales'\n , 'Date_year'\n , 'Date_month'\n , 'Date_yweek'\n , 'Date_mweek' \n ]]\nsales_df_dates.tail(10)\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\n\n\n\n\n1026\n2012-11-02\n93\nNaN\n2012\n11\n44\n1\n\n\n1027\n2012-11-09\n93\nNaN\n2012\n11\n45\n2\n\n\n1028\n2012-11-16\n93\nNaN\n2012\n11\n46\n3\n\n\n1029\n2012-11-23\n93\nNaN\n2012\n11\n47\n4\n\n\n1030\n2012-11-30\n93\nNaN\n2012\n11\n48\n5\n\n\n1031\n2012-11-02\n95\nNaN\n2012\n11\n44\n1\n\n\n1032\n2012-11-09\n95\nNaN\n2012\n11\n45\n2\n\n\n1033\n2012-11-16\n95\nNaN\n2012\n11\n46\n3\n\n\n1034\n2012-11-23\n95\nNaN\n2012\n11\n47\n4\n\n\n1035\n2012-11-30\n95\nNaN\n2012\n11\n48\n5" + "objectID": "tutorials/01_sales_crm.html#data-exploration-with-tk.summarize_by_time", + "href": "tutorials/01_sales_crm.html#data-exploration-with-tk.summarize_by_time", + "title": "Sales Analysis", + "section": "2.2 Data Exploration with tk.summarize_by_time", + "text": "2.2 Data Exploration with tk.summarize_by_time\nCRM data is often bustling with activity, reflecting the myriad of transactions happening daily. Due to this high volume, the data can sometimes seem overwhelming or noisy. To derive meaningful insights, it’s essential to aggregate this data over specific time intervals. This is where tk.summarize_by_time() comes into play.\nThe tk.summarize_by_time() function offers a streamlined approach to time-based data aggregation. By defining a desired frequency and an aggregation method, this function seamlessly organizes your data. The beauty of it is its versatility; from a broad array of built-in aggregation methods and frequencies to the flexibility of integrating a custom function, it caters to a range of requirements.\n\n\n\n\n\n\nGetting to know tk.summarize_by_time()\n\n\n\n\n\nCurious about the various options it provides?\n\nClick here to see our Data Wrangling Guide\nUse help(tk.summarize_by_time) to review additional helpful documentation. And explore the plethora of possibilities!\n\n\n\n\n\nGetting Weekly Totals\nWe can quickly get totals by week with summarize_byt_time.\n\n\nCode\nweekly_totals = df.summarize_by_time(\n date_column = 'order_date',\n value_column = 'total_price',\n agg_func = ['sum'],\n freq = 'W'\n)\n\nweekly_totals.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum\n\n\n\n\n0\n2011-01-09\n12040\n\n\n1\n2011-01-16\n151460\n\n\n2\n2011-01-23\n143850\n\n\n3\n2011-01-30\n175665\n\n\n4\n2011-02-06\n105210\n\n\n5\n2011-02-13\n250390\n\n\n6\n2011-02-20\n410595\n\n\n7\n2011-02-27\n254045\n\n\n8\n2011-03-06\n308420\n\n\n9\n2011-03-13\n45450\n\n\n\n\n\n\n\n\n\nGet Weekly Totals by Group (Category 2)\nTo better understand your data, you might want to add groups to this summary. We can include a groupby before the summarize_by_time and then aggregate our data.\n\n\nCode\n sales_by_week = df \\\n .groupby('category_2') \\\n .summarize_by_time(\n date_column = 'order_date',\n value_column = 'total_price',\n agg_func = ['sum'],\n freq = 'W'\n )\n\nsales_by_week.head(10)\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\n\n\n\n\n0\nCross Country Race\n2011-01-16\n61750\n\n\n1\nCross Country Race\n2011-01-23\n25050\n\n\n2\nCross Country Race\n2011-01-30\n56860\n\n\n3\nCross Country Race\n2011-02-06\n8740\n\n\n4\nCross Country Race\n2011-02-13\n78070\n\n\n5\nCross Country Race\n2011-02-20\n115010\n\n\n6\nCross Country Race\n2011-02-27\n64290\n\n\n7\nCross Country Race\n2011-03-06\n95070\n\n\n8\nCross Country Race\n2011-03-13\n3200\n\n\n9\nCross Country Race\n2011-03-20\n21170\n\n\n\n\n\n\n\n\n\nLong vs Wide Format\nThis long format can make it a little hard to compare the different group values visually, so instead of long-format you might want to pivot wide to view the data.\n\n\nCode\nsales_by_week_wide = df \\\n .groupby('category_2') \\\n .summarize_by_time(\n date_column = 'order_date',\n value_column = 'total_price',\n agg_func = ['sum'],\n freq = 'W',\n wide_format = True\n )\n\nsales_by_week_wide.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum_Cross Country Race\ntotal_price_sum_Cyclocross\ntotal_price_sum_Elite Road\ntotal_price_sum_Endurance Road\ntotal_price_sum_Fat Bike\ntotal_price_sum_Over Mountain\ntotal_price_sum_Sport\ntotal_price_sum_Trail\ntotal_price_sum_Triathalon\n\n\n\n\n0\n2011-01-09\n0.0\n0.0\n0.0\n0.0\n0.0\n12040.0\n0.0\n0.0\n0.0\n\n\n1\n2011-01-16\n61750.0\n1960.0\n49540.0\n11110.0\n0.0\n9170.0\n4030.0\n7450.0\n6450.0\n\n\n2\n2011-01-23\n25050.0\n3500.0\n51330.0\n47930.0\n0.0\n3840.0\n0.0\n0.0\n12200.0\n\n\n3\n2011-01-30\n56860.0\n2450.0\n43895.0\n24160.0\n0.0\n10880.0\n3720.0\n26700.0\n7000.0\n\n\n4\n2011-02-06\n8740.0\n7000.0\n35640.0\n22680.0\n3730.0\n14270.0\n980.0\n10220.0\n1950.0\n\n\n5\n2011-02-13\n78070.0\n0.0\n83780.0\n24820.0\n2130.0\n17160.0\n6810.0\n17120.0\n20500.0\n\n\n6\n2011-02-20\n115010.0\n7910.0\n79770.0\n27650.0\n26100.0\n37830.0\n10925.0\n96250.0\n9150.0\n\n\n7\n2011-02-27\n64290.0\n6650.0\n86900.0\n31900.0\n5860.0\n22070.0\n6165.0\n16410.0\n13800.0\n\n\n8\n2011-03-06\n95070.0\n2450.0\n31990.0\n47660.0\n5860.0\n82060.0\n9340.0\n26790.0\n7200.0\n\n\n9\n2011-03-13\n3200.0\n4200.0\n23110.0\n7260.0\n0.0\n5970.0\n1710.0\n0.0\n0.0\n\n\n\n\n\n\n\nYou can now observe the total sales for each product side by side. This streamlined view facilitates easy comparison between product sales." }, { - "objectID": "tutorials/03_demand_forecasting.html#lag-features-with-tk.augment_lags", - "href": "tutorials/03_demand_forecasting.html#lag-features-with-tk.augment_lags", - "title": "Demand Forecasting", - "section": "2.3 Lag Features with tk.augment_lags", - "text": "2.3 Lag Features with tk.augment_lags\nAs previously noted, it’s important to recognize that machine learning models lack inherent awareness of time, a vital consideration in time series modeling. Furthermore, these models operate under the assumption that each row is independent, meaning that the information from last month’s weekly sales is not inherently integrated into the prediction of next month’s sales target. To address this limitation, we incorporate additional features, such as lags, into the models to capture temporal dependencies. You can easily achieve this by employing the tk.augment_lags function.\n\n\nCode\ndf_with_lags = sales_df_dates \\\n .groupby('Dept') \\\n .augment_lags(\n date_column = 'Date',\n value_column = 'Weekly_Sales',\n lags = [5,6,7,8,9]\n )\ndf_with_lags.head(5)\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\nWeekly_Sales_lag_5\nWeekly_Sales_lag_6\nWeekly_Sales_lag_7\nWeekly_Sales_lag_8\nWeekly_Sales_lag_9\n\n\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n2\n2010-02-19\n1\n41595.55\n2010\n2\n7\n3\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n3\n2010-02-26\n1\n19403.54\n2010\n2\n8\n4\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n4\n2010-03-05\n1\n21827.90\n2010\n3\n9\n1\nNaN\nNaN\nNaN\nNaN\nNaN" + "objectID": "tutorials/01_sales_crm.html#visualize-your-time-series-data-with-tk.plot_timeseries", + "href": "tutorials/01_sales_crm.html#visualize-your-time-series-data-with-tk.plot_timeseries", + "title": "Sales Analysis", + "section": "2.3 Visualize your time series data with tk.plot_timeseries", + "text": "2.3 Visualize your time series data with tk.plot_timeseries\nYou can now visualize the summarized data to gain a clearer insight into the prevailing trends.\n\nPlotlyPlotnine\n\n\n\n\nCode\nsales_by_week \\\n .groupby('category_2') \\\n .plot_timeseries(\n date_column = 'order_date', \n value_column = 'total_price_sum',\n title = 'Bike Sales by Category',\n facet_ncol = 2,\n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 1000,\n height = 800,\n y_lab = 'Total Sales', \n engine = 'plotly'\n )\n\n\n\n \n\n\n\n\n\n\nCode\nsales_by_week \\\n .groupby('category_2') \\\n .plot_timeseries(\n date_column = 'order_date', \n value_column = 'total_price_sum',\n title = 'Bike Sales by Category',\n facet_ncol = 2,\n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 1000,\n height = 800,\n y_lab = 'Total Sales', \n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (1000 x 800)>\n\n\n\n\n\nThe graph showcases a pronounced uptick in sales for most of the different bike products during the summer. It’s a natural trend, aligning with our understanding that people gravitate towards biking during the balmy summer days. Conversely, as the chill of winter sets in at the year’s start and end, we observe a corresponding dip in sales.\nIt’s worth highlighting the elegance of the plot_timeseries function. Beyond just plotting raw data, it introduces a smoother, accentuating underlying trends and making them more discernible. This enhancement ensures we can effortlessly capture and comprehend the cyclical nature of bike sales throughout the year." }, { - "objectID": "tutorials/03_demand_forecasting.html#rolling-lag-features-with-tk.augment_rolling", - "href": "tutorials/03_demand_forecasting.html#rolling-lag-features-with-tk.augment_rolling", - "title": "Demand Forecasting", - "section": "2.4 Rolling Lag Features with tk.augment_rolling", - "text": "2.4 Rolling Lag Features with tk.augment_rolling\nAnother pivotal aspect of time series analysis involves the utilization of rolling lags. These operations facilitate computations within a moving time window, enabling the use of functions such as “mean” and “std” on these rolling windows. This can be achieved by invoking the tk.augment_rolling() function on grouped time series data. To execute this, we will initially gather all columns containing ‘lag’ in their names. We then apply this function to the lag values, as opposed to the weekly sales, since we lack future weekly sales data. By applying these functions to the lag values, we ensure the prevention of data leakage and maintain the adaptability of our method to unforeseen future data.\n\n\nCode\nlag_columns = [col for col in df_with_lags.columns if 'lag' in col]\n\ndf_with_rolling = df_with_lags \\\n .groupby('Dept') \\\n .augment_rolling(\n date_column = 'Date',\n value_column = lag_columns,\n window = 4,\n window_func = 'mean',\n threads = 1 # Change to -1 to use all available cores\n ) \ndf_with_rolling[df_with_rolling.Dept ==1].head(10)\n\n\n\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\nWeekly_Sales_lag_5\nWeekly_Sales_lag_6\nWeekly_Sales_lag_7\nWeekly_Sales_lag_8\nWeekly_Sales_lag_9\nWeekly_Sales_lag_5_rolling_mean_win_4\nWeekly_Sales_lag_6_rolling_mean_win_4\nWeekly_Sales_lag_7_rolling_mean_win_4\nWeekly_Sales_lag_8_rolling_mean_win_4\nWeekly_Sales_lag_9_rolling_mean_win_4\n\n\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n0\n2010-02-05\n1\n24924.50\n2010\n2\n5\n1\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n1\n2010-02-12\n1\n46039.49\n2010\n2\n6\n2\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n\n\n\n\n\nNotice when we add lag values to our dataframe, this creates several NA values. This is because when using lags, there will be some data that is not available early in our dataset.Thus as a result, NA values are introduced.\nTo simplify and clean up the process, we will remove these rows entirely since we already extracted some meaningful information from them (ie. lags, rolling lags).\n\n\nCode\nall_lag_columns = [col for col in df_with_rolling.columns if 'lag' in col]\n\ndf_no_nas = df_with_rolling \\\n .dropna(subset=all_lag_columns, inplace=False)\n\ndf_no_nas.head()\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\nWeekly_Sales_lag_5\nWeekly_Sales_lag_6\nWeekly_Sales_lag_7\nWeekly_Sales_lag_8\nWeekly_Sales_lag_9\nWeekly_Sales_lag_5_rolling_mean_win_4\nWeekly_Sales_lag_6_rolling_mean_win_4\nWeekly_Sales_lag_7_rolling_mean_win_4\nWeekly_Sales_lag_8_rolling_mean_win_4\nWeekly_Sales_lag_9_rolling_mean_win_4\n\n\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.9\n19403.54\n22809.285\n21102.8675\n25967.595\n32216.62\n32990.77\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.9\n19403.54\n22809.285\n21102.8675\n25967.595\n32216.62\n32990.77\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.9\n19403.54\n22809.285\n21102.8675\n25967.595\n32216.62\n32990.77\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.9\n19403.54\n22809.285\n21102.8675\n25967.595\n32216.62\n32990.77\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.9\n19403.54\n22809.285\n21102.8675\n25967.595\n32216.62\n32990.77\n\n\n\n\n\n\n\nWe can call tk.glimpse() again to quickly see what features we still have available.\n\n\nCode\ndf_no_nas.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 4760 rows of 17 columns\nDate: datetime64[ns] [Timestamp('20 ...\nDept: int64 [1, 1, 1, 1, 1 ...\nWeekly_Sales: float64 [16555.11, 165 ...\nDate_year: int64 [2010, 2010, 2 ...\nDate_month: int64 [4, 4, 4, 4, 4 ...\nDate_yweek: UInt32 [17, 17, 17, 1 ...\nDate_mweek: int64 [5, 5, 5, 5, 5 ...\nWeekly_Sales_lag_5: float64 [26229.21, 262 ...\nWeekly_Sales_lag_6: float64 [22136.64, 221 ...\nWeekly_Sales_lag_7: float64 [21043.39, 210 ...\nWeekly_Sales_lag_8: float64 [21827.9, 2182 ...\nWeekly_Sales_lag_9: float64 [19403.54, 194 ...\nWeekly_Sales_lag_5_rolling_mean_win_4: float64 [22809.285, 22 ...\nWeekly_Sales_lag_6_rolling_mean_win_4: float64 [21102.8675, 2 ...\nWeekly_Sales_lag_7_rolling_mean_win_4: float64 [25967.595, 25 ...\nWeekly_Sales_lag_8_rolling_mean_win_4: float64 [32216.6200000 ...\nWeekly_Sales_lag_9_rolling_mean_win_4: float64 [32990.7700000 ..." + "objectID": "tutorials/01_sales_crm.html#making-irregular-data-regular-with-tk.pad_by_time", + "href": "tutorials/01_sales_crm.html#making-irregular-data-regular-with-tk.pad_by_time", + "title": "Sales Analysis", + "section": "3.1 Making irregular data regular with tk.pad_by_time", + "text": "3.1 Making irregular data regular with tk.pad_by_time\nKicking off our journey, we’ll utilize pytimetk’s tk.pad_by_time() function. For this, grouping by the ‘category_1’ variable is recommended. Moreover, it’s prudent to establish a definitive end date. This ensures that all groups are equipped with training data up to the most recent date, accommodating scenarios where certain categories might have seen no sales in the final training week. By doing so, we create a representative observation for every group, capturing the nuances of each category’s sales pattern.\n\n\nCode\nsales_padded = sales_by_week \\\n .groupby('category_2') \\\n .pad_by_time(\n date_column = 'order_date',\n freq = 'W',\n end_date = sales_by_week.order_date.max()\n )\nsales_padded\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\n\n\n\n\n0\nCross Country Race\n2011-01-09\nNaN\n\n\n1\nCross Country Race\n2011-01-16\n61750.0\n\n\n2\nCross Country Race\n2011-01-23\n25050.0\n\n\n3\nCross Country Race\n2011-01-30\n56860.0\n\n\n4\nCross Country Race\n2011-02-06\n8740.0\n\n\n...\n...\n...\n...\n\n\n463\nTriathalon\n2011-12-04\n3200.0\n\n\n464\nTriathalon\n2011-12-11\n28350.0\n\n\n465\nTriathalon\n2011-12-18\n2700.0\n\n\n466\nTriathalon\n2011-12-25\n3900.0\n\n\n467\nTriathalon\n2012-01-01\nNaN\n\n\n\n\n468 rows × 3 columns" }, { - "objectID": "tutorials/03_demand_forecasting.html#training-and-future-sets", - "href": "tutorials/03_demand_forecasting.html#training-and-future-sets", - "title": "Demand Forecasting", - "section": "2.5 Training and Future Sets", - "text": "2.5 Training and Future Sets\nNow that we have our training set built, we can start to train our regressor. To do so, let’s first do some model cleanup.\nSplit our data in to train and future sets.\n\n\nCode\nfuture = df_no_nas[df_no_nas.Weekly_Sales.isnull()]\ntrain = df_no_nas[df_no_nas.Weekly_Sales.notnull()]" + "objectID": "tutorials/01_sales_crm.html#making-future-dates-easier-with-tk.future_frame", + "href": "tutorials/01_sales_crm.html#making-future-dates-easier-with-tk.future_frame", + "title": "Sales Analysis", + "section": "3.2 Making Future Dates Easier with tk.future_frame", + "text": "3.2 Making Future Dates Easier with tk.future_frame\nMoving on, let’s set up the future frame, which will serve as our test dataset. To achieve this, employ the tk.future_frame() method. This function allows for the specification of a grouping column and a forecast horizon.\nUpon invoking tk.future_frame(), you’ll observe that placeholders (null values) are added for each group, extending 12 weeks into the future.\n\n\nCode\ndf_with_futureframe = sales_padded \\\n .groupby('category_2') \\\n .future_frame(\n date_column = 'order_date',\n length_out = 12\n )\ndf_with_futureframe\n\n\n\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\n\n\n\n\n0\nCross Country Race\n2011-01-09\nNaN\n\n\n1\nCross Country Race\n2011-01-16\n61750.0\n\n\n2\nCross Country Race\n2011-01-23\n25050.0\n\n\n3\nCross Country Race\n2011-01-30\n56860.0\n\n\n4\nCross Country Race\n2011-02-06\n8740.0\n\n\n...\n...\n...\n...\n\n\n571\nTriathalon\n2012-02-26\nNaN\n\n\n572\nTriathalon\n2012-03-04\nNaN\n\n\n573\nTriathalon\n2012-03-11\nNaN\n\n\n574\nTriathalon\n2012-03-18\nNaN\n\n\n575\nTriathalon\n2012-03-25\nNaN\n\n\n\n\n576 rows × 3 columns" }, { - "objectID": "tutorials/03_demand_forecasting.html#model-with-regressor", - "href": "tutorials/03_demand_forecasting.html#model-with-regressor", - "title": "Demand Forecasting", - "section": "2.6 Model with regressor", - "text": "2.6 Model with regressor\nWe still have a datetime object in our training data. We will need to remove that before passing to our regressor. Let’s subset our column to just the features we want to use for modeling.\n\n\nCode\ntrain_columns = [ \n 'Dept'\n , 'Date_year'\n , 'Date_month'\n , 'Date_yweek'\n , 'Date_mweek'\n , 'Weekly_Sales_lag_5'\n , 'Weekly_Sales_lag_6'\n , 'Weekly_Sales_lag_7'\n , 'Weekly_Sales_lag_8'\n , 'Weekly_Sales_lag_5_rolling_mean_win_4'\n , 'Weekly_Sales_lag_6_rolling_mean_win_4'\n , 'Weekly_Sales_lag_7_rolling_mean_win_4'\n , 'Weekly_Sales_lag_8_rolling_mean_win_4'\n ]\n\nX = train[train_columns]\ny = train[['Weekly_Sales']]\n\nmodel = RandomForestRegressor(random_state=123)\nmodel = model.fit(X, y)\n\n\nNow that we have a trained model, we can pass in our future frame to predict weekly sales.\n\n\nCode\npredicted_values = model.predict(future[train_columns])\nfuture['y_pred'] = predicted_values\n\nfuture.head(10)\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\nWeekly_Sales_lag_5\nWeekly_Sales_lag_6\nWeekly_Sales_lag_7\nWeekly_Sales_lag_8\nWeekly_Sales_lag_9\nWeekly_Sales_lag_5_rolling_mean_win_4\nWeekly_Sales_lag_6_rolling_mean_win_4\nWeekly_Sales_lag_7_rolling_mean_win_4\nWeekly_Sales_lag_8_rolling_mean_win_4\nWeekly_Sales_lag_9_rolling_mean_win_4\ny_pred\n\n\n\n\n1001\n2012-11-02\n1\nNaN\n2012\n11\n44\n1\n18947.81\n19251.50\n19616.22\n18322.37\n16680.24\n19034.475\n18467.5825\n17726.3075\n17154.9275\n16604.3150\n26627.7378\n\n\n1001\n2012-11-02\n1\nNaN\n2012\n11\n44\n1\n18947.81\n19251.50\n19616.22\n18322.37\n16680.24\n19034.475\n18467.5825\n17726.3075\n17154.9275\n16604.3150\n26627.7378\n\n\n1001\n2012-11-02\n1\nNaN\n2012\n11\n44\n1\n18947.81\n19251.50\n19616.22\n18322.37\n16680.24\n19034.475\n18467.5825\n17726.3075\n17154.9275\n16604.3150\n26627.7378\n\n\n1001\n2012-11-02\n1\nNaN\n2012\n11\n44\n1\n18947.81\n19251.50\n19616.22\n18322.37\n16680.24\n19034.475\n18467.5825\n17726.3075\n17154.9275\n16604.3150\n26627.7378\n\n\n1001\n2012-11-02\n1\nNaN\n2012\n11\n44\n1\n18947.81\n19251.50\n19616.22\n18322.37\n16680.24\n19034.475\n18467.5825\n17726.3075\n17154.9275\n16604.3150\n26627.7378\n\n\n1002\n2012-11-09\n1\nNaN\n2012\n11\n45\n2\n21904.47\n18947.81\n19251.50\n19616.22\n18322.37\n19930.000\n19034.4750\n18467.5825\n17726.3075\n17154.9275\n20959.0553\n\n\n1002\n2012-11-09\n1\nNaN\n2012\n11\n45\n2\n21904.47\n18947.81\n19251.50\n19616.22\n18322.37\n19930.000\n19034.4750\n18467.5825\n17726.3075\n17154.9275\n20959.0553\n\n\n1002\n2012-11-09\n1\nNaN\n2012\n11\n45\n2\n21904.47\n18947.81\n19251.50\n19616.22\n18322.37\n19930.000\n19034.4750\n18467.5825\n17726.3075\n17154.9275\n20959.0553\n\n\n1002\n2012-11-09\n1\nNaN\n2012\n11\n45\n2\n21904.47\n18947.81\n19251.50\n19616.22\n18322.37\n19930.000\n19034.4750\n18467.5825\n17726.3075\n17154.9275\n20959.0553\n\n\n1002\n2012-11-09\n1\nNaN\n2012\n11\n45\n2\n21904.47\n18947.81\n19251.50\n19616.22\n18322.37\n19930.000\n19034.4750\n18467.5825\n17726.3075\n17154.9275\n20959.0553\n\n\n\n\n\n\n\nLet’s create a label to split up our actuals from our prediction dataset before recombining.\n\n\nCode\ntrain['type'] = 'actuals'\nfuture['type'] = 'prediction'\n\nfull_df = pd.concat([train, future])\n\nfull_df.head(10)\n\n\n\n\n\n\n\n\n\nDate\nDept\nWeekly_Sales\nDate_year\nDate_month\nDate_yweek\nDate_mweek\nWeekly_Sales_lag_5\nWeekly_Sales_lag_6\nWeekly_Sales_lag_7\nWeekly_Sales_lag_8\nWeekly_Sales_lag_9\nWeekly_Sales_lag_5_rolling_mean_win_4\nWeekly_Sales_lag_6_rolling_mean_win_4\nWeekly_Sales_lag_7_rolling_mean_win_4\nWeekly_Sales_lag_8_rolling_mean_win_4\nWeekly_Sales_lag_9_rolling_mean_win_4\ntype\ny_pred\n\n\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.90\n19403.54\n22809.2850\n21102.8675\n25967.5950\n32216.620\n32990.77\nactuals\nNaN\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.90\n19403.54\n22809.2850\n21102.8675\n25967.5950\n32216.620\n32990.77\nactuals\nNaN\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.90\n19403.54\n22809.2850\n21102.8675\n25967.5950\n32216.620\n32990.77\nactuals\nNaN\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.90\n19403.54\n22809.2850\n21102.8675\n25967.5950\n32216.620\n32990.77\nactuals\nNaN\n\n\n12\n2010-04-30\n1\n16555.11\n2010\n4\n17\n5\n26229.21\n22136.64\n21043.39\n21827.90\n19403.54\n22809.2850\n21102.8675\n25967.5950\n32216.620\n32990.77\nactuals\nNaN\n\n\n13\n2010-05-07\n1\n17413.94\n2010\n5\n18\n1\n57258.43\n26229.21\n22136.64\n21043.39\n21827.90\n31666.9175\n22809.2850\n21102.8675\n25967.595\n32216.62\nactuals\nNaN\n\n\n13\n2010-05-07\n1\n17413.94\n2010\n5\n18\n1\n57258.43\n26229.21\n22136.64\n21043.39\n21827.90\n31666.9175\n22809.2850\n21102.8675\n25967.595\n32216.62\nactuals\nNaN\n\n\n13\n2010-05-07\n1\n17413.94\n2010\n5\n18\n1\n57258.43\n26229.21\n22136.64\n21043.39\n21827.90\n31666.9175\n22809.2850\n21102.8675\n25967.595\n32216.62\nactuals\nNaN\n\n\n13\n2010-05-07\n1\n17413.94\n2010\n5\n18\n1\n57258.43\n26229.21\n22136.64\n21043.39\n21827.90\n31666.9175\n22809.2850\n21102.8675\n25967.595\n32216.62\nactuals\nNaN\n\n\n13\n2010-05-07\n1\n17413.94\n2010\n5\n18\n1\n57258.43\n26229.21\n22136.64\n21043.39\n21827.90\n31666.9175\n22809.2850\n21102.8675\n25967.595\n32216.62\nactuals\nNaN" + "objectID": "tutorials/01_sales_crm.html#lag-values-with-tk.augment_lags", + "href": "tutorials/01_sales_crm.html#lag-values-with-tk.augment_lags", + "title": "Sales Analysis", + "section": "3.3 Lag Values with tk.augment_lags", + "text": "3.3 Lag Values with tk.augment_lags\nCrafting features from time series data can be intricate, but thanks to the suite of feature engineering tools in pytimetk, the process is streamlined and intuitive.\nIn this guide, we’ll focus on the basics: introducing a few lag variables and incorporating some date-related features.\nFirstly, let’s dive into creating lag features.\nGiven our forecasting objective of a 12-week horizon, to ensure we have lag data available for every future point, we should utilize a lag of 12 or more. The beauty of the toolkit is that it supports the addition of multiple lags simultaneously.\nLag features play a pivotal role in machine learning for time series. Often, recent data offers valuable insights into future trends. To capture this recency effect, it’s crucial to integrate lag values. For this purpose, tk.augment_lags() comes in handy.\n\n\nCode\ndf_with_lags = df_with_futureframe \\\n .groupby('category_2') \\\n .augment_lags(\n date_column = 'order_date',\n value_column = 'total_price_sum',\n lags = [12,24]\n\n )\ndf_with_lags.head(25)\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\n\n\n\n\n0\nCross Country Race\n2011-01-09\nNaN\nNaN\nNaN\n\n\n1\nCross Country Race\n2011-01-16\n61750.0\nNaN\nNaN\n\n\n2\nCross Country Race\n2011-01-23\n25050.0\nNaN\nNaN\n\n\n3\nCross Country Race\n2011-01-30\n56860.0\nNaN\nNaN\n\n\n4\nCross Country Race\n2011-02-06\n8740.0\nNaN\nNaN\n\n\n5\nCross Country Race\n2011-02-13\n78070.0\nNaN\nNaN\n\n\n6\nCross Country Race\n2011-02-20\n115010.0\nNaN\nNaN\n\n\n7\nCross Country Race\n2011-02-27\n64290.0\nNaN\nNaN\n\n\n8\nCross Country Race\n2011-03-06\n95070.0\nNaN\nNaN\n\n\n9\nCross Country Race\n2011-03-13\n3200.0\nNaN\nNaN\n\n\n10\nCross Country Race\n2011-03-20\n21170.0\nNaN\nNaN\n\n\n11\nCross Country Race\n2011-03-27\n28990.0\nNaN\nNaN\n\n\n12\nCross Country Race\n2011-04-03\n51860.0\nNaN\nNaN\n\n\n13\nCross Country Race\n2011-04-10\n85910.0\n61750.0\nNaN\n\n\n14\nCross Country Race\n2011-04-17\n138230.0\n25050.0\nNaN\n\n\n15\nCross Country Race\n2011-04-24\n138350.0\n56860.0\nNaN\n\n\n16\nCross Country Race\n2011-05-01\n136090.0\n8740.0\nNaN\n\n\n17\nCross Country Race\n2011-05-08\n32110.0\n78070.0\nNaN\n\n\n18\nCross Country Race\n2011-05-15\n139010.0\n115010.0\nNaN\n\n\n19\nCross Country Race\n2011-05-22\n2060.0\n64290.0\nNaN\n\n\n20\nCross Country Race\n2011-05-29\n26130.0\n95070.0\nNaN\n\n\n21\nCross Country Race\n2011-06-05\n30360.0\n3200.0\nNaN\n\n\n22\nCross Country Race\n2011-06-12\n88280.0\n21170.0\nNaN\n\n\n23\nCross Country Race\n2011-06-19\n109470.0\n28990.0\nNaN\n\n\n24\nCross Country Race\n2011-06-26\n107280.0\n51860.0\nNaN\n\n\n\n\n\n\n\nObserve that lag values of 12 and 24 introduce missing entries at the dataset’s outset. This occurs because there isn’t available data from 12 or 24 weeks prior. To address these gaps, you can adopt one of two strategies:\n\nDiscard the Affected Rows: This is a recommended approach if your dataset is sufficiently large. Removing a few initial rows might not significantly impact the training process.\nBackfill Missing Values: In situations with limited data, you might consider backfilling these nulls using the first available values from lag 12 and 24. However, the appropriateness of this technique hinges on your specific context and objectives.\n\nFor the scope of this tutorial, we’ll opt to remove these rows. However, it’s worth pointing out that our dataset is quite small with limited historical data, so this might impact our model.\n\n\nCode\nlag_columns = [col for col in df_with_lags.columns if 'lag' in col]\ndf_no_nas = df_with_lags \\\n .dropna(subset=lag_columns, inplace=False)\n\ndf_no_nas.head()\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\n\n\n\n\n25\nCross Country Race\n2011-07-03\n56430.0\n85910.0\n61750.0\n\n\n26\nCross Country Race\n2011-07-10\n62320.0\n138230.0\n25050.0\n\n\n27\nCross Country Race\n2011-07-17\n141620.0\n138350.0\n56860.0\n\n\n28\nCross Country Race\n2011-07-24\n75720.0\n136090.0\n8740.0\n\n\n29\nCross Country Race\n2011-07-31\n21240.0\n32110.0\n78070.0" }, { - "objectID": "tutorials/03_demand_forecasting.html#pre-visualization-clean-up", - "href": "tutorials/03_demand_forecasting.html#pre-visualization-clean-up", - "title": "Demand Forecasting", - "section": "2.7 Pre-Visualization Clean-up", - "text": "2.7 Pre-Visualization Clean-up\n\n\nCode\nfull_df['Weekly_Sales'] = np.where(full_df.type =='actuals', full_df.Weekly_Sales, full_df.y_pred)" + "objectID": "tutorials/01_sales_crm.html#date-features-with-tk.augment_timeseries_signature", + "href": "tutorials/01_sales_crm.html#date-features-with-tk.augment_timeseries_signature", + "title": "Sales Analysis", + "section": "3.4 Date Features with tk.augment_timeseries_signature", + "text": "3.4 Date Features with tk.augment_timeseries_signature\nNow, let’s enrich our dataset with date-related features.\nWith the function tk.augment_timeseries_signature(), you can effortlessly append 29 date attributes to a timestamp. Given that our dataset captures weekly intervals, certain attributes like ‘hour’ may not be pertinent. Thus, it’s prudent to refine our columns, retaining only those that truly matter to our analysis.\n\n\nCode\ndf_with_datefeatures = df_no_nas \\\n .augment_timeseries_signature(date_column='order_date')\n\ndf_with_datefeatures.head(10)\n\n\n\n\n\n\n\n\n\ncategory_2\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\norder_date_index_num\norder_date_year\norder_date_year_iso\norder_date_yearstart\norder_date_yearend\n...\norder_date_mday\norder_date_qday\norder_date_yday\norder_date_weekend\norder_date_hour\norder_date_minute\norder_date_second\norder_date_msecond\norder_date_nsecond\norder_date_am_pm\n\n\n\n\n25\nCross Country Race\n2011-07-03\n56430.0\n85910.0\n61750.0\n1309651200\n2011\n2011\n0\n0\n...\n3\n3\n184\n1\n0\n0\n0\n0\n0\nam\n\n\n26\nCross Country Race\n2011-07-10\n62320.0\n138230.0\n25050.0\n1310256000\n2011\n2011\n0\n0\n...\n10\n10\n191\n1\n0\n0\n0\n0\n0\nam\n\n\n27\nCross Country Race\n2011-07-17\n141620.0\n138350.0\n56860.0\n1310860800\n2011\n2011\n0\n0\n...\n17\n17\n198\n1\n0\n0\n0\n0\n0\nam\n\n\n28\nCross Country Race\n2011-07-24\n75720.0\n136090.0\n8740.0\n1311465600\n2011\n2011\n0\n0\n...\n24\n24\n205\n1\n0\n0\n0\n0\n0\nam\n\n\n29\nCross Country Race\n2011-07-31\n21240.0\n32110.0\n78070.0\n1312070400\n2011\n2011\n0\n0\n...\n31\n31\n212\n1\n0\n0\n0\n0\n0\nam\n\n\n30\nCross Country Race\n2011-08-07\n11620.0\n139010.0\n115010.0\n1312675200\n2011\n2011\n0\n0\n...\n7\n38\n219\n1\n0\n0\n0\n0\n0\nam\n\n\n31\nCross Country Race\n2011-08-14\n9730.0\n2060.0\n64290.0\n1313280000\n2011\n2011\n0\n0\n...\n14\n45\n226\n1\n0\n0\n0\n0\n0\nam\n\n\n32\nCross Country Race\n2011-08-21\n22780.0\n26130.0\n95070.0\n1313884800\n2011\n2011\n0\n0\n...\n21\n52\n233\n1\n0\n0\n0\n0\n0\nam\n\n\n33\nCross Country Race\n2011-08-28\n53680.0\n30360.0\n3200.0\n1314489600\n2011\n2011\n0\n0\n...\n28\n59\n240\n1\n0\n0\n0\n0\n0\nam\n\n\n34\nCross Country Race\n2011-09-04\n38360.0\n88280.0\n21170.0\n1315094400\n2011\n2011\n0\n0\n...\n4\n66\n247\n1\n0\n0\n0\n0\n0\nam\n\n\n\n\n10 rows × 34 columns\n\n\n\nWe can quickly get a sense of what features were just created using tk.glimpse.\n\n\nCode\ndf_with_datefeatures.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 341 rows of 34 columns\ncategory_2: object ['Cross Country Race', 'Cros ...\norder_date: datetime64[ns] [Timestamp('2011-07-03 00:00 ...\ntotal_price_sum: float64 [56430.0, 62320.0, 141620.0, ...\ntotal_price_sum_lag_12: float64 [85910.0, 138230.0, 138350.0 ...\ntotal_price_sum_lag_24: float64 [61750.0, 25050.0, 56860.0, ...\norder_date_index_num: int64 [1309651200, 1310256000, 131 ...\norder_date_year: int64 [2011, 2011, 2011, 2011, 201 ...\norder_date_year_iso: UInt32 [2011, 2011, 2011, 2011, 201 ...\norder_date_yearstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_yearend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_leapyear: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_half: int64 [2, 2, 2, 2, 2, 2, 2, 2, 2, ...\norder_date_quarter: int64 [3, 3, 3, 3, 3, 3, 3, 3, 3, ...\norder_date_quarteryear: object ['2011Q3', '2011Q3', '2011Q3 ...\norder_date_quarterstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_quarterend: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_month: int64 [7, 7, 7, 7, 7, 8, 8, 8, 8, ...\norder_date_month_lbl: object ['July', 'July', 'July', 'Ju ...\norder_date_monthstart: uint8 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_monthend: uint8 [0, 0, 0, 0, 1, 0, 0, 0, 0, ...\norder_date_yweek: UInt32 [26, 27, 28, 29, 30, 31, 32, ...\norder_date_mweek: int64 [1, 2, 3, 4, 5, 1, 2, 3, 4, ...\norder_date_wday: int64 [7, 7, 7, 7, 7, 7, 7, 7, 7, ...\norder_date_wday_lbl: object ['Sunday', 'Sunday', 'Sunday ...\norder_date_mday: int64 [3, 10, 17, 24, 31, 7, 14, 2 ...\norder_date_qday: int64 [3, 10, 17, 24, 31, 38, 45, ...\norder_date_yday: int64 [184, 191, 198, 205, 212, 21 ...\norder_date_weekend: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, ...\norder_date_hour: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_minute: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_second: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_msecond: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_nsecond: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, ...\norder_date_am_pm: object ['am', 'am', 'am', 'am', 'am ...\n\n\nLet’s subset to just a few of the relevant date features. Let’s use tk.glimpse again.\n\n\nCode\ndf_with_datefeatures_narrom = df_with_datefeatures[[\n 'order_date', \n 'category_2', \n 'total_price_sum',\n 'total_price_sum_lag_12',\n 'total_price_sum_lag_24',\n 'order_date_year', \n 'order_date_half', \n 'order_date_quarter', \n 'order_date_month',\n 'order_date_yweek'\n]]\n\ndf_with_datefeatures_narrom.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 341 rows of 10 columns\norder_date: datetime64[ns] [Timestamp('2011-07-03 00:00: ...\ncategory_2: object ['Cross Country Race', 'Cross ...\ntotal_price_sum: float64 [56430.0, 62320.0, 141620.0, ...\ntotal_price_sum_lag_12: float64 [85910.0, 138230.0, 138350.0, ...\ntotal_price_sum_lag_24: float64 [61750.0, 25050.0, 56860.0, 8 ...\norder_date_year: int64 [2011, 2011, 2011, 2011, 2011 ...\norder_date_half: int64 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2 ...\norder_date_quarter: int64 [3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ...\norder_date_month: int64 [7, 7, 7, 7, 7, 8, 8, 8, 8, 9 ...\norder_date_yweek: UInt32 [26, 27, 28, 29, 30, 31, 32, ...\n\n\n\nOne-Hot Encoding\nThe final phase in our feature engineering journey is one-hot encoding our categorical variables. While certain machine learning models like CatBoost can natively handle categorical data, many cannot. Enter one-hot encoding, a technique that transforms each category within a column into its separate column, marking its presence with a ‘1’ or absence with a ‘0’.\nFor this transformation, the handy pd.get_dummies() function from pandas comes to the rescue.\n\n\nCode\ndf_encoded = pd.get_dummies(df_with_datefeatures_narrom, columns=['category_2'])\n\ndf_encoded.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 341 rows of 18 columns\norder_date: datetime64[ns] [Timestamp('2011-07-03 ...\ntotal_price_sum: float64 [56430.0, 62320.0, 141 ...\ntotal_price_sum_lag_12: float64 [85910.0, 138230.0, 13 ...\ntotal_price_sum_lag_24: float64 [61750.0, 25050.0, 568 ...\norder_date_year: int64 [2011, 2011, 2011, 201 ...\norder_date_half: int64 [2, 2, 2, 2, 2, 2, 2, ...\norder_date_quarter: int64 [3, 3, 3, 3, 3, 3, 3, ...\norder_date_month: int64 [7, 7, 7, 7, 7, 8, 8, ...\norder_date_yweek: UInt32 [26, 27, 28, 29, 30, 3 ...\ncategory_2_Cross Country Race: uint8 [1, 1, 1, 1, 1, 1, 1, ...\ncategory_2_Cyclocross: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Elite Road: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Endurance Road: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Fat Bike: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Over Mountain: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Sport: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Trail: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Triathalon: uint8 [0, 0, 0, 0, 0, 0, 0, ...\n\n\n\n\nTraining and Future Feature Sets\nPytimetk offers an extensive array of feature engineering tools and augmentation functions, giving you a broad spectrum of possibilities. However, for the purposes of this tutorial, let’s shift our focus to modeling.\nLet’s proceed by segmenting our dataframe into training and future sets.\n\n\nCode\nfuture = df_encoded[df_encoded.total_price_sum.isnull()]\ntrain = df_encoded[df_encoded.total_price_sum.notnull()]\n\n\nLet’s focus on the columns essential for training. You’ll observe that we’ve excluded the ‘order_date’ column. This is because numerous machine learning models struggle with date data types. This is precisely why we utilized the tk.augment_timeseries_signature earlier—to transform date features into a format that’s compatible with ML models.\nWe can quickly see what features we have available with tk.glimpse().\n\n\nCode\ntrain.glimpse()\n\n\n<class 'pandas.core.frame.DataFrame'>: 233 rows of 18 columns\norder_date: datetime64[ns] [Timestamp('2011-07-03 ...\ntotal_price_sum: float64 [56430.0, 62320.0, 141 ...\ntotal_price_sum_lag_12: float64 [85910.0, 138230.0, 13 ...\ntotal_price_sum_lag_24: float64 [61750.0, 25050.0, 568 ...\norder_date_year: int64 [2011, 2011, 2011, 201 ...\norder_date_half: int64 [2, 2, 2, 2, 2, 2, 2, ...\norder_date_quarter: int64 [3, 3, 3, 3, 3, 3, 3, ...\norder_date_month: int64 [7, 7, 7, 7, 7, 8, 8, ...\norder_date_yweek: UInt32 [26, 27, 28, 29, 30, 3 ...\ncategory_2_Cross Country Race: uint8 [1, 1, 1, 1, 1, 1, 1, ...\ncategory_2_Cyclocross: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Elite Road: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Endurance Road: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Fat Bike: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Over Mountain: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Sport: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Trail: uint8 [0, 0, 0, 0, 0, 0, 0, ...\ncategory_2_Triathalon: uint8 [0, 0, 0, 0, 0, 0, 0, ..." }, { - "objectID": "tutorials/03_demand_forecasting.html#plot-predictions", - "href": "tutorials/03_demand_forecasting.html#plot-predictions", - "title": "Demand Forecasting", - "section": "2.8 Plot Predictions", - "text": "2.8 Plot Predictions\n\nPlotlyPlotnine\n\n\n\n\nCode\nfull_df \\\n .groupby('Dept') \\\n .plot_timeseries(\n date_column = 'Date',\n value_column = 'Weekly_Sales',\n color_column = 'type',\n smooth = False,\n smooth_alpha = 0,\n facet_ncol = 2,\n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 800,\n height = 600,\n engine = 'plotly'\n )\n\n\n\n \n\n\n\n\n\n\nCode\nfull_df \\\n .groupby('Dept') \\\n .plot_timeseries(\n date_column = 'Date',\n value_column = 'Weekly_Sales',\n color_column = 'type',\n smooth = False,\n smooth_alpha = 0,\n facet_ncol = 2,\n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 800,\n height = 600,\n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (800 x 600)>\n\n\n\n\n\nOur weekly sales forecasts exhibit a noticeable alignment with historical trends, indicating that our models are effectively capturing essential data signals. It’s worth noting that with some additional feature engineering, we have the potential to further enhance the model’s performance.\nHere are some additional techniques that can be explored to elevate its performance:\n\nExperiment with the incorporation of various lags using the versatile tk.augment_lags() function.\nEnhance the model’s capabilities by introducing additional rolling calculations through tk.augment_rolling().\nConsider incorporating cyclic features by utilizing tk.augment_fourier().\nTry different models and build a robust cross-validation strategy for model selection.\n\nThese strategies hold promise for refining the model’s accuracy and predictive power" + "objectID": "tutorials/01_sales_crm.html#scikit-learn-model", + "href": "tutorials/01_sales_crm.html#scikit-learn-model", + "title": "Sales Analysis", + "section": "3.5 Scikit Learn Model", + "text": "3.5 Scikit Learn Model\nNow for some machine learning.\n\nFitting a Random Forest Regressor\nLet’s create a RandomForestRegressor to predict future sales patterns.\n\ntrain_columns = [ 'total_price_sum_lag_12',\n 'total_price_sum_lag_24', 'order_date_year', 'order_date_half',\n 'order_date_quarter', 'order_date_month', 'order_date_yweek','category_2_Cross Country Race', 'category_2_Cyclocross',\n 'category_2_Elite Road', 'category_2_Endurance Road',\n 'category_2_Fat Bike', 'category_2_Over Mountain', 'category_2_Sport',\n 'category_2_Trail', 'category_2_Triathalon']\nX = train[train_columns]\ny = train[['total_price_sum']]\n\nmodel = RandomForestRegressor(random_state=123)\nmodel = model.fit(X, y)\n\n\n\nPrediction\nWe now have a fitted model, and can use this to predict sales from our future frame.\n\n\nCode\npredicted_values = model.predict(future[train_columns])\nfuture['y_pred'] = predicted_values\n\nfuture.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\norder_date_year\norder_date_half\norder_date_quarter\norder_date_month\norder_date_yweek\ncategory_2_Cross Country Race\ncategory_2_Cyclocross\ncategory_2_Elite Road\ncategory_2_Endurance Road\ncategory_2_Fat Bike\ncategory_2_Over Mountain\ncategory_2_Sport\ncategory_2_Trail\ncategory_2_Triathalon\ny_pred\n\n\n\n\n468\n2012-01-08\nNaN\n51820.0\n75720.0\n2012\n1\n1\n1\n1\n1\n0\n0\n0\n0\n0\n0\n0\n0\n59462.00\n\n\n469\n2012-01-15\nNaN\n62940.0\n21240.0\n2012\n1\n1\n1\n2\n1\n0\n0\n0\n0\n0\n0\n0\n0\n59149.45\n\n\n470\n2012-01-22\nNaN\n9060.0\n11620.0\n2012\n1\n1\n1\n3\n1\n0\n0\n0\n0\n0\n0\n0\n0\n20458.40\n\n\n471\n2012-01-29\nNaN\n15980.0\n9730.0\n2012\n1\n1\n1\n4\n1\n0\n0\n0\n0\n0\n0\n0\n0\n31914.00\n\n\n472\n2012-02-05\nNaN\n59180.0\n22780.0\n2012\n1\n1\n2\n5\n1\n0\n0\n0\n0\n0\n0\n0\n0\n59128.95\n\n\n473\n2012-02-12\nNaN\n132550.0\n53680.0\n2012\n1\n1\n2\n6\n1\n0\n0\n0\n0\n0\n0\n0\n0\n76397.50\n\n\n474\n2012-02-19\nNaN\n68430.0\n38360.0\n2012\n1\n1\n2\n7\n1\n0\n0\n0\n0\n0\n0\n0\n0\n63497.80\n\n\n475\n2012-02-26\nNaN\n29470.0\n90290.0\n2012\n1\n1\n2\n8\n1\n0\n0\n0\n0\n0\n0\n0\n0\n57332.00\n\n\n476\n2012-03-04\nNaN\n71080.0\n7380.0\n2012\n1\n1\n3\n9\n1\n0\n0\n0\n0\n0\n0\n0\n0\n60981.30\n\n\n477\n2012-03-11\nNaN\n9800.0\n0.0\n2012\n1\n1\n3\n10\n1\n0\n0\n0\n0\n0\n0\n0\n0\n18738.15\n\n\n\n\n\n\n\n\n\nCleaning Up\nNow let us do a little cleanup. For ease in plotting later, let’s add a column to track the actuals vs. the predicted values.\n\n\nCode\ntrain['type'] = 'actuals'\nfuture['type'] = 'prediction'\n\nfull_df = pd.concat([train, future])\n\nfull_df.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\norder_date_year\norder_date_half\norder_date_quarter\norder_date_month\norder_date_yweek\ncategory_2_Cross Country Race\ncategory_2_Cyclocross\ncategory_2_Elite Road\ncategory_2_Endurance Road\ncategory_2_Fat Bike\ncategory_2_Over Mountain\ncategory_2_Sport\ncategory_2_Trail\ncategory_2_Triathalon\ntype\ny_pred\n\n\n\n\n25\n2011-07-03\n56430.0\n85910.0\n61750.0\n2011\n2\n3\n7\n26\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n26\n2011-07-10\n62320.0\n138230.0\n25050.0\n2011\n2\n3\n7\n27\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n27\n2011-07-17\n141620.0\n138350.0\n56860.0\n2011\n2\n3\n7\n28\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n28\n2011-07-24\n75720.0\n136090.0\n8740.0\n2011\n2\n3\n7\n29\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n29\n2011-07-31\n21240.0\n32110.0\n78070.0\n2011\n2\n3\n7\n30\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n30\n2011-08-07\n11620.0\n139010.0\n115010.0\n2011\n2\n3\n8\n31\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n31\n2011-08-14\n9730.0\n2060.0\n64290.0\n2011\n2\n3\n8\n32\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n32\n2011-08-21\n22780.0\n26130.0\n95070.0\n2011\n2\n3\n8\n33\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n33\n2011-08-28\n53680.0\n30360.0\n3200.0\n2011\n2\n3\n8\n34\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n34\n2011-09-04\n38360.0\n88280.0\n21170.0\n2011\n2\n3\n9\n35\n1\n0\n0\n0\n0\n0\n0\n0\n0\nactuals\nNaN\n\n\n\n\n\n\n\nYou can get the grouping category back from the one-hot encoding for easier plotting. For simplicity, we will search for any column with ‘category’ in its name.\n\n\nCode\n# Extract dummy columns\ndummy_cols = [col for col in full_df.columns if 'category' in col.lower() ]\nfull_df_reverted = full_df.copy()\n\n# Convert dummy columns back to categorical column\nfull_df_reverted['category'] = full_df_reverted[dummy_cols].idxmax(axis=1).str.replace(\"A_\", \"\")\n\n# Drop dummy columns\nfull_df_reverted = full_df_reverted.drop(columns=dummy_cols)\n\nfull_df_reverted.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\norder_date_year\norder_date_half\norder_date_quarter\norder_date_month\norder_date_yweek\ntype\ny_pred\ncategory\n\n\n\n\n25\n2011-07-03\n56430.0\n85910.0\n61750.0\n2011\n2\n3\n7\n26\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n26\n2011-07-10\n62320.0\n138230.0\n25050.0\n2011\n2\n3\n7\n27\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n27\n2011-07-17\n141620.0\n138350.0\n56860.0\n2011\n2\n3\n7\n28\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n28\n2011-07-24\n75720.0\n136090.0\n8740.0\n2011\n2\n3\n7\n29\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n29\n2011-07-31\n21240.0\n32110.0\n78070.0\n2011\n2\n3\n7\n30\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n30\n2011-08-07\n11620.0\n139010.0\n115010.0\n2011\n2\n3\n8\n31\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n31\n2011-08-14\n9730.0\n2060.0\n64290.0\n2011\n2\n3\n8\n32\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n32\n2011-08-21\n22780.0\n26130.0\n95070.0\n2011\n2\n3\n8\n33\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n33\n2011-08-28\n53680.0\n30360.0\n3200.0\n2011\n2\n3\n8\n34\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n34\n2011-09-04\n38360.0\n88280.0\n21170.0\n2011\n2\n3\n9\n35\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n\n\n\n\n\n\n\nPre-Visualization Wrangling\nBefore we proceed to visualization, let’s streamline our dataset by aligning our predicted values with the actuals. This approach will simplify the plotting process. Given that our DataFrame columns are already labeled as ‘actuals’ and ‘predictions’, a brief conditional check will allow us to consolidate the necessary values.\n\n\nCode\nfull_df_reverted['total_price_sum'] = np.where(full_df_reverted.type =='actuals', full_df_reverted.total_price_sum, full_df_reverted.y_pred)\n\nfull_df_reverted.head(10)\n\n\n\n\n\n\n\n\n\norder_date\ntotal_price_sum\ntotal_price_sum_lag_12\ntotal_price_sum_lag_24\norder_date_year\norder_date_half\norder_date_quarter\norder_date_month\norder_date_yweek\ntype\ny_pred\ncategory\n\n\n\n\n25\n2011-07-03\n56430.0\n85910.0\n61750.0\n2011\n2\n3\n7\n26\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n26\n2011-07-10\n62320.0\n138230.0\n25050.0\n2011\n2\n3\n7\n27\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n27\n2011-07-17\n141620.0\n138350.0\n56860.0\n2011\n2\n3\n7\n28\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n28\n2011-07-24\n75720.0\n136090.0\n8740.0\n2011\n2\n3\n7\n29\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n29\n2011-07-31\n21240.0\n32110.0\n78070.0\n2011\n2\n3\n7\n30\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n30\n2011-08-07\n11620.0\n139010.0\n115010.0\n2011\n2\n3\n8\n31\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n31\n2011-08-14\n9730.0\n2060.0\n64290.0\n2011\n2\n3\n8\n32\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n32\n2011-08-21\n22780.0\n26130.0\n95070.0\n2011\n2\n3\n8\n33\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n33\n2011-08-28\n53680.0\n30360.0\n3200.0\n2011\n2\n3\n8\n34\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n34\n2011-09-04\n38360.0\n88280.0\n21170.0\n2011\n2\n3\n9\n35\nactuals\nNaN\ncategory_2_Cross Country Race\n\n\n\n\n\n\n\n\n\nVisualize the Forecast\nLet’s again use tk.plot_timeseries() to visually inspect the forecasts.\n\nPlotlyPlotnine\n\n\n\n\nCode\nfull_df_reverted \\\n .groupby('category') \\\n .plot_timeseries(\n date_column = 'order_date',\n value_column = 'total_price_sum',\n color_column = 'type',\n smooth = False,\n smooth_alpha = 0,\n facet_ncol = 2,\n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 800,\n height = 600,\n engine = 'plotly'\n )\n\n\n\n \n\n\n\n\n\n\nCode\nfull_df_reverted \\\n .groupby('category') \\\n .plot_timeseries(\n date_column = 'order_date',\n value_column = 'total_price_sum',\n color_column = 'type',\n smooth = False,\n smooth_alpha = 0,\n facet_ncol = 2, \n facet_scales = \"free\",\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 1000,\n height = 800,\n engine = 'plotnine'\n )\n\n\n\n\n\n<Figure Size: (1000 x 800)>\n\n\n\n\n\nUpon examining the graph, our models look alright given the length of time for training. Important points:\n\nFor effective time series forecasting, having multiple years of data is pivotal. This provides the model ample opportunities to recognize and adapt to seasonal variations.\nGiven our dataset spanned less than a year, the model lacked the depth of historical context to discern such patterns.\nAlthough our feature engineering was kept basic to introduce various pytimetk capabilities, there’s room for enhancement.\nFor a more refined analysis, consider experimenting with different machine learning models and diving deeper into feature engineering.\nPytimetk’s tk.augment_fourier() might assist in discerning seasonal trends, but with the dataset’s limited historical scope, capturing intricate patterns could remain a challenge." }, { - "objectID": "tutorials/02_finance.html", - "href": "tutorials/02_finance.html", - "title": "Finance Analysis", + "objectID": "tutorials/07_timeseries_crossvalidation.html", + "href": "tutorials/07_timeseries_crossvalidation.html", + "title": "Time Series Cross Validation", "section": "", - "text": "Timetk is designed to work with any time series domain. Arguably the most important is Finance. This tutorial showcases how you can perform Financial Investment and Stock Analysis at scale with pytimetk. This applied tutorial covers financial analysis with:\nLoad the following packages before proceeding with this tutorial.\nCode\nimport pytimetk as tk\nimport pandas as pd\nimport numpy as np" + "text": "In this tutorial, you’ll learn how to use the TimeSeriesCV and TimeSeriesCVSplitter classes from pytimetk for time series cross-validation, using the walmart_sales_df dataset as an example, which contains 7 time series groups.\n\nIn Part 1, we’ll start with exploring the data and move on to creating and visualizing time-based cross-validation splits. This will prepare you for the next section with Scikit Learn.\nIn Part 2, we’ll implement time series cross-validation with Scikit-Learn, engineer features, train a random forest model, and visualize the results in Python. By following this process, you can ensure a robust evaluation of your time series models and gain insights into their predictive performance." }, { - "objectID": "tutorials/02_finance.html#application-moving-averages-10-day-and-50-day", - "href": "tutorials/02_finance.html#application-moving-averages-10-day-and-50-day", - "title": "Finance Analysis", - "section": "3.1 Application: Moving Averages, 10-Day and 50-Day", - "text": "3.1 Application: Moving Averages, 10-Day and 50-Day\nThis code template can be used to make and visualize the 10-day and 50-Day moving average of a group of stock symbols. Click to expand the code.\n\nPlotlyPlotnine\n\n\n\n\nCode\n# Add 2 moving averages (10-day and 50-Day)\nsma_df = stocks_df[['symbol', 'date', 'adjusted']] \\\n .groupby('symbol') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'adjusted',\n window = [10, 50],\n window_func = ['mean'],\n center = False,\n threads = 1, # Change to -1 to use all available cores\n )\n\n# Visualize \n(sma_df \n\n # zoom in on dates\n .query('date >= \"2023-01-01\"') \n\n # Convert to long format\n .melt(\n id_vars = ['symbol', 'date'],\n value_vars = [\"adjusted\", \"adjusted_rolling_mean_win_10\", \"adjusted_rolling_mean_win_50\"]\n ) \n\n # Group on symbol and visualize\n .groupby(\"symbol\") \n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n color_column = 'variable',\n smooth = False, \n facet_ncol = 2,\n width = 900,\n height = 700,\n engine = \"plotly\"\n )\n)\n\n\n\n\n\n\n \n\n\n\n\n\n\nCode\n# Add 2 moving averages (10-day and 50-Day)\nsma_df = stocks_df[['symbol', 'date', 'adjusted']] \\\n .groupby('symbol') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'adjusted',\n window = [10, 50],\n window_func = ['mean'],\n center = False,\n threads = 1, # Change to -1 to use all available cores\n )\n\n# Visualize \n(sma_df \n\n # zoom in on dates\n .query('date >= \"2023-01-01\"') \n\n # Convert to long format\n .melt(\n id_vars = ['symbol', 'date'],\n value_vars = [\"adjusted\", \"adjusted_rolling_mean_win_10\", \"adjusted_rolling_mean_win_50\"]\n ) \n\n # Group on symbol and visualize\n .groupby(\"symbol\") \n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n color_column = 'variable',\n smooth = False, \n facet_ncol = 2,\n width = 900,\n height = 700,\n engine = \"plotnine\"\n )\n)\n\n\n\n\n\n\n\n\n<Figure Size: (900 x 700)>" + "objectID": "tutorials/07_timeseries_crossvalidation.html#step-1-load-and-explore-the-data", + "href": "tutorials/07_timeseries_crossvalidation.html#step-1-load-and-explore-the-data", + "title": "Time Series Cross Validation", + "section": "2.1 Step 1: Load and Explore the Data", + "text": "2.1 Step 1: Load and Explore the Data\nFirst, let’s load the Walmart sales dataset and explore its structure:\n\n# libraries\nimport pytimetk as tk\nimport pandas as pd\nimport numpy as np\n\n# Import Data\nwalmart_sales_df = tk.load_dataset('walmart_sales_weekly')\n\nwalmart_sales_df['Date'] = pd.to_datetime(walmart_sales_df['Date'])\n\nwalmart_sales_df = walmart_sales_df[['id', 'Date', 'Weekly_Sales']]\n\nwalmart_sales_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 1001 rows of 3 columns\nid: object ['1_1', '1_1', '1_1', '1_1', '1_1', '1_ ...\nDate: datetime64[ns] [Timestamp('2010-02-05 00:00:00'), Time ...\nWeekly_Sales: float64 [24924.5, 46039.49, 41595.55, 19403.54, ..." }, { - "objectID": "tutorials/02_finance.html#application-bollinger-bands", - "href": "tutorials/02_finance.html#application-bollinger-bands", - "title": "Finance Analysis", - "section": "3.2 Application: Bollinger Bands", - "text": "3.2 Application: Bollinger Bands\nBollinger Bands are a volatility indicator commonly used in financial trading. They consist of three lines:\n\nThe middle band, which is a simple moving average (usually over 20 periods).\nThe upper band, calculated as the middle band plus k times the standard deviation of the price (typically, k=2).\nThe lower band, calculated as the middle band minus k times the standard deviation of the price.\n\nHere’s how you can calculate and plot Bollinger Bands with pytimetk using this code template (click to expand):\n\nPlotlyPlotnine\n\n\n\n\nCode\n# Bollinger Bands\nbollinger_df = stocks_df[['symbol', 'date', 'adjusted']] \\\n .groupby('symbol') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'adjusted',\n window = 20,\n window_func = ['mean', 'std'],\n center = False\n ) \\\n .assign(\n upper_band = lambda x: x['adjusted_rolling_mean_win_20'] + 2*x['adjusted_rolling_std_win_20'],\n lower_band = lambda x: x['adjusted_rolling_mean_win_20'] - 2*x['adjusted_rolling_std_win_20']\n )\n\n\n# Visualize\n(bollinger_df\n\n # zoom in on dates\n .query('date >= \"2023-01-01\"') \n\n # Convert to long format\n .melt(\n id_vars = ['symbol', 'date'],\n value_vars = [\"adjusted\", \"adjusted_rolling_mean_win_20\", \"upper_band\", \"lower_band\"]\n ) \n\n # Group on symbol and visualize\n .groupby(\"symbol\") \n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n color_column = 'variable',\n # Adjust colors for Bollinger Bands\n color_palette =[\"#2C3E50\", \"#E31A1C\", '#18BC9C', '#18BC9C'],\n smooth = False, \n facet_ncol = 2,\n width = 900,\n height = 700,\n engine = \"plotly\" \n )\n)\n\n\n\n\n\n\n \n\n\n\n\n\n\nCode\n# Bollinger Bands\nbollinger_df = stocks_df[['symbol', 'date', 'adjusted']] \\\n .groupby('symbol') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'adjusted',\n window = 20,\n window_func = ['mean', 'std'],\n center = False\n ) \\\n .assign(\n upper_band = lambda x: x['adjusted_rolling_mean_win_20'] + 2*x['adjusted_rolling_std_win_20'],\n lower_band = lambda x: x['adjusted_rolling_mean_win_20'] - 2*x['adjusted_rolling_std_win_20']\n )\n\n\n# Visualize\n(bollinger_df\n\n # zoom in on dates\n .query('date >= \"2023-01-01\"') \n\n # Convert to long format\n .melt(\n id_vars = ['symbol', 'date'],\n value_vars = [\"adjusted\", \"adjusted_rolling_mean_win_20\", \"upper_band\", \"lower_band\"]\n ) \n\n # Group on symbol and visualize\n .groupby(\"symbol\") \n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n color_column = 'variable',\n # Adjust colors for Bollinger Bands\n color_palette =[\"#2C3E50\", \"#E31A1C\", '#18BC9C', '#18BC9C'],\n smooth = False, \n facet_ncol = 2,\n width = 900,\n height = 700,\n engine = \"plotnine\"\n )\n)\n\n\n\n\n\n\n\n\n<Figure Size: (900 x 700)>" + "objectID": "tutorials/07_timeseries_crossvalidation.html#step-2-visualize-the-time-series-data", + "href": "tutorials/07_timeseries_crossvalidation.html#step-2-visualize-the-time-series-data", + "title": "Time Series Cross Validation", + "section": "2.2 Step 2: Visualize the Time Series Data", + "text": "2.2 Step 2: Visualize the Time Series Data\nWe can visualize the weekly sales data for different store IDs using the plot_timeseries method from pytimetk:\n\nwalmart_sales_df \\\n .groupby('id') \\\n .plot_timeseries(\n \"Date\", \"Weekly_Sales\",\n plotly_dropdown = True,\n )\n\n\n \n\n\nThis will generate an interactive time series plot, allowing you to explore sales data for different stores using a dropdown." }, { - "objectID": "tutorials/02_finance.html#returns-analysis-by-time", - "href": "tutorials/02_finance.html#returns-analysis-by-time", - "title": "Finance Analysis", - "section": "4.1 Returns Analysis By Time", - "text": "4.1 Returns Analysis By Time\n\n\n\n\n\n\nReturns are NOT static (so analyze them by time)\n\n\n\n\n\n\nWe can use rolling window calculations with tk.augment_rolling() to compute many rolling features at scale such as rolling mean, std, range (spread).\nWe can expand our tk.augment_rolling_apply() rolling calculations to Rolling Correlation and Rolling Regression (to make comparisons over time)\n\n\n\n\n\nApplication: Descriptive Statistic Analysis\nMany traders compute descriptive statistics like mean, median, mode, skewness, kurtosis, and standard deviation to understand the central tendency, spread, and shape of the return distribution.\n\n\nStep 1: Returns\nUse this code to get the pct_change() in wide format. Click expand to get the code.\n\n\nCode\nreturns_wide_df = stocks_df[['symbol', 'date', 'adjusted']] \\\n .pivot(index = 'date', columns = 'symbol', values = 'adjusted') \\\n .pct_change() \\\n .reset_index() \\\n [1:]\n\nreturns_wide_df\n\n\n\n\n\n\n\n\nsymbol\ndate\nAAPL\nAMZN\nGOOG\nMETA\nNFLX\nNVDA\n\n\n\n\n1\n2013-01-03\n-0.012622\n0.004547\n0.000581\n-0.008214\n0.049777\n0.000786\n\n\n2\n2013-01-04\n-0.027854\n0.002592\n0.019760\n0.035650\n-0.006315\n0.032993\n\n\n3\n2013-01-07\n-0.005883\n0.035925\n-0.004363\n0.022949\n0.033549\n-0.028897\n\n\n4\n2013-01-08\n0.002691\n-0.007748\n-0.001974\n-0.012237\n-0.020565\n-0.021926\n\n\n5\n2013-01-09\n-0.015629\n-0.000113\n0.006573\n0.052650\n-0.012865\n-0.022418\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n2694\n2023-09-15\n-0.004154\n-0.029920\n-0.004964\n-0.036603\n-0.008864\n-0.036879\n\n\n2695\n2023-09-18\n0.016913\n-0.002920\n0.004772\n0.007459\n-0.006399\n0.001503\n\n\n2696\n2023-09-19\n0.006181\n-0.016788\n-0.000936\n0.008329\n0.004564\n-0.010144\n\n\n2697\n2023-09-20\n-0.019992\n-0.017002\n-0.030541\n-0.017701\n-0.024987\n-0.029435\n\n\n2698\n2023-09-21\n-0.008889\n-0.044053\n-0.023999\n-0.013148\n-0.005566\n-0.028931\n\n\n\n\n2698 rows × 7 columns\n\n\n\n\n\nStep 2: Descriptive Stats\nUse this code to get standard statistics with the describe() method. Click expand to get the code.\n\n\nCode\nreturns_wide_df.describe()\n\n\n\n\n\n\n\n\nsymbol\nAAPL\nAMZN\nGOOG\nMETA\nNFLX\nNVDA\n\n\n\n\ncount\n2698.000000\n2698.000000\n2698.000000\n2698.000000\n2698.000000\n2698.000000\n\n\nmean\n0.001030\n0.001068\n0.000885\n0.001170\n0.001689\n0.002229\n\n\nstd\n0.018036\n0.020621\n0.017267\n0.024291\n0.029683\n0.028320\n\n\nmin\n-0.128647\n-0.140494\n-0.111008\n-0.263901\n-0.351166\n-0.187559\n\n\n25%\n-0.007410\n-0.008635\n-0.006900\n-0.009610\n-0.012071\n-0.010938\n\n\n50%\n0.000892\n0.001050\n0.000700\n0.001051\n0.000544\n0.001918\n\n\n75%\n0.010324\n0.011363\n0.009053\n0.012580\n0.014678\n0.015202\n\n\nmax\n0.119808\n0.141311\n0.160524\n0.296115\n0.422235\n0.298067\n\n\n\n\n\n\n\n\n\nStep 3: Correlation\nAnd run a correlation with corr(). Click expand to get the code.\n\n\nCode\ncorr_table_df = returns_wide_df.drop('date', axis=1).corr()\ncorr_table_df\n\n\n\n\n\n\n\n\nsymbol\nAAPL\nAMZN\nGOOG\nMETA\nNFLX\nNVDA\n\n\nsymbol\n\n\n\n\n\n\n\n\n\n\nAAPL\n1.000000\n0.497906\n0.566452\n0.479787\n0.321694\n0.526508\n\n\nAMZN\n0.497906\n1.000000\n0.628103\n0.544481\n0.475078\n0.490234\n\n\nGOOG\n0.566452\n0.628103\n1.000000\n0.595728\n0.428470\n0.531382\n\n\nMETA\n0.479787\n0.544481\n0.595728\n1.000000\n0.407417\n0.450586\n\n\nNFLX\n0.321694\n0.475078\n0.428470\n0.407417\n1.000000\n0.380153\n\n\nNVDA\n0.526508\n0.490234\n0.531382\n0.450586\n0.380153\n1.000000\n\n\n\n\n\n\n\n\nThe problem is that the stock market is constantly changing. And these descriptive statistics aren’t representative of the most recent fluctuations. This is where pytimetk comes into play with rolling descriptive statistics.\n\n\n\nApplication: 90-Day Rolling Descriptive Statistics Analysis with tk.augment_rolling()\nLet’s compute and visualize the 90-day rolling statistics.\n\n\n\n\n\n\nGetting More Info: tk.augment_rolling()\n\n\n\n\n\n\nClick here to see our Augmenting Guide\nUse help(tk.augment_rolling) to review additional helpful documentation.\n\n\n\n\n\nStep 1: Long Format Pt.1\nUse this code to get the date melt() into long format. Click expand to get the code.\n\n\nCode\nreturns_long_df = returns_wide_df \\\n .melt(id_vars='date', value_name='returns') \n\nreturns_long_df\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\n\n\n\n\n0\n2013-01-03\nAAPL\n-0.012622\n\n\n1\n2013-01-04\nAAPL\n-0.027854\n\n\n2\n2013-01-07\nAAPL\n-0.005883\n\n\n3\n2013-01-08\nAAPL\n0.002691\n\n\n4\n2013-01-09\nAAPL\n-0.015629\n\n\n...\n...\n...\n...\n\n\n16183\n2023-09-15\nNVDA\n-0.036879\n\n\n16184\n2023-09-18\nNVDA\n0.001503\n\n\n16185\n2023-09-19\nNVDA\n-0.010144\n\n\n16186\n2023-09-20\nNVDA\n-0.029435\n\n\n16187\n2023-09-21\nNVDA\n-0.028931\n\n\n\n\n16188 rows × 3 columns\n\n\n\n\n\nStep 2: Augment Rolling Statistic\nLet’s add multiple columns of rolling statistics. Click to expand the code.\n\n\nCode\nrolling_stats_df = returns_long_df \\\n .groupby('symbol') \\\n .augment_rolling(\n date_column = 'date',\n value_column = 'returns',\n window = [90],\n window_func = [\n 'mean', \n 'std', \n 'min',\n ('q25', lambda x: np.quantile(x, 0.25)),\n 'median',\n ('q75', lambda x: np.quantile(x, 0.75)),\n 'max'\n ],\n threads = 1 # Change to -1 to use all threads\n ) \\\n .dropna()\n\nrolling_stats_df\n\n\n\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\nreturns_rolling_mean_win_90\nreturns_rolling_std_win_90\nreturns_rolling_min_win_90\nreturns_rolling_q25_win_90\nreturns_rolling_median_win_90\nreturns_rolling_q75_win_90\nreturns_rolling_max_win_90\n\n\n\n\n89\n2013-05-13\nAAPL\n0.003908\n-0.001702\n0.022233\n-0.123558\n-0.010533\n-0.001776\n0.012187\n0.041509\n\n\n90\n2013-05-14\nAAPL\n-0.023926\n-0.001827\n0.022327\n-0.123558\n-0.010533\n-0.001776\n0.012187\n0.041509\n\n\n91\n2013-05-15\nAAPL\n-0.033817\n-0.001894\n0.022414\n-0.123558\n-0.010533\n-0.001776\n0.012187\n0.041509\n\n\n92\n2013-05-16\nAAPL\n0.013361\n-0.001680\n0.022467\n-0.123558\n-0.010533\n-0.001360\n0.013120\n0.041509\n\n\n93\n2013-05-17\nAAPL\n-0.003037\n-0.001743\n0.022462\n-0.123558\n-0.010533\n-0.001776\n0.013120\n0.041509\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n16183\n2023-09-15\nNVDA\n-0.036879\n0.005159\n0.036070\n-0.056767\n-0.012587\n-0.000457\n0.018480\n0.243696\n\n\n16184\n2023-09-18\nNVDA\n0.001503\n0.005396\n0.035974\n-0.056767\n-0.011117\n0.000177\n0.018480\n0.243696\n\n\n16185\n2023-09-19\nNVDA\n-0.010144\n0.005162\n0.036006\n-0.056767\n-0.011117\n-0.000457\n0.018480\n0.243696\n\n\n16186\n2023-09-20\nNVDA\n-0.029435\n0.004953\n0.036153\n-0.056767\n-0.012587\n-0.000457\n0.018480\n0.243696\n\n\n16187\n2023-09-21\nNVDA\n-0.028931\n0.004724\n0.036303\n-0.056767\n-0.013166\n-0.000457\n0.018480\n0.243696\n\n\n\n\n15654 rows × 10 columns\n\n\n\n\n\nStep 3: Long Format Pt.2\nFinally, we can .melt() each of the rolling statistics for a Long Format Analysis. Click to expand the code.\n\n\nCode\nrolling_stats_long_df = rolling_stats_df \\\n .melt(\n id_vars = [\"symbol\", \"date\"],\n var_name = \"statistic_type\"\n )\n\nrolling_stats_long_df\n\n\n\n\n\n\n\n\n\nsymbol\ndate\nstatistic_type\nvalue\n\n\n\n\n0\nAAPL\n2013-05-13\nreturns\n0.003908\n\n\n1\nAAPL\n2013-05-14\nreturns\n-0.023926\n\n\n2\nAAPL\n2013-05-15\nreturns\n-0.033817\n\n\n3\nAAPL\n2013-05-16\nreturns\n0.013361\n\n\n4\nAAPL\n2013-05-17\nreturns\n-0.003037\n\n\n...\n...\n...\n...\n...\n\n\n125227\nNVDA\n2023-09-15\nreturns_rolling_max_win_90\n0.243696\n\n\n125228\nNVDA\n2023-09-18\nreturns_rolling_max_win_90\n0.243696\n\n\n125229\nNVDA\n2023-09-19\nreturns_rolling_max_win_90\n0.243696\n\n\n125230\nNVDA\n2023-09-20\nreturns_rolling_max_win_90\n0.243696\n\n\n125231\nNVDA\n2023-09-21\nreturns_rolling_max_win_90\n0.243696\n\n\n\n\n125232 rows × 4 columns\n\n\n\nWith the data formatted properly we can evaluate the 90-Day Rolling Statistics using .plot_timeseries().\n\nPlotlyPlotnine\n\n\n\n\nCode\nrolling_stats_long_df \\\n .groupby(['symbol', 'statistic_type']) \\\n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n facet_ncol = 6,\n width = 1500,\n height = 1000,\n title = \"90-Day Rolling Statistics\"\n )\n\n\n\n \n\n\n\n\n\n\nCode\nrolling_stats_long_df \\\n .groupby(['symbol', 'statistic_type']) \\\n .plot_timeseries(\n date_column = 'date',\n value_column = 'value',\n facet_ncol = 6,\n facet_dir = 'v',\n width = 1500,\n height = 1000,\n title = \"90-Day Rolling Statistics\",\n engine = \"plotnine\"\n )\n\n\n\n\n\n<Figure Size: (1500 x 1000)>" + "objectID": "tutorials/07_timeseries_crossvalidation.html#step-3-set-up-timeseriescv-for-cross-validation", + "href": "tutorials/07_timeseries_crossvalidation.html#step-3-set-up-timeseriescv-for-cross-validation", + "title": "Time Series Cross Validation", + "section": "2.3 Step 3: Set Up TimeSeriesCV for Cross-Validation", + "text": "2.3 Step 3: Set Up TimeSeriesCV for Cross-Validation\nNow, let’s set up a time-based cross-validation scheme using TimeSeriesCV:\n\nfrom pytimetk.crossvalidation import TimeSeriesCV\n\n# Define parameters for TimeSeriesCV\ntscv = TimeSeriesCV(\n frequency=\"weeks\",\n train_size=52, # Use 52 weeks for training\n forecast_horizon=12, # Forecast 12 weeks ahead\n gap=0, # No gap between training and forecast sets\n stride=4, # Move forward by 4 weeks after each split\n window=\"rolling\", # Use a rolling window\n mode=\"backward\" # Generate splits from end to start\n)\n\n# Glimpse the cross-validation splits\ntscv.glimpse(\n walmart_sales_df['Weekly_Sales'], \n time_series=walmart_sales_df['Date']\n)\n\nSplit Number: 1\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-08-05 00:00:00 to 2012-07-27 00:00:00\nForecast Period: 2012-08-03 00:00:00 to 2012-10-19 00:00:00\n\nSplit Number: 2\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-07-08 00:00:00 to 2012-06-29 00:00:00\nForecast Period: 2012-07-06 00:00:00 to 2012-09-21 00:00:00\n\nSplit Number: 3\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-06-10 00:00:00 to 2012-06-01 00:00:00\nForecast Period: 2012-06-08 00:00:00 to 2012-08-24 00:00:00\n\nSplit Number: 4\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-05-13 00:00:00 to 2012-05-04 00:00:00\nForecast Period: 2012-05-11 00:00:00 to 2012-07-27 00:00:00\n\nSplit Number: 5\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-04-15 00:00:00 to 2012-04-06 00:00:00\nForecast Period: 2012-04-13 00:00:00 to 2012-06-29 00:00:00\n\nSplit Number: 6\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-03-18 00:00:00 to 2012-03-09 00:00:00\nForecast Period: 2012-03-16 00:00:00 to 2012-06-01 00:00:00\n\nSplit Number: 7\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-02-18 00:00:00 to 2012-02-10 00:00:00\nForecast Period: 2012-02-17 00:00:00 to 2012-05-04 00:00:00\n\nSplit Number: 8\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2011-01-21 00:00:00 to 2012-01-13 00:00:00\nForecast Period: 2012-01-20 00:00:00 to 2012-04-06 00:00:00\n\nSplit Number: 9\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-12-24 00:00:00 to 2011-12-16 00:00:00\nForecast Period: 2011-12-23 00:00:00 to 2012-03-09 00:00:00\n\nSplit Number: 10\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-11-26 00:00:00 to 2011-11-18 00:00:00\nForecast Period: 2011-11-25 00:00:00 to 2012-02-10 00:00:00\n\nSplit Number: 11\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-10-29 00:00:00 to 2011-10-21 00:00:00\nForecast Period: 2011-10-28 00:00:00 to 2012-01-13 00:00:00\n\nSplit Number: 12\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-10-01 00:00:00 to 2011-09-23 00:00:00\nForecast Period: 2011-09-30 00:00:00 to 2011-12-16 00:00:00\n\nSplit Number: 13\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-09-03 00:00:00 to 2011-08-26 00:00:00\nForecast Period: 2011-09-02 00:00:00 to 2011-11-18 00:00:00\n\nSplit Number: 14\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-08-06 00:00:00 to 2011-07-29 00:00:00\nForecast Period: 2011-08-05 00:00:00 to 2011-10-21 00:00:00\n\nSplit Number: 15\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-07-09 00:00:00 to 2011-07-01 00:00:00\nForecast Period: 2011-07-08 00:00:00 to 2011-09-23 00:00:00\n\nSplit Number: 16\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-06-11 00:00:00 to 2011-06-03 00:00:00\nForecast Period: 2011-06-10 00:00:00 to 2011-08-26 00:00:00\n\nSplit Number: 17\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-05-14 00:00:00 to 2011-05-06 00:00:00\nForecast Period: 2011-05-13 00:00:00 to 2011-07-29 00:00:00\n\nSplit Number: 18\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-04-16 00:00:00 to 2011-04-08 00:00:00\nForecast Period: 2011-04-15 00:00:00 to 2011-07-01 00:00:00\n\nSplit Number: 19\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-03-19 00:00:00 to 2011-03-11 00:00:00\nForecast Period: 2011-03-18 00:00:00 to 2011-06-03 00:00:00\n\nSplit Number: 20\nTrain Shape: (364,), Forecast Shape: (84,)\nTrain Period: 2010-02-19 00:00:00 to 2011-02-11 00:00:00\nForecast Period: 2011-02-18 00:00:00 to 2011-05-06 00:00:00\n\n\n\nThe glimpse method provides a summary of each cross-validation fold, including the start and end dates of the training and forecast periods." }, { - "objectID": "tutorials/02_finance.html#about-rolling-correlation", - "href": "tutorials/02_finance.html#about-rolling-correlation", - "title": "Finance Analysis", - "section": "5.1 About: Rolling Correlation", - "text": "5.1 About: Rolling Correlation\nRolling correlation calculates the correlation between two time series over a rolling window of a specified size, moving one period at a time. In stock analysis, this is often used to assess:\n\nDiversification: Helps in identifying how different stocks move in relation to each other, aiding in the creation of a diversified portfolio.\nMarket Dependency: Measures how a particular stock or sector is correlated with a broader market index.\nRisk Management: Helps in identifying changes in correlation structures over time which is crucial for risk assessment and management.\n\nFor example, if the rolling correlation between two stocks starts increasing, it might suggest that they are being influenced by similar factors or market conditions." + "objectID": "tutorials/07_timeseries_crossvalidation.html#step-4-plot-the-cross-validation-splits", + "href": "tutorials/07_timeseries_crossvalidation.html#step-4-plot-the-cross-validation-splits", + "title": "Time Series Cross Validation", + "section": "2.4 Step 4: Plot the Cross-Validation Splits", + "text": "2.4 Step 4: Plot the Cross-Validation Splits\nYou can visualize how the data is split for training and testing:\n\n# Plot the cross-validation splits\ntscv.plot(\n walmart_sales_df['Weekly_Sales'], \n time_series=walmart_sales_df['Date']\n)\n\n\n \n\n\nThis plot will show each fold, illustrating which weeks are used for training and which weeks are used for forecasting." }, { - "objectID": "tutorials/02_finance.html#application-rolling-correlation", - "href": "tutorials/02_finance.html#application-rolling-correlation", - "title": "Finance Analysis", - "section": "5.2 Application: Rolling Correlation", - "text": "5.2 Application: Rolling Correlation\nLet’s revisit the returns wide and long format. We can combine these two using the merge() method.\n\nStep 1: Create the return_combinations_long_df\nPerform data wrangling to get the pairwise combinations in long format:\n\nWe first .merge() to join the long returns with the wide returns by date.\nWe then .melt() to get the wide data into long format.\n\n\n\nCode\nreturn_combinations_long_df = returns_long_df \\\n .merge(returns_wide_df, how='left', on = 'date') \\\n .melt(\n id_vars = ['date', 'symbol', 'returns'],\n var_name = \"comp\",\n value_name = \"returns_comp\"\n )\nreturn_combinations_long_df\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\ncomp\nreturns_comp\n\n\n\n\n0\n2013-01-03\nAAPL\n-0.012622\nAAPL\n-0.012622\n\n\n1\n2013-01-04\nAAPL\n-0.027854\nAAPL\n-0.027854\n\n\n2\n2013-01-07\nAAPL\n-0.005883\nAAPL\n-0.005883\n\n\n3\n2013-01-08\nAAPL\n0.002691\nAAPL\n0.002691\n\n\n4\n2013-01-09\nAAPL\n-0.015629\nAAPL\n-0.015629\n\n\n...\n...\n...\n...\n...\n...\n\n\n97123\n2023-09-15\nNVDA\n-0.036879\nNVDA\n-0.036879\n\n\n97124\n2023-09-18\nNVDA\n0.001503\nNVDA\n0.001503\n\n\n97125\n2023-09-19\nNVDA\n-0.010144\nNVDA\n-0.010144\n\n\n97126\n2023-09-20\nNVDA\n-0.029435\nNVDA\n-0.029435\n\n\n97127\n2023-09-21\nNVDA\n-0.028931\nNVDA\n-0.028931\n\n\n\n\n97128 rows × 5 columns\n\n\n\n\n\nStep 2: Add Rolling Correlations with tk.augment_rolling_apply()\nNext, let’s add rolling correlations.\n\nWe first .groupby() on the combination of our target assets “symbol” and our comparison asset “comp”.\nThen we use a different function, tk.augment_rolling_apply().\n\n\n\n\n\n\n\ntk.augment_rolling() vs tk.augment_rolling_apply()\n\n\n\n\n\n\nFor the vast majority of operations, tk.augment_rolling() will suffice. It’s used on a single column where there is a simple rolling transformation applied to only the value_column.\nFor more complex cases where other columns beyond a value_column are needed (e.g. rolling correlations, rolling regressions), the tk.augment_rolling_apply() comes to the rescue.\ntk.augment_rolling_apply() exposes the group’s columns as a DataFrame to window function, thus allowing for multi-column analysis.\n\n\n\n\n\n\n\n\n\n\ntk.augment_rolling_apply() has no value_column\n\n\n\n\n\nThis is because the rolling apply passes a DataFrame containing all columns to the custom function. The custom function is then responsible for handling the columns internally. This is how you can select multiple columns to work with.\n\n\n\n\n\nCode\nreturn_corr_df = return_combinations_long_df \\\n .groupby([\"symbol\", \"comp\"]) \\\n .augment_rolling_apply(\n date_column = \"date\",\n window = 90,\n window_func=[('corr', lambda x: x['returns'].corr(x['returns_comp']))],\n threads = 1, # Change to -1 to use all available cores\n )\n\nreturn_corr_df\n\n\n\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\ncomp\nreturns_comp\nrolling_corr_win_90\n\n\n\n\n0\n2013-01-03\nAAPL\n-0.012622\nAAPL\n-0.012622\nNaN\n\n\n1\n2013-01-04\nAAPL\n-0.027854\nAAPL\n-0.027854\nNaN\n\n\n2\n2013-01-07\nAAPL\n-0.005883\nAAPL\n-0.005883\nNaN\n\n\n3\n2013-01-08\nAAPL\n0.002691\nAAPL\n0.002691\nNaN\n\n\n4\n2013-01-09\nAAPL\n-0.015629\nAAPL\n-0.015629\nNaN\n\n\n...\n...\n...\n...\n...\n...\n...\n\n\n97123\n2023-09-15\nNVDA\n-0.036879\nNVDA\n-0.036879\n1.0\n\n\n97124\n2023-09-18\nNVDA\n0.001503\nNVDA\n0.001503\n1.0\n\n\n97125\n2023-09-19\nNVDA\n-0.010144\nNVDA\n-0.010144\n1.0\n\n\n97126\n2023-09-20\nNVDA\n-0.029435\nNVDA\n-0.029435\n1.0\n\n\n97127\n2023-09-21\nNVDA\n-0.028931\nNVDA\n-0.028931\n1.0\n\n\n\n\n97128 rows × 6 columns\n\n\n\n\n\nStep 3: Visualize the Rolling Correlation\nWe can use tk.plot_timeseries() to visualize the 90-day rolling correlation. It’s interesting to see that stock combinations such as AAPL | AMZN returns have a high positive correlation of 0.80, but this relationship was much lower 0.25 before 2015.\n\nThe blue smoother can help us detect trends\nThe y_intercept is useful in this case to draw lines at -1, 0, and 1\n\n\nPlotlyPlotnine\n\n\n\n\nCode\nreturn_corr_df \\\n .dropna() \\\n .groupby(['symbol', 'comp']) \\\n .plot_timeseries(\n date_column = \"date\",\n value_column = \"rolling_corr_win_90\",\n facet_ncol = 6,\n y_intercept = [-1,0,1],\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 1500,\n height = 1000,\n title = \"90-Day Rolling Correlation\",\n engine = \"plotly\"\n )\n\n\n\n \n\n\n\n\n\n\nCode\nreturn_corr_df \\\n .dropna() \\\n .groupby(['symbol', 'comp']) \\\n .plot_timeseries(\n date_column = \"date\",\n value_column = \"rolling_corr_win_90\",\n facet_ncol = 6,\n y_intercept = [-1,0,1],\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 1500,\n height = 1000,\n title = \"90-Day Rolling Correlation\",\n engine = \"plotnine\"\n )\n\n\n\n\n\n<Figure Size: (1500 x 1000)>\n\n\n\n\n\nFor comparison, we can examine the corr_table_df from the Descriptive Statistics Analysis:\n\nNotice that the values tend not to match the most recent trends\nFor example APPL | AMZN is correlated at 0.49 over the entire time period. But more recently this correlation has dropped to 0.17 in the 90-Day Rolling Correlation chart.\n\n\n\nCode\ncorr_table_df\n\n\n\n\n\n\n\n\nsymbol\nAAPL\nAMZN\nGOOG\nMETA\nNFLX\nNVDA\n\n\nsymbol\n\n\n\n\n\n\n\n\n\n\nAAPL\n1.000000\n0.497906\n0.566452\n0.479787\n0.321694\n0.526508\n\n\nAMZN\n0.497906\n1.000000\n0.628103\n0.544481\n0.475078\n0.490234\n\n\nGOOG\n0.566452\n0.628103\n1.000000\n0.595728\n0.428470\n0.531382\n\n\nMETA\n0.479787\n0.544481\n0.595728\n1.000000\n0.407417\n0.450586\n\n\nNFLX\n0.321694\n0.475078\n0.428470\n0.407417\n1.000000\n0.380153\n\n\nNVDA\n0.526508\n0.490234\n0.531382\n0.450586\n0.380153\n1.000000" + "objectID": "tutorials/07_timeseries_crossvalidation.html#step-1-setting-up-the-timeseriescvsplitter", + "href": "tutorials/07_timeseries_crossvalidation.html#step-1-setting-up-the-timeseriescvsplitter", + "title": "Time Series Cross Validation", + "section": "3.1 Step 1: Setting Up the TimeSeriesCVSplitter", + "text": "3.1 Step 1: Setting Up the TimeSeriesCVSplitter\nThe TimeSeriesCVSplitter helps us divide our dataset into training and forecast sets in a rolling window fashion. Here’s how we configure it:\n\nfrom pytimetk.crossvalidation import TimeSeriesCVSplitter\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.model_selection import cross_val_score\n\n# Set up TimeSeriesCVSplitter\ncv_splitter = TimeSeriesCVSplitter(\n time_series=walmart_sales_df['Date'],\n frequency=\"weeks\",\n train_size=52*2,\n forecast_horizon=12,\n gap=0,\n stride=4,\n window=\"rolling\",\n mode=\"backward\",\n split_limit = 5\n)\n\n# Visualize the TSCV Strategy\ncv_splitter.splitter.plot(walmart_sales_df['Weekly_Sales'], walmart_sales_df['Date'])\n\n\n \n\n\nThe TimeSeriesCVSplitter creates multiple splits of the time series data, allowing us to validate the model across different periods. By visualizing the cross-validation strategy, we can see how the training and forecast sets are structured." }, { - "objectID": "tutorials/02_finance.html#about-rolling-regression", - "href": "tutorials/02_finance.html#about-rolling-regression", - "title": "Finance Analysis", - "section": "5.3 About: Rolling Regression", - "text": "5.3 About: Rolling Regression\nRolling regression involves running regression analyses over rolling windows of data points to assess the relationship between a dependent and one or more independent variables. In the context of stock analysis, it can be used to:\n\nBeta Estimation: It can be used to estimate the beta of a stock (a measure of market risk) against a market index over different time periods. A higher beta indicates higher market-related risk.\nMarket Timing: It can be useful in identifying changing relationships between stocks and market indicators, helping traders to adjust their positions accordingly.\nHedge Ratio Determination: It helps in determining the appropriate hedge ratios for pairs trading or other hedging strategies." + "objectID": "tutorials/07_timeseries_crossvalidation.html#step-2-feature-engineering-for-time-series-data", + "href": "tutorials/07_timeseries_crossvalidation.html#step-2-feature-engineering-for-time-series-data", + "title": "Time Series Cross Validation", + "section": "3.2 Step 2: Feature Engineering for Time Series Data", + "text": "3.2 Step 2: Feature Engineering for Time Series Data\nEffective feature engineering can significantly impact the performance of a time series model. Using pytimetk, we extract a variety of features from the Date column.\n\nGenerating Time Series Features\nWe use get_timeseries_signature to generate useful features, such as year, quarter, month, and day-of-week indicators.\n\n# Prepare data for modeling\n\n# Extract time series features from the 'Date' column\nX_time_features = tk.get_timeseries_signature(walmart_sales_df['Date'])\n\n# Select features to dummy encode\nfeatures_to_dummy = ['Date_quarteryear', 'Date_month_lbl', 'Date_wday_lbl', 'Date_am_pm']\n\n# Dummy encode the selected features\nX_time_dummies = pd.get_dummies(X_time_features[features_to_dummy], drop_first=True)\n\n# Dummy encode the 'id' column\nX_id_dummies = pd.get_dummies(walmart_sales_df['id'], prefix='store')\n\n# Combine the time series features, dummy-encoded features, and the 'id' dummies\nX = pd.concat([X_time_features, X_time_dummies, X_id_dummies], axis=1)\n\n# Drop the original categorical columns that were dummy encoded\nX = X.drop(columns=features_to_dummy).drop('Date', axis=1)\n\n# Set the target variable\ny = walmart_sales_df['Weekly_Sales'].values" }, { - "objectID": "tutorials/02_finance.html#application-90-day-rolling-regression", - "href": "tutorials/02_finance.html#application-90-day-rolling-regression", - "title": "Finance Analysis", - "section": "5.4 Application: 90-Day Rolling Regression", - "text": "5.4 Application: 90-Day Rolling Regression\n\n\n\n\n\n\nThis Application Requires Scikit Learn\n\n\n\n\n\nWe need to make a regression function that returns the Slope and Intercept. Scikit Learn has an easy-to-use modeling interface. You may need to pip install scikit-learn to use this applied tutorial.\n\n\n\n\nStep 1: Get Market Returns\nFor our purposes, we assume the market is the average returns of the 6 technology stocks.\n\nWe calculate an equal-weight portfolio as the “market returns”.\nThen we merge the market returns into the returns long data.\n\n\n\nCode\n# Assume Market Returns = Equal Weight Portfolio\nmarket_returns_df = returns_wide_df \\\n .set_index(\"date\") \\\n .assign(returns_market = lambda df: df.sum(axis = 1) * (1 / df.shape[1])) \\\n .reset_index() \\\n [['date', 'returns_market']]\n\n# Merge with returns long\nreturns_long_market_df = returns_long_df \\\n .merge(market_returns_df, how='left', on='date')\n\nreturns_long_market_df\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\nreturns_market\n\n\n\n\n0\n2013-01-03\nAAPL\n-0.012622\n0.005809\n\n\n1\n2013-01-04\nAAPL\n-0.027854\n0.009471\n\n\n2\n2013-01-07\nAAPL\n-0.005883\n0.008880\n\n\n3\n2013-01-08\nAAPL\n0.002691\n-0.010293\n\n\n4\n2013-01-09\nAAPL\n-0.015629\n0.001366\n\n\n...\n...\n...\n...\n...\n\n\n16183\n2023-09-15\nNVDA\n-0.036879\n-0.020231\n\n\n16184\n2023-09-18\nNVDA\n0.001503\n0.003555\n\n\n16185\n2023-09-19\nNVDA\n-0.010144\n-0.001466\n\n\n16186\n2023-09-20\nNVDA\n-0.029435\n-0.023276\n\n\n16187\n2023-09-21\nNVDA\n-0.028931\n-0.020764\n\n\n\n\n16188 rows × 4 columns\n\n\n\n\n\nStep 2: Run a Rolling Regression\nNext, run the following code to perform a rolling regression:\n\nUse a custom regression function that will return the slope and intercept as a pandas series.\nRun the rolling regression with tk.augment_rolling_apply().\n\n\n\nCode\ndef regression(df):\n \n # External functions must \n from sklearn.linear_model import LinearRegression\n\n model = LinearRegression()\n X = df[['returns_market']] # Extract X values (independent variables)\n y = df['returns'] # Extract y values (dependent variable)\n model.fit(X, y)\n ret = pd.Series([model.intercept_, model.coef_[0]], index=['Intercept', 'Slope'])\n \n return ret # Return intercept and slope as a Series\n\nreturn_regression_df = returns_long_market_df \\\n .groupby('symbol') \\\n .augment_rolling_apply(\n date_column = \"date\",\n window = 90,\n window_func = [('regression', regression)],\n threads = 1, # Change to -1 to use all available cores \n ) \\\n .dropna()\n\nreturn_regression_df\n\n\n\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\nreturns_market\nrolling_regression_win_90\n\n\n\n\n89\n2013-05-13\nAAPL\n0.003908\n0.007082\nIntercept -0.001844 Slope 0.061629 dt...\n\n\n90\n2013-05-14\nAAPL\n-0.023926\n0.007583\nIntercept -0.001959 Slope 0.056540 dt...\n\n\n91\n2013-05-15\nAAPL\n-0.033817\n0.005381\nIntercept -0.002036 Slope 0.062330 dt...\n\n\n92\n2013-05-16\nAAPL\n0.013361\n-0.009586\nIntercept -0.001789 Slope 0.052348 dt...\n\n\n93\n2013-05-17\nAAPL\n-0.003037\n0.009005\nIntercept -0.001871 Slope 0.055661 dt...\n\n\n...\n...\n...\n...\n...\n...\n\n\n16183\n2023-09-15\nNVDA\n-0.036879\n-0.020231\nIntercept 0.000100 Slope 1.805479 dt...\n\n\n16184\n2023-09-18\nNVDA\n0.001503\n0.003555\nIntercept 0.000207 Slope 1.800813 dt...\n\n\n16185\n2023-09-19\nNVDA\n-0.010144\n-0.001466\nIntercept 0.000301 Slope 1.817878 dt...\n\n\n16186\n2023-09-20\nNVDA\n-0.029435\n-0.023276\nIntercept 0.000845 Slope 1.825818 dt...\n\n\n16187\n2023-09-21\nNVDA\n-0.028931\n-0.020764\nIntercept 0.000901 Slope 1.818710 dt...\n\n\n\n\n15654 rows × 5 columns\n\n\n\n\n\nStep 3: Extract the Slope Coefficient (Beta)\nThis is more of a hack than anything to extract the beta (slope) of the rolling regression.\n\n\nCode\nintercept_slope_df = pd.concat(return_regression_df['rolling_regression_win_90'].to_list(), axis=1).T \n\nintercept_slope_df.index = return_regression_df.index\n\nreturn_beta_df = pd.concat([return_regression_df, intercept_slope_df], axis=1)\n\nreturn_beta_df\n\n\n\n\n\n\n\n\n\ndate\nsymbol\nreturns\nreturns_market\nrolling_regression_win_90\nIntercept\nSlope\n\n\n\n\n89\n2013-05-13\nAAPL\n0.003908\n0.007082\nIntercept -0.001844 Slope 0.061629 dt...\n-0.001844\n0.061629\n\n\n90\n2013-05-14\nAAPL\n-0.023926\n0.007583\nIntercept -0.001959 Slope 0.056540 dt...\n-0.001959\n0.056540\n\n\n91\n2013-05-15\nAAPL\n-0.033817\n0.005381\nIntercept -0.002036 Slope 0.062330 dt...\n-0.002036\n0.062330\n\n\n92\n2013-05-16\nAAPL\n0.013361\n-0.009586\nIntercept -0.001789 Slope 0.052348 dt...\n-0.001789\n0.052348\n\n\n93\n2013-05-17\nAAPL\n-0.003037\n0.009005\nIntercept -0.001871 Slope 0.055661 dt...\n-0.001871\n0.055661\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n16183\n2023-09-15\nNVDA\n-0.036879\n-0.020231\nIntercept 0.000100 Slope 1.805479 dt...\n0.000100\n1.805479\n\n\n16184\n2023-09-18\nNVDA\n0.001503\n0.003555\nIntercept 0.000207 Slope 1.800813 dt...\n0.000207\n1.800813\n\n\n16185\n2023-09-19\nNVDA\n-0.010144\n-0.001466\nIntercept 0.000301 Slope 1.817878 dt...\n0.000301\n1.817878\n\n\n16186\n2023-09-20\nNVDA\n-0.029435\n-0.023276\nIntercept 0.000845 Slope 1.825818 dt...\n0.000845\n1.825818\n\n\n16187\n2023-09-21\nNVDA\n-0.028931\n-0.020764\nIntercept 0.000901 Slope 1.818710 dt...\n0.000901\n1.818710\n\n\n\n\n15654 rows × 7 columns\n\n\n\n\n\nStep 4: Visualize the Rolling Beta\n\nPlotlyPlotnine\n\n\n\n\nCode\nreturn_beta_df \\\n .groupby('symbol') \\\n .plot_timeseries(\n date_column = \"date\",\n value_column = \"Slope\",\n facet_ncol = 2,\n facet_scales = \"free_x\",\n y_intercept = [0, 3],\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 800,\n height = 600,\n title = \"90-Day Rolling Regression\",\n engine = \"plotly\",\n )\n\n\n\n \n\n\n\n\n\n\nCode\nreturn_beta_df \\\n .groupby('symbol') \\\n .plot_timeseries(\n date_column = \"date\",\n value_column = \"Slope\",\n facet_ncol = 2,\n facet_scales = \"free_x\",\n y_intercept = [0, 3],\n y_intercept_color = tk.palette_timetk()['steel_blue'],\n width = 800,\n height = 600,\n title = \"90-Day Rolling Regression\",\n engine = \"plotnine\",\n )\n\n\n\n\n\n<Figure Size: (800 x 600)>" + "objectID": "tutorials/07_timeseries_crossvalidation.html#step-3-model-training-and-evaluation-with-random-forest", + "href": "tutorials/07_timeseries_crossvalidation.html#step-3-model-training-and-evaluation-with-random-forest", + "title": "Time Series Cross Validation", + "section": "3.3 Step 3: Model Training and Evaluation with Random Forest", + "text": "3.3 Step 3: Model Training and Evaluation with Random Forest\nFor this example, we use RandomForestRegressor from scikit-learn to model the time series data. A random forest is a robust, ensemble-based model that can handle a wide range of regression tasks.\n\n# Initialize the RandomForestRegressor model\nmodel = RandomForestRegressor(\n n_estimators=100, # Number of trees in the forest\n max_depth=None, # Maximum depth of the trees (None means nodes are expanded until all leaves are pure)\n random_state=42 # Set a random state for reproducibility\n)\n\n# Evaluate the model using cross-validation scores\nscores = cross_val_score(model, X, y, cv=cv_splitter, scoring='neg_mean_squared_error')\n\n# Print cross-validation scores\nprint(\"Cross-Validation Scores (Negative MSE):\", scores)\n\nCross-Validation Scores (Negative MSE): [-23761708.80112538 -23107644.58461143 -21728878.18790144\n -25113860.93913386 -86192034.48953015]" + }, + { + "objectID": "tutorials/07_timeseries_crossvalidation.html#step-4-visualizing-the-forecast", + "href": "tutorials/07_timeseries_crossvalidation.html#step-4-visualizing-the-forecast", + "title": "Time Series Cross Validation", + "section": "3.4 Step 4: Visualizing the Forecast", + "text": "3.4 Step 4: Visualizing the Forecast\nVisualization is crucial to understand how well the model predicts future values. We collect the actual and predicted values for each fold and combine them for easy plotting.\n\n# Lists to store the combined data\ncombined_data = []\n\n# Iterate through each fold and collect the data\nfor i, (train_index, test_index) in enumerate(cv_splitter.split(X, y), start=1):\n # Get the training and forecast data from the original DataFrame\n train_df = walmart_sales_df.iloc[train_index].copy()\n test_df = walmart_sales_df.iloc[test_index].copy()\n \n # Fit the model on the training data\n model.fit(X.iloc[train_index], y[train_index])\n \n # Predict on the test set\n y_pred = model.predict(X.iloc[test_index])\n \n # Add the actual and predicted values\n train_df['Actual'] = y[train_index]\n train_df['Predicted'] = None # No predictions for training data\n train_df['Fold'] = i # Indicate the current fold\n \n test_df['Actual'] = y[test_index]\n test_df['Predicted'] = y_pred # Predictions for the test data\n test_df['Fold'] = i # Indicate the current fold\n \n # Append both the training and forecast DataFrames to the combined data list\n combined_data.extend([train_df, test_df])\n\n# Combine all the data into a single DataFrame\nfull_forecast_df = pd.concat(combined_data, ignore_index=True)\n\nfull_forecast_df = full_forecast_df[['id', 'Date', 'Actual', 'Predicted', 'Fold']]\n\nfull_forecast_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 4060 rows of 5 columns\nid: object ['1_1', '1_1', '1_1', '1_1', '1_1', '1_1', ...\nDate: datetime64[ns] [Timestamp('2010-08-06 00:00:00'), Timesta ...\nActual: float64 [17508.41, 15536.4, 15740.13, 15793.87, 16 ...\nPredicted: float64 [nan, nan, nan, nan, nan, nan, nan, nan, n ...\nFold: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\n\n\n\nPreparing Data for Visualization\nTo make the data easier to plot, we use pd.melt() to transform the Actual and Predicted columns into a long format.\n\n# Melt the Actual and Predicted columns\nmelted_df = pd.melt(\n full_forecast_df,\n id_vars=['id', 'Date', 'Fold'], # Columns to keep\n value_vars=['Actual', 'Predicted'], # Columns to melt\n var_name='Type', # Name for the new column indicating 'Actual' or 'Predicted'\n value_name='Value' # Name for the new column with the values\n)\n\nmelted_df[\"unique_id\"] = \"ID_\" + melted_df['id'] + \"-Fold_\" + melted_df[\"Fold\"].astype(str)\n\nmelted_df.glimpse()\n\n<class 'pandas.core.frame.DataFrame'>: 8120 rows of 6 columns\nid: object ['1_1', '1_1', '1_1', '1_1', '1_1', '1_1', ...\nDate: datetime64[ns] [Timestamp('2010-08-06 00:00:00'), Timesta ...\nFold: int64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\nType: object ['Actual', 'Actual', 'Actual', 'Actual', ' ...\nValue: float64 [17508.41, 15536.4, 15740.13, 15793.87, 16 ...\nunique_id: object ['ID_1_1-Fold_1', 'ID_1_1-Fold_1', 'ID_1_1 ...\n\n\n\n\nPlotting the Forecasts\nFinally, we use plot_timeseries() to visualize the forecasts, comparing the actual and predicted values for each fold.\n\nmelted_df \\\n .groupby('unique_id') \\\n .plot_timeseries(\n \"Date\", \"Value\",\n color_column = \"Type\",\n smooth=False, \n plotly_dropdown=True\n )" }, { "objectID": "changelog-news.html", @@ -1922,7 +1922,7 @@ "href": "reference/plot_timeseries.html#examples", "title": "plot_timeseries", "section": "Examples", - "text": "Examples\n\nimport pytimetk as tk\n\ndf = tk.load_dataset('m4_monthly', parse_dates = ['date'])\n\n# Plotly Object: Single Time Series\nfig = (\n df\n .query('id == \"M750\"')\n .plot_timeseries(\n 'date', 'value', \n facet_ncol = 1,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n )\n)\nfig\n\n\n \n\n\n\n# Plotly Object: Grouped Time Series (Facets)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n facet_ncol = 2,\n facet_scales = \"free_y\",\n smooth_frac = 0.2,\n smooth_size = 2.0,\n y_intercept = None,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n width = 600,\n height = 500,\n )\n)\nfig\n\n\n \n\n\n\n# Plotly Object: Grouped Time Series (Plotly Dropdown)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n facet_scales = \"free_y\",\n smooth_frac = 0.2,\n smooth_size = 2.0,\n y_intercept = None,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n width = 600,\n height = 500,\n plotly_dropdown = True, # Plotly Dropdown\n )\n)\nfig\n\n\n \n\n\n\n# Plotly Object: Color Column\nfig = (\n df\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n smooth = False,\n y_intercept = 0,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n )\n)\nfig\n\n\n \n\n\n\n# Plotnine Object: Single Time Series\nfig = (\n df\n .query('id == \"M1\"')\n .plot_timeseries(\n 'date', 'value', \n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n)\nfig\n\n\n\n\n<Figure Size: (700 x 500)>\n\n\n\n# Plotnine Object: Grouped Time Series\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value',\n facet_ncol = 2,\n facet_scales = \"free\",\n line_size = 0.35,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n)\nfig\n\n\n\n\n<Figure Size: (700 x 500)>\n\n\n\n# Plotnine Object: Color Column\nfig = (\n df\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n smooth = False,\n y_intercept = 0,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine',\n )\n)\nfig\n\n\n\n\n<Figure Size: (700 x 500)>\n\n\n\n# Matplotlib object (same as plotnine, but converted to matplotlib object)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n facet_ncol = 2,\n x_axis_date_labels = \"%Y\",\n engine = 'matplotlib',\n )\n)\nfig" + "text": "Examples\n\nimport pytimetk as tk\n\ndf = tk.load_dataset('m4_monthly', parse_dates = ['date'])\n\n# Plotly Object: Single Time Series\nfig = (\n df\n .query('id == \"M750\"')\n .plot_timeseries(\n 'date', 'value', \n facet_ncol = 1,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n )\n)\nfig\n\n\n \n\n\n\n# Plotly Object: Grouped Time Series (Facets)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n facet_ncol = 2,\n facet_scales = \"free_y\",\n smooth_frac = 0.2,\n smooth_size = 2.0,\n y_intercept = None,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n width = 600,\n height = 500,\n )\n)\nfig\n\n\n \n\n\n\n# Plotly Object: Grouped Time Series (Plotly Dropdown)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n facet_scales = \"free_y\",\n smooth_frac = 0.2,\n smooth_size = 2.0,\n y_intercept = None,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n width = 600,\n height = 500,\n plotly_dropdown = True, # Plotly Dropdown\n )\n)\nfig\n\n\n \n\n\n\n# Plotly Object: Color Column\nfig = (\n df\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n smooth = False,\n y_intercept = 0,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n )\n)\nfig\n\n\n \n\n\n\n# Plotnine Object: Single Time Series\nfig = (\n df\n .query('id == \"M1\"')\n .plot_timeseries(\n 'date', 'value', \n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n)\nfig\n\n\n\n\n<Figure Size: (700 x 500)>\n\n\n\n# Plotnine Object: Grouped Time Series\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value',\n facet_ncol = 2,\n facet_scales = \"free\",\n line_size = 0.35,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine'\n )\n)\nfig\n\n\n\n\n<Figure Size: (700 x 500)>\n\n\n\n# Plotnine Object: Color Column\nfig = (\n df\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n smooth = False,\n y_intercept = 0,\n x_axis_date_labels = \"%Y\",\n engine = 'plotnine',\n )\n)\nfig\n\n\n\n\n<Figure Size: (700 x 500)>\n\n\n\n# Matplotlib object (same as plotnine, but converted to matplotlib object)\nfig = (\n df\n .groupby('id')\n .plot_timeseries(\n 'date', 'value', \n color_column = 'id',\n facet_ncol = 2,\n x_axis_date_labels = \"%Y\",\n engine = 'matplotlib',\n )\n)\nfig\n\n\n\n\n\n# Wide-Format Plotting\n\n# Imports\nimport pandas as pd\nimport numpy as np\nimport pytimetk as tk\n\n# Set a random seed for reproducibility\nnp.random.seed(42) \n\n# Create a date range\ndates = pd.date_range(start=\"2020-01-01\", periods=100, freq=\"D\")\n\n# Generate random sales data and compute expenses and profit\nsales = np.random.uniform(1000, 5000, len(dates))\nexpenses = sales * np.random.uniform(0.5, 0.8, len(dates))\nprofit = sales - expenses\n\n# Create the DataFrame\ndf = pd.DataFrame({\n 'date': dates,\n 'sales': sales,\n 'expenses': expenses,\n 'profit': profit\n})\n\n(\n df\n .plot_timeseries(\n date_column = 'date', \n value_column = ['sales', 'expenses', 'profit'],\n color_column = ['sales', 'expenses', 'profit'], \n smooth = True,\n x_axis_date_labels = \"%Y\",\n engine = 'plotly',\n plotly_dropdown = True, # Plotly Dropdown\n )\n)" }, { "objectID": "reference/augment_diffs.html", diff --git a/docs/_site/sitemap.xml b/docs/_site/sitemap.xml index 2348c247..96e1d268 100644 --- a/docs/_site/sitemap.xml +++ b/docs/_site/sitemap.xml @@ -2,354 +2,354 @@ https://business-science.github.io/pytimetk/reference/ts_features.html - 2024-11-06T04:06:27.207Z + 2024-11-06T16:10:34.375Z https://business-science.github.io/pytimetk/reference/floor_date.html - 2024-11-06T04:06:24.296Z + 2024-11-06T16:10:31.288Z https://business-science.github.io/pytimetk/reference/ts_summary.html - 2024-11-06T04:06:21.750Z + 2024-11-06T16:10:28.505Z https://business-science.github.io/pytimetk/reference/timeseries_unit_frequency_table.html - 2024-11-06T04:06:18.937Z + 2024-11-06T16:10:25.270Z https://business-science.github.io/pytimetk/reference/transform_columns.html - 2024-11-06T04:06:17.208Z + 2024-11-06T16:10:23.357Z https://business-science.github.io/pytimetk/reference/augment_timeseries_signature.html - 2024-11-06T04:06:15.752Z + 2024-11-06T16:10:21.475Z https://business-science.github.io/pytimetk/reference/plot_anomalies.html - 2024-11-06T04:06:13.741Z + 2024-11-06T16:10:19.221Z https://business-science.github.io/pytimetk/reference/augment_ewm.html - 2024-11-06T04:06:10.896Z + 2024-11-06T16:10:16.059Z https://business-science.github.io/pytimetk/reference/make_future_timeseries.html - 2024-11-06T04:06:09.148Z + 2024-11-06T16:10:14.046Z https://business-science.github.io/pytimetk/reference/augment_roc.html - 2024-11-06T04:06:07.765Z + 2024-11-06T16:10:12.434Z https://business-science.github.io/pytimetk/reference/get_frequency_summary.html - 2024-11-06T04:06:06.111Z + 2024-11-06T16:10:10.579Z https://business-science.github.io/pytimetk/reference/augment_qsmomentum.html - 2024-11-06T04:06:04.379Z + 2024-11-06T16:10:08.659Z https://business-science.github.io/pytimetk/reference/augment_hilbert.html - 2024-11-06T04:06:02.148Z + 2024-11-06T16:10:06.232Z https://business-science.github.io/pytimetk/reference/progress_apply.html - 2024-11-06T04:05:59.291Z + 2024-11-06T16:10:03.082Z https://business-science.github.io/pytimetk/reference/flatten_multiindex_column_names.html - 2024-11-06T04:05:57.647Z + 2024-11-06T16:10:01.264Z https://business-science.github.io/pytimetk/reference/make_weekday_sequence.html - 2024-11-06T04:05:55.699Z + 2024-11-06T16:09:59.214Z https://business-science.github.io/pytimetk/reference/augment_pct_change.html - 2024-11-06T04:05:54.385Z + 2024-11-06T16:09:57.662Z https://business-science.github.io/pytimetk/reference/get_trend_frequency.html - 2024-11-06T04:05:51.553Z + 2024-11-06T16:09:54.327Z https://business-science.github.io/pytimetk/reference/augment_fourier.html - 2024-11-06T04:05:49.970Z + 2024-11-06T16:09:52.524Z https://business-science.github.io/pytimetk/reference/correlate.html - 2024-11-06T04:05:48.009Z + 2024-11-06T16:09:50.397Z https://business-science.github.io/pytimetk/reference/filter_by_time.html - 2024-11-06T04:05:45.857Z + 2024-11-06T16:09:47.936Z https://business-science.github.io/pytimetk/reference/summarize_by_time.html - 2024-11-06T04:05:43.444Z + 2024-11-06T16:09:45.303Z https://business-science.github.io/pytimetk/reference/drop_zero_variance.html - 2024-11-06T04:05:40.163Z + 2024-11-06T16:09:41.871Z https://business-science.github.io/pytimetk/reference/index.html - 2024-11-06T04:05:38.350Z + 2024-11-06T16:09:39.841Z https://business-science.github.io/pytimetk/reference/get_pandas_frequency.html - 2024-11-06T04:05:36.219Z + 2024-11-06T16:09:37.616Z https://business-science.github.io/pytimetk/reference/palette_timetk.html - 2024-11-06T04:05:34.753Z + 2024-11-06T16:09:35.969Z https://business-science.github.io/pytimetk/reference/get_date_summary.html - 2024-11-06T04:05:33.247Z + 2024-11-06T16:09:34.255Z https://business-science.github.io/pytimetk/reference/is_holiday.html - 2024-11-06T04:05:31.473Z + 2024-11-06T16:09:32.182Z https://business-science.github.io/pytimetk/reference/get_available_datasets.html - 2024-11-06T04:05:29.878Z + 2024-11-06T16:09:30.492Z https://business-science.github.io/pytimetk/reference/augment_wavelet.html - 2024-11-06T04:05:28.145Z + 2024-11-06T16:09:28.575Z https://business-science.github.io/pytimetk/reference/augment_cmo.html - 2024-11-06T04:05:25.860Z + 2024-11-06T16:09:25.864Z https://business-science.github.io/pytimetk/reference/augment_expanding.html - 2024-11-06T04:05:22.709Z + 2024-11-06T16:09:22.470Z https://business-science.github.io/pytimetk/reference/ceil_date.html - 2024-11-06T04:05:20.598Z + 2024-11-06T16:09:19.990Z https://business-science.github.io/pytimetk/reference/augment_rolling_apply.html - 2024-11-06T04:05:18.669Z + 2024-11-06T16:09:17.486Z https://business-science.github.io/pytimetk/reference/augment_rolling.html - 2024-11-06T04:05:16.091Z + 2024-11-06T16:08:57.152Z https://business-science.github.io/pytimetk/performance/01_speed_comparisons.html - 2024-11-06T04:05:13.479Z + 2024-11-06T16:08:54.493Z - https://business-science.github.io/pytimetk/tutorials/01_sales_crm.html - 2024-11-06T04:05:04.339Z - - - https://business-science.github.io/pytimetk/tutorials/05_clustering.html - 2024-11-06T04:04:56.757Z + https://business-science.github.io/pytimetk/tutorials/02_finance.html + 2024-11-06T16:08:52.117Z - https://business-science.github.io/pytimetk/tutorials/04_anomaly_detection.html - 2024-11-06T04:04:54.880Z + https://business-science.github.io/pytimetk/tutorials/03_demand_forecasting.html + 2024-11-06T16:08:38.707Z - https://business-science.github.io/pytimetk/guides/04_wrangling.html - 2024-11-06T04:04:52.761Z + https://business-science.github.io/pytimetk/tutorials/06_correlationfunnel.html + 2024-11-06T16:08:34.359Z - https://business-science.github.io/pytimetk/guides/07_timeseries_crossvalidation.html - 2024-11-06T04:04:46.853Z + https://business-science.github.io/pytimetk/guides/05_augmenting.html + 2024-11-06T16:08:28.943Z https://business-science.github.io/pytimetk/guides/06_anomalize.html - 2024-11-06T04:04:25.699Z + 2024-11-06T16:08:23.684Z https://business-science.github.io/pytimetk/guides/03_pandas_frequency.html - 2024-11-06T04:04:20.180Z + 2024-11-06T16:08:18.409Z https://business-science.github.io/pytimetk/getting-started/01_installation.html - 2024-11-06T04:04:17.628Z + 2024-11-06T16:08:16.067Z https://business-science.github.io/pytimetk/contributing.html - 2024-11-06T04:04:15.441Z + 2024-11-06T16:08:13.354Z https://business-science.github.io/pytimetk/index.html - 2024-11-06T04:04:16.888Z + 2024-11-06T16:08:15.021Z https://business-science.github.io/pytimetk/getting-started/02_quick_start.html - 2024-11-06T04:04:18.984Z + 2024-11-06T16:08:17.393Z https://business-science.github.io/pytimetk/guides/02_timetk_concepts.html - 2024-11-06T04:04:24.521Z + 2024-11-06T16:08:22.503Z https://business-science.github.io/pytimetk/guides/01_visualization.html - 2024-11-06T04:04:26.961Z + 2024-11-06T16:08:25.064Z - https://business-science.github.io/pytimetk/guides/05_augmenting.html - 2024-11-06T04:04:50.441Z + https://business-science.github.io/pytimetk/guides/04_wrangling.html + 2024-11-06T16:08:31.041Z - https://business-science.github.io/pytimetk/tutorials/06_correlationfunnel.html - 2024-11-06T04:04:56.039Z + https://business-science.github.io/pytimetk/tutorials/04_anomaly_detection.html + 2024-11-06T16:08:33.258Z - https://business-science.github.io/pytimetk/tutorials/03_demand_forecasting.html - 2024-11-06T04:05:00.096Z + https://business-science.github.io/pytimetk/tutorials/05_clustering.html + 2024-11-06T16:08:35.085Z - https://business-science.github.io/pytimetk/tutorials/02_finance.html - 2024-11-06T04:05:12.252Z + https://business-science.github.io/pytimetk/tutorials/01_sales_crm.html + 2024-11-06T16:08:43.314Z + + + https://business-science.github.io/pytimetk/tutorials/07_timeseries_crossvalidation.html + 2024-11-06T16:08:53.295Z https://business-science.github.io/pytimetk/changelog-news.html - 2024-11-06T04:05:14.548Z + 2024-11-06T16:08:55.634Z https://business-science.github.io/pytimetk/reference/plot_timeseries.html - 2024-11-06T04:05:17.609Z + 2024-11-06T16:09:16.147Z https://business-science.github.io/pytimetk/reference/augment_diffs.html - 2024-11-06T04:05:19.815Z + 2024-11-06T16:09:18.978Z https://business-science.github.io/pytimetk/reference/time_scale_template.html - 2024-11-06T04:05:21.411Z + 2024-11-06T16:09:20.908Z https://business-science.github.io/pytimetk/reference/pad_by_time.html - 2024-11-06T04:05:24.274Z + 2024-11-06T16:09:24.139Z https://business-science.github.io/pytimetk/reference/plot_anomalies_cleaned.html - 2024-11-06T04:05:26.909Z + 2024-11-06T16:09:27.296Z https://business-science.github.io/pytimetk/reference/plot_anomaly_decomp.html - 2024-11-06T04:05:29.147Z + 2024-11-06T16:09:29.711Z https://business-science.github.io/pytimetk/reference/get_seasonal_frequency.html - 2024-11-06T04:05:30.665Z + 2024-11-06T16:09:31.370Z https://business-science.github.io/pytimetk/reference/plot_correlation_funnel.html - 2024-11-06T04:05:32.468Z + 2024-11-06T16:09:33.280Z https://business-science.github.io/pytimetk/reference/glimpse.html - 2024-11-06T04:05:34.014Z + 2024-11-06T16:09:35.140Z https://business-science.github.io/pytimetk/reference/augment_ppo.html - 2024-11-06T04:05:35.728Z + 2024-11-06T16:09:37.030Z https://business-science.github.io/pytimetk/reference/augment_bbands.html - 2024-11-06T04:05:37.328Z + 2024-11-06T16:09:38.720Z https://business-science.github.io/pytimetk/reference/apply_by_time.html - 2024-11-06T04:05:39.680Z + 2024-11-06T16:09:41.321Z https://business-science.github.io/pytimetk/reference/plot_anomalies_decomp.html - 2024-11-06T04:05:41.206Z + 2024-11-06T16:09:42.986Z https://business-science.github.io/pytimetk/reference/parallel_apply.html - 2024-11-06T04:05:44.446Z + 2024-11-06T16:09:46.399Z https://business-science.github.io/pytimetk/reference/augment_expanding_apply.html - 2024-11-06T04:05:46.928Z + 2024-11-06T16:09:49.168Z https://business-science.github.io/pytimetk/reference/TimeSeriesCVSplitter.html - 2024-11-06T04:05:48.833Z + 2024-11-06T16:09:51.249Z https://business-science.github.io/pytimetk/reference/week_of_month.html - 2024-11-06T04:05:50.766Z + 2024-11-06T16:09:53.388Z https://business-science.github.io/pytimetk/reference/future_frame.html - 2024-11-06T04:05:53.261Z + 2024-11-06T16:09:56.334Z https://business-science.github.io/pytimetk/reference/get_frequency.html - 2024-11-06T04:05:54.890Z + 2024-11-06T16:09:58.333Z https://business-science.github.io/pytimetk/reference/TimeSeriesCV.html - 2024-11-06T04:05:56.874Z + 2024-11-06T16:10:00.386Z https://business-science.github.io/pytimetk/reference/theme_timetk.html - 2024-11-06T04:05:58.497Z + 2024-11-06T16:10:02.213Z https://business-science.github.io/pytimetk/reference/augment_rsi.html - 2024-11-06T04:06:00.808Z + 2024-11-06T16:10:04.720Z https://business-science.github.io/pytimetk/reference/augment_leads.html - 2024-11-06T04:06:03.533Z + 2024-11-06T16:10:07.704Z https://business-science.github.io/pytimetk/reference/binarize.html - 2024-11-06T04:06:05.313Z + 2024-11-06T16:10:09.727Z https://business-science.github.io/pytimetk/reference/load_dataset.html - 2024-11-06T04:06:06.686Z + 2024-11-06T16:10:11.234Z https://business-science.github.io/pytimetk/reference/get_diff_summary.html - 2024-11-06T04:06:08.293Z + 2024-11-06T16:10:13.107Z https://business-science.github.io/pytimetk/reference/make_weekend_sequence.html - 2024-11-06T04:06:09.972Z + 2024-11-06T16:10:15.021Z https://business-science.github.io/pytimetk/reference/get_holiday_signature.html - 2024-11-06T04:06:12.344Z + 2024-11-06T16:10:17.764Z https://business-science.github.io/pytimetk/reference/augment_atr.html - 2024-11-06T04:06:14.835Z + 2024-11-06T16:10:20.427Z https://business-science.github.io/pytimetk/reference/augment_macd.html - 2024-11-06T04:06:16.717Z + 2024-11-06T16:10:22.722Z https://business-science.github.io/pytimetk/reference/get_timeseries_signature.html - 2024-11-06T04:06:18.108Z + 2024-11-06T16:10:24.352Z https://business-science.github.io/pytimetk/reference/augment_lags.html - 2024-11-06T04:06:20.329Z + 2024-11-06T16:10:26.844Z https://business-science.github.io/pytimetk/reference/augment_holiday_signature.html - 2024-11-06T04:06:23.464Z + 2024-11-06T16:10:30.401Z https://business-science.github.io/pytimetk/reference/anomalize.html - 2024-11-06T04:06:26.246Z + 2024-11-06T16:10:33.313Z diff --git a/docs/_site/tutorials/01_sales_crm.html b/docs/_site/tutorials/01_sales_crm.html index 7357a844..a346d04f 100644 --- a/docs/_site/tutorials/01_sales_crm.html +++ b/docs/_site/tutorials/01_sales_crm.html @@ -64,7 +64,7 @@ - + @@ -266,12 +266,6 @@ Anomaly Detection - - @@ -319,6 +313,12 @@ Correlation Funnel + + @@ -3470,8 +3470,8 @@

5 More Coming Soo diff --git a/docs/_site/tutorials/07_timeseries_crossvalidation.html b/docs/_site/tutorials/07_timeseries_crossvalidation.html new file mode 100644 index 00000000..1db0a6c9 --- /dev/null +++ b/docs/_site/tutorials/07_timeseries_crossvalidation.html @@ -0,0 +1,1313 @@ + + + + + + + + + +pytimetk - Time Series Cross Validation + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ + +
+ +
+ + +
+ + + +
+ +
+
+

Time Series Cross Validation

+
+ + + +
+ + + + +
+ + +
+ +
+

1 Time-Based Cross-Validation Using TimeSeriesCV and TimeSeriesCVSplitter

+

In this tutorial, you’ll learn how to use the TimeSeriesCV and TimeSeriesCVSplitter classes from pytimetk for time series cross-validation, using the walmart_sales_df dataset as an example, which contains 7 time series groups.

+
    +
  1. In Part 1, we’ll start with exploring the data and move on to creating and visualizing time-based cross-validation splits. This will prepare you for the next section with Scikit Learn.

  2. +
  3. In Part 2, we’ll implement time series cross-validation with Scikit-Learn, engineer features, train a random forest model, and visualize the results in Python. By following this process, you can ensure a robust evaluation of your time series models and gain insights into their predictive performance.

  4. +
+
+
+

2 Part 1: Getting Started with TimeSeriesCV

+

TimeSeriesCV is used to generate many time series splits (or folds) for use in modeling and resampling with one or more time series groups contained in the data.

+
+
+
+ +
+
+Using with Scikit Learn +
+
+
+
+
+

If you are wanting a drop-in replacement for Scikit Learn’s TimeSeriesSplit, please use TimeSeriesCVSplitter() discussed next. The splitter uses TimeSeriesCV under the hood.

+
+
+
+
+

2.1 Step 1: Load and Explore the Data

+

First, let’s load the Walmart sales dataset and explore its structure:

+
+
# libraries
+import pytimetk as tk
+import pandas as pd
+import numpy as np
+
+# Import Data
+walmart_sales_df = tk.load_dataset('walmart_sales_weekly')
+
+walmart_sales_df['Date'] = pd.to_datetime(walmart_sales_df['Date'])
+
+walmart_sales_df = walmart_sales_df[['id', 'Date', 'Weekly_Sales']]
+
+walmart_sales_df.glimpse()
+
+
<class 'pandas.core.frame.DataFrame'>: 1001 rows of 3 columns
+id:            object            ['1_1', '1_1', '1_1', '1_1', '1_1', '1_ ...
+Date:          datetime64[ns]    [Timestamp('2010-02-05 00:00:00'), Time ...
+Weekly_Sales:  float64           [24924.5, 46039.49, 41595.55, 19403.54, ...
+
+
+
+
+

2.2 Step 2: Visualize the Time Series Data

+

We can visualize the weekly sales data for different store IDs using the plot_timeseries method from pytimetk:

+
+
walmart_sales_df \
+    .groupby('id') \
+    .plot_timeseries(
+        "Date", "Weekly_Sales",
+        plotly_dropdown = True,
+    )
+
+ +
+
+
+

This will generate an interactive time series plot, allowing you to explore sales data for different stores using a dropdown.

+
+
+

2.3 Step 3: Set Up TimeSeriesCV for Cross-Validation

+

Now, let’s set up a time-based cross-validation scheme using TimeSeriesCV:

+
+
from pytimetk.crossvalidation import TimeSeriesCV
+
+# Define parameters for TimeSeriesCV
+tscv = TimeSeriesCV(
+    frequency="weeks",
+    train_size=52,          # Use 52 weeks for training
+    forecast_horizon=12,    # Forecast 12 weeks ahead
+    gap=0,                  # No gap between training and forecast sets
+    stride=4,               # Move forward by 4 weeks after each split
+    window="rolling",       # Use a rolling window
+    mode="backward"         # Generate splits from end to start
+)
+
+# Glimpse the cross-validation splits
+tscv.glimpse(
+    walmart_sales_df['Weekly_Sales'], 
+    time_series=walmart_sales_df['Date']
+)
+
+
Split Number: 1
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2011-08-05 00:00:00 to 2012-07-27 00:00:00
+Forecast Period: 2012-08-03 00:00:00 to 2012-10-19 00:00:00
+
+Split Number: 2
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2011-07-08 00:00:00 to 2012-06-29 00:00:00
+Forecast Period: 2012-07-06 00:00:00 to 2012-09-21 00:00:00
+
+Split Number: 3
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2011-06-10 00:00:00 to 2012-06-01 00:00:00
+Forecast Period: 2012-06-08 00:00:00 to 2012-08-24 00:00:00
+
+Split Number: 4
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2011-05-13 00:00:00 to 2012-05-04 00:00:00
+Forecast Period: 2012-05-11 00:00:00 to 2012-07-27 00:00:00
+
+Split Number: 5
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2011-04-15 00:00:00 to 2012-04-06 00:00:00
+Forecast Period: 2012-04-13 00:00:00 to 2012-06-29 00:00:00
+
+Split Number: 6
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2011-03-18 00:00:00 to 2012-03-09 00:00:00
+Forecast Period: 2012-03-16 00:00:00 to 2012-06-01 00:00:00
+
+Split Number: 7
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2011-02-18 00:00:00 to 2012-02-10 00:00:00
+Forecast Period: 2012-02-17 00:00:00 to 2012-05-04 00:00:00
+
+Split Number: 8
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2011-01-21 00:00:00 to 2012-01-13 00:00:00
+Forecast Period: 2012-01-20 00:00:00 to 2012-04-06 00:00:00
+
+Split Number: 9
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-12-24 00:00:00 to 2011-12-16 00:00:00
+Forecast Period: 2011-12-23 00:00:00 to 2012-03-09 00:00:00
+
+Split Number: 10
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-11-26 00:00:00 to 2011-11-18 00:00:00
+Forecast Period: 2011-11-25 00:00:00 to 2012-02-10 00:00:00
+
+Split Number: 11
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-10-29 00:00:00 to 2011-10-21 00:00:00
+Forecast Period: 2011-10-28 00:00:00 to 2012-01-13 00:00:00
+
+Split Number: 12
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-10-01 00:00:00 to 2011-09-23 00:00:00
+Forecast Period: 2011-09-30 00:00:00 to 2011-12-16 00:00:00
+
+Split Number: 13
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-09-03 00:00:00 to 2011-08-26 00:00:00
+Forecast Period: 2011-09-02 00:00:00 to 2011-11-18 00:00:00
+
+Split Number: 14
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-08-06 00:00:00 to 2011-07-29 00:00:00
+Forecast Period: 2011-08-05 00:00:00 to 2011-10-21 00:00:00
+
+Split Number: 15
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-07-09 00:00:00 to 2011-07-01 00:00:00
+Forecast Period: 2011-07-08 00:00:00 to 2011-09-23 00:00:00
+
+Split Number: 16
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-06-11 00:00:00 to 2011-06-03 00:00:00
+Forecast Period: 2011-06-10 00:00:00 to 2011-08-26 00:00:00
+
+Split Number: 17
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-05-14 00:00:00 to 2011-05-06 00:00:00
+Forecast Period: 2011-05-13 00:00:00 to 2011-07-29 00:00:00
+
+Split Number: 18
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-04-16 00:00:00 to 2011-04-08 00:00:00
+Forecast Period: 2011-04-15 00:00:00 to 2011-07-01 00:00:00
+
+Split Number: 19
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-03-19 00:00:00 to 2011-03-11 00:00:00
+Forecast Period: 2011-03-18 00:00:00 to 2011-06-03 00:00:00
+
+Split Number: 20
+Train Shape: (364,), Forecast Shape: (84,)
+Train Period: 2010-02-19 00:00:00 to 2011-02-11 00:00:00
+Forecast Period: 2011-02-18 00:00:00 to 2011-05-06 00:00:00
+
+
+
+

The glimpse method provides a summary of each cross-validation fold, including the start and end dates of the training and forecast periods.

+
+
+

2.4 Step 4: Plot the Cross-Validation Splits

+

You can visualize how the data is split for training and testing:

+
+
# Plot the cross-validation splits
+tscv.plot(
+    walmart_sales_df['Weekly_Sales'], 
+    time_series=walmart_sales_df['Date']
+)
+
+ +
+
+
+

This plot will show each fold, illustrating which weeks are used for training and which weeks are used for forecasting.

+
+
+
+

3 Part 2: Using TimeSeriesCVSplitter for Model Evaluation with Scikit Learn

+

When evaluating a model’s predictive performance on time series data, we need to split the data in a way that respects the order of time within the Scikit Learn framework. We use a custom splitter, TimeSeriesCVSplitter, from the pytimetk library to handle this.

+
+

3.1 Step 1: Setting Up the TimeSeriesCVSplitter

+

The TimeSeriesCVSplitter helps us divide our dataset into training and forecast sets in a rolling window fashion. Here’s how we configure it:

+
+
from pytimetk.crossvalidation import TimeSeriesCVSplitter
+from sklearn.ensemble import RandomForestRegressor
+from sklearn.model_selection import cross_val_score
+
+# Set up TimeSeriesCVSplitter
+cv_splitter = TimeSeriesCVSplitter(
+    time_series=walmart_sales_df['Date'],
+    frequency="weeks",
+    train_size=52*2,
+    forecast_horizon=12,
+    gap=0,
+    stride=4,
+    window="rolling",
+    mode="backward",
+    split_limit = 5
+)
+
+# Visualize the TSCV Strategy
+cv_splitter.splitter.plot(walmart_sales_df['Weekly_Sales'], walmart_sales_df['Date'])
+
+ +
+
+
+

The TimeSeriesCVSplitter creates multiple splits of the time series data, allowing us to validate the model across different periods. By visualizing the cross-validation strategy, we can see how the training and forecast sets are structured.

+
+
+

3.2 Step 2: Feature Engineering for Time Series Data

+

Effective feature engineering can significantly impact the performance of a time series model. Using pytimetk, we extract a variety of features from the Date column.

+
+

Generating Time Series Features

+

We use get_timeseries_signature to generate useful features, such as year, quarter, month, and day-of-week indicators.

+
+
# Prepare data for modeling
+
+# Extract time series features from the 'Date' column
+X_time_features = tk.get_timeseries_signature(walmart_sales_df['Date'])
+
+# Select features to dummy encode
+features_to_dummy = ['Date_quarteryear', 'Date_month_lbl', 'Date_wday_lbl', 'Date_am_pm']
+
+# Dummy encode the selected features
+X_time_dummies = pd.get_dummies(X_time_features[features_to_dummy], drop_first=True)
+
+# Dummy encode the 'id' column
+X_id_dummies = pd.get_dummies(walmart_sales_df['id'], prefix='store')
+
+# Combine the time series features, dummy-encoded features, and the 'id' dummies
+X = pd.concat([X_time_features, X_time_dummies, X_id_dummies], axis=1)
+
+# Drop the original categorical columns that were dummy encoded
+X = X.drop(columns=features_to_dummy).drop('Date', axis=1)
+
+# Set the target variable
+y = walmart_sales_df['Weekly_Sales'].values
+
+
+
+
+

3.3 Step 3: Model Training and Evaluation with Random Forest

+

For this example, we use RandomForestRegressor from scikit-learn to model the time series data. A random forest is a robust, ensemble-based model that can handle a wide range of regression tasks.

+
+
# Initialize the RandomForestRegressor model
+model = RandomForestRegressor(
+    n_estimators=100,      # Number of trees in the forest
+    max_depth=None,        # Maximum depth of the trees (None means nodes are expanded until all leaves are pure)
+    random_state=42        # Set a random state for reproducibility
+)
+
+# Evaluate the model using cross-validation scores
+scores = cross_val_score(model, X, y, cv=cv_splitter, scoring='neg_mean_squared_error')
+
+# Print cross-validation scores
+print("Cross-Validation Scores (Negative MSE):", scores)
+
+
Cross-Validation Scores (Negative MSE): [-23761708.80112538 -23107644.58461143 -21728878.18790144
+ -25113860.93913386 -86192034.48953015]
+
+
+
+
+

3.4 Step 4: Visualizing the Forecast

+

Visualization is crucial to understand how well the model predicts future values. We collect the actual and predicted values for each fold and combine them for easy plotting.

+
+
# Lists to store the combined data
+combined_data = []
+
+# Iterate through each fold and collect the data
+for i, (train_index, test_index) in enumerate(cv_splitter.split(X, y), start=1):
+    # Get the training and forecast data from the original DataFrame
+    train_df = walmart_sales_df.iloc[train_index].copy()
+    test_df = walmart_sales_df.iloc[test_index].copy()
+    
+    # Fit the model on the training data
+    model.fit(X.iloc[train_index], y[train_index])
+    
+    # Predict on the test set
+    y_pred = model.predict(X.iloc[test_index])
+    
+    # Add the actual and predicted values
+    train_df['Actual'] = y[train_index]
+    train_df['Predicted'] = None  # No predictions for training data
+    train_df['Fold'] = i  # Indicate the current fold
+    
+    test_df['Actual'] = y[test_index]
+    test_df['Predicted'] = y_pred  # Predictions for the test data
+    test_df['Fold'] = i  # Indicate the current fold
+    
+    # Append both the training and forecast DataFrames to the combined data list
+    combined_data.extend([train_df, test_df])
+
+# Combine all the data into a single DataFrame
+full_forecast_df = pd.concat(combined_data, ignore_index=True)
+
+full_forecast_df = full_forecast_df[['id', 'Date', 'Actual', 'Predicted', 'Fold']]
+
+full_forecast_df.glimpse()
+
+
<class 'pandas.core.frame.DataFrame'>: 4060 rows of 5 columns
+id:         object            ['1_1', '1_1', '1_1', '1_1', '1_1', '1_1', ...
+Date:       datetime64[ns]    [Timestamp('2010-08-06 00:00:00'), Timesta ...
+Actual:     float64           [17508.41, 15536.4, 15740.13, 15793.87, 16 ...
+Predicted:  float64           [nan, nan, nan, nan, nan, nan, nan, nan, n ...
+Fold:       int64             [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+
+
+
+

Preparing Data for Visualization

+

To make the data easier to plot, we use pd.melt() to transform the Actual and Predicted columns into a long format.

+
+
# Melt the Actual and Predicted columns
+melted_df = pd.melt(
+    full_forecast_df,
+    id_vars=['id', 'Date', 'Fold'],  # Columns to keep
+    value_vars=['Actual', 'Predicted'],  # Columns to melt
+    var_name='Type',  # Name for the new column indicating 'Actual' or 'Predicted'
+    value_name='Value'  # Name for the new column with the values
+)
+
+melted_df["unique_id"] = "ID_" + melted_df['id'] + "-Fold_" + melted_df["Fold"].astype(str)
+
+melted_df.glimpse()
+
+
<class 'pandas.core.frame.DataFrame'>: 8120 rows of 6 columns
+id:         object            ['1_1', '1_1', '1_1', '1_1', '1_1', '1_1', ...
+Date:       datetime64[ns]    [Timestamp('2010-08-06 00:00:00'), Timesta ...
+Fold:       int64             [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+Type:       object            ['Actual', 'Actual', 'Actual', 'Actual', ' ...
+Value:      float64           [17508.41, 15536.4, 15740.13, 15793.87, 16 ...
+unique_id:  object            ['ID_1_1-Fold_1', 'ID_1_1-Fold_1', 'ID_1_1 ...
+
+
+
+
+

Plotting the Forecasts

+

Finally, we use plot_timeseries() to visualize the forecasts, comparing the actual and predicted values for each fold.

+
+
melted_df \
+    .groupby('unique_id') \
+    .plot_timeseries(
+        "Date", "Value",
+        color_column = "Type",
+        smooth=False, 
+        plotly_dropdown=True
+    )
+
+ +
+
+
+
+
+
+
+

4 Conclusion

+

This guide demonstrated how to implement time series cross-validation, engineer features, train a random forest model, and visualize the results in Python. By following this process, you can ensure a robust evaluation of your time series models and gain insights into their predictive performance. Happy modeling!

+
+
+

5 More Coming Soon…

+

We are in the early stages of development. But it’s obvious the potential for pytimetk now in Python. 🐍

+ + + +
+ +
+ + +
+ + + + \ No newline at end of file diff --git a/docs/reference/plot_timeseries.qmd b/docs/reference/plot_timeseries.qmd index 8555bdb4..189c8ac0 100644 --- a/docs/reference/plot_timeseries.qmd +++ b/docs/reference/plot_timeseries.qmd @@ -231,4 +231,5 @@ df = pd.DataFrame({ engine = 'plotly', plotly_dropdown = True, # Plotly Dropdown ) -) \ No newline at end of file +) +``` \ No newline at end of file diff --git a/docs/guides/07_timeseries_crossvalidation.qmd b/docs/tutorials/07_timeseries_crossvalidation.qmd similarity index 100% rename from docs/guides/07_timeseries_crossvalidation.qmd rename to docs/tutorials/07_timeseries_crossvalidation.qmd diff --git a/src/pytimetk/plot/plot_timeseries.py b/src/pytimetk/plot/plot_timeseries.py index 3cd549bf..9f128473 100644 --- a/src/pytimetk/plot/plot_timeseries.py +++ b/src/pytimetk/plot/plot_timeseries.py @@ -440,6 +440,7 @@ def plot_timeseries( plotly_dropdown = True, # Plotly Dropdown ) ) + ``` ''' # Common checks