From 71c78fc3b353fc72b82ae6bfc2426b42fcc10359 Mon Sep 17 00:00:00 2001 From: Tony Bagnall Date: Wed, 5 Jul 2023 08:25:14 +0100 Subject: [PATCH 01/14] datatypes notebook --- examples/AA_datatypes_and_datasets.ipynb | 18 +- examples/datasets/data_conversions.ipynb | 238 +++++++++++++++++++++++ 2 files changed, 252 insertions(+), 4 deletions(-) create mode 100644 examples/datasets/data_conversions.ipynb diff --git a/examples/AA_datatypes_and_datasets.ipynb b/examples/AA_datatypes_and_datasets.ipynb index 4a82197063..0bbec22f95 100644 --- a/examples/AA_datatypes_and_datasets.ipynb +++ b/examples/AA_datatypes_and_datasets.ipynb @@ -748,13 +748,23 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": " dask_series np.ndarray pd.DataFrame pd.Series xr.DataArray\ndask_series 1 1 1 1 1\nnp.ndarray 1 1 1 1 1\npd.DataFrame 1 1 1 1 1\npd.Series 1 1 1 1 1\nxr.DataArray 1 1 1 1 1", + "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
dask_seriesnp.ndarraypd.DataFramepd.Seriesxr.DataArray
dask_series11111
np.ndarray11111
pd.DataFrame11111
pd.Series11111
xr.DataArray11111
\n
" + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "from aeon.datatypes._convert import _conversions_defined\n", "\n", - "_conversions_defined(scitype=\"Panel\")" + "_conversions_defined(scitype=\"Series\")" ] }, { diff --git a/examples/datasets/data_conversions.ipynb b/examples/datasets/data_conversions.ipynb new file mode 100644 index 0000000000..225eaf9d47 --- /dev/null +++ b/examples/datasets/data_conversions.ipynb @@ -0,0 +1,238 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Data conversions in aeon\n", + "\n", + "We recommend you follow the data storage described in the [data storage notebook](examples/datasets/data_storage.ipynb)\n", + "which can be summarised as follows: Use `pd.Series` or `pd.DataFrame` for forecasting\n", + " and for classification, clustering and regression, use 3D numpy of shape `(n_cases,\n", + " n_channels, n_timepoints)` if your collection of time series are equal length, or a\n", + " list of 2D numpy of length `[n_cases]` if not equal length. All are [data loaders]\n", + " (examples/datasets/data_loading.ipynb) use this format.\n", + "\n", + "However, `aeon` provides a range of converters in the `datatypes` package. These are\n", + "grouped into converters for single series and converters for collections of series" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "# Series Converters\n", + "\n", + "Single time series can be stored in the following data structures\n", + "\n", + "pd.Series: a univariate time series\n", + "pd.DataFrame: a univariate or multivariate time series\n", + "np.ndarray: 2D numpy.ndarray of shape `(n_timepoints, n_channels)`.\n", + "xr.DataArray: a univariate or multivariate time series\n", + "dask_series: Dask DataFrame: a univariate or multivariate time series\n", + "\n", + "NOTE the 2D numpy array representation is not consistent with that used in\n", + "collections. This is an unfortunate difference that is a result of legacy design and\n", + "norms in different research fields. We recommend not using numpy arrays with\n", + "forecasting.\n", + "\n", + "Conversion to and from these data structures is fairly straightforward. `aeon` contains\n", + "converters that are part of the legacy code base. There is a wrapper to hide all this\n", + " code, but we also show under the hood. This code is not likely to be maintained." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 8, + "outputs": [ + { + "data": { + "text/plain": "xarray.core.dataarray.DataArray" + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "\n", + "from aeon.datatypes import convert\n", + "\n", + "numpyarray = np.random.random(size=(100, 1))\n", + "series = convert(numpyarray, from_type=\"np.ndarray\", to_type=\"xr.DataArray\")\n", + "type(series)" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "All the actual converter functions for series are in the following file `aeon.datatypes._series._convert`. We stress,\n", + "this is legacy code. `aeon` thinks it better the user is responsible for getting the\n", + "data into the best format for the estimators." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 9, + "outputs": [ + { + "data": { + "text/plain": "pandas.core.frame.DataFrame" + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from aeon.datatypes._series._convert import (\n", + " convert_mvs_to_dask_as_series,\n", + " convert_Mvs_to_xrdatarray_as_Series,\n", + " convert_np_to_MvS_as_Series,\n", + ")\n", + "\n", + "pd_dataframe = convert_np_to_MvS_as_Series(numpyarray)\n", + "type(pd_dataframe)" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 10, + "outputs": [ + { + "data": { + "text/plain": "dask.dataframe.core.DataFrame" + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dask_dataframe = convert_mvs_to_dask_as_series(pd_dataframe)\n", + "type(dask_dataframe)" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 11, + "outputs": [ + { + "data": { + "text/plain": "xarray.core.dataarray.DataArray" + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "xrarray = convert_Mvs_to_xrdatarray_as_Series(pd_dataframe)\n", + "type(xrarray)" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 11, + "outputs": [], + "source": [], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "# Collections Converters\n", + "\n", + "Previously, collections of time series were called panels (a term from econometrics,\n", + "not machine learning), and there are still references to panel. Collections can be\n", + "stored as follows\n", + "\n", + "numpy3D: 3D np.array of format (n_instances, n_channels, n_timepoints)\n", + "np-list:\n", + "\n", + "\n", + "MTYPE_REGISTER_PANEL = [\n", + " (\n", + " \"nested_univ\",\n", + " \"Panel\",\n", + " \"pd.DataFrame with one column per channel, pd.Series in cells\",\n", + " ),\n", + " (\n", + " \"numpy3D\",\n", + " \"Panel\",\n", + " \"3D np.array of format (n_instances, n_channels, n_timepoints)\",\n", + " ),\n", + " (\n", + " \"numpyflat\",\n", + " \"Panel\",\n", + " \"2D np.array of format (n_instances, n_columns*n_timepoints)\",\n", + " ),\n", + " (\"pd-multiindex\", \"Panel\", \"pd.DataFrame with multi-index (instances, timepoints)\"),\n", + " (\"pd-wide\", \"Panel\", \"pd.DataFrame in wide format, cols = (instance*timepoints)\"),\n", + " (\n", + " \"pd-long\",\n", + " \"Panel\",\n", + " \"pd.DataFrame in long format, cols = (index, time_index, column)\",\n", + " ),\n", + " (\"df-list\", \"Panel\", \"list of pd.DataFrame\"),\n", + " (\n", + " \"dask_panel\",\n", + " \"Panel\",\n", + " \"dask frame with one instance and one time index, as per dask_to_pd convention\",\n", + " ),\n", + " (\n", + " \"np-list\",\n", + " \"Panel\",\n", + " \"list of n_cases, each case a 2D np.array of shape (n_channels, series_length)\",\n", + " ),\n", + "]\n" + ], + "metadata": { + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 6817086a3af64f4fdcccbf9339b532118998ebf8 Mon Sep 17 00:00:00 2001 From: Tony Bagnall Date: Wed, 5 Jul 2023 09:33:28 +0100 Subject: [PATCH 02/14] datatypes notebook --- aeon/classification/tests/test_base.py | 2 +- aeon/datasets/_dataframe_loaders.py | 2 +- aeon/datatypes/_check.py | 2 +- .../{_panel => _collection}/__init__.py | 12 +- .../{_panel => _collection}/_check.py | 0 .../{_panel => _collection}/_convert.py | 2 +- .../{_panel => _collection}/_examples.py | 0 .../{_panel => _collection}/_registry.py | 9 +- aeon/datatypes/_convert.py | 2 +- aeon/datatypes/_examples.py | 10 +- aeon/datatypes/_hierarchical/_check.py | 2 +- aeon/datatypes/_registry.py | 10 +- aeon/datatypes/tests/test_panel_converters.py | 4 +- .../tests/test_series_to_panel_converters.py | 2 +- aeon/forecasting/base/tests/test_base.py | 2 +- aeon/transformations/collection/segment.py | 2 +- aeon/transformations/collection/tsfresh.py | 2 +- aeon/utils/_testing/estimator_checks.py | 2 +- aeon/utils/validation/panel.py | 4 +- examples/datasets/data_conversions.ipynb | 216 ++++++++++++------ 20 files changed, 183 insertions(+), 104 deletions(-) rename aeon/datatypes/{_panel => _collection}/__init__.py (51%) rename aeon/datatypes/{_panel => _collection}/_check.py (100%) rename aeon/datatypes/{_panel => _collection}/_convert.py (99%) rename aeon/datatypes/{_panel => _collection}/_examples.py (100%) rename aeon/datatypes/{_panel => _collection}/_registry.py (77%) diff --git a/aeon/classification/tests/test_base.py b/aeon/classification/tests/test_base.py index b90bcbccd7..c2fa94cd2e 100644 --- a/aeon/classification/tests/test_base.py +++ b/aeon/classification/tests/test_base.py @@ -9,7 +9,7 @@ from aeon.classification import DummyClassifier from aeon.classification.base import BaseClassifier -from aeon.datatypes._panel._convert import ( +from aeon.datatypes._collection._convert import ( from_nested_to_dflist_adp, from_nested_to_multi_index, ) diff --git a/aeon/datasets/_dataframe_loaders.py b/aeon/datasets/_dataframe_loaders.py index dbd5483620..e83424d6a1 100644 --- a/aeon/datasets/_dataframe_loaders.py +++ b/aeon/datasets/_dataframe_loaders.py @@ -24,7 +24,7 @@ from aeon.datasets._data_generators import _convert_tsf_to_hierarchical from aeon.datatypes import MTYPE_LIST_HIERARCHICAL, convert -from aeon.datatypes._panel._convert import from_long_to_nested +from aeon.datatypes._collection._convert import from_long_to_nested DIRNAME = "data" MODULE = os.path.dirname(__file__) diff --git a/aeon/datatypes/_check.py b/aeon/datatypes/_check.py index 3861b640a0..23f600dbec 100644 --- a/aeon/datatypes/_check.py +++ b/aeon/datatypes/_check.py @@ -29,8 +29,8 @@ import numpy as np from aeon.datatypes._alignment import check_dict_Alignment +from aeon.datatypes._collection import check_dict_Panel from aeon.datatypes._hierarchical import check_dict_Hierarchical -from aeon.datatypes._panel import check_dict_Panel from aeon.datatypes._proba import check_dict_Proba from aeon.datatypes._registry import AMBIGUOUS_MTYPES, SCITYPE_LIST, mtype_to_scitype from aeon.datatypes._series import check_dict_Series diff --git a/aeon/datatypes/_panel/__init__.py b/aeon/datatypes/_collection/__init__.py similarity index 51% rename from aeon/datatypes/_panel/__init__.py rename to aeon/datatypes/_collection/__init__.py index bab0affda4..02aa9e0c06 100644 --- a/aeon/datatypes/_panel/__init__.py +++ b/aeon/datatypes/_collection/__init__.py @@ -1,16 +1,16 @@ # -*- coding: utf-8 -*- """Module exports: Panel type checkers, converters and mtype inference.""" -from aeon.datatypes._panel._check import check_dict as check_dict_Panel -from aeon.datatypes._panel._convert import convert_dict as convert_dict_Panel -from aeon.datatypes._panel._examples import example_dict as example_dict_Panel -from aeon.datatypes._panel._examples import ( +from aeon.datatypes._collection._check import check_dict as check_dict_Panel +from aeon.datatypes._collection._convert import convert_dict as convert_dict_Panel +from aeon.datatypes._collection._examples import example_dict as example_dict_Panel +from aeon.datatypes._collection._examples import ( example_dict_lossy as example_dict_lossy_Panel, ) -from aeon.datatypes._panel._examples import ( +from aeon.datatypes._collection._examples import ( example_dict_metadata as example_dict_metadata_Panel, ) -from aeon.datatypes._panel._registry import MTYPE_LIST_PANEL, MTYPE_REGISTER_PANEL +from aeon.datatypes._collection._registry import MTYPE_LIST_PANEL, MTYPE_REGISTER_PANEL __all__ = [ "check_dict_Panel", diff --git a/aeon/datatypes/_panel/_check.py b/aeon/datatypes/_collection/_check.py similarity index 100% rename from aeon/datatypes/_panel/_check.py rename to aeon/datatypes/_collection/_check.py diff --git a/aeon/datatypes/_panel/_convert.py b/aeon/datatypes/_collection/_convert.py similarity index 99% rename from aeon/datatypes/_panel/_convert.py rename to aeon/datatypes/_collection/_convert.py index fe3b2f1ee5..4b5d4fdd9a 100644 --- a/aeon/datatypes/_panel/_convert.py +++ b/aeon/datatypes/_collection/_convert.py @@ -34,8 +34,8 @@ "convert_dict", ] +from aeon.datatypes._collection._registry import MTYPE_LIST_PANEL from aeon.datatypes._convert_utils._convert import _extend_conversions -from aeon.datatypes._panel._registry import MTYPE_LIST_PANEL from aeon.utils.validation._dependencies import _check_soft_dependencies # dictionary indexed by triples of types diff --git a/aeon/datatypes/_panel/_examples.py b/aeon/datatypes/_collection/_examples.py similarity index 100% rename from aeon/datatypes/_panel/_examples.py rename to aeon/datatypes/_collection/_examples.py diff --git a/aeon/datatypes/_panel/_registry.py b/aeon/datatypes/_collection/_registry.py similarity index 77% rename from aeon/datatypes/_panel/_registry.py rename to aeon/datatypes/_collection/_registry.py index 94e0f39da6..51f1f6d79b 100644 --- a/aeon/datatypes/_panel/_registry.py +++ b/aeon/datatypes/_collection/_registry.py @@ -1,5 +1,5 @@ # -*- coding: utf-8 -*- -"""Registry of mtypes for Panel scitype. See datatypes._registry for API.""" +"""Registry of mtypes for Collections. See datatypes._registry for API.""" import pandas as pd @@ -19,12 +19,12 @@ ( "numpy3D", "Panel", - "3D np.array of format (n_instances, n_channels, n_timepoints)", + "3D np.ndarray of format (n_cases, n_channels, n_timepoints)", ), ( "numpyflat", "Panel", - "2D np.array of format (n_instances, n_columns*n_timepoints)", + "2D np.ndarray of format (n_cases, n_channels*n_timepoints)", ), ("pd-multiindex", "Panel", "pd.DataFrame with multi-index (instances, timepoints)"), ("pd-wide", "Panel", "pd.DataFrame in wide format, cols = (instance*timepoints)"), @@ -42,7 +42,8 @@ ( "np-list", "Panel", - "list of n_cases, each case a 2D np.array of shape (n_channels, series_length)", + "list of length [n_cases], each case a 2D np.ndarray of shape (n_channels, " + "n_timepoints)", ), ] diff --git a/aeon/datatypes/_convert.py b/aeon/datatypes/_convert.py index c06b5d4965..4afa17ce81 100644 --- a/aeon/datatypes/_convert.py +++ b/aeon/datatypes/_convert.py @@ -71,8 +71,8 @@ import pandas as pd from aeon.datatypes._check import mtype as infer_mtype +from aeon.datatypes._collection import convert_dict_Panel from aeon.datatypes._hierarchical import convert_dict_Hierarchical -from aeon.datatypes._panel import convert_dict_Panel from aeon.datatypes._proba import convert_dict_Proba from aeon.datatypes._registry import AMBIGUOUS_MTYPES, mtype_to_scitype from aeon.datatypes._series import convert_dict_Series diff --git a/aeon/datatypes/_examples.py b/aeon/datatypes/_examples.py index be06bc52f9..d860d1f26c 100644 --- a/aeon/datatypes/_examples.py +++ b/aeon/datatypes/_examples.py @@ -23,16 +23,16 @@ ] from aeon.datatypes._alignment import example_dict_Alignment +from aeon.datatypes._collection import ( + example_dict_lossy_Panel, + example_dict_metadata_Panel, + example_dict_Panel, +) from aeon.datatypes._hierarchical import ( example_dict_Hierarchical, example_dict_lossy_Hierarchical, example_dict_metadata_Hierarchical, ) -from aeon.datatypes._panel import ( - example_dict_lossy_Panel, - example_dict_metadata_Panel, - example_dict_Panel, -) from aeon.datatypes._proba import ( example_dict_lossy_Proba, example_dict_metadata_Proba, diff --git a/aeon/datatypes/_hierarchical/_check.py b/aeon/datatypes/_hierarchical/_check.py index 1181180dfe..4c29115a2e 100644 --- a/aeon/datatypes/_hierarchical/_check.py +++ b/aeon/datatypes/_hierarchical/_check.py @@ -44,7 +44,7 @@ import numpy as np -from aeon.datatypes._panel._check import check_pdmultiindex_panel +from aeon.datatypes._collection._check import check_pdmultiindex_panel from aeon.utils.validation._dependencies import _check_soft_dependencies diff --git a/aeon/datatypes/_registry.py b/aeon/datatypes/_registry.py index 57fcc2d8cf..d13a63e240 100644 --- a/aeon/datatypes/_registry.py +++ b/aeon/datatypes/_registry.py @@ -43,16 +43,16 @@ MTYPE_LIST_ALIGNMENT, MTYPE_REGISTER_ALIGNMENT, ) +from aeon.datatypes._collection._registry import ( + MTYPE_LIST_PANEL, + MTYPE_REGISTER_PANEL, + MTYPE_SOFT_DEPS_PANEL, +) from aeon.datatypes._hierarchical._registry import ( MTYPE_LIST_HIERARCHICAL, MTYPE_REGISTER_HIERARCHICAL, MTYPE_SOFT_DEPS_HIERARCHICAL, ) -from aeon.datatypes._panel._registry import ( - MTYPE_LIST_PANEL, - MTYPE_REGISTER_PANEL, - MTYPE_SOFT_DEPS_PANEL, -) from aeon.datatypes._proba._registry import MTYPE_LIST_PROBA, MTYPE_REGISTER_PROBA from aeon.datatypes._series._registry import ( MTYPE_LIST_SERIES, diff --git a/aeon/datatypes/tests/test_panel_converters.py b/aeon/datatypes/tests/test_panel_converters.py index adc68997c0..7a0737b368 100644 --- a/aeon/datatypes/tests/test_panel_converters.py +++ b/aeon/datatypes/tests/test_panel_converters.py @@ -6,12 +6,12 @@ from aeon.datasets import make_example_long_table, make_example_multi_index_dataframe from aeon.datatypes._adapter import convert_from_multiindex_to_listdataset -from aeon.datatypes._panel._check import ( +from aeon.datatypes._collection._check import ( are_columns_nested, check_nplist_panel, is_nested_dataframe, ) -from aeon.datatypes._panel._convert import ( +from aeon.datatypes._collection._convert import ( from_2d_array_to_nested, from_3d_numpy_to_2d_array, from_3d_numpy_to_multi_index, diff --git a/aeon/datatypes/tests/test_series_to_panel_converters.py b/aeon/datatypes/tests/test_series_to_panel_converters.py index 152d5ea5ec..5181900b82 100644 --- a/aeon/datatypes/tests/test_series_to_panel_converters.py +++ b/aeon/datatypes/tests/test_series_to_panel_converters.py @@ -4,7 +4,7 @@ import numpy as np import pandas as pd -from aeon.datatypes._panel._convert import from_3d_numpy_to_multi_index +from aeon.datatypes._collection._convert import from_3d_numpy_to_multi_index from aeon.datatypes._series_as_panel import ( convert_Panel_to_Series, convert_Series_to_Panel, diff --git a/aeon/forecasting/base/tests/test_base.py b/aeon/forecasting/base/tests/test_base.py index bb1ae83a6a..bd98210fd7 100644 --- a/aeon/forecasting/base/tests/test_base.py +++ b/aeon/forecasting/base/tests/test_base.py @@ -12,7 +12,7 @@ from pandas.testing import assert_series_equal from aeon.datatypes import check_is_mtype, convert -from aeon.datatypes._panel._convert import from_nested_to_multi_index +from aeon.datatypes._collection._convert import from_nested_to_multi_index from aeon.datatypes._utilities import get_cutoff, get_window from aeon.forecasting.arima import ARIMA from aeon.utils._testing.collection import make_3d_test_data, make_nested_dataframe_data diff --git a/aeon/transformations/collection/segment.py b/aeon/transformations/collection/segment.py index a6af5b87b1..6757b3ce73 100644 --- a/aeon/transformations/collection/segment.py +++ b/aeon/transformations/collection/segment.py @@ -6,7 +6,7 @@ import pandas as pd from sklearn.utils import check_random_state -from aeon.datatypes._panel._convert import _concat_nested_arrays, _get_time_index +from aeon.datatypes._collection._convert import _concat_nested_arrays, _get_time_index from aeon.transformations.base import BaseTransformer from aeon.utils.validation import check_window_length diff --git a/aeon/transformations/collection/tsfresh.py b/aeon/transformations/collection/tsfresh.py index 21146b4e48..f47a02a2bd 100644 --- a/aeon/transformations/collection/tsfresh.py +++ b/aeon/transformations/collection/tsfresh.py @@ -5,7 +5,7 @@ __author__ = ["AyushmaanSeth", "mloning", "Alwin Wang", "MatthewMiddlehurst"] __all__ = ["TSFreshFeatureExtractor", "TSFreshRelevantFeatureExtractor"] -from aeon.datatypes._panel._convert import from_3d_numpy_to_long +from aeon.datatypes._collection._convert import from_3d_numpy_to_long from aeon.transformations.collection.base import BaseCollectionTransformer from aeon.utils.validation import check_n_jobs from aeon.utils.validation._dependencies import _check_soft_dependencies diff --git a/aeon/utils/_testing/estimator_checks.py b/aeon/utils/_testing/estimator_checks.py index 6b4c5b1b76..6a4021f24d 100644 --- a/aeon/utils/_testing/estimator_checks.py +++ b/aeon/utils/_testing/estimator_checks.py @@ -17,7 +17,7 @@ from aeon.classification.base import BaseClassifier from aeon.classification.early_classification import BaseEarlyClassifier from aeon.clustering.base import BaseClusterer -from aeon.datatypes._panel._check import is_nested_dataframe +from aeon.datatypes._collection._check import is_nested_dataframe from aeon.forecasting.base import BaseForecaster from aeon.regression.base import BaseRegressor from aeon.tests._config import VALID_ESTIMATOR_TYPES diff --git a/aeon/utils/validation/panel.py b/aeon/utils/validation/panel.py index 8799b3d063..71c848e703 100644 --- a/aeon/utils/validation/panel.py +++ b/aeon/utils/validation/panel.py @@ -12,8 +12,8 @@ import pandas as pd from sklearn.utils.validation import check_consistent_length -from aeon.datatypes._panel._check import is_nested_dataframe -from aeon.datatypes._panel._convert import ( +from aeon.datatypes._collection._check import is_nested_dataframe +from aeon.datatypes._collection._convert import ( from_3d_numpy_to_nested, from_nested_to_3d_numpy, ) diff --git a/examples/datasets/data_conversions.ipynb b/examples/datasets/data_conversions.ipynb index 225eaf9d47..2276c1a4a3 100644 --- a/examples/datasets/data_conversions.ipynb +++ b/examples/datasets/data_conversions.ipynb @@ -26,20 +26,23 @@ "\n", "Single time series can be stored in the following data structures\n", "\n", - "pd.Series: a univariate time series\n", - "pd.DataFrame: a univariate or multivariate time series\n", - "np.ndarray: 2D numpy.ndarray of shape `(n_timepoints, n_channels)`.\n", - "xr.DataArray: a univariate or multivariate time series\n", - "dask_series: Dask DataFrame: a univariate or multivariate time series\n", + "- \"pd.Series\": Pandas Series storing a univariate time series\n", + "- \"pd.DataFrame\": Pandas DataFrame storing a univariate or multivariate time series\n", + "- \"np.ndarray\": numpy 2d array for series of shape `(n_timepoints, n_channels)`.\n", + "- \"xr.DataArray\": xarray DataArray a for a univariate or multivariate time series\n", + "- \"dask_series\": Dask DataFrame for a univariate or multivariate time series\n", "\n", - "NOTE the 2D numpy array representation is not consistent with that used in\n", + "The above strings are used to internally specify each different data structure. NOTE the\n", + " 2D numpy array representation is not consistent with that used in\n", "collections. This is an unfortunate difference that is a result of legacy design and\n", - "norms in different research fields. We recommend not using numpy arrays with\n", - "forecasting.\n", + "norms in different research fields.\n", "\n", - "Conversion to and from these data structures is fairly straightforward. `aeon` contains\n", - "converters that are part of the legacy code base. There is a wrapper to hide all this\n", - " code, but we also show under the hood. This code is not likely to be maintained." + "Conversion to and from these data structures is fairly straightforward, but we\n", + "provide tools to help. `aeon` contains converters that are wrapped by the method\n", + "`convert`. This method will attempt to convert from one of the five types to another,\n", + " and raise an exception if the conversion is invalid (e.g. if the object is not in\n", + " fact of type \"from_type\"). Note that series estimators will attempt to automatically\n", + " perform this conversion to the specified internal type of that estimator." ], "metadata": { "collapsed": false @@ -47,13 +50,13 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 27, "outputs": [ { "data": { "text/plain": "xarray.core.dataarray.DataArray" }, - "execution_count": 8, + "execution_count": 27, "metadata": {}, "output_type": "execute_result" } @@ -74,9 +77,8 @@ { "cell_type": "markdown", "source": [ - "All the actual converter functions for series are in the following file `aeon.datatypes._series._convert`. We stress,\n", - "this is legacy code. `aeon` thinks it better the user is responsible for getting the\n", - "data into the best format for the estimators." + "the method `convert` wraps actual converter functions in the file `aeon.datatypes\n", + "._series._convert`. Some examples below" ], "metadata": { "collapsed": false @@ -84,13 +86,13 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 28, "outputs": [ { "data": { "text/plain": "pandas.core.frame.DataFrame" }, - "execution_count": 9, + "execution_count": 28, "metadata": {}, "output_type": "execute_result" } @@ -111,13 +113,13 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 29, "outputs": [ { "data": { "text/plain": "dask.dataframe.core.DataFrame" }, - "execution_count": 10, + "execution_count": 29, "metadata": {}, "output_type": "execute_result" } @@ -132,13 +134,13 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 30, "outputs": [ { "data": { "text/plain": "xarray.core.dataarray.DataArray" }, - "execution_count": 11, + "execution_count": 30, "metadata": {}, "output_type": "execute_result" } @@ -151,11 +153,82 @@ "collapsed": false } }, + { + "cell_type": "markdown", + "source": [ + "# Collections Converters\n", + "\n", + "Previously, collections of time series were called panels (a term from econometrics,\n", + "not machine learning), and there are still references to panel. The main\n", + "data structures for storing collections are as follows\n", + "\n", + "- \"numpy3D\": 3D np.ndarray of format `(n_cases, n_channels, n_timepoints)`\n", + "- \"np-list\": python list of 2D numpy array of length `[n_cases]`, each of shape\n", + "`(n_channels, n_timepoints_i)`\n", + "- \"df-list\": python list of 2D pd.DataFrames of length `[n_cases]`, each a of shape\n", + "`(n_timepoints_i, n_channels)`\n", + "- \"numpyflat\": 2D np.ndarray of format `(n_cases, n_channels*n_timepoints)`\n", + "\n", + "Other supported types which may be useful in forecasting are\n", + "\n", + "- \"nested_univ\": a pd.DataFrame of shape `(n_cases, n_channels)` where each cell is a\n", + " pd.Series of length `(n_timepoints)`\n", + " - \"pd-multiindex\": pd.DataFrame with multi-index `(cases, timepoints)`\n", + " - \"pd-wide\": pd.DataFrame in wide format, `cols = (instance*timepoints)`\n", + " - \"dask_panel\": dask frame with one instance and one time index\n", + "\n", + "As with series, conversion is performed with the method `convert` and auto conversion\n", + " happens in estimator base classes. These wrap methods in `aeon.datatypes\n", + "._collection._convert`" + ], + "metadata": { + "collapsed": false + } + }, { "cell_type": "code", - "execution_count": 11, - "outputs": [], - "source": [], + "execution_count": 35, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Type = , type first shape first (3, 100)\n" + ] + } + ], + "source": [ + "# 10 multivariate time series with 3 channels of length 100 in \"numpy3D\" format\n", + "multi = np.random.random(size=(10, 3, 100))\n", + "np_list = convert(multi, from_type=\"numpy3D\", to_type=\"np-list\")\n", + "print(\n", + " f\" Type = {type(np_list)}, type first {type(np_list[0])} shape first \"\n", + " f\"{np_list[0].shape}\"\n", + ")" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 36, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Type = , type first shape first (100, 3)\n" + ] + } + ], + "source": [ + "df_list = convert(multi, from_type=\"numpy3D\", to_type=\"df-list\")\n", + "print(\n", + " f\" Type = {type(df_list)}, type first {type(df_list[0])} shape first \"\n", + " f\"{df_list[0].shape}\"\n", + ")" + ], "metadata": { "collapsed": false } @@ -163,51 +236,56 @@ { "cell_type": "markdown", "source": [ - "# Collections Converters\n", - "\n", - "Previously, collections of time series were called panels (a term from econometrics,\n", - "not machine learning), and there are still references to panel. Collections can be\n", - "stored as follows\n", - "\n", - "numpy3D: 3D np.array of format (n_instances, n_channels, n_timepoints)\n", - "np-list:\n", - "\n", + "Note again the difference in storage convention: series in 2D numpy are stored in `\n", + "(n_channels, n_timepoints)`, whereas with dataframes, they are in shape `\n", + "(n_timepoints, n_channels)`. We know this is confusing, and are thinking about the\n", + "best way of reconciling this distinction. See [this issue](https://github\n", + ".com/aeon-toolkit/aeon/issues/537). The actual converter functions are here\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 39, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Type = ,shape (3000, 4)\n" + ] + } + ], + "source": [ + "from aeon.datatypes._collection._convert import (\n", + " from_3d_numpy_to_long,\n", + " from_3d_numpy_to_multi_index,\n", + ")\n", "\n", - "MTYPE_REGISTER_PANEL = [\n", - " (\n", - " \"nested_univ\",\n", - " \"Panel\",\n", - " \"pd.DataFrame with one column per channel, pd.Series in cells\",\n", - " ),\n", - " (\n", - " \"numpy3D\",\n", - " \"Panel\",\n", - " \"3D np.array of format (n_instances, n_channels, n_timepoints)\",\n", - " ),\n", - " (\n", - " \"numpyflat\",\n", - " \"Panel\",\n", - " \"2D np.array of format (n_instances, n_columns*n_timepoints)\",\n", - " ),\n", - " (\"pd-multiindex\", \"Panel\", \"pd.DataFrame with multi-index (instances, timepoints)\"),\n", - " (\"pd-wide\", \"Panel\", \"pd.DataFrame in wide format, cols = (instance*timepoints)\"),\n", - " (\n", - " \"pd-long\",\n", - " \"Panel\",\n", - " \"pd.DataFrame in long format, cols = (index, time_index, column)\",\n", - " ),\n", - " (\"df-list\", \"Panel\", \"list of pd.DataFrame\"),\n", - " (\n", - " \"dask_panel\",\n", - " \"Panel\",\n", - " \"dask frame with one instance and one time index, as per dask_to_pd convention\",\n", - " ),\n", - " (\n", - " \"np-list\",\n", - " \"Panel\",\n", - " \"list of n_cases, each case a 2D np.array of shape (n_channels, series_length)\",\n", - " ),\n", - "]\n" + "long = from_3d_numpy_to_long(multi)\n", + "print(f\" Type = {type(long)},shape {long.shape}\")" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 40, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Type = ,shape (1000, 3)\n" + ] + } + ], + "source": [ + "mi = from_3d_numpy_to_multi_index(multi)\n", + "print(f\" Type = {type(mi)},shape {mi.shape}\")" ], "metadata": { "collapsed": false From 8821a6e0e76f18621cc4e2616a27f9fa2ab83b63 Mon Sep 17 00:00:00 2001 From: Tony Bagnall Date: Wed, 5 Jul 2023 10:09:15 +0100 Subject: [PATCH 03/14] revert collection to panel to find circular import --- aeon/classification/tests/test_base.py | 2 +- aeon/datasets/_dataframe_loaders.py | 2 +- aeon/datatypes/_check.py | 2 +- aeon/datatypes/_convert.py | 2 +- aeon/datatypes/_examples.py | 10 ++--- aeon/datatypes/_hierarchical/_check.py | 2 +- .../{_collection => _panel}/__init__.py | 12 +++--- .../{_collection => _panel}/_check.py | 0 .../{_collection => _panel}/_convert.py | 2 +- .../{_collection => _panel}/_examples.py | 0 .../{_collection => _panel}/_registry.py | 0 aeon/datatypes/_registry.py | 10 ++--- aeon/datatypes/tests/test_panel_converters.py | 4 +- .../tests/test_series_to_panel_converters.py | 2 +- aeon/forecasting/base/tests/test_base.py | 2 +- aeon/transformations/collection/segment.py | 2 +- aeon/transformations/collection/tsfresh.py | 2 +- aeon/utils/_testing/estimator_checks.py | 2 +- aeon/utils/validation/panel.py | 4 +- examples/AA_datatypes_and_datasets.ipynb | 39 +++++++++++-------- 20 files changed, 54 insertions(+), 47 deletions(-) rename aeon/datatypes/{_collection => _panel}/__init__.py (51%) rename aeon/datatypes/{_collection => _panel}/_check.py (100%) rename aeon/datatypes/{_collection => _panel}/_convert.py (99%) rename aeon/datatypes/{_collection => _panel}/_examples.py (100%) rename aeon/datatypes/{_collection => _panel}/_registry.py (100%) diff --git a/aeon/classification/tests/test_base.py b/aeon/classification/tests/test_base.py index c2fa94cd2e..b90bcbccd7 100644 --- a/aeon/classification/tests/test_base.py +++ b/aeon/classification/tests/test_base.py @@ -9,7 +9,7 @@ from aeon.classification import DummyClassifier from aeon.classification.base import BaseClassifier -from aeon.datatypes._collection._convert import ( +from aeon.datatypes._panel._convert import ( from_nested_to_dflist_adp, from_nested_to_multi_index, ) diff --git a/aeon/datasets/_dataframe_loaders.py b/aeon/datasets/_dataframe_loaders.py index e83424d6a1..dbd5483620 100644 --- a/aeon/datasets/_dataframe_loaders.py +++ b/aeon/datasets/_dataframe_loaders.py @@ -24,7 +24,7 @@ from aeon.datasets._data_generators import _convert_tsf_to_hierarchical from aeon.datatypes import MTYPE_LIST_HIERARCHICAL, convert -from aeon.datatypes._collection._convert import from_long_to_nested +from aeon.datatypes._panel._convert import from_long_to_nested DIRNAME = "data" MODULE = os.path.dirname(__file__) diff --git a/aeon/datatypes/_check.py b/aeon/datatypes/_check.py index 23f600dbec..3861b640a0 100644 --- a/aeon/datatypes/_check.py +++ b/aeon/datatypes/_check.py @@ -29,8 +29,8 @@ import numpy as np from aeon.datatypes._alignment import check_dict_Alignment -from aeon.datatypes._collection import check_dict_Panel from aeon.datatypes._hierarchical import check_dict_Hierarchical +from aeon.datatypes._panel import check_dict_Panel from aeon.datatypes._proba import check_dict_Proba from aeon.datatypes._registry import AMBIGUOUS_MTYPES, SCITYPE_LIST, mtype_to_scitype from aeon.datatypes._series import check_dict_Series diff --git a/aeon/datatypes/_convert.py b/aeon/datatypes/_convert.py index 4afa17ce81..c06b5d4965 100644 --- a/aeon/datatypes/_convert.py +++ b/aeon/datatypes/_convert.py @@ -71,8 +71,8 @@ import pandas as pd from aeon.datatypes._check import mtype as infer_mtype -from aeon.datatypes._collection import convert_dict_Panel from aeon.datatypes._hierarchical import convert_dict_Hierarchical +from aeon.datatypes._panel import convert_dict_Panel from aeon.datatypes._proba import convert_dict_Proba from aeon.datatypes._registry import AMBIGUOUS_MTYPES, mtype_to_scitype from aeon.datatypes._series import convert_dict_Series diff --git a/aeon/datatypes/_examples.py b/aeon/datatypes/_examples.py index d860d1f26c..be06bc52f9 100644 --- a/aeon/datatypes/_examples.py +++ b/aeon/datatypes/_examples.py @@ -23,16 +23,16 @@ ] from aeon.datatypes._alignment import example_dict_Alignment -from aeon.datatypes._collection import ( - example_dict_lossy_Panel, - example_dict_metadata_Panel, - example_dict_Panel, -) from aeon.datatypes._hierarchical import ( example_dict_Hierarchical, example_dict_lossy_Hierarchical, example_dict_metadata_Hierarchical, ) +from aeon.datatypes._panel import ( + example_dict_lossy_Panel, + example_dict_metadata_Panel, + example_dict_Panel, +) from aeon.datatypes._proba import ( example_dict_lossy_Proba, example_dict_metadata_Proba, diff --git a/aeon/datatypes/_hierarchical/_check.py b/aeon/datatypes/_hierarchical/_check.py index 4c29115a2e..1181180dfe 100644 --- a/aeon/datatypes/_hierarchical/_check.py +++ b/aeon/datatypes/_hierarchical/_check.py @@ -44,7 +44,7 @@ import numpy as np -from aeon.datatypes._collection._check import check_pdmultiindex_panel +from aeon.datatypes._panel._check import check_pdmultiindex_panel from aeon.utils.validation._dependencies import _check_soft_dependencies diff --git a/aeon/datatypes/_collection/__init__.py b/aeon/datatypes/_panel/__init__.py similarity index 51% rename from aeon/datatypes/_collection/__init__.py rename to aeon/datatypes/_panel/__init__.py index 02aa9e0c06..bab0affda4 100644 --- a/aeon/datatypes/_collection/__init__.py +++ b/aeon/datatypes/_panel/__init__.py @@ -1,16 +1,16 @@ # -*- coding: utf-8 -*- """Module exports: Panel type checkers, converters and mtype inference.""" -from aeon.datatypes._collection._check import check_dict as check_dict_Panel -from aeon.datatypes._collection._convert import convert_dict as convert_dict_Panel -from aeon.datatypes._collection._examples import example_dict as example_dict_Panel -from aeon.datatypes._collection._examples import ( +from aeon.datatypes._panel._check import check_dict as check_dict_Panel +from aeon.datatypes._panel._convert import convert_dict as convert_dict_Panel +from aeon.datatypes._panel._examples import example_dict as example_dict_Panel +from aeon.datatypes._panel._examples import ( example_dict_lossy as example_dict_lossy_Panel, ) -from aeon.datatypes._collection._examples import ( +from aeon.datatypes._panel._examples import ( example_dict_metadata as example_dict_metadata_Panel, ) -from aeon.datatypes._collection._registry import MTYPE_LIST_PANEL, MTYPE_REGISTER_PANEL +from aeon.datatypes._panel._registry import MTYPE_LIST_PANEL, MTYPE_REGISTER_PANEL __all__ = [ "check_dict_Panel", diff --git a/aeon/datatypes/_collection/_check.py b/aeon/datatypes/_panel/_check.py similarity index 100% rename from aeon/datatypes/_collection/_check.py rename to aeon/datatypes/_panel/_check.py diff --git a/aeon/datatypes/_collection/_convert.py b/aeon/datatypes/_panel/_convert.py similarity index 99% rename from aeon/datatypes/_collection/_convert.py rename to aeon/datatypes/_panel/_convert.py index 4b5d4fdd9a..fe3b2f1ee5 100644 --- a/aeon/datatypes/_collection/_convert.py +++ b/aeon/datatypes/_panel/_convert.py @@ -34,8 +34,8 @@ "convert_dict", ] -from aeon.datatypes._collection._registry import MTYPE_LIST_PANEL from aeon.datatypes._convert_utils._convert import _extend_conversions +from aeon.datatypes._panel._registry import MTYPE_LIST_PANEL from aeon.utils.validation._dependencies import _check_soft_dependencies # dictionary indexed by triples of types diff --git a/aeon/datatypes/_collection/_examples.py b/aeon/datatypes/_panel/_examples.py similarity index 100% rename from aeon/datatypes/_collection/_examples.py rename to aeon/datatypes/_panel/_examples.py diff --git a/aeon/datatypes/_collection/_registry.py b/aeon/datatypes/_panel/_registry.py similarity index 100% rename from aeon/datatypes/_collection/_registry.py rename to aeon/datatypes/_panel/_registry.py diff --git a/aeon/datatypes/_registry.py b/aeon/datatypes/_registry.py index d13a63e240..57fcc2d8cf 100644 --- a/aeon/datatypes/_registry.py +++ b/aeon/datatypes/_registry.py @@ -43,16 +43,16 @@ MTYPE_LIST_ALIGNMENT, MTYPE_REGISTER_ALIGNMENT, ) -from aeon.datatypes._collection._registry import ( - MTYPE_LIST_PANEL, - MTYPE_REGISTER_PANEL, - MTYPE_SOFT_DEPS_PANEL, -) from aeon.datatypes._hierarchical._registry import ( MTYPE_LIST_HIERARCHICAL, MTYPE_REGISTER_HIERARCHICAL, MTYPE_SOFT_DEPS_HIERARCHICAL, ) +from aeon.datatypes._panel._registry import ( + MTYPE_LIST_PANEL, + MTYPE_REGISTER_PANEL, + MTYPE_SOFT_DEPS_PANEL, +) from aeon.datatypes._proba._registry import MTYPE_LIST_PROBA, MTYPE_REGISTER_PROBA from aeon.datatypes._series._registry import ( MTYPE_LIST_SERIES, diff --git a/aeon/datatypes/tests/test_panel_converters.py b/aeon/datatypes/tests/test_panel_converters.py index 7a0737b368..adc68997c0 100644 --- a/aeon/datatypes/tests/test_panel_converters.py +++ b/aeon/datatypes/tests/test_panel_converters.py @@ -6,12 +6,12 @@ from aeon.datasets import make_example_long_table, make_example_multi_index_dataframe from aeon.datatypes._adapter import convert_from_multiindex_to_listdataset -from aeon.datatypes._collection._check import ( +from aeon.datatypes._panel._check import ( are_columns_nested, check_nplist_panel, is_nested_dataframe, ) -from aeon.datatypes._collection._convert import ( +from aeon.datatypes._panel._convert import ( from_2d_array_to_nested, from_3d_numpy_to_2d_array, from_3d_numpy_to_multi_index, diff --git a/aeon/datatypes/tests/test_series_to_panel_converters.py b/aeon/datatypes/tests/test_series_to_panel_converters.py index 5181900b82..152d5ea5ec 100644 --- a/aeon/datatypes/tests/test_series_to_panel_converters.py +++ b/aeon/datatypes/tests/test_series_to_panel_converters.py @@ -4,7 +4,7 @@ import numpy as np import pandas as pd -from aeon.datatypes._collection._convert import from_3d_numpy_to_multi_index +from aeon.datatypes._panel._convert import from_3d_numpy_to_multi_index from aeon.datatypes._series_as_panel import ( convert_Panel_to_Series, convert_Series_to_Panel, diff --git a/aeon/forecasting/base/tests/test_base.py b/aeon/forecasting/base/tests/test_base.py index bd98210fd7..bb1ae83a6a 100644 --- a/aeon/forecasting/base/tests/test_base.py +++ b/aeon/forecasting/base/tests/test_base.py @@ -12,7 +12,7 @@ from pandas.testing import assert_series_equal from aeon.datatypes import check_is_mtype, convert -from aeon.datatypes._collection._convert import from_nested_to_multi_index +from aeon.datatypes._panel._convert import from_nested_to_multi_index from aeon.datatypes._utilities import get_cutoff, get_window from aeon.forecasting.arima import ARIMA from aeon.utils._testing.collection import make_3d_test_data, make_nested_dataframe_data diff --git a/aeon/transformations/collection/segment.py b/aeon/transformations/collection/segment.py index 6757b3ce73..a6af5b87b1 100644 --- a/aeon/transformations/collection/segment.py +++ b/aeon/transformations/collection/segment.py @@ -6,7 +6,7 @@ import pandas as pd from sklearn.utils import check_random_state -from aeon.datatypes._collection._convert import _concat_nested_arrays, _get_time_index +from aeon.datatypes._panel._convert import _concat_nested_arrays, _get_time_index from aeon.transformations.base import BaseTransformer from aeon.utils.validation import check_window_length diff --git a/aeon/transformations/collection/tsfresh.py b/aeon/transformations/collection/tsfresh.py index f47a02a2bd..21146b4e48 100644 --- a/aeon/transformations/collection/tsfresh.py +++ b/aeon/transformations/collection/tsfresh.py @@ -5,7 +5,7 @@ __author__ = ["AyushmaanSeth", "mloning", "Alwin Wang", "MatthewMiddlehurst"] __all__ = ["TSFreshFeatureExtractor", "TSFreshRelevantFeatureExtractor"] -from aeon.datatypes._collection._convert import from_3d_numpy_to_long +from aeon.datatypes._panel._convert import from_3d_numpy_to_long from aeon.transformations.collection.base import BaseCollectionTransformer from aeon.utils.validation import check_n_jobs from aeon.utils.validation._dependencies import _check_soft_dependencies diff --git a/aeon/utils/_testing/estimator_checks.py b/aeon/utils/_testing/estimator_checks.py index 6a4021f24d..6b4c5b1b76 100644 --- a/aeon/utils/_testing/estimator_checks.py +++ b/aeon/utils/_testing/estimator_checks.py @@ -17,7 +17,7 @@ from aeon.classification.base import BaseClassifier from aeon.classification.early_classification import BaseEarlyClassifier from aeon.clustering.base import BaseClusterer -from aeon.datatypes._collection._check import is_nested_dataframe +from aeon.datatypes._panel._check import is_nested_dataframe from aeon.forecasting.base import BaseForecaster from aeon.regression.base import BaseRegressor from aeon.tests._config import VALID_ESTIMATOR_TYPES diff --git a/aeon/utils/validation/panel.py b/aeon/utils/validation/panel.py index 71c848e703..8799b3d063 100644 --- a/aeon/utils/validation/panel.py +++ b/aeon/utils/validation/panel.py @@ -12,8 +12,8 @@ import pandas as pd from sklearn.utils.validation import check_consistent_length -from aeon.datatypes._collection._check import is_nested_dataframe -from aeon.datatypes._collection._convert import ( +from aeon.datatypes._panel._check import is_nested_dataframe +from aeon.datatypes._panel._convert import ( from_3d_numpy_to_nested, from_nested_to_3d_numpy, ) diff --git a/examples/AA_datatypes_and_datasets.ipynb b/examples/AA_datatypes_and_datasets.ipynb index 0bbec22f95..34ba963c25 100644 --- a/examples/AA_datatypes_and_datasets.ipynb +++ b/examples/AA_datatypes_and_datasets.ipynb @@ -15,7 +15,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -87,9 +87,26 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": {}, - "outputs": [], + "outputs": [ + { + "ename": "ImportError", + "evalue": "cannot import name 'MTYPE_LIST_SERIES' from partially initialized module 'aeon.datatypes._registry' (most likely due to a circular import) (C:\\Code\\aeon\\aeon\\datatypes\\_registry.py)", + "output_type": "error", + "traceback": [ + "\u001B[1;31m---------------------------------------------------------------------------\u001B[0m", + "\u001B[1;31mImportError\u001B[0m Traceback (most recent call last)", + "Cell \u001B[1;32mIn[5], line 2\u001B[0m\n\u001B[0;32m 1\u001B[0m \u001B[38;5;66;03m# import to retrieve examples\u001B[39;00m\n\u001B[1;32m----> 2\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m get_examples\n", + "File \u001B[1;32mC:\\Code\\aeon\\aeon\\datatypes\\__init__.py:6\u001B[0m\n\u001B[0;32m 2\u001B[0m \u001B[38;5;124;03m\"\"\"Module exports: data type definitions, checks, validation, fixtures, converters.\"\"\"\u001B[39;00m\n\u001B[0;32m 4\u001B[0m __author__ \u001B[38;5;241m=\u001B[39m [\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mfkiraly\u001B[39m\u001B[38;5;124m\"\u001B[39m]\n\u001B[1;32m----> 6\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_check\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m (\n\u001B[0;32m 7\u001B[0m check_is_mtype,\n\u001B[0;32m 8\u001B[0m check_is_scitype,\n\u001B[0;32m 9\u001B[0m check_raise,\n\u001B[0;32m 10\u001B[0m mtype,\n\u001B[0;32m 11\u001B[0m scitype,\n\u001B[0;32m 12\u001B[0m )\n\u001B[0;32m 13\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_convert\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m convert, convert_to\n\u001B[0;32m 14\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_examples\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m get_examples\n", + "File \u001B[1;32mC:\\Code\\aeon\\aeon\\datatypes\\_check.py:35\u001B[0m\n\u001B[0;32m 33\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_hierarchical\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m check_dict_Hierarchical\n\u001B[0;32m 34\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_proba\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m check_dict_Proba\n\u001B[1;32m---> 35\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_registry\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m AMBIGUOUS_MTYPES, SCITYPE_LIST, mtype_to_scitype\n\u001B[0;32m 36\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_series\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m check_dict_Series\n\u001B[0;32m 37\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_table\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m check_dict_Table\n", + "File \u001B[1;32mC:\\Code\\aeon\\aeon\\datatypes\\_registry.py:57\u001B[0m\n\u001B[0;32m 51\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_hierarchical\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_registry\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m (\n\u001B[0;32m 52\u001B[0m MTYPE_LIST_HIERARCHICAL,\n\u001B[0;32m 53\u001B[0m MTYPE_REGISTER_HIERARCHICAL,\n\u001B[0;32m 54\u001B[0m MTYPE_SOFT_DEPS_HIERARCHICAL,\n\u001B[0;32m 55\u001B[0m )\n\u001B[0;32m 56\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_proba\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_registry\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m MTYPE_LIST_PROBA, MTYPE_REGISTER_PROBA\n\u001B[1;32m---> 57\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_series\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_registry\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m (\n\u001B[0;32m 58\u001B[0m MTYPE_LIST_SERIES,\n\u001B[0;32m 59\u001B[0m MTYPE_REGISTER_SERIES,\n\u001B[0;32m 60\u001B[0m MTYPE_SOFT_DEPS_SERIES,\n\u001B[0;32m 61\u001B[0m )\n\u001B[0;32m 62\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_table\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_registry\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m MTYPE_LIST_TABLE, MTYPE_REGISTER_TABLE\n\u001B[0;32m 64\u001B[0m MTYPE_REGISTER \u001B[38;5;241m=\u001B[39m []\n", + "File \u001B[1;32mC:\\Code\\aeon\\aeon\\datatypes\\_series\\__init__.py:5\u001B[0m\n\u001B[0;32m 2\u001B[0m \u001B[38;5;124;03m\"\"\"Module exports: Series type checkers, converters and mtype inference.\"\"\"\u001B[39;00m\n\u001B[0;32m 4\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_series\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_check\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m check_dict \u001B[38;5;28;01mas\u001B[39;00m check_dict_Series\n\u001B[1;32m----> 5\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_series\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_convert\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m convert_dict \u001B[38;5;28;01mas\u001B[39;00m convert_dict_Series\n\u001B[0;32m 6\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_series\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_examples\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m example_dict \u001B[38;5;28;01mas\u001B[39;00m example_dict_Series\n\u001B[0;32m 7\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_series\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_examples\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m (\n\u001B[0;32m 8\u001B[0m example_dict_lossy \u001B[38;5;28;01mas\u001B[39;00m example_dict_lossy_Series,\n\u001B[0;32m 9\u001B[0m )\n", + "File \u001B[1;32mC:\\Code\\aeon\\aeon\\datatypes\\_series\\_convert.py:41\u001B[0m\n\u001B[0;32m 37\u001B[0m \u001B[38;5;66;03m##############################################################\u001B[39;00m\n\u001B[0;32m 38\u001B[0m \u001B[38;5;66;03m# methods to convert one machine type to another machine type\u001B[39;00m\n\u001B[0;32m 39\u001B[0m \u001B[38;5;66;03m##############################################################\u001B[39;00m\n\u001B[0;32m 40\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_convert_utils\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_convert\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m _extend_conversions\n\u001B[1;32m---> 41\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mdatatypes\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_registry\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m MTYPE_LIST_SERIES\n\u001B[0;32m 42\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01maeon\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mutils\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mvalidation\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01m_dependencies\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m _check_soft_dependencies\n\u001B[0;32m 44\u001B[0m convert_dict \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mdict\u001B[39m()\n", + "\u001B[1;31mImportError\u001B[0m: cannot import name 'MTYPE_LIST_SERIES' from partially initialized module 'aeon.datatypes._registry' (most likely due to a circular import) (C:\\Code\\aeon\\aeon\\datatypes\\_registry.py)" + ] + } + ], "source": [ "# import to retrieve examples\n", "from aeon.datatypes import get_examples" @@ -748,23 +765,13 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": " dask_series np.ndarray pd.DataFrame pd.Series xr.DataArray\ndask_series 1 1 1 1 1\nnp.ndarray 1 1 1 1 1\npd.DataFrame 1 1 1 1 1\npd.Series 1 1 1 1 1\nxr.DataArray 1 1 1 1 1", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
dask_seriesnp.ndarraypd.DataFramepd.Seriesxr.DataArray
dask_series11111
np.ndarray11111
pd.DataFrame11111
pd.Series11111
xr.DataArray11111
\n
" - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "from aeon.datatypes._convert import _conversions_defined\n", "\n", - "_conversions_defined(scitype=\"Series\")" + "_conversions_defined(scitype=\"Panel\")" ] }, { From 77f37ca2f915db3b1900d30e8b328557c181ee4a Mon Sep 17 00:00:00 2001 From: Tony Bagnall Date: Wed, 5 Jul 2023 10:18:00 +0100 Subject: [PATCH 04/14] revert notebook to _panel --- examples/datasets/data_conversions.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/datasets/data_conversions.ipynb b/examples/datasets/data_conversions.ipynb index 2276c1a4a3..a228a7affd 100644 --- a/examples/datasets/data_conversions.ipynb +++ b/examples/datasets/data_conversions.ipynb @@ -179,7 +179,7 @@ "\n", "As with series, conversion is performed with the method `convert` and auto conversion\n", " happens in estimator base classes. These wrap methods in `aeon.datatypes\n", - "._collection._convert`" + "._panel._convert`" ], "metadata": { "collapsed": false @@ -259,7 +259,7 @@ } ], "source": [ - "from aeon.datatypes._collection._convert import (\n", + "from aeon.datatypes._panel._convert import (\n", " from_3d_numpy_to_long,\n", " from_3d_numpy_to_multi_index,\n", ")\n", From c24b6ae9bc4c83e218179fd408e6fcbc3f516907 Mon Sep 17 00:00:00 2001 From: chrisholder Date: Wed, 5 Jul 2023 16:44:16 +0100 Subject: [PATCH 05/14] removed isinstance --- aeon/distances/_distance.py | 16 +++- aeon/distances/_erp.py | 95 +++++++++++-------- .../tests/test_numba_distance_parameters.py | 4 +- setup.cfg | 18 ++-- 4 files changed, 79 insertions(+), 54 deletions(-) diff --git a/aeon/distances/_distance.py b/aeon/distances/_distance.py index c2da8f19ff..6674a685b3 100644 --- a/aeon/distances/_distance.py +++ b/aeon/distances/_distance.py @@ -132,7 +132,9 @@ def distance( elif metric == "lcss": return lcss_distance(x, y, kwargs.get("window"), kwargs.get("epsilon", 1.0)) elif metric == "erp": - return erp_distance(x, y, kwargs.get("window"), kwargs.get("g", 0.0)) + return erp_distance( + x, y, kwargs.get("window"), kwargs.get("g", 0.0), kwargs.get("g_arr", None) + ) elif metric == "edr": return edr_distance(x, y, kwargs.get("window"), kwargs.get("epsilon")) elif metric == "twe": @@ -243,7 +245,9 @@ def pairwise_distance( x, y, kwargs.get("window"), kwargs.get("epsilon", 1.0) ) elif metric == "erp": - return erp_pairwise_distance(x, y, kwargs.get("window"), kwargs.get("g", 0.0)) + return erp_pairwise_distance( + x, y, kwargs.get("window"), kwargs.get("g", 0.0), kwargs.get("g_arr", None) + ) elif metric == "edr": return edr_pairwise_distance(x, y, kwargs.get("window"), kwargs.get("epsilon")) elif metric == "twe": @@ -374,7 +378,9 @@ def alignment_path( x, y, kwargs.get("window"), kwargs.get("epsilon", 1.0) ) elif metric == "erp": - return erp_alignment_path(x, y, kwargs.get("window"), kwargs.get("g", 0.0)) + return erp_alignment_path( + x, y, kwargs.get("window"), kwargs.get("g", 0.0), kwargs.get("g_arr", None) + ) elif metric == "edr": return edr_alignment_path(x, y, kwargs.get("window"), kwargs.get("epsilon")) elif metric == "twe": @@ -460,7 +466,9 @@ def cost_matrix( elif metric == "lcss": return lcss_cost_matrix(x, y, kwargs.get("window"), kwargs.get("epsilon", 1.0)) elif metric == "erp": - return erp_cost_matrix(x, y, kwargs.get("window"), kwargs.get("g", 0.0)) + return erp_cost_matrix( + x, y, kwargs.get("window"), kwargs.get("g", 0.0), kwargs.get("g_arr", None) + ) elif metric == "edr": return edr_cost_matrix(x, y, kwargs.get("window"), kwargs.get("epsilon")) elif metric == "twe": diff --git a/aeon/distances/_erp.py b/aeon/distances/_erp.py index 56e966ee32..fdf426b6ca 100644 --- a/aeon/distances/_erp.py +++ b/aeon/distances/_erp.py @@ -34,7 +34,8 @@ def erp_distance( x: np.ndarray, y: np.ndarray, window: float = None, - g: Union[float, np.ndarray] = 0.0, + g: float = 0.0, + g_arr: np.ndarray = None, ) -> float: """Compute the ERP distance between two time series. @@ -58,10 +59,10 @@ def erp_distance( window: float, defaults=None The window to use for the bounding matrix. If None, no bounding matrix is used. - g: float or np.ndarray of shape (n_channels), defaults=0. - The reference value to penalise gaps. The default is 0. If it is an array - then it must be the length of the number of channels in x and y. If a single - value is provided then that value is used across each channel + g: float. + The reference value to penalise gaps. The default is 0. + g_arr: np.ndarray of shape (n_channels), defaults=None + Numpy array that must be the length of the number of channels in x and y. Returns ------- @@ -91,10 +92,10 @@ def erp_distance( _x = x.reshape((1, x.shape[0])) _y = y.reshape((1, y.shape[0])) bounding_matrix = create_bounding_matrix(_x.shape[1], _y.shape[1], window) - return _erp_distance(_x, _y, bounding_matrix, g) + return _erp_distance(_x, _y, bounding_matrix, g, g_arr) if x.ndim == 2 and y.ndim == 2: bounding_matrix = create_bounding_matrix(x.shape[1], y.shape[1], window) - return _erp_distance(x, y, bounding_matrix, g) + return _erp_distance(x, y, bounding_matrix, g, g_arr) raise ValueError("x and y must be 1D or 2D") @@ -104,6 +105,7 @@ def erp_cost_matrix( y: np.ndarray, window: float = None, g: Union[float, np.ndarray] = 0.0, + g_arr: np.ndarray = None, ) -> np.ndarray: """Compute the ERP cost matrix between two time series. @@ -121,10 +123,10 @@ def erp_cost_matrix( window: float, defaults=None The window to use for the bounding matrix. If None, no bounding matrix is used. - g: float or np.ndarray of shape (n_channels), defaults=0. - The reference value to penalise gaps. The default is 0. If it is an array - then it must be the length of the number of channels in x and y. If a single - value is provided then that value is used across each channel. + g: float. + The reference value to penalise gaps. The default is 0. + g_arr: np.ndarray of shape (n_channels), defaults=None + Numpy array that must be the length of the number of channels in x and y. Returns ------- @@ -158,10 +160,10 @@ def erp_cost_matrix( _x = x.reshape((1, x.shape[0])) _y = y.reshape((1, y.shape[0])) bounding_matrix = create_bounding_matrix(_x.shape[1], _y.shape[1], window) - return _erp_cost_matrix(_x, _y, bounding_matrix, g) + return _erp_cost_matrix(_x, _y, bounding_matrix, g, g_arr) if x.ndim == 2 and y.ndim == 2: bounding_matrix = create_bounding_matrix(x.shape[1], y.shape[1], window) - return _erp_cost_matrix(x, y, bounding_matrix, g) + return _erp_cost_matrix(x, y, bounding_matrix, g, g_arr) raise ValueError("x and y must be 1D or 2D") @@ -170,9 +172,12 @@ def _erp_distance( x: np.ndarray, y: np.ndarray, bounding_matrix: np.ndarray, - g: Union[float, np.ndarray], + g: float, + g_arr: np.ndarray, ) -> float: - return _erp_cost_matrix(x, y, bounding_matrix, g)[x.shape[1] - 1, y.shape[1] - 1] + return _erp_cost_matrix(x, y, bounding_matrix, g, g_arr)[ + x.shape[1] - 1, y.shape[1] - 1 + ] @njit(cache=True, fastmath=True) @@ -180,15 +185,16 @@ def _erp_cost_matrix( x: np.ndarray, y: np.ndarray, bounding_matrix: np.ndarray, - g: Union[float, np.ndarray], + g: float, + g_arr: np.ndarray, ) -> np.ndarray: x_size = x.shape[1] y_size = y.shape[1] cost_matrix = np.zeros((x_size + 1, y_size + 1)) - gx_distance, x_sum = _precompute_g(x, g) - gy_distance, y_sum = _precompute_g(y, g) + gx_distance, x_sum = _precompute_g(x, g, g_arr) + gy_distance, y_sum = _precompute_g(y, g, g_arr) cost_matrix[1:, 0] = x_sum cost_matrix[0, 1:] = y_sum @@ -208,15 +214,15 @@ def _erp_cost_matrix( @njit(cache=True, fastmath=True) def _precompute_g( - x: np.ndarray, g: Union[float, np.ndarray] + x: np.ndarray, g: float, g_array: np.ndarray ) -> Tuple[np.ndarray, float]: gx_distance = np.zeros(x.shape[1]) - if isinstance(g, float): + if g_array is None: g_arr = np.full(x.shape[0], g) else: - if g.shape[0] != x.shape[0]: + if g_array.shape[0] != x.shape[0]: raise ValueError("g must be a float or an array with shape (x.shape[0],)") - g_arr = g + g_arr = g_array x_sum = 0 for i in range(x.shape[1]): @@ -231,7 +237,8 @@ def erp_pairwise_distance( X: np.ndarray, y: np.ndarray = None, window: float = None, - g: Union[float, np.ndarray] = 0.0, + g: float = 0.0, + g_arr: np.ndarray = None, ) -> np.ndarray: """Compute the erp pairwise distance between a set of time series. @@ -251,10 +258,10 @@ def erp_pairwise_distance( window: float, default=None The window to use for the bounding matrix. If None, no bounding matrix is used. - g: float or np.ndarray of shape (n_channels), defaults=0 - The reference value to penalise gaps. The default is 0. If it is an array - then it must be the length of the number of channels in x and y. If a single - value is provided then that value is used across each channel. + g: float. + The reference value to penalise gaps. The default is 0. + g_arr: np.ndarray of shape (n_channels), defaults=None + Numpy array that must be the length of the number of channels in x and y. Returns ------- @@ -297,18 +304,21 @@ def erp_pairwise_distance( if y is None: # To self if X.ndim == 3: - return _erp_pairwise_distance(X, window, g) + return _erp_pairwise_distance(X, window, g, g_arr) if X.ndim == 2: _X = X.reshape((X.shape[0], 1, X.shape[1])) - return _erp_pairwise_distance(_X, window, g) + return _erp_pairwise_distance(_X, window, g, g_arr) raise ValueError("x and y must be 2D or 3D arrays") _x, _y = reshape_pairwise_to_multiple(X, y) - return _erp_from_multiple_to_multiple_distance(_x, _y, window, g) + return _erp_from_multiple_to_multiple_distance(_x, _y, window, g, g_arr) @njit(cache=True, fastmath=True) def _erp_pairwise_distance( - X: np.ndarray, window: float, g: Union[float, np.ndarray] + X: np.ndarray, + window: float, + g: float, + g_arr: np.ndarray, ) -> np.ndarray: n_instances = X.shape[0] distances = np.zeros((n_instances, n_instances)) @@ -316,7 +326,7 @@ def _erp_pairwise_distance( for i in range(n_instances): for j in range(i + 1, n_instances): - distances[i, j] = _erp_distance(X[i], X[j], bounding_matrix, g) + distances[i, j] = _erp_distance(X[i], X[j], bounding_matrix, g, g_arr) distances[j, i] = distances[i, j] return distances @@ -324,7 +334,11 @@ def _erp_pairwise_distance( @njit(cache=True, fastmath=True) def _erp_from_multiple_to_multiple_distance( - x: np.ndarray, y: np.ndarray, window: float, g: Union[float, np.ndarray] + x: np.ndarray, + y: np.ndarray, + window: float, + g: float, + g_arr: np.ndarray, ) -> np.ndarray: n_instances = x.shape[0] m_instances = y.shape[0] @@ -333,7 +347,7 @@ def _erp_from_multiple_to_multiple_distance( for i in range(n_instances): for j in range(m_instances): - distances[i, j] = _erp_distance(x[i], y[j], bounding_matrix, g) + distances[i, j] = _erp_distance(x[i], y[j], bounding_matrix, g, g_arr) return distances @@ -342,7 +356,8 @@ def erp_alignment_path( x: np.ndarray, y: np.ndarray, window: float = None, - g: Union[float, np.ndarray] = 0.0, + g: float = 0.0, + g_arr: np.ndarray = None, ) -> Tuple[List[Tuple[int, int]], float]: """Compute the erp alignment path between two time series. @@ -360,10 +375,10 @@ def erp_alignment_path( window: float, default=None The window to use for the bounding matrix. If None, no bounding matrix is used. - g: float or np.ndarray of shape (n_channels), defaults=0. - The reference value to penalise gaps. The default is 0. If it is an array - then it must be the length of the number of channels in x and y. If a single - value is provided then that value is used across each channel. + g: float. + The reference value to penalise gaps. The default is 0. + g_arr: np.ndarray of shape (n_channels), defaults=None + Numpy array that must be the length of the number of channels in x and y. Returns ------- @@ -390,7 +405,7 @@ def erp_alignment_path( """ bounding_matrix = create_bounding_matrix(x.shape[-1], y.shape[-1], window) cost_matrix = _add_inf_to_out_of_bounds_cost_matrix( - erp_cost_matrix(x, y, window, g), bounding_matrix + erp_cost_matrix(x, y, window, g, g_arr), bounding_matrix ) return ( compute_min_return_path(cost_matrix), diff --git a/aeon/distances/tests/test_numba_distance_parameters.py b/aeon/distances/tests/test_numba_distance_parameters.py index cded18a764..808f50147c 100644 --- a/aeon/distances/tests/test_numba_distance_parameters.py +++ b/aeon/distances/tests/test_numba_distance_parameters.py @@ -33,7 +33,9 @@ def _test_distance_params( curr_results = [] for x, y in test_ts: if g_none: - param_dict["g"] = np.std([x, y], axis=0).sum(axis=1) + param_dict["g_arr"] = np.std([x, y], axis=0).sum(axis=1) + if "g" in param_dict: + del param_dict["g"] results = [] results.append(distance_func(x, y, **param_dict)) results.append(distance(x, y, metric=distance_str, **param_dict)) diff --git a/setup.cfg b/setup.cfg index 66eb4a7dae..2c57f80ab3 100644 --- a/setup.cfg +++ b/setup.cfg @@ -11,15 +11,15 @@ addopts = --ignore build_tools --ignore examples --ignore docs - --doctest-modules - --durations 10 - --timeout 600 - --cov aeon - --cov-report xml - --cov-report html - --showlocals - --matrixdesign True - -n auto +; --doctest-modules +; --durations 10 +; --timeout 600 +; --cov aeon +; --cov-report xml +; --cov-report html +; --showlocals +; --matrixdesign True +; -n auto filterwarnings = ignore::UserWarning ignore:numpy.dtype size changed From 6017bc7dba3ad11928df2cf04f91821a81ba0234 Mon Sep 17 00:00:00 2001 From: chrisholder Date: Wed, 5 Jul 2023 17:03:08 +0100 Subject: [PATCH 06/14] fixed the bug --- .../metrics/averaging/_barycenter_averaging.py | 5 +++-- aeon/clustering/tests/test_k_means.py | 16 ++++++++++++++++ 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/aeon/clustering/metrics/averaging/_barycenter_averaging.py b/aeon/clustering/metrics/averaging/_barycenter_averaging.py index f788db621e..15cca3466f 100644 --- a/aeon/clustering/metrics/averaging/_barycenter_averaging.py +++ b/aeon/clustering/metrics/averaging/_barycenter_averaging.py @@ -110,7 +110,6 @@ def _ba_update( ) -> Tuple[np.ndarray, float]: X_size, X_dims, X_timepoints = X.shape sum = np.zeros(X_timepoints) - alignment = np.zeros((X_dims, X_timepoints)) cost = 0.0 for i in range(X_size): @@ -134,7 +133,9 @@ def _ba_update( curr_ts, center, window, independent, c ) else: - raise ValueError(f"Metric must be a known string, got {metric}") + # When numba version > 0.57 add more informative error with what metric + # was passed. + raise ValueError("Metric parameter invalid") for j, k in curr_alignment: alignment[:, k] += curr_ts[:, j] sum[k] += 1 diff --git a/aeon/clustering/tests/test_k_means.py b/aeon/clustering/tests/test_k_means.py index f730b45e11..cb8f395eb7 100644 --- a/aeon/clustering/tests/test_k_means.py +++ b/aeon/clustering/tests/test_k_means.py @@ -179,3 +179,19 @@ def test_kmeans_dba(): for val in proba: assert np.count_nonzero(val == 1.0) == 1 + +def test_kmeans_bug(): + import numpy as np + from aeon.clustering.k_means import TimeSeriesKMeans + X_train = np.random.random(size=(100, 1, 100)) + + k_means = TimeSeriesKMeans( + n_clusters=13, # Number of desired centers + init_algorithm="forgy", # Center initialisation technique + max_iter=10, # Maximum number of iterations for refinement on training set + metric="dtw", # Distance metric to use + averaging_method="dba", # Averaging technique to use + random_state=1, + ) + + k_means.fit(X_train) \ No newline at end of file From 7c17e8ab885e12051fef91b33926b9ea6510004d Mon Sep 17 00:00:00 2001 From: chrisholder Date: Wed, 5 Jul 2023 17:03:26 +0100 Subject: [PATCH 07/14] setup --- setup.cfg | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/setup.cfg b/setup.cfg index 2c57f80ab3..66eb4a7dae 100644 --- a/setup.cfg +++ b/setup.cfg @@ -11,15 +11,15 @@ addopts = --ignore build_tools --ignore examples --ignore docs -; --doctest-modules -; --durations 10 -; --timeout 600 -; --cov aeon -; --cov-report xml -; --cov-report html -; --showlocals -; --matrixdesign True -; -n auto + --doctest-modules + --durations 10 + --timeout 600 + --cov aeon + --cov-report xml + --cov-report html + --showlocals + --matrixdesign True + -n auto filterwarnings = ignore::UserWarning ignore:numpy.dtype size changed From ad0cf250bbc112b857f8d3990d47bdac5237ad9b Mon Sep 17 00:00:00 2001 From: chrisholder Date: Wed, 5 Jul 2023 17:07:54 +0100 Subject: [PATCH 08/14] removed test --- aeon/clustering/tests/test_k_means.py | 16 ---------------- 1 file changed, 16 deletions(-) diff --git a/aeon/clustering/tests/test_k_means.py b/aeon/clustering/tests/test_k_means.py index cb8f395eb7..f730b45e11 100644 --- a/aeon/clustering/tests/test_k_means.py +++ b/aeon/clustering/tests/test_k_means.py @@ -179,19 +179,3 @@ def test_kmeans_dba(): for val in proba: assert np.count_nonzero(val == 1.0) == 1 - -def test_kmeans_bug(): - import numpy as np - from aeon.clustering.k_means import TimeSeriesKMeans - X_train = np.random.random(size=(100, 1, 100)) - - k_means = TimeSeriesKMeans( - n_clusters=13, # Number of desired centers - init_algorithm="forgy", # Center initialisation technique - max_iter=10, # Maximum number of iterations for refinement on training set - metric="dtw", # Distance metric to use - averaging_method="dba", # Averaging technique to use - random_state=1, - ) - - k_means.fit(X_train) \ No newline at end of file From cfda9b50d3f1686d3a422129de4a59d6f835ea81 Mon Sep 17 00:00:00 2001 From: Tony Bagnall Date: Sun, 9 Jul 2023 22:03:00 +0100 Subject: [PATCH 09/14] new converters --- aeon/datasets/_data_generators.py | 20 ++ aeon/utils/validation/_convert_collection.py | 192 +++++++++++++++++ aeon/utils/validation/collection.py | 193 ++++++++++++++++++ .../utils/validation/tests/test_collection.py | 63 ++++++ 4 files changed, 468 insertions(+) create mode 100644 aeon/utils/validation/_convert_collection.py create mode 100644 aeon/utils/validation/collection.py create mode 100644 aeon/utils/validation/tests/test_collection.py diff --git a/aeon/datasets/_data_generators.py b/aeon/datasets/_data_generators.py index 8ae0d07f5d..697acfce40 100644 --- a/aeon/datasets/_data_generators.py +++ b/aeon/datasets/_data_generators.py @@ -179,6 +179,26 @@ def make_example_long_table(n_cases=50, n_channels=2, n_timepoints=20): return df +def make_example_nested_dataframe(n_instances=10, n_channels=3, n_timepoints=20): + """Generate example nested dataframe, type "nested_univ". + + Parameters + ---------- + n_instances : int + Number of instances. + n_channels : int + Number of columns (series) in multi-indexed DataFrame. + n_timepoints : int + Number of timepoints per instance-column pair. + + Returns + ------- + nested_df : pd.DataFrame. each cell a pd.Series length n_timepoints + + """ + return None + + def make_example_multi_index_dataframe(n_instances=50, n_channels=3, n_timepoints=20): """Generate example multi-index DataFrame. diff --git a/aeon/utils/validation/_convert_collection.py b/aeon/utils/validation/_convert_collection.py new file mode 100644 index 0000000000..ef226dc26f --- /dev/null +++ b/aeon/utils/validation/_convert_collection.py @@ -0,0 +1,192 @@ +# -*- coding: utf-8 -*- +"""Collection data converters.""" +import numpy as np +import pandas as pd + +from aeon.utils.validation.collection import DATA_TYPES + +convert_dict = dict() + + +def convert_identity(obj, store=None): + """Convert identity.""" + return obj + + +# assign identity function to type conversion to self +for x in DATA_TYPES: + convert_dict[(x, x)] = convert_identity + + +def from_numpy3d_to_pd_multiindex(X): + """Convert numpy3D collection to pandas multi-index Panel. + + Parameters + ---------- + X : np.ndarray + 3-dimensional NumPy array (n_instances, n_channels, n_timepoints) + + Returns + ------- + X_mi : pd.DataFrame + The multi-indexed pandas DataFrame + """ + if X.ndim != 3: + msg = " ".join( + [ + "Input should be 3-dimensional NumPy array with shape", + "(n_instances, n_channels, n_timepoints).", + ] + ) + raise TypeError(msg) + + n_instances, n_channels, n_timepoints = X.shape + multi_index = pd.MultiIndex.from_product( + [range(n_instances), range(n_channels), range(n_timepoints)], + names=["instances", "columns", "timepoints"], + ) + + X_mi = pd.DataFrame({"X": X.flatten()}, index=multi_index) + X_mi = X_mi.unstack(level="columns") + X_mi.columns = [f"var_{i}" for i in range(n_channels)] + return X_mi + + +def from_numpy3d_to_nested_univ(X): + """Convert numpy3D collection to nested_univ pd.DataFrame. + + Convert NumPy ndarray with shape (n_instances, n_channels, n_timepoints) + into nested pandas DataFrame (with time series as pandas Series in cells) + + Parameters + ---------- + X : np.ndarray + 3-dimensional NumPy array (n_instances, n_channels, n_timepoints) + + Returns + ------- + df : pd.DataFrame + """ + n_instances, n_channels, n_timepoints = X.shape + array_type = X.dtype + container = pd.Series + column_names = [f"var_{i}" for i in range(n_channels)] + column_list = [] + for j, column in enumerate(column_names): + nested_column = ( + pd.DataFrame(X[:, j, :]) + .apply(lambda x: [container(x, dtype=array_type)], axis=1) + .str[0] + .rename(column) + ) + column_list.append(nested_column) + df = pd.concat(column_list, axis=1) + return df + + +def from_numpy3d_to_np_list(X, store=None): + """Convert 3D np.darray to a list of 2D numpy. + + Converts 3D numpy array (n_instances, n_channels, n_timepoints) to + a 2D list length [n_instances] each of shape (n_channels, n_timepoints) + + Parameters + ---------- + X : np.ndarray + The input array with shape (n_instances, n_channels, n_timepoints) + + Returns + ------- + list : list [n_instances] np.ndarray + A list of np.ndarray + """ + np_list = [] + for arr in X: + np_list.append(arr) + return np_list + + +def from_numpy3d_to_df_list(X, store=None): + """Convert 3D np.darray to a list of dataframes in wide format. + + Converts 3D numpy array (n_instances, n_channels, n_timepoints) to + a 2D list length [n_instances] of pd.DataFrames shape (n_channels, n_timepoints) + + Parameters + ---------- + X : np.ndarray + The input array with shape (n_instances, n_channels, n_timepoints) + + Returns + ------- + df : pd.DataFrame + """ + df_list = [] + for arr in X: + df_list.append(pd.DataFrame(arr)) + return df_list + + +def from_numpy3d_to_pd_wide(X, store=None): + """Convert 3D np.darray to a list of dataframes in wide format. + + Only valid with univariate time series. Converts 3D numpy array (n_instances, 1, + n_timepoints) to a dataframe (n_instances, n_timepoints) + + Parameters + ---------- + X : np.ndarray + The input array with shape (n_instances, 1, n_timepoints) + + Returns + ------- + df : a dataframe (n_instances, n_timepoints) + + Raise + ----- + ValueError if X has n_channels>1 + """ + if X.shape[1] > 1: + raise ValueError( + "Error, numpy3D passed with > 1 channel, cannot convert to " "pd-wide" + ) + return pd.DataFrame(X.squeeze()) + + +def from_numpyflat_to_nested_univ(X): + """Convert np.ndarray to nested_univ format pd.DataFrame with a single column. + + Parameters + ---------- + X : np.ndarray shape (n_cases, n_timepoints) + + Returns + ------- + Xt : pd.DataFrame + DataFrame with a single column of pd.Series + """ + container = pd.Series + n_instances, n_timepoints = X.shape + time_index = np.arange(n_timepoints) + kwargs = {"index": time_index} + + Xt = pd.DataFrame( + pd.Series([container(X[i, :], **kwargs) for i in range(n_instances)]) + ) + return Xt + + +def from_pd_wide_to_nested_univ(X): + """Convert wide pd.DataFrame to nested_univ format pd.DataFrame. + + Parameters + ---------- + X : pd.DataFrame shape (n_cases, n_timepoints) + + Returns + ------- + Xt : pd.DataFrame + Transformed DataFrame with a single column of pd.Series + """ + X = X.to_numpy() + return from_numpyflat_to_nested_univ(X) diff --git a/aeon/utils/validation/collection.py b/aeon/utils/validation/collection.py new file mode 100644 index 0000000000..39fa22d045 --- /dev/null +++ b/aeon/utils/validation/collection.py @@ -0,0 +1,193 @@ +# -*- coding: utf-8 -*- +"""Conversion and checking for collections of time series.""" +import numpy as np +import pandas as pd + +from aeon.datatypes._panel._convert import convert_dict + +DATA_TYPES = [ + "numpy3D", # 3D np.ndarray of format (n_cases, n_channels, n_timepoints) + "np-list", # python list of 2D numpy array of length [n_cases], each of shape ( + # n_channels, n_timepoints_i) + "df-list", # python list of 2D pd.DataFrames of length [n_cases], each a of + # shape (n_timepoints_i, n_channels) + "numpyflat", # 2D np.ndarray of shape (n_cases, n_timepoints) + "pd-wide", # 2D pd.DataFrame of shape (n_cases, n_timepoints) + "nested_univ", # pd.DataFrame (n_cases, n_channels) with each cell a pd.Series, +] +# "pd-multiindex", d.DataFrame with multi-index, +# "dask_panel": not used anywhere + + +def convertX(X, to_type): + """Convert from one of DATA_TYPE to another. + + Parameters + ---------- + X : data structure. + to_type : string, one of DATA_TYPES + + Returns + ------- + Data structure conforming to "to_type" + + Raises + ------ + ValueError if + X pd.ndarray but wrong dimension + X is list but not of np.ndarray or p.DataFrame. + X is a pd.DataFrame on non float primitives. + + Example + ------- + >>> X=convertX(np.zeros(shape=(10, 3, 20)), "np-list") + >>> type(X) + list + """ + input_type = get_type(X) + return convert_dict[(input_type, to_type, "Panel")](X) + + +def get_type(X): + """Get the string identifier associated with different data structures. + + Parameters + ---------- + X : data structure. + + Returns + ------- + input_type : string, one of DATA_TYPES + + Raises + ------ + ValueError if + X pd.ndarray but wrong dimension + X is list but not of np.ndarray or p.DataFrame. + X is a pd.DataFrame on non float primitives. + + Example + ------- + >>> equal_length( np.zeros(shape=(10, 3, 20)), "numpy3D") + True + """ + if isinstance(X, np.ndarray): # “numpy3D” or numpyflat + if X.ndim == 3: + return "numpy3D" + elif X.ndim == 2: + return "numpyflat" + else: + raise ValueError("ERROR np.ndarray must be either 2D or 3D") + elif isinstance(X, list): # np-list or df-list + if isinstance(X[0], np.ndarray): # if one a numpy they must all be 2D numpy + for a in X: + if not (isinstance(a, np.ndarray) and a.ndim == 2): + raise ValueError("ERROR np-list np.ndarray must be either 2D or 3D") + return "np-list" + elif isinstance(X[0], pd.DataFrame): + for a in X: + if not isinstance(a, pd.DataFrame): + raise ValueError("ERROR df-list must only contain pd.DataFrame") + return "df-list" + elif isinstance(X, pd.DataFrame): # Nested univariate, hierachical or pd-wide + if _is_nested_dataframe(X): + return "nested_univ" + if isinstance(X.index, pd.MultiIndex): + return "pd-multiindex" + elif _is_pd_wide(X): + return "pd-wide" + raise ValueError( + "ERROR unknown pd.DataFrame, contains non float values, " + "not hierarchical nor is it nested pd.Series" + ) + # if isinstance(X, dask.dataframe.core.DataFrame): + # return "dask_panel" + raise ValueError(f"ERROR unknown input type {type(X)}") + + +def equal_length(X, input_type): + """Test if X contains equal length time series. + + Assumes input_type is a valid type (DATA_TYPES). + + Parameters + ---------- + X : data structure. + input_type : string, one of DATA_TYPES + + Returns + ------- + boolean: True if all series in X are equal length, False otherwise + + Raises + ------ + ValueError if input_type equals "dask_panel" or not in DATA_TYPES. + + Example + ------- + >>> equal_length( np.zeros(shape=(10, 3, 20)), "numpy3D") + True + """ + always_equal = {"numpy3D", "numpyflat", "pd-wide"} + if input_type in always_equal: + return True + if input_type == "np-list": + first = X[0].shape[1] + for i in range(1, len(X)): + if X[i].shape[1] != first: + return False + return True + if input_type == "df-list": + first = X[0].shape[0] + for i in range(1, len(X)): + if X[i].shape[0] != first: + return False + return True + if input_type == "nested_univ": # Nested univariate or hierachical + return _nested_uni_is_equal(X) + if input_type == "pd-multiindex": + # TEMPORARY: WORK OUT HOW TO TEST THESE + return True + # raise ValueError(" Multi index not supported here ") + if input_type == "dask_panel": + raise ValueError(" DASK panel not supported here ") + raise ValueError(f" unknown input type {input_type}") + return False + + +def has_missing(X, input_type): + """Check if X has missing values.""" + # if isinstance(X, np.ndarray): # “numpy3D” or numpyflat + # elif isinstance(X, list): # np-list or df-list + return False + + +def _nested_uni_is_equal(X): + """Check whether series are unequal length.""" + length = X.iloc[0, 0].size + for series in X.iloc[0]: + if series.size != length: + return False + return True + + +def _is_nested_dataframe(X): + """Check if X is nested dataframe.""" + # Otherwise check all entries are pd.Series + if not isinstance(X, pd.DataFrame): + return False + for _, series in X.items(): + for cell in series: + if not isinstance(cell, pd.Series): + return False + return True + + +def _is_pd_wide(X): + """Check whether the input nested DataFrame is "pd-wide" type.""" + # only test is if all values are float. This from chatgpt seems stupid + float_cols = X.select_dtypes(include=[np.float]).columns + for col in float_cols: + if not np.issubdtype(X[col].dtype, np.floating): + return False + return True diff --git a/aeon/utils/validation/tests/test_collection.py b/aeon/utils/validation/tests/test_collection.py new file mode 100644 index 0000000000..090f944ec1 --- /dev/null +++ b/aeon/utils/validation/tests/test_collection.py @@ -0,0 +1,63 @@ +#!/usr/bin/env python3 -u +# -*- coding: utf-8 -*- +"""Unit tests for aeon.utils.validation.collection check/convert functions.""" +import numpy as np +import pandas as pd +import pytest + +# from aeon.datasets._data_generators import make_example_multi_index_dataframe +from aeon.utils._testing.tests.test_collection import make_nested_dataframe_data +from aeon.utils.validation.collection import ( # _nested_uni_is_equal,; has_missing, + DATA_TYPES, + _is_nested_dataframe, + convertX, + equal_length, + get_type, +) + +np_list = [] +for _ in range(10): + np_list.append(np.zeros(shape=(20, 2))) +df_list = [] +for _ in range(10): + df_list.append(pd.DataFrame(np.zeros(shape=(20, 2)))) +nested, _ = make_nested_dataframe_data() +# multi = make_example_multi_index_dataframe() + +DATA_EXAMPLES = { + "numpy3D": np.zeros(shape=(10, 3, 20)), + "numpyflat": np.zeros(shape=(10, 20)), + "np-list": np_list, + "df-list": df_list, + "pd-wide": pd.DataFrame(np.zeros(shape=(10, 20))), + "nested_univ": nested, +} +# "pd-multiindex": multi, + + +@pytest.mark.parametrize("data", DATA_TYPES) +def test_equal_length(data): + assert equal_length(DATA_EXAMPLES[data], data) + + +@pytest.mark.parametrize("data", DATA_TYPES) +def test_get_type(data): + assert get_type(DATA_EXAMPLES[data]) == data + + +@pytest.mark.parametrize("data", DATA_TYPES) +def test_is_nested_dataframe(data): + if data == "nested_univ": + assert _is_nested_dataframe(DATA_EXAMPLES[data]) + else: + assert not _is_nested_dataframe(DATA_EXAMPLES[data]) + + +@pytest.mark.parametrize("input_data", DATA_TYPES) +@pytest.mark.parametrize("output_data", DATA_TYPES) +def test_convertX(input_data, output_data): + # dont test conversion from unequal supporting to equal only, or multivariate to + # univariate only. pd-wide seems unsupported. + X = convertX(DATA_EXAMPLES[input_data], output_data) + t = get_type(X) + assert t == output_data From a6925171e1795b81fb0edfd7e02bab4c299ac180 Mon Sep 17 00:00:00 2001 From: Tony Bagnall Date: Wed, 12 Jul 2023 16:37:21 +0100 Subject: [PATCH 10/14] remove conversions from this PR --- aeon/utils/validation/_convert_collection.py | 192 ----------------- aeon/utils/validation/collection.py | 193 ------------------ .../utils/validation/tests/test_collection.py | 63 ------ 3 files changed, 448 deletions(-) delete mode 100644 aeon/utils/validation/_convert_collection.py delete mode 100644 aeon/utils/validation/collection.py delete mode 100644 aeon/utils/validation/tests/test_collection.py diff --git a/aeon/utils/validation/_convert_collection.py b/aeon/utils/validation/_convert_collection.py deleted file mode 100644 index ef226dc26f..0000000000 --- a/aeon/utils/validation/_convert_collection.py +++ /dev/null @@ -1,192 +0,0 @@ -# -*- coding: utf-8 -*- -"""Collection data converters.""" -import numpy as np -import pandas as pd - -from aeon.utils.validation.collection import DATA_TYPES - -convert_dict = dict() - - -def convert_identity(obj, store=None): - """Convert identity.""" - return obj - - -# assign identity function to type conversion to self -for x in DATA_TYPES: - convert_dict[(x, x)] = convert_identity - - -def from_numpy3d_to_pd_multiindex(X): - """Convert numpy3D collection to pandas multi-index Panel. - - Parameters - ---------- - X : np.ndarray - 3-dimensional NumPy array (n_instances, n_channels, n_timepoints) - - Returns - ------- - X_mi : pd.DataFrame - The multi-indexed pandas DataFrame - """ - if X.ndim != 3: - msg = " ".join( - [ - "Input should be 3-dimensional NumPy array with shape", - "(n_instances, n_channels, n_timepoints).", - ] - ) - raise TypeError(msg) - - n_instances, n_channels, n_timepoints = X.shape - multi_index = pd.MultiIndex.from_product( - [range(n_instances), range(n_channels), range(n_timepoints)], - names=["instances", "columns", "timepoints"], - ) - - X_mi = pd.DataFrame({"X": X.flatten()}, index=multi_index) - X_mi = X_mi.unstack(level="columns") - X_mi.columns = [f"var_{i}" for i in range(n_channels)] - return X_mi - - -def from_numpy3d_to_nested_univ(X): - """Convert numpy3D collection to nested_univ pd.DataFrame. - - Convert NumPy ndarray with shape (n_instances, n_channels, n_timepoints) - into nested pandas DataFrame (with time series as pandas Series in cells) - - Parameters - ---------- - X : np.ndarray - 3-dimensional NumPy array (n_instances, n_channels, n_timepoints) - - Returns - ------- - df : pd.DataFrame - """ - n_instances, n_channels, n_timepoints = X.shape - array_type = X.dtype - container = pd.Series - column_names = [f"var_{i}" for i in range(n_channels)] - column_list = [] - for j, column in enumerate(column_names): - nested_column = ( - pd.DataFrame(X[:, j, :]) - .apply(lambda x: [container(x, dtype=array_type)], axis=1) - .str[0] - .rename(column) - ) - column_list.append(nested_column) - df = pd.concat(column_list, axis=1) - return df - - -def from_numpy3d_to_np_list(X, store=None): - """Convert 3D np.darray to a list of 2D numpy. - - Converts 3D numpy array (n_instances, n_channels, n_timepoints) to - a 2D list length [n_instances] each of shape (n_channels, n_timepoints) - - Parameters - ---------- - X : np.ndarray - The input array with shape (n_instances, n_channels, n_timepoints) - - Returns - ------- - list : list [n_instances] np.ndarray - A list of np.ndarray - """ - np_list = [] - for arr in X: - np_list.append(arr) - return np_list - - -def from_numpy3d_to_df_list(X, store=None): - """Convert 3D np.darray to a list of dataframes in wide format. - - Converts 3D numpy array (n_instances, n_channels, n_timepoints) to - a 2D list length [n_instances] of pd.DataFrames shape (n_channels, n_timepoints) - - Parameters - ---------- - X : np.ndarray - The input array with shape (n_instances, n_channels, n_timepoints) - - Returns - ------- - df : pd.DataFrame - """ - df_list = [] - for arr in X: - df_list.append(pd.DataFrame(arr)) - return df_list - - -def from_numpy3d_to_pd_wide(X, store=None): - """Convert 3D np.darray to a list of dataframes in wide format. - - Only valid with univariate time series. Converts 3D numpy array (n_instances, 1, - n_timepoints) to a dataframe (n_instances, n_timepoints) - - Parameters - ---------- - X : np.ndarray - The input array with shape (n_instances, 1, n_timepoints) - - Returns - ------- - df : a dataframe (n_instances, n_timepoints) - - Raise - ----- - ValueError if X has n_channels>1 - """ - if X.shape[1] > 1: - raise ValueError( - "Error, numpy3D passed with > 1 channel, cannot convert to " "pd-wide" - ) - return pd.DataFrame(X.squeeze()) - - -def from_numpyflat_to_nested_univ(X): - """Convert np.ndarray to nested_univ format pd.DataFrame with a single column. - - Parameters - ---------- - X : np.ndarray shape (n_cases, n_timepoints) - - Returns - ------- - Xt : pd.DataFrame - DataFrame with a single column of pd.Series - """ - container = pd.Series - n_instances, n_timepoints = X.shape - time_index = np.arange(n_timepoints) - kwargs = {"index": time_index} - - Xt = pd.DataFrame( - pd.Series([container(X[i, :], **kwargs) for i in range(n_instances)]) - ) - return Xt - - -def from_pd_wide_to_nested_univ(X): - """Convert wide pd.DataFrame to nested_univ format pd.DataFrame. - - Parameters - ---------- - X : pd.DataFrame shape (n_cases, n_timepoints) - - Returns - ------- - Xt : pd.DataFrame - Transformed DataFrame with a single column of pd.Series - """ - X = X.to_numpy() - return from_numpyflat_to_nested_univ(X) diff --git a/aeon/utils/validation/collection.py b/aeon/utils/validation/collection.py deleted file mode 100644 index 39fa22d045..0000000000 --- a/aeon/utils/validation/collection.py +++ /dev/null @@ -1,193 +0,0 @@ -# -*- coding: utf-8 -*- -"""Conversion and checking for collections of time series.""" -import numpy as np -import pandas as pd - -from aeon.datatypes._panel._convert import convert_dict - -DATA_TYPES = [ - "numpy3D", # 3D np.ndarray of format (n_cases, n_channels, n_timepoints) - "np-list", # python list of 2D numpy array of length [n_cases], each of shape ( - # n_channels, n_timepoints_i) - "df-list", # python list of 2D pd.DataFrames of length [n_cases], each a of - # shape (n_timepoints_i, n_channels) - "numpyflat", # 2D np.ndarray of shape (n_cases, n_timepoints) - "pd-wide", # 2D pd.DataFrame of shape (n_cases, n_timepoints) - "nested_univ", # pd.DataFrame (n_cases, n_channels) with each cell a pd.Series, -] -# "pd-multiindex", d.DataFrame with multi-index, -# "dask_panel": not used anywhere - - -def convertX(X, to_type): - """Convert from one of DATA_TYPE to another. - - Parameters - ---------- - X : data structure. - to_type : string, one of DATA_TYPES - - Returns - ------- - Data structure conforming to "to_type" - - Raises - ------ - ValueError if - X pd.ndarray but wrong dimension - X is list but not of np.ndarray or p.DataFrame. - X is a pd.DataFrame on non float primitives. - - Example - ------- - >>> X=convertX(np.zeros(shape=(10, 3, 20)), "np-list") - >>> type(X) - list - """ - input_type = get_type(X) - return convert_dict[(input_type, to_type, "Panel")](X) - - -def get_type(X): - """Get the string identifier associated with different data structures. - - Parameters - ---------- - X : data structure. - - Returns - ------- - input_type : string, one of DATA_TYPES - - Raises - ------ - ValueError if - X pd.ndarray but wrong dimension - X is list but not of np.ndarray or p.DataFrame. - X is a pd.DataFrame on non float primitives. - - Example - ------- - >>> equal_length( np.zeros(shape=(10, 3, 20)), "numpy3D") - True - """ - if isinstance(X, np.ndarray): # “numpy3D” or numpyflat - if X.ndim == 3: - return "numpy3D" - elif X.ndim == 2: - return "numpyflat" - else: - raise ValueError("ERROR np.ndarray must be either 2D or 3D") - elif isinstance(X, list): # np-list or df-list - if isinstance(X[0], np.ndarray): # if one a numpy they must all be 2D numpy - for a in X: - if not (isinstance(a, np.ndarray) and a.ndim == 2): - raise ValueError("ERROR np-list np.ndarray must be either 2D or 3D") - return "np-list" - elif isinstance(X[0], pd.DataFrame): - for a in X: - if not isinstance(a, pd.DataFrame): - raise ValueError("ERROR df-list must only contain pd.DataFrame") - return "df-list" - elif isinstance(X, pd.DataFrame): # Nested univariate, hierachical or pd-wide - if _is_nested_dataframe(X): - return "nested_univ" - if isinstance(X.index, pd.MultiIndex): - return "pd-multiindex" - elif _is_pd_wide(X): - return "pd-wide" - raise ValueError( - "ERROR unknown pd.DataFrame, contains non float values, " - "not hierarchical nor is it nested pd.Series" - ) - # if isinstance(X, dask.dataframe.core.DataFrame): - # return "dask_panel" - raise ValueError(f"ERROR unknown input type {type(X)}") - - -def equal_length(X, input_type): - """Test if X contains equal length time series. - - Assumes input_type is a valid type (DATA_TYPES). - - Parameters - ---------- - X : data structure. - input_type : string, one of DATA_TYPES - - Returns - ------- - boolean: True if all series in X are equal length, False otherwise - - Raises - ------ - ValueError if input_type equals "dask_panel" or not in DATA_TYPES. - - Example - ------- - >>> equal_length( np.zeros(shape=(10, 3, 20)), "numpy3D") - True - """ - always_equal = {"numpy3D", "numpyflat", "pd-wide"} - if input_type in always_equal: - return True - if input_type == "np-list": - first = X[0].shape[1] - for i in range(1, len(X)): - if X[i].shape[1] != first: - return False - return True - if input_type == "df-list": - first = X[0].shape[0] - for i in range(1, len(X)): - if X[i].shape[0] != first: - return False - return True - if input_type == "nested_univ": # Nested univariate or hierachical - return _nested_uni_is_equal(X) - if input_type == "pd-multiindex": - # TEMPORARY: WORK OUT HOW TO TEST THESE - return True - # raise ValueError(" Multi index not supported here ") - if input_type == "dask_panel": - raise ValueError(" DASK panel not supported here ") - raise ValueError(f" unknown input type {input_type}") - return False - - -def has_missing(X, input_type): - """Check if X has missing values.""" - # if isinstance(X, np.ndarray): # “numpy3D” or numpyflat - # elif isinstance(X, list): # np-list or df-list - return False - - -def _nested_uni_is_equal(X): - """Check whether series are unequal length.""" - length = X.iloc[0, 0].size - for series in X.iloc[0]: - if series.size != length: - return False - return True - - -def _is_nested_dataframe(X): - """Check if X is nested dataframe.""" - # Otherwise check all entries are pd.Series - if not isinstance(X, pd.DataFrame): - return False - for _, series in X.items(): - for cell in series: - if not isinstance(cell, pd.Series): - return False - return True - - -def _is_pd_wide(X): - """Check whether the input nested DataFrame is "pd-wide" type.""" - # only test is if all values are float. This from chatgpt seems stupid - float_cols = X.select_dtypes(include=[np.float]).columns - for col in float_cols: - if not np.issubdtype(X[col].dtype, np.floating): - return False - return True diff --git a/aeon/utils/validation/tests/test_collection.py b/aeon/utils/validation/tests/test_collection.py deleted file mode 100644 index 090f944ec1..0000000000 --- a/aeon/utils/validation/tests/test_collection.py +++ /dev/null @@ -1,63 +0,0 @@ -#!/usr/bin/env python3 -u -# -*- coding: utf-8 -*- -"""Unit tests for aeon.utils.validation.collection check/convert functions.""" -import numpy as np -import pandas as pd -import pytest - -# from aeon.datasets._data_generators import make_example_multi_index_dataframe -from aeon.utils._testing.tests.test_collection import make_nested_dataframe_data -from aeon.utils.validation.collection import ( # _nested_uni_is_equal,; has_missing, - DATA_TYPES, - _is_nested_dataframe, - convertX, - equal_length, - get_type, -) - -np_list = [] -for _ in range(10): - np_list.append(np.zeros(shape=(20, 2))) -df_list = [] -for _ in range(10): - df_list.append(pd.DataFrame(np.zeros(shape=(20, 2)))) -nested, _ = make_nested_dataframe_data() -# multi = make_example_multi_index_dataframe() - -DATA_EXAMPLES = { - "numpy3D": np.zeros(shape=(10, 3, 20)), - "numpyflat": np.zeros(shape=(10, 20)), - "np-list": np_list, - "df-list": df_list, - "pd-wide": pd.DataFrame(np.zeros(shape=(10, 20))), - "nested_univ": nested, -} -# "pd-multiindex": multi, - - -@pytest.mark.parametrize("data", DATA_TYPES) -def test_equal_length(data): - assert equal_length(DATA_EXAMPLES[data], data) - - -@pytest.mark.parametrize("data", DATA_TYPES) -def test_get_type(data): - assert get_type(DATA_EXAMPLES[data]) == data - - -@pytest.mark.parametrize("data", DATA_TYPES) -def test_is_nested_dataframe(data): - if data == "nested_univ": - assert _is_nested_dataframe(DATA_EXAMPLES[data]) - else: - assert not _is_nested_dataframe(DATA_EXAMPLES[data]) - - -@pytest.mark.parametrize("input_data", DATA_TYPES) -@pytest.mark.parametrize("output_data", DATA_TYPES) -def test_convertX(input_data, output_data): - # dont test conversion from unequal supporting to equal only, or multivariate to - # univariate only. pd-wide seems unsupported. - X = convertX(DATA_EXAMPLES[input_data], output_data) - t = get_type(X) - assert t == output_data From 065fc9e2907ab3948d8f08e13a3769293a619466 Mon Sep 17 00:00:00 2001 From: Tony Bagnall Date: Sat, 22 Jul 2023 20:02:20 +0100 Subject: [PATCH 11/14] remove method stub --- aeon/datasets/_data_generators.py | 20 -------------------- 1 file changed, 20 deletions(-) diff --git a/aeon/datasets/_data_generators.py b/aeon/datasets/_data_generators.py index 697acfce40..8ae0d07f5d 100644 --- a/aeon/datasets/_data_generators.py +++ b/aeon/datasets/_data_generators.py @@ -179,26 +179,6 @@ def make_example_long_table(n_cases=50, n_channels=2, n_timepoints=20): return df -def make_example_nested_dataframe(n_instances=10, n_channels=3, n_timepoints=20): - """Generate example nested dataframe, type "nested_univ". - - Parameters - ---------- - n_instances : int - Number of instances. - n_channels : int - Number of columns (series) in multi-indexed DataFrame. - n_timepoints : int - Number of timepoints per instance-column pair. - - Returns - ------- - nested_df : pd.DataFrame. each cell a pd.Series length n_timepoints - - """ - return None - - def make_example_multi_index_dataframe(n_instances=50, n_channels=3, n_timepoints=20): """Generate example multi-index DataFrame. From 5e5f6dec5c1854c9a82c447e8c7b66c796c745ea Mon Sep 17 00:00:00 2001 From: MatthewMiddlehurst Date: Mon, 24 Jul 2023 17:02:27 +0100 Subject: [PATCH 12/14] storage and benchmarking --- examples/datasets/benchmarking_data.ipynb | 164 +++++---- examples/datasets/data_conversions.ipynb | 10 +- examples/datasets/data_loading.ipynb | 22 +- examples/datasets/data_storage.ipynb | 417 +++++++++++++++------- 4 files changed, 401 insertions(+), 212 deletions(-) diff --git a/examples/datasets/benchmarking_data.ipynb b/examples/datasets/benchmarking_data.ipynb index 514aac7a38..7ad658f6a5 100644 --- a/examples/datasets/benchmarking_data.ipynb +++ b/examples/datasets/benchmarking_data.ipynb @@ -6,16 +6,20 @@ "# Downloading and loading benchmarking datasets\n", "\n", "It is common to use standard collections of data to compare different estimators for\n", - "classification, clustering, regression and forecasting. Some of these datasets are\n", - "shipped with aeon in the datasets/data directory. However, the files are far too\n", - "big to include them all. aeon p[rovides tools to download these data to use in benchmarking experiments.\n", - "Classification and regression data are stored in .ts format. Forecasting\n", - "data are stored in the equivalent .tsf format. See the [data formats notebook](examples/data_formats.ipynb) for more info.\n", + "classification, clustering, regression and forecasting. Some of the smaller datasets from\n", + "these datasets included with `aeon` in the `aeon/datasets/data` directory. However,\n", + "there is way to many datasets to include them all, and some of the files are far too big\n", + "to consider including in the package. `aeon` provides tools to download these data to use\n", + "in benchmarking experiments. Classification and regression data are stored in .ts format.\n", + "Forecasting data are stored in the equivalent .tsf format. See the\n", + "[data loading notebook](examples/data_loading.ipynb) for more info.\n", "\n", - "Classification and regression are loaded into 3D numpy arrays of shape `(n_cases, n_channels, n_timepoints)` if equal length\n", - "or a list of `[n_cases]` of 2D numpy if `n_timepoints` is different for different\n", - "cases. Forecasting data are loaded into pd.DataFrame. For more information on\n", - "aeon data types see the [data storage notebook](examples/data_storage.ipynb).\n", + "Classification and regression are loaded into 3D numpy arrays of shape\n", + "`(n_cases, n_channels, n_timepoints)` if equal length or a list of length\n", + "`n_cases` of 2D numpy arrays of shape `(n_channels, n_timepoints)` if\n", + "`n_timepoints` is different between cases. Forecasting data are loaded into\n", + "pd.DataFrame. For more information on aeon data types see the\n", + "[data storage notebook](examples/data_storage.ipynb).\n", "\n", "Note that this notebook is dependent on external websites, so will not function if\n", "you are not online or the associated website is down. We use the following three\n", @@ -27,13 +31,17 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 9, "outputs": [], "source": [ "from aeon.datasets import load_classification, load_forecasting, load_regression" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:38.987856700Z", + "start_time": "2023-07-24T15:29:38.941979900Z" + } } }, { @@ -41,14 +49,14 @@ "source": [ "## Time Series Classification Archive\n", "\n", - "[UCR/TSML Time Series Classification Archive](https://timeseriesclassification.com)\n", - "hosts the UCR univariate TSC archive [1], also available from [UCR](ucrweb) and\n", + "The [UCR/TSML Time Series Classification Archive](https://timeseriesclassification.com)\n", + "hosts the UCR univariate TSC archive (also available from\n", + "[UCR](https://www.cs.ucr.edu/%7Eeamonn/time_series_data_2018/)) [1], and\n", "the multivariate archive [2] (previously called the UEA archive, soon to change). We\n", - "provide seven of these in the datasets/data directort: ACSF1, ArrowHead, BasicMotions,\n", + "provide seven of these in the datasets/data directory: ACSF1, ArrowHead, BasicMotions,\n", "GunPoint, ItalyPowerDemand, JapaneseVowels and PLAID. The archive is much bigger. The\n", - " last batch release was for 128 univariate [1] and 33 multivariate [2]. If you just\n", - " want to download them all, please go to the [website]\n", - " (https://timeseriesclassification.com)" + "last batch release was for 128 univariate [1] and 33 multivariate [2] datasets. If you just\n", + "want to download them all, please go to the [website](https://timeseriesclassification.com)." ], "metadata": { "collapsed": false @@ -56,13 +64,13 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 10, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Univariate length = 127\n", + "Univariate length = 128\n", "Multivariate length = 33\n" ] } @@ -75,7 +83,11 @@ "print(\"Multivariate length = \", len(multivariate))" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:38.988882400Z", + "start_time": "2023-07-24T15:29:38.949956800Z" + } } }, { @@ -87,9 +99,9 @@ " /Chinatown/Chinatown_TRAIN.ts\n", " /Chinatown/Chinatown_TEST.ts\n", "\n", - "You can load these problems directly from TSC.com and load them into memory. Note by\n", - "default, these functions return the data and associated metadata. This usage combines\n", - " the train and test splits and loads them into one `X` and one `y` array." + "You can load these problems directly from [https://timeseriesclassification.com] and load\n", + "them into memory. Note by default, these functions return the data and associated metadata.\n", + "This usage combines the train and test splits and loads them into one `X` and one `y` array." ], "metadata": { "collapsed": false @@ -97,7 +109,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 11, "outputs": [ { "name": "stdout", @@ -118,20 +130,25 @@ "print(\"\\nMeta data = \", meta)" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:39.067643100Z", + "start_time": "2023-07-24T15:29:38.954944100Z" + } } }, { "cell_type": "markdown", "source": [ - "If you look in aeon/datasets you should see a directory called `local_data`\n", + "If you look in `aeon/datasets/local_data/` you should see a directory called `Chinatown`\n", "containing the Chinatown datasets. All of the zips have `.ts` files. Some also have\n", "`.arff` and `.txt` files. If you load again, it will not download again if the file is\n", - "already there. If you want to store data somewhere else, you can specify a file path.\n", - " Also, you can load the train and test separately. This code will download the data\n", - " to Temp once, and load into separate train/test splits. The split argument is not\n", - " case sensitive. Once downloaded, `load_classification` is a equivalent to a call to\n", - " `load_from_tsfile`" + "already there. If you want to store data somewhere else, you can specify a file path\n", + "using the `extract_path` parameter. Additionally, you can load the train and test\n", + "separately as shown below.\n", + "\n", + "This code will download the data and load into separate train/test splits. The split argument is not\n", + "case sensitive. Once downloaded, `load_classification` is a equivalent to a call to `load_from_tsfile`." ], "metadata": { "collapsed": false @@ -139,46 +156,31 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 12, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train shape = (20, 1, 512)\n", - "Test shape = (20, 1, 512)\n", - "Loaded directly shape = (20, 1, 512)\n" + "Test shape = (20, 1, 512)\n" ] - }, - { - "data": { - "text/plain": "array([1.7400873, 1.7331051, 1.7091917, 1.6333304, 1.5405759])" - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" } ], "source": [ "X_train, y_train = load_classification(\n", - " \"BeetleFly\", extract_path=\"C://Temp/\", split=\"TRAIN\", return_metadata=False\n", - ")\n", - "X_test, y_test = load_classification(\n", - " \"BeetleFly\", extract_path=\"C://Temp/\", split=\"test\", return_metadata=False\n", + " \"BeetleFly\", split=\"TRAIN\", return_metadata=False\n", ")\n", + "X_test, y_test = load_classification(\"BeetleFly\", split=\"test\", return_metadata=False)\n", "print(\"Train shape = \", X_train.shape)\n", - "print(\"Test shape = \", X_test.shape)\n", - "from aeon.datasets import load_from_tsfile\n", - "\n", - "X_train, y_train = load_from_tsfile(\n", - " full_file_path_and_name=\"C://Temp/BeetleFly/BeetleFLY_TRAIN\"\n", - ")\n", - "print(\"Loaded directly shape = \", X_train.shape)\n", - "\n", - "X_test[0][0][:5]" + "print(\"Test shape = \", X_test.shape)" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:39.229225500Z", + "start_time": "2023-07-24T15:29:39.068640Z" + } } }, { @@ -186,10 +188,10 @@ "source": [ "## Time Series (Extrinsic) Regression\n", "\n", - "[The Monash Time Series Extrinsic Regression Archive]() [3] repo (called extrinsic to\n", - " diffentiate if from sliding window based regression) currently contains 19\n", - " regression problems in .ts format. One of these, Covid3Month, is in `datasets\\data`.\n", - " The usage of `load_regression` is identical to `load_classification`\n" + "The [Monash Time Series Extrinsic Regression Archive](http://tseregression.org/) [3] repo\n", + "(called extrinsic to differentiate if from sliding window based regression) currently\n", + "contains 19 regression problems in `.ts` format. One of these, Covid3Month, is in\n", + "`datasets\\data`. The usage of `load_regression` is identical to `load_classification`\n" ], "metadata": { "collapsed": false @@ -197,13 +199,13 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 13, "outputs": [ { "data": { "text/plain": "['AppliancesEnergy',\n 'AustraliaRainfall',\n 'BIDMCHR',\n 'BIDMCRR',\n 'BIDMCSpO2',\n 'BeijingPM10Quality',\n 'BeijingPM25Quality',\n 'BenzeneConcentration',\n 'Covid3Month',\n 'FloodModeling1',\n 'FloodModeling2',\n 'FloodModeling3',\n 'HouseholdPowerConsumption1',\n 'HouseholdPowerConsumption2',\n 'IEEEPPG',\n 'LiveFuelMoistureContent',\n 'NewsHeadlineSentiment',\n 'NewsTitleSentiment',\n 'PPGDalia']" }, - "execution_count": 5, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } @@ -214,12 +216,16 @@ "list_available_tser_datasets()" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:39.237204700Z", + "start_time": "2023-07-24T15:29:39.230223400Z" + } } }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 14, "outputs": [ { "name": "stdout", @@ -234,7 +240,11 @@ "print(\"Shape of X = \", X.shape)" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:42.341536600Z", + "start_time": "2023-07-24T15:29:39.237204700Z" + } } }, { @@ -253,13 +263,13 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 15, "outputs": [ { "data": { "text/plain": "['australian_electricity_demand_dataset',\n 'car_parts_dataset_with_missing_values',\n 'car_parts_dataset_without_missing_values',\n 'cif_2016_dataset',\n 'covid_deaths_dataset',\n 'covid_mobility_dataset_with_missing_values',\n 'covid_mobility_dataset_without_missing_values',\n 'dominick_dataset',\n 'elecdemand_dataset',\n 'electricity_hourly_dataset',\n 'electricity_weekly_dataset',\n 'fred_md_dataset',\n 'hospital_dataset',\n 'kaggle_web_traffic_dataset_with_missing_values',\n 'kaggle_web_traffic_dataset_without_missing_values',\n 'kaggle_web_traffic_weekly_dataset',\n 'kdd_cup_2018_dataset_with_missing_values',\n 'kdd_cup_2018_dataset_without_missing_values',\n 'london_smart_meters_dataset_with_missing_values',\n 'london_smart_meters_dataset_without_missing_values',\n 'm1_monthly_dataset',\n 'm1_quarterly_dataset',\n 'm1_yearly_dataset',\n 'm3_monthly_dataset',\n 'm3_other_dataset',\n 'm3_quarterly_dataset',\n 'm3_yearly_dataset',\n 'm4_daily_dataset',\n 'm4_hourly_dataset',\n 'm4_monthly_dataset',\n 'm4_quarterly_dataset',\n 'm4_weekly_dataset',\n 'm4_yearly_dataset',\n 'nn5_daily_dataset_with_missing_values',\n 'nn5_daily_dataset_without_missing_values',\n 'nn5_weekly_dataset',\n 'pedestrian_counts_dataset',\n 'saugeenday_dataset',\n 'solar_10_minutes_dataset',\n 'solar_4_seconds_dataset',\n 'solar_weekly_dataset',\n 'sunspot_dataset_with_missing_values',\n 'sunspot_dataset_without_missing_values',\n 'tourism_monthly_dataset',\n 'tourism_quarterly_dataset',\n 'tourism_yearly_dataset',\n 'traffic_hourly_dataset',\n 'traffic_weekly_dataset',\n 'us_births_dataset',\n 'weather_dataset',\n 'wind_4_seconds_dataset',\n 'wind_farms_minutely_dataset_with_missing_values',\n 'wind_farms_minutely_dataset_without_missing_values']" }, - "execution_count": 7, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } @@ -270,12 +280,16 @@ "list_available_tsf_datasets()" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:42.347519900Z", + "start_time": "2023-07-24T15:29:42.341536600Z" + } } }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 16, "outputs": [ { "name": "stdout", @@ -307,7 +321,11 @@ "print(data)" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:49.815481300Z", + "start_time": "2023-07-24T15:29:42.347519900Z" + } } }, { @@ -319,9 +337,9 @@ " and experimental evaluation of recent algorithmic advances, Data Mining and\n", " Knowledge Discovery 35(2), 2021\n", "[3] Tan et. al, Time Series Extrinsic Regression, Data Mining and Knowledge\n", - "Discovery, 2021\n", - "[4] Godahewa et. al, Monash Time Series Forecasting Archive,Neural Information\n", - "Processing Systems Track on Datasets and Benchmarks, 2021\n" + " Discovery, 2021\n", + "[4] Godahewa et. al, Monash Time Series Forecasting Archive, Neural Information\n", + " Processing Systems Track on Datasets and Benchmarks, 2021\n" ], "metadata": { "collapsed": false diff --git a/examples/datasets/data_conversions.ipynb b/examples/datasets/data_conversions.ipynb index a228a7affd..6aa41242d4 100644 --- a/examples/datasets/data_conversions.ipynb +++ b/examples/datasets/data_conversions.ipynb @@ -6,11 +6,11 @@ "# Data conversions in aeon\n", "\n", "We recommend you follow the data storage described in the [data storage notebook](examples/datasets/data_storage.ipynb)\n", - "which can be summarised as follows: Use `pd.Series` or `pd.DataFrame` for forecasting\n", - " and for classification, clustering and regression, use 3D numpy of shape `(n_cases,\n", - " n_channels, n_timepoints)` if your collection of time series are equal length, or a\n", - " list of 2D numpy of length `[n_cases]` if not equal length. All are [data loaders]\n", - " (examples/datasets/data_loading.ipynb) use this format.\n", + "which can be summarised as follows: Use `pd.Series` or `pd.DataFrame` for tasks\n", + "which focus on single series such a forecasting, and for tasks such as classification,\n", + "clustering and regression use a 3D numpy array of shape `(n_cases, n_channels, n_timepoints)`\n", + "if your collection of time series are equal length, or a list of 2D numpy of length `[n_cases]`\n", + "if not equal length. All are [data loaders](examples/datasets/data_loading.ipynb) use this format.\n", "\n", "However, `aeon` provides a range of converters in the `datatypes` package. These are\n", "grouped into converters for single series and converters for collections of series" diff --git a/examples/datasets/data_loading.ipynb b/examples/datasets/data_loading.ipynb index 758106e8ed..045531830e 100644 --- a/examples/datasets/data_loading.ipynb +++ b/examples/datasets/data_loading.ipynb @@ -3,15 +3,23 @@ { "cell_type": "markdown", "source": [ - "# Loading data into aeon\n", - "aeon supports a range of data input formats. Example problems are described in\n", - "provided_data.ipyn. Downloading data is described in benchmarking_data.ipynb. You\n", - "can of course load and format the data so that it conforms to the input types\n", - "describe in data_storage. aeon also provides data formats for time series for both\n", - "forecasting and machine learning. These are all text files with a particular\n", + "# Loading data in aeon\n", + "\n", + "`aeon` supports a range of data input formats. Accepted datatypes are provided in the\n", + "[data conversions](examples/datasets/data_conversions.ipynb) and\n", + "[data storage](examples/datasets/data_storage.ipynb) notebooks. Example problems are\n", + "described in the [provided data notebook](examples/datasets/provided_data.ipynb), with\n", + "guidance on downloading popular benchmark data provided in the\n", + "[benchmarking data notebook](examples/datasets/benchmarking_data.ipynb).\n", + "\n", + "This notebook provides guidance on loading data from a few popular data file formats used in\n", + "time series machine learning and forecasting scenarios.\n", + "You can of course load data from whatever format you wish and then format the data so that\n", + "it conforms to the input types described. These are all text files with a particular\n", "structure. Both formats store a single time series per row.\n", "\n", - "1. The `.ts` and `.tsf` format used by the aeon packages and the [time series](https://timeseriesclassification.com) and [forecasting](https://forecastingdata.org)\n", + "1. `.csv`\n", + "2. The `.ts` and `.tsf` format used by the aeon packages and the [time series](https://timeseriesclassification.com) and [forecasting](https://forecastingdata.org)\n", " repositories. More information on the `.tsf` format is\n", "[here](https://openreview.net/pdf?id=wEc1mgAjU-)\n", "Links to download all of the UCR univariate and the tsml multivariate data in `.ts`\n", diff --git a/examples/datasets/data_storage.ipynb b/examples/datasets/data_storage.ipynb index 880c4c0b5f..e8bf8fc67f 100644 --- a/examples/datasets/data_storage.ipynb +++ b/examples/datasets/data_storage.ipynb @@ -5,36 +5,44 @@ "source": [ "# Storing data to use for aeon estimators\n", "\n", - "aeon includes time series forecasting and machine learning. These two communities\n", - "have different conventions on how to store data and what to call data structures.\n", - "Some of the differences are\n", + "`aeon` includes multiple time series tasks such as forecasting and machine learning\n", + "(i.e. classification, regression and clustering). These two communities have different\n", + "conventions and requirements for storing data and what to call data structures. We try\n", + "to accommodate for both, which leads to some differences between `aeon` packages. Some\n", + "differences are:\n", "\n", - "1. Forecasters almost always stores data in pandas data structures, whereas machine\n", - "learners use numpy arrays almost exclusively.\n", - "2. n forecasting a 2 dimensional data is almost always shape `(n_timepoints, n_timeseries)` whereas in\n", - "machine learning we would tend to store data in a `(n_timeseries, n_timepoints)` array.\n", - "3. In forecasting, a variable `y` refers to a time series for which we are attempting\n", + "1. Forecasters almost always store data in pandas data structures internally, whereas machine\n", + " learners use numpy arrays almost exclusively.\n", + "2. Most forecasting estimators (but not all) will take a single series as a 1D or 2D array-like\n", + " as the data to learn from, whereas machine learning estimators will take a collection of series\n", + " as a 3D or 2D array-like.\n", + "3. In forecasting 2D arrays are almost always single series of shape `(n_timepoints, n_channels)`\n", + " whereas in machine learning we would tend to store data in a `(n_cases, n_timepoints)`\n", + " collection of series.\n", + "4. In forecasting, a variable `y` refers to a time series for which we are attempting\n", " to make a forecast, hence `y` is assumed to be ordered. In machine learning,\n", " `y` is a list of either class labels (for classification) or observations of a\n", - " response vairable (for regression). The ordering of values in `y` is determined by\n", + " response variables (for regression). The ordering of values in `y` is determined by\n", " the ordering of the `X` input.\n", "\n", - "Because of these sources of confusion, we recommend that you store data in\n", - "pandas data structures for forecasting and numpy arrays for machine learning." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ + "Because of these sources of confusion, we recommend carefully reading the documentation for the task\n", + "prior to usage to ensure you are using the correct input data type. We also recommend that you store\n", + "data in pandas data structures for forecasting and numpy arrays for machine learning tasks. All of\n", + "our accepted input types can be used given they are compatible with the algorithms (see the\n", + "[data conversions notebook](examples/datasets/data_conversions.ipynb) for more accepted types), but\n", + "keeping to the recommended types is likely to reduce the number of data conversions and make finding help\n", + "easier.\n", + "\n", + "In the following, we provide guidance and examples for storing data for forecasting and machine learning\n", + "using our recommended data types.\n", + "\n", "## Forecasting data\n", "\n", - "aeon forecasting uses pd.Series, pd.DataFrame and pd.Multiindex to store data. It has\n", - "some built in forecasting datasets and tools for downloading commonly used\n", - "benchmarks, loading_data.ipynb forecasting section. For details of the forecasting\n", - "functionality, see the numerous forecasting notebooks.\n", + "The `aeon` forecasting module primarily uses pd.Series, pd.DataFrame and pd.Multiindex to store data.\n", + "It has some built in forecasting datasets and tools for downloading commonly used benchmarks, see the\n", + "[data loading notebook](examples/datasets/loading_data.ipynb.ipynb) forecasting section. For details of\n", + "the forecasting functionality, see the [forecasting user guide](examples/forecasting/forecasting.ipynb)\n", + "and the numerous forecasting notebooks on the [examples page](examples).\n", "\n", "`pd.Series` are used to store a univariate time series with entries corresponding to\n", "different time points." @@ -45,13 +53,13 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 19, "outputs": [ { "data": { - "text/plain": "5 120.0\n6 140.0\n7 160.0\ndtype: float64" + "text/plain": "0 20.0\n1 40.0\n2 60.0\n3 80.0\n4 100.0\ndtype: float64" }, - "execution_count": 1, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } @@ -61,23 +69,48 @@ "import numpy as np\n", "import pandas as pd\n", "\n", + "y = pd.Series([20.0, 40.0, 60.0, 80.0, 100.0])\n", + "y" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.192506600Z", + "start_time": "2023-07-24T11:45:57.132654400Z" + } + } + }, + { + "cell_type": "code", + "execution_count": 20, + "outputs": [ + { + "data": { + "text/plain": "5 120.0\n6 140.0\n7 160.0\ndtype: float64" + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ "from aeon.forecasting.trend import TrendForecaster\n", "\n", - "y = pd.Series([20.0, 40.0, 60.0, 80.0, 100.0])\n", - "forecaster = TrendForecaster()\n", - "forecaster.fit(y) # fit the forecaster\n", - "forecaster.predict(fh=[1, 2, 3]) # forecast the next 3 values" + "tf = TrendForecaster()\n", + "tf.fit(y) # fit the forecaster\n", + "tf.predict(fh=[1, 2, 3]) # forecast the next 3 values" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.194505900Z", + "start_time": "2023-07-24T11:45:57.140619300Z" + } } }, { "cell_type": "markdown", "source": [ - "`pd.Series` are used to store a univariate time series with entries corresponding to\n", - "different time points.\n", - "\n", "`pd.DataFrame` are used to store multiple time series, where each column is a time\n", "series, and each row corresponds to a different, distinct time point. The index\n", "is the time point and should be monotonic. This creates two series called Sales and\n", @@ -89,27 +122,14 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 21, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " Sales Temperature\n", - "0 111 26\n", - "1 100 21\n", - "2 90 19\n", - "3 80 14\n", - "4 65 12\n", - "5 89 22\n" - ] - }, { "data": { - "text/plain": " Sales Temperature\n6 89.0 22.0\n7 89.0 22.0\n8 89.0 22.0", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
SalesTemperature
689.022.0
789.022.0
889.022.0
\n
" + "text/plain": " Sales Temperature\n0 111 26\n1 100 21\n2 90 19\n3 80 14\n4 65 12\n5 89 22", + "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
SalesTemperature
011126
110021
29019
38014
46512
58922
\n
" }, - "execution_count": 2, + "execution_count": 21, "metadata": {}, "output_type": "execute_result" } @@ -121,15 +141,43 @@ "}\n", "# Create DataFrame\n", "ice_creams = pd.DataFrame(ice_creams)\n", - "print(ice_creams)\n", + "ice_creams" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.245339100Z", + "start_time": "2023-07-24T11:45:57.148598Z" + } + } + }, + { + "cell_type": "code", + "execution_count": 22, + "outputs": [ + { + "data": { + "text/plain": " Sales Temperature\n6 89.0 22.0\n7 89.0 22.0\n8 89.0 22.0", + "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
SalesTemperature
689.022.0
789.022.0
889.022.0
\n
" + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ "from aeon.forecasting.exp_smoothing import ExponentialSmoothing\n", "\n", - "forecaster = ExponentialSmoothing()\n", - "forecaster.fit(ice_creams)\n", - "forecaster.predict(fh=[1, 2, 3])" + "es = ExponentialSmoothing()\n", + "es.fit(ice_creams)\n", + "es.predict(fh=[1, 2, 3])" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.256309400Z", + "start_time": "2023-07-24T11:45:57.156602400Z" + } } }, { @@ -143,21 +191,16 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 23, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - " Sales Temperature\n", - "datetime \n", - "2018-01-06 23:15:00 111 26\n", - "2019-02-09 01:48:00 100 21\n", - "2020-08-06 13:20:00 90 19\n", - "2021-07-03 14:50:00 80 14\n", - "2022-07-06 11:50:00 65 12\n", - "2023-03-05 16:50:00 89 22\n" - ] + "data": { + "text/plain": " Sales Temperature\ndatetime \n2018-01-06 23:15:00 111 26\n2019-02-09 01:48:00 100 21\n2020-08-06 13:20:00 90 19\n2021-07-03 14:50:00 80 14\n2022-07-06 11:50:00 65 12\n2023-03-05 16:50:00 89 22", + "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
SalesTemperature
datetime
2018-01-06 23:15:0011126
2019-02-09 01:48:0010021
2020-08-06 13:20:009019
2021-07-03 14:50:008014
2022-07-06 11:50:006512
2023-03-05 16:50:008922
\n
" + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ @@ -172,17 +215,21 @@ " ]\n", ")\n", "ice_creams = ice_creams.set_index(\"datetime\")\n", - "print(ice_creams)" + "ice_creams" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.257307Z", + "start_time": "2023-07-24T11:45:57.179516200Z" + } } }, { "cell_type": "markdown", "source": [ "`pd.DataFrame` also have the capability to store multiple indexes, which can be used\n", - "to represent whats called Panel data in forecasting hierarchical data. A Panel is a\n", + "to represent what's called Panel data in forecasting hierarchical data. A Panel is a\n", "collection of (possibly) multivariate data." ], "metadata": { @@ -191,14 +238,14 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 24, "outputs": [ { "data": { - "text/plain": " c0\nh0 h1 time \nh0_0 h1_0 2000-01-01 2.199534\n 2000-01-02 5.267746\n 2000-01-03 4.792742\n 2000-01-04 3.115800\n 2000-01-05 5.581822", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
c0
h0h1time
h0_0h1_02000-01-012.199534
2000-01-025.267746
2000-01-034.792742
2000-01-043.115800
2000-01-055.581822
\n
" + "text/plain": " c0\nh0 h1 time \nh0_0 h1_0 2000-01-01 4.249534\n 2000-01-02 2.899939\n 2000-01-03 2.671320\n 2000-01-04 4.380220\n 2000-01-05 5.538047\n... ...\nh0_1 h1_3 2000-01-08 3.658460\n 2000-01-09 3.672319\n 2000-01-10 2.938018\n 2000-01-11 2.902982\n 2000-01-12 2.871146\n\n[96 rows x 1 columns]", + "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
c0
h0h1time
h0_0h1_02000-01-014.249534
2000-01-022.899939
2000-01-032.671320
2000-01-044.380220
2000-01-055.538047
............
h0_1h1_32000-01-083.658460
2000-01-093.672319
2000-01-102.938018
2000-01-112.902982
2000-01-122.871146
\n

96 rows × 1 columns

\n
" }, - "execution_count": 4, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } @@ -207,40 +254,47 @@ "from aeon.utils._testing.hierarchical import _make_hierarchical\n", "\n", "y = _make_hierarchical()\n", - "y.head()" + "y" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.258304600Z", + "start_time": "2023-07-24T11:45:57.188516600Z" + } } }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 25, "outputs": [ { "data": { - "text/plain": " c0\nh0 h1 time \nh0_0 h1_0 2000-01-13 4.076904\n 2000-01-14 4.076904\n h1_1 2000-01-13 5.185745\n 2000-01-14 5.185745\n h1_2 2000-01-13 3.773312\n 2000-01-14 3.773312\n h1_3 2000-01-13 2.851027\n 2000-01-14 2.851027\nh0_1 h1_0 2000-01-13 3.468474\n 2000-01-14 3.468474\n h1_1 2000-01-13 4.421536\n 2000-01-14 4.421536\n h1_2 2000-01-13 3.791238\n 2000-01-14 3.791238\n h1_3 2000-01-13 4.026049\n 2000-01-14 4.026049", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
c0
h0h1time
h0_0h1_02000-01-134.076904
2000-01-144.076904
h1_12000-01-135.185745
2000-01-145.185745
h1_22000-01-133.773312
2000-01-143.773312
h1_32000-01-132.851027
2000-01-142.851027
h0_1h1_02000-01-133.468474
2000-01-143.468474
h1_12000-01-134.421536
2000-01-144.421536
h1_22000-01-133.791238
2000-01-143.791238
h1_32000-01-134.026049
2000-01-144.026049
\n
" + "text/plain": " c0\nh0 h1 time \nh0_0 h1_0 2000-01-13 4.200625\n 2000-01-14 4.200625\n h1_1 2000-01-13 3.714500\n 2000-01-14 3.714500\n h1_2 2000-01-13 3.982618\n 2000-01-14 3.982618\n h1_3 2000-01-13 3.911963\n 2000-01-14 3.911963\nh0_1 h1_0 2000-01-13 3.627664\n 2000-01-14 3.627664\n h1_1 2000-01-13 3.844651\n 2000-01-14 3.844651\n h1_2 2000-01-13 3.889248\n 2000-01-14 3.889248\n h1_3 2000-01-13 3.119286\n 2000-01-14 3.119286", + "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
c0
h0h1time
h0_0h1_02000-01-134.200625
2000-01-144.200625
h1_12000-01-133.714500
2000-01-143.714500
h1_22000-01-133.982618
2000-01-143.982618
h1_32000-01-133.911963
2000-01-143.911963
h0_1h1_02000-01-133.627664
2000-01-143.627664
h1_12000-01-133.844651
2000-01-143.844651
h1_22000-01-133.889248
2000-01-143.889248
h1_32000-01-133.119286
2000-01-143.119286
\n
" }, - "execution_count": 5, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "forecaster.fit(y, fh=[1, 2]).predict()" + "es.fit(y, fh=[1, 2]).predict()" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.418875200Z", + "start_time": "2023-07-24T11:45:57.200459Z" + } } }, { "cell_type": "markdown", "source": [ "`np.ndarray` can be used with the forecasters in aeon, although we recommend using\n", - "pandas. One dimensional np.ndarray are treated as a single time series. 2D numpy\n", - "array are treated as multiple series of shape `(n_timeseries, n_timepoints)`.\n", - "Forecasters fit independently on each series." + "pandas. One-dimensional np.ndarray are treated as a single time series. 2D numpy\n", + "arrays are treated as multiple series of shape `(n_timeseries, n_timepoints)`." ], "metadata": { "collapsed": false @@ -248,13 +302,13 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 26, "outputs": [ { "data": { "text/plain": "array([[120.],\n [140.],\n [160.]])" }, - "execution_count": 6, + "execution_count": 26, "metadata": {}, "output_type": "execute_result" } @@ -266,18 +320,22 @@ "forecaster.predict(fh=[1, 2, 3]) # forecast the next 3 values" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.420869900Z", + "start_time": "2023-07-24T11:45:57.299224700Z" + } } }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 27, "outputs": [ { "data": { "text/plain": "array([[120., 50.],\n [140., 40.],\n [160., 30.]])" }, - "execution_count": 7, + "execution_count": 27, "metadata": {}, "output_type": "execute_result" } @@ -290,7 +348,11 @@ "forecaster.predict(fh=[1, 2, 3]) # forecast the next 3 values" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.444806900Z", + "start_time": "2023-07-24T11:45:57.308171700Z" + } } }, { @@ -310,7 +372,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 28, "outputs": [ { "name": "stdout", @@ -330,12 +392,16 @@ "print(\"X shape = \", X.shape, \" First series =\", X[0], \"second series = \", X[1])" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.445803400Z", + "start_time": "2023-07-24T11:45:57.324129700Z" + } } }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 29, "outputs": [ { "name": "stdout", @@ -351,14 +417,6 @@ " [ 14. 70. 60. 22.]\n", " [ 49. 49. 66. 9.]]\n" ] - }, - { - "data": { - "text/plain": "array([0, 1, 1, 1], dtype=int64)" - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" } ], "source": [ @@ -371,7 +429,30 @@ " ]\n", ")\n", "# n_cases = 4, n_channels =3, n_timepoints = 4\n", - "print(\"X shape = \", X.shape, \"\\n First series =\\n\", X[0], \"\\nsecond series = \\n\", X[1])\n", + "print(\"X shape = \", X.shape, \"\\n First series =\\n\", X[0], \"\\nsecond series = \\n\", X[1])" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.446800500Z", + "start_time": "2023-07-24T11:45:57.330112900Z" + } + } + }, + { + "cell_type": "code", + "execution_count": 30, + "outputs": [ + { + "data": { + "text/plain": "array([0, 1, 1, 1], dtype=int64)" + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ "from aeon.clustering.k_means import TimeSeriesKMeans\n", "\n", "kmeans = TimeSeriesKMeans(metric=\"euclidean\", n_clusters=2)\n", @@ -379,7 +460,11 @@ "kmeans.predict(X)" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T11:45:57.473727600Z", + "start_time": "2023-07-24T11:45:57.337094Z" + } } }, { @@ -394,13 +479,13 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 31, "outputs": [ { "data": { "text/plain": "array(['pass', 'pass', 'fail', 'fail'], dtype=' Date: Wed, 26 Jul 2023 11:17:35 +0100 Subject: [PATCH 13/14] fixes --- examples/datasets/benchmarking_data.ipynb | 7 ++++++- examples/datasets/data_loading.ipynb | 3 +-- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/examples/datasets/benchmarking_data.ipynb b/examples/datasets/benchmarking_data.ipynb index 7ad658f6a5..76b0d21134 100644 --- a/examples/datasets/benchmarking_data.ipynb +++ b/examples/datasets/benchmarking_data.ipynb @@ -99,7 +99,8 @@ " /Chinatown/Chinatown_TRAIN.ts\n", " /Chinatown/Chinatown_TEST.ts\n", "\n", - "You can load these problems directly from [https://timeseriesclassification.com] and load\n", + "You can load these problems directly from\n", + "[https://timeseriesclassification.com](https://timeseriesclassification.com) and load\n", "them into memory. Note by default, these functions return the data and associated metadata.\n", "This usage combines the train and test splits and loads them into one `X` and one `y` array." ], @@ -332,12 +333,16 @@ "cell_type": "markdown", "source": [ "## References\n", + "\n", "[1] Dau et. al, The UCR time series archive, IEEE/CAA Journal of Automatica Sinica, 2019\n", + "\n", "[2] Ruiz et. al, The great multivariate time series classification bake off: a review\n", " and experimental evaluation of recent algorithmic advances, Data Mining and\n", " Knowledge Discovery 35(2), 2021\n", + "\n", "[3] Tan et. al, Time Series Extrinsic Regression, Data Mining and Knowledge\n", " Discovery, 2021\n", + "\n", "[4] Godahewa et. al, Monash Time Series Forecasting Archive, Neural Information\n", " Processing Systems Track on Datasets and Benchmarks, 2021\n" ], diff --git a/examples/datasets/data_loading.ipynb b/examples/datasets/data_loading.ipynb index 045531830e..74b13b464a 100644 --- a/examples/datasets/data_loading.ipynb +++ b/examples/datasets/data_loading.ipynb @@ -18,8 +18,7 @@ "it conforms to the input types described. These are all text files with a particular\n", "structure. Both formats store a single time series per row.\n", "\n", - "1. `.csv`\n", - "2. The `.ts` and `.tsf` format used by the aeon packages and the [time series](https://timeseriesclassification.com) and [forecasting](https://forecastingdata.org)\n", + "1. The `.ts` and `.tsf` format used by the aeon packages and the [time series](https://timeseriesclassification.com) and [forecasting](https://forecastingdata.org)\n", " repositories. More information on the `.tsf` format is\n", "[here](https://openreview.net/pdf?id=wEc1mgAjU-)\n", "Links to download all of the UCR univariate and the tsml multivariate data in `.ts`\n", From 7db60f252b5eafbcdd63e12d71f195078c4e2d33 Mon Sep 17 00:00:00 2001 From: MatthewMiddlehurst Date: Mon, 9 Oct 2023 14:23:34 +0100 Subject: [PATCH 14/14] rename --- examples/datasets/benchmarking_data.ipynb | 375 --------------------- examples/datasets/load_data_from_web.ipynb | 167 +++++---- 2 files changed, 95 insertions(+), 447 deletions(-) delete mode 100644 examples/datasets/benchmarking_data.ipynb diff --git a/examples/datasets/benchmarking_data.ipynb b/examples/datasets/benchmarking_data.ipynb deleted file mode 100644 index 76b0d21134..0000000000 --- a/examples/datasets/benchmarking_data.ipynb +++ /dev/null @@ -1,375 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "source": [ - "# Downloading and loading benchmarking datasets\n", - "\n", - "It is common to use standard collections of data to compare different estimators for\n", - "classification, clustering, regression and forecasting. Some of the smaller datasets from\n", - "these datasets included with `aeon` in the `aeon/datasets/data` directory. However,\n", - "there is way to many datasets to include them all, and some of the files are far too big\n", - "to consider including in the package. `aeon` provides tools to download these data to use\n", - "in benchmarking experiments. Classification and regression data are stored in .ts format.\n", - "Forecasting data are stored in the equivalent .tsf format. See the\n", - "[data loading notebook](examples/data_loading.ipynb) for more info.\n", - "\n", - "Classification and regression are loaded into 3D numpy arrays of shape\n", - "`(n_cases, n_channels, n_timepoints)` if equal length or a list of length\n", - "`n_cases` of 2D numpy arrays of shape `(n_channels, n_timepoints)` if\n", - "`n_timepoints` is different between cases. Forecasting data are loaded into\n", - "pd.DataFrame. For more information on aeon data types see the\n", - "[data storage notebook](examples/data_storage.ipynb).\n", - "\n", - "Note that this notebook is dependent on external websites, so will not function if\n", - "you are not online or the associated website is down. We use the following three\n", - "functions" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": 9, - "outputs": [], - "source": [ - "from aeon.datasets import load_classification, load_forecasting, load_regression" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-07-24T15:29:38.987856700Z", - "start_time": "2023-07-24T15:29:38.941979900Z" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Time Series Classification Archive\n", - "\n", - "The [UCR/TSML Time Series Classification Archive](https://timeseriesclassification.com)\n", - "hosts the UCR univariate TSC archive (also available from\n", - "[UCR](https://www.cs.ucr.edu/%7Eeamonn/time_series_data_2018/)) [1], and\n", - "the multivariate archive [2] (previously called the UEA archive, soon to change). We\n", - "provide seven of these in the datasets/data directory: ACSF1, ArrowHead, BasicMotions,\n", - "GunPoint, ItalyPowerDemand, JapaneseVowels and PLAID. The archive is much bigger. The\n", - "last batch release was for 128 univariate [1] and 33 multivariate [2] datasets. If you just\n", - "want to download them all, please go to the [website](https://timeseriesclassification.com)." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": 10, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Univariate length = 128\n", - "Multivariate length = 33\n" - ] - } - ], - "source": [ - "from aeon.datasets.tsc_data_lists import multivariate, univariate\n", - "\n", - "# This file also contains sub lists by type, e.g. unequal length\n", - "print(\"Univariate length = \", len(univariate))\n", - "print(\"Multivariate length = \", len(multivariate))" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-07-24T15:29:38.988882400Z", - "start_time": "2023-07-24T15:29:38.949956800Z" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "A default train and test split is provided for this data. The file structure for a\n", - "problem such as Chinatown is\n", - "\n", - " /Chinatown/Chinatown_TRAIN.ts\n", - " /Chinatown/Chinatown_TEST.ts\n", - "\n", - "You can load these problems directly from\n", - "[https://timeseriesclassification.com](https://timeseriesclassification.com) and load\n", - "them into memory. Note by default, these functions return the data and associated metadata.\n", - "This usage combines the train and test splits and loads them into one `X` and one `y` array." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": 11, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Shape of X = (363, 1, 24)\n", - "First case = [ 573. 375. 301. 212. 55. 34. 25. 33. 113. 143. 303. 615.\n", - " 1226. 1281. 1221. 1081. 866. 1096. 1039. 975. 746. 581. 409. 182.] has label = 1\n", - "\n", - "Meta data = {'problemname': 'chinatown', 'timestamps': False, 'missing': False, 'univariate': True, 'equallength': True, 'classlabel': True, 'targetlabel': False, 'class_values': ['1', '2']}\n" - ] - } - ], - "source": [ - "X, y, meta = load_classification(\"Chinatown\")\n", - "print(\"Shape of X = \", X.shape)\n", - "print(\"First case = \", X[0][0], \" has label = \", y[0])\n", - "print(\"\\nMeta data = \", meta)" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-07-24T15:29:39.067643100Z", - "start_time": "2023-07-24T15:29:38.954944100Z" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "If you look in `aeon/datasets/local_data/` you should see a directory called `Chinatown`\n", - "containing the Chinatown datasets. All of the zips have `.ts` files. Some also have\n", - "`.arff` and `.txt` files. If you load again, it will not download again if the file is\n", - "already there. If you want to store data somewhere else, you can specify a file path\n", - "using the `extract_path` parameter. Additionally, you can load the train and test\n", - "separately as shown below.\n", - "\n", - "This code will download the data and load into separate train/test splits. The split argument is not\n", - "case sensitive. Once downloaded, `load_classification` is a equivalent to a call to `load_from_tsfile`." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": 12, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Train shape = (20, 1, 512)\n", - "Test shape = (20, 1, 512)\n" - ] - } - ], - "source": [ - "X_train, y_train = load_classification(\n", - " \"BeetleFly\", split=\"TRAIN\", return_metadata=False\n", - ")\n", - "X_test, y_test = load_classification(\"BeetleFly\", split=\"test\", return_metadata=False)\n", - "print(\"Train shape = \", X_train.shape)\n", - "print(\"Test shape = \", X_test.shape)" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-07-24T15:29:39.229225500Z", - "start_time": "2023-07-24T15:29:39.068640Z" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Time Series (Extrinsic) Regression\n", - "\n", - "The [Monash Time Series Extrinsic Regression Archive](http://tseregression.org/) [3] repo\n", - "(called extrinsic to differentiate if from sliding window based regression) currently\n", - "contains 19 regression problems in `.ts` format. One of these, Covid3Month, is in\n", - "`datasets\\data`. The usage of `load_regression` is identical to `load_classification`\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": 13, - "outputs": [ - { - "data": { - "text/plain": "['AppliancesEnergy',\n 'AustraliaRainfall',\n 'BIDMCHR',\n 'BIDMCRR',\n 'BIDMCSpO2',\n 'BeijingPM10Quality',\n 'BeijingPM25Quality',\n 'BenzeneConcentration',\n 'Covid3Month',\n 'FloodModeling1',\n 'FloodModeling2',\n 'FloodModeling3',\n 'HouseholdPowerConsumption1',\n 'HouseholdPowerConsumption2',\n 'IEEEPPG',\n 'LiveFuelMoistureContent',\n 'NewsHeadlineSentiment',\n 'NewsTitleSentiment',\n 'PPGDalia']" - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from aeon.datasets.dataset_collections import list_available_tser_datasets\n", - "\n", - "list_available_tser_datasets()" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-07-24T15:29:39.237204700Z", - "start_time": "2023-07-24T15:29:39.230223400Z" - } - } - }, - { - "cell_type": "code", - "execution_count": 14, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Shape of X = (673, 1, 266)\n" - ] - } - ], - "source": [ - "X, y, meta = load_regression(\"FloodModeling1\")\n", - "print(\"Shape of X = \", X.shape)" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-07-24T15:29:42.341536600Z", - "start_time": "2023-07-24T15:29:39.237204700Z" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Time Series Forecasting\n", - "\n", - "The [Monash time series forecasting](https://forecastingdata.org/) repo contains a\n", - "large number of forecasting data, including competition data such as M1, M3 and M4.\n", - "Usage is the same as the other problems, although there is no provided train/test\n", - "splits.\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": 15, - "outputs": [ - { - "data": { - "text/plain": "['australian_electricity_demand_dataset',\n 'car_parts_dataset_with_missing_values',\n 'car_parts_dataset_without_missing_values',\n 'cif_2016_dataset',\n 'covid_deaths_dataset',\n 'covid_mobility_dataset_with_missing_values',\n 'covid_mobility_dataset_without_missing_values',\n 'dominick_dataset',\n 'elecdemand_dataset',\n 'electricity_hourly_dataset',\n 'electricity_weekly_dataset',\n 'fred_md_dataset',\n 'hospital_dataset',\n 'kaggle_web_traffic_dataset_with_missing_values',\n 'kaggle_web_traffic_dataset_without_missing_values',\n 'kaggle_web_traffic_weekly_dataset',\n 'kdd_cup_2018_dataset_with_missing_values',\n 'kdd_cup_2018_dataset_without_missing_values',\n 'london_smart_meters_dataset_with_missing_values',\n 'london_smart_meters_dataset_without_missing_values',\n 'm1_monthly_dataset',\n 'm1_quarterly_dataset',\n 'm1_yearly_dataset',\n 'm3_monthly_dataset',\n 'm3_other_dataset',\n 'm3_quarterly_dataset',\n 'm3_yearly_dataset',\n 'm4_daily_dataset',\n 'm4_hourly_dataset',\n 'm4_monthly_dataset',\n 'm4_quarterly_dataset',\n 'm4_weekly_dataset',\n 'm4_yearly_dataset',\n 'nn5_daily_dataset_with_missing_values',\n 'nn5_daily_dataset_without_missing_values',\n 'nn5_weekly_dataset',\n 'pedestrian_counts_dataset',\n 'saugeenday_dataset',\n 'solar_10_minutes_dataset',\n 'solar_4_seconds_dataset',\n 'solar_weekly_dataset',\n 'sunspot_dataset_with_missing_values',\n 'sunspot_dataset_without_missing_values',\n 'tourism_monthly_dataset',\n 'tourism_quarterly_dataset',\n 'tourism_yearly_dataset',\n 'traffic_hourly_dataset',\n 'traffic_weekly_dataset',\n 'us_births_dataset',\n 'weather_dataset',\n 'wind_4_seconds_dataset',\n 'wind_farms_minutely_dataset_with_missing_values',\n 'wind_farms_minutely_dataset_without_missing_values']" - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from aeon.datasets.dataset_collections import list_available_tsf_datasets\n", - "\n", - "list_available_tsf_datasets()" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-07-24T15:29:42.347519900Z", - "start_time": "2023-07-24T15:29:42.341536600Z" - } - } - }, - { - "cell_type": "code", - "execution_count": 16, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(23000, 3)\n", - "{'frequency': 'yearly', 'forecast_horizon': 6, 'contain_missing_values': False, 'contain_equal_length': False}\n", - " series_name start_timestamp \\\n", - "0 T1 1979-01-01 12:00:00 \n", - "1 T2 1979-01-01 12:00:00 \n", - "2 T3 1979-01-01 12:00:00 \n", - "3 T4 1979-01-01 12:00:00 \n", - "4 T5 1979-01-01 12:00:00 \n", - "\n", - " series_value \n", - "0 [5172.1, 5133.5, 5186.9, 5084.6, 5182.0, 5414.... \n", - "1 [2070.0, 2104.0, 2394.0, 1651.0, 1492.0, 1348.... \n", - "2 [2760.0, 2980.0, 3200.0, 3450.0, 3670.0, 3850.... \n", - "3 [3380.0, 3670.0, 3960.0, 4190.0, 4440.0, 4700.... \n", - "4 [1980.0, 2030.0, 2220.0, 2530.0, 2610.0, 2720.... \n" - ] - } - ], - "source": [ - "X, metadata = load_forecasting(\"m4_yearly_dataset\")\n", - "print(X.shape)\n", - "print(metadata)\n", - "data = X.head()\n", - "print(data)" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-07-24T15:29:49.815481300Z", - "start_time": "2023-07-24T15:29:42.347519900Z" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## References\n", - "\n", - "[1] Dau et. al, The UCR time series archive, IEEE/CAA Journal of Automatica Sinica, 2019\n", - "\n", - "[2] Ruiz et. al, The great multivariate time series classification bake off: a review\n", - " and experimental evaluation of recent algorithmic advances, Data Mining and\n", - " Knowledge Discovery 35(2), 2021\n", - "\n", - "[3] Tan et. al, Time Series Extrinsic Regression, Data Mining and Knowledge\n", - " Discovery, 2021\n", - "\n", - "[4] Godahewa et. al, Monash Time Series Forecasting Archive, Neural Information\n", - " Processing Systems Track on Datasets and Benchmarks, 2021\n" - ], - "metadata": { - "collapsed": false - } - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 2 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/examples/datasets/load_data_from_web.ipynb b/examples/datasets/load_data_from_web.ipynb index c2eff91acf..76b0d21134 100644 --- a/examples/datasets/load_data_from_web.ipynb +++ b/examples/datasets/load_data_from_web.ipynb @@ -6,16 +6,20 @@ "# Downloading and loading benchmarking datasets\n", "\n", "It is common to use standard collections of data to compare different estimators for\n", - "classification, clustering, regression and forecasting. Some of these datasets are\n", - "shipped with aeon in the datasets/data directory. However, the files are far too\n", - "big to include them all. aeon p[rovides tools to download these data to use in benchmarking experiments.\n", - "Classification and regression data are stored in .ts format. Forecasting\n", - "data are stored in the equivalent .tsf format. See the [data loading notebook](data_loading.ipynb) for more info.\n", + "classification, clustering, regression and forecasting. Some of the smaller datasets from\n", + "these datasets included with `aeon` in the `aeon/datasets/data` directory. However,\n", + "there is way to many datasets to include them all, and some of the files are far too big\n", + "to consider including in the package. `aeon` provides tools to download these data to use\n", + "in benchmarking experiments. Classification and regression data are stored in .ts format.\n", + "Forecasting data are stored in the equivalent .tsf format. See the\n", + "[data loading notebook](examples/data_loading.ipynb) for more info.\n", "\n", - "Classification and regression are loaded into 3D numpy arrays of shape `(n_cases, n_channels, n_timepoints)` if equal length\n", - "or a list of `[n_cases]` of 2D numpy if `n_timepoints` is different for different\n", - "cases. Forecasting data are loaded into pd.DataFrame. For more information on\n", - "aeon data types see the [data structures notebook](data_structures.ipynb).\n", + "Classification and regression are loaded into 3D numpy arrays of shape\n", + "`(n_cases, n_channels, n_timepoints)` if equal length or a list of length\n", + "`n_cases` of 2D numpy arrays of shape `(n_channels, n_timepoints)` if\n", + "`n_timepoints` is different between cases. Forecasting data are loaded into\n", + "pd.DataFrame. For more information on aeon data types see the\n", + "[data storage notebook](examples/data_storage.ipynb).\n", "\n", "Note that this notebook is dependent on external websites, so will not function if\n", "you are not online or the associated website is down. We use the following three\n", @@ -27,13 +31,17 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 9, "outputs": [], "source": [ "from aeon.datasets import load_classification, load_forecasting, load_regression" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:38.987856700Z", + "start_time": "2023-07-24T15:29:38.941979900Z" + } } }, { @@ -41,14 +49,14 @@ "source": [ "## Time Series Classification Archive\n", "\n", - "[UCR/TSML Time Series Classification Archive](https://timeseriesclassification.com)\n", - "hosts the UCR univariate TSC archive [1], also available from [UCR](https://www.cs.ucr.edu/~eamonn/time_series_data_2018/) and\n", + "The [UCR/TSML Time Series Classification Archive](https://timeseriesclassification.com)\n", + "hosts the UCR univariate TSC archive (also available from\n", + "[UCR](https://www.cs.ucr.edu/%7Eeamonn/time_series_data_2018/)) [1], and\n", "the multivariate archive [2] (previously called the UEA archive, soon to change). We\n", - "provide seven of these in the datasets/data directort: ACSF1, ArrowHead, BasicMotions,\n", + "provide seven of these in the datasets/data directory: ACSF1, ArrowHead, BasicMotions,\n", "GunPoint, ItalyPowerDemand, JapaneseVowels and PLAID. The archive is much bigger. The\n", - " last batch release was for 128 univariate [1] and 33 multivariate [2]. If you just\n", - " want to download them all, please go to the [website]\n", - " (https://timeseriesclassification.com)" + "last batch release was for 128 univariate [1] and 33 multivariate [2] datasets. If you just\n", + "want to download them all, please go to the [website](https://timeseriesclassification.com)." ], "metadata": { "collapsed": false @@ -56,7 +64,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 10, "outputs": [ { "name": "stdout", @@ -75,7 +83,11 @@ "print(\"Multivariate length = \", len(multivariate))" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:38.988882400Z", + "start_time": "2023-07-24T15:29:38.949956800Z" + } } }, { @@ -87,9 +99,10 @@ " /Chinatown/Chinatown_TRAIN.ts\n", " /Chinatown/Chinatown_TEST.ts\n", "\n", - "You can load these problems directly from TSC.com and load them into memory. Note by\n", - "default, these functions return the data and associated metadata. This usage combines\n", - " the train and test splits and loads them into one `X` and one `y` array." + "You can load these problems directly from\n", + "[https://timeseriesclassification.com](https://timeseriesclassification.com) and load\n", + "them into memory. Note by default, these functions return the data and associated metadata.\n", + "This usage combines the train and test splits and loads them into one `X` and one `y` array." ], "metadata": { "collapsed": false @@ -97,7 +110,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 11, "outputs": [ { "name": "stdout", @@ -118,20 +131,25 @@ "print(\"\\nMeta data = \", meta)" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:39.067643100Z", + "start_time": "2023-07-24T15:29:38.954944100Z" + } } }, { "cell_type": "markdown", "source": [ - "If you look in aeon/datasets you should see a directory called `local_data`\n", + "If you look in `aeon/datasets/local_data/` you should see a directory called `Chinatown`\n", "containing the Chinatown datasets. All of the zips have `.ts` files. Some also have\n", "`.arff` and `.txt` files. If you load again, it will not download again if the file is\n", - "already there. If you want to store data somewhere else, you can specify a file path.\n", - " Also, you can load the train and test separately. This code will download the data\n", - " to Temp once, and load into separate train/test splits. The split argument is not\n", - " case sensitive. Once downloaded, `load_classification` is a equivalent to a call to\n", - " `load_from_tsfile`" + "already there. If you want to store data somewhere else, you can specify a file path\n", + "using the `extract_path` parameter. Additionally, you can load the train and test\n", + "separately as shown below.\n", + "\n", + "This code will download the data and load into separate train/test splits. The split argument is not\n", + "case sensitive. Once downloaded, `load_classification` is a equivalent to a call to `load_from_tsfile`." ], "metadata": { "collapsed": false @@ -139,46 +157,31 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 12, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train shape = (20, 1, 512)\n", - "Test shape = (20, 1, 512)\n", - "Loaded directly shape = (20, 1, 512)\n" + "Test shape = (20, 1, 512)\n" ] - }, - { - "data": { - "text/plain": "array([1.7400873, 1.7331051, 1.7091917, 1.6333304, 1.5405759])" - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" } ], "source": [ "X_train, y_train = load_classification(\n", - " \"BeetleFly\", extract_path=\"./Temp/\", split=\"TRAIN\", return_metadata=False\n", - ")\n", - "X_test, y_test = load_classification(\n", - " \"BeetleFly\", extract_path=\"./Temp/\", split=\"test\", return_metadata=False\n", + " \"BeetleFly\", split=\"TRAIN\", return_metadata=False\n", ")\n", + "X_test, y_test = load_classification(\"BeetleFly\", split=\"test\", return_metadata=False)\n", "print(\"Train shape = \", X_train.shape)\n", - "print(\"Test shape = \", X_test.shape)\n", - "from aeon.datasets import load_from_tsfile\n", - "\n", - "X_train, y_train = load_from_tsfile(\n", - " full_file_path_and_name=\"./Temp/BeetleFly/BeetleFLY_TRAIN\"\n", - ")\n", - "print(\"Loaded directly shape = \", X_train.shape)\n", - "\n", - "X_test[0][0][:5]" + "print(\"Test shape = \", X_test.shape)" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:39.229225500Z", + "start_time": "2023-07-24T15:29:39.068640Z" + } } }, { @@ -186,10 +189,10 @@ "source": [ "## Time Series (Extrinsic) Regression\n", "\n", - "[The Monash Time Series Extrinsic Regression Archive]() [3] repo (called extrinsic to\n", - " diffentiate if from sliding window based regression) currently contains 19\n", - " regression problems in .ts format. One of these, Covid3Month, is in `datasets\\data`.\n", - " The usage of `load_regression` is identical to `load_classification`\n" + "The [Monash Time Series Extrinsic Regression Archive](http://tseregression.org/) [3] repo\n", + "(called extrinsic to differentiate if from sliding window based regression) currently\n", + "contains 19 regression problems in `.ts` format. One of these, Covid3Month, is in\n", + "`datasets\\data`. The usage of `load_regression` is identical to `load_classification`\n" ], "metadata": { "collapsed": false @@ -197,13 +200,13 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 13, "outputs": [ { "data": { "text/plain": "['AppliancesEnergy',\n 'AustraliaRainfall',\n 'BIDMCHR',\n 'BIDMCRR',\n 'BIDMCSpO2',\n 'BeijingPM10Quality',\n 'BeijingPM25Quality',\n 'BenzeneConcentration',\n 'Covid3Month',\n 'FloodModeling1',\n 'FloodModeling2',\n 'FloodModeling3',\n 'HouseholdPowerConsumption1',\n 'HouseholdPowerConsumption2',\n 'IEEEPPG',\n 'LiveFuelMoistureContent',\n 'NewsHeadlineSentiment',\n 'NewsTitleSentiment',\n 'PPGDalia']" }, - "execution_count": 5, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } @@ -214,12 +217,16 @@ "list_available_tser_datasets()" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:39.237204700Z", + "start_time": "2023-07-24T15:29:39.230223400Z" + } } }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 14, "outputs": [ { "name": "stdout", @@ -234,7 +241,11 @@ "print(\"Shape of X = \", X.shape)" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:42.341536600Z", + "start_time": "2023-07-24T15:29:39.237204700Z" + } } }, { @@ -253,13 +264,13 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 15, "outputs": [ { "data": { "text/plain": "['australian_electricity_demand_dataset',\n 'car_parts_dataset_with_missing_values',\n 'car_parts_dataset_without_missing_values',\n 'cif_2016_dataset',\n 'covid_deaths_dataset',\n 'covid_mobility_dataset_with_missing_values',\n 'covid_mobility_dataset_without_missing_values',\n 'dominick_dataset',\n 'elecdemand_dataset',\n 'electricity_hourly_dataset',\n 'electricity_weekly_dataset',\n 'fred_md_dataset',\n 'hospital_dataset',\n 'kaggle_web_traffic_dataset_with_missing_values',\n 'kaggle_web_traffic_dataset_without_missing_values',\n 'kaggle_web_traffic_weekly_dataset',\n 'kdd_cup_2018_dataset_with_missing_values',\n 'kdd_cup_2018_dataset_without_missing_values',\n 'london_smart_meters_dataset_with_missing_values',\n 'london_smart_meters_dataset_without_missing_values',\n 'm1_monthly_dataset',\n 'm1_quarterly_dataset',\n 'm1_yearly_dataset',\n 'm3_monthly_dataset',\n 'm3_other_dataset',\n 'm3_quarterly_dataset',\n 'm3_yearly_dataset',\n 'm4_daily_dataset',\n 'm4_hourly_dataset',\n 'm4_monthly_dataset',\n 'm4_quarterly_dataset',\n 'm4_weekly_dataset',\n 'm4_yearly_dataset',\n 'nn5_daily_dataset_with_missing_values',\n 'nn5_daily_dataset_without_missing_values',\n 'nn5_weekly_dataset',\n 'pedestrian_counts_dataset',\n 'saugeenday_dataset',\n 'solar_10_minutes_dataset',\n 'solar_4_seconds_dataset',\n 'solar_weekly_dataset',\n 'sunspot_dataset_with_missing_values',\n 'sunspot_dataset_without_missing_values',\n 'tourism_monthly_dataset',\n 'tourism_quarterly_dataset',\n 'tourism_yearly_dataset',\n 'traffic_hourly_dataset',\n 'traffic_weekly_dataset',\n 'us_births_dataset',\n 'weather_dataset',\n 'wind_4_seconds_dataset',\n 'wind_farms_minutely_dataset_with_missing_values',\n 'wind_farms_minutely_dataset_without_missing_values']" }, - "execution_count": 7, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } @@ -270,12 +281,16 @@ "list_available_tsf_datasets()" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:42.347519900Z", + "start_time": "2023-07-24T15:29:42.341536600Z" + } } }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 16, "outputs": [ { "name": "stdout", @@ -307,21 +322,29 @@ "print(data)" ], "metadata": { - "collapsed": false + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-07-24T15:29:49.815481300Z", + "start_time": "2023-07-24T15:29:42.347519900Z" + } } }, { "cell_type": "markdown", "source": [ "## References\n", + "\n", "[1] Dau et. al, The UCR time series archive, IEEE/CAA Journal of Automatica Sinica, 2019\n", + "\n", "[2] Ruiz et. al, The great multivariate time series classification bake off: a review\n", " and experimental evaluation of recent algorithmic advances, Data Mining and\n", " Knowledge Discovery 35(2), 2021\n", + "\n", "[3] Tan et. al, Time Series Extrinsic Regression, Data Mining and Knowledge\n", - "Discovery, 2021\n", - "[4] Godahewa et. al, Monash Time Series Forecasting Archive,Neural Information\n", - "Processing Systems Track on Datasets and Benchmarks, 2021\n" + " Discovery, 2021\n", + "\n", + "[4] Godahewa et. al, Monash Time Series Forecasting Archive, Neural Information\n", + " Processing Systems Track on Datasets and Benchmarks, 2021\n" ], "metadata": { "collapsed": false