Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Update data_storage and benchmarking_data notebooks #599

Closed
wants to merge 39 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
71c78fc
datatypes notebook
TonyBagnall Jul 5, 2023
6817086
datatypes notebook
TonyBagnall Jul 5, 2023
8821a6e
revert collection to panel to find circular import
TonyBagnall Jul 5, 2023
77f37ca
revert notebook to _panel
TonyBagnall Jul 5, 2023
c24b6ae
removed isinstance
chrisholder Jul 5, 2023
6017bc7
fixed the bug
chrisholder Jul 5, 2023
7c17e8a
setup
chrisholder Jul 5, 2023
ad0cf25
removed test
chrisholder Jul 5, 2023
87274a1
Merge branch 'main' into distance-numba-fixes
TonyBagnall Jul 5, 2023
ab0110f
Merge branch 'distance-numba-fixes'
TonyBagnall Jul 6, 2023
a4d0018
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 6, 2023
d968077
Merge branch 'main' into ajb/datatypes_docs
TonyBagnall Jul 7, 2023
38622bb
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 7, 2023
17c127a
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 7, 2023
1ee8ddf
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 7, 2023
2eac140
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 9, 2023
cfda9b5
new converters
TonyBagnall Jul 9, 2023
19d3511
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 10, 2023
9274755
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 11, 2023
a70e023
Merge branch 'main' into ajb/datatypes_docs
TonyBagnall Jul 12, 2023
a692517
remove conversions from this PR
TonyBagnall Jul 12, 2023
b47531b
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 12, 2023
0461056
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 14, 2023
07f4132
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 16, 2023
42505c5
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 17, 2023
0327801
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 20, 2023
c094b31
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 20, 2023
fbbcaa0
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 20, 2023
ef3d35e
Merge branch 'main' into ajb/datatypes_docs
TonyBagnall Jul 20, 2023
4d4d653
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 20, 2023
88b7b90
Merge branch 'main' of https://github.com/aeon-toolkit/aeon
TonyBagnall Jul 21, 2023
065fc9e
remove method stub
TonyBagnall Jul 22, 2023
b95a381
Merge branch 'main' into ajb/datatypes_docs
TonyBagnall Jul 22, 2023
5e5f6de
storage and benchmarking
MatthewMiddlehurst Jul 24, 2023
c348ae8
Merge branch 'main' of https://github.com/aeon-toolkit/aeon into mm/d…
MatthewMiddlehurst Jul 26, 2023
1940abb
fixes
MatthewMiddlehurst Jul 26, 2023
98fd35d
Merge branch 'main' of https://github.com/aeon-toolkit/aeon into mm/d…
MatthewMiddlehurst Oct 9, 2023
7db60f2
rename
MatthewMiddlehurst Oct 9, 2023
ab4ca98
merge
MatthewMiddlehurst Apr 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 58 additions & 43 deletions examples/datasets/data_conversions.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@
"source": [
"# Data conversions in aeon\n",
"\n",
"We recommend you follow the following strategy: Use `pd.Series` or `pd\n",
".DataFrame` for forecasting and for classification, clustering and regression, use 3D\n",
" numpy of shape `(n_cases, n_channels, n_timepoints)` if your collection of time series are equal length, or a\n",
" list of 2D numpy of length `[n_cases]` if not equal length. All data loaded from\n",
" file with our [data loaders](data_loading.ipynb) use this\n",
" strategy.\n",
"We recommend you follow the data storage described in the [data storage notebook](examples/datasets/data_storage.ipynb)\n",
"which can be summarised as follows: Use `pd.Series` or `pd.DataFrame` for tasks\n",
"which focus on single series such a forecasting, and for tasks such as classification,\n",
"clustering and regression use a 3D numpy array of shape `(n_cases, n_channels, n_timepoints)`\n",
"if your collection of time series are equal length, or a list of 2D numpy of length `[n_cases]`\n",
"if not equal length. All are [data loaders](examples/datasets/data_loading.ipynb) use this format.\n",
"\n",
"However, `aeon` provides a range of converters in the `datatypes` package. These are\n",
"grouped into converters for single series and converters for collections of series"
Expand All @@ -22,7 +22,7 @@
{
"cell_type": "markdown",
"source": [
"## Series Converters\n",
"# Series Converters\n",
"\n",
"Single time series can be stored in the following data structures\n",
"\n",
Expand All @@ -32,16 +32,16 @@
"- \"xr.DataArray\": xarray DataArray a for a univariate or multivariate time series\n",
"- \"dask_series\": Dask DataFrame for a univariate or multivariate time series\n",
"\n",
"The above strings are used to internally specify each different data structure for\n",
"internal conversion purposes. NOTE the 2D numpy array representation is not consistent with that used in\n",
"The above strings are used to internally specify each different data structure. NOTE the\n",
" 2D numpy array representation is not consistent with that used in\n",
"collections. This is an unfortunate difference that is a result of legacy design and\n",
"norms in different research fields.\n",
"\n",
"Conversion to and from these data structures is fairly straightforward, but we\n",
"provide tools to help. `aeon` contains converters that are wrapped by the method\n",
"`convert`. This method will attempt to convert from one of the five types to another,\n",
" and raise an exception if the conversion is invalid (e.g. if the object is not in\n",
" fact of type \"from_type\"). Note that estimators will attempt to automatically\n",
" fact of type \"from_type\"). Note that series estimators will attempt to automatically\n",
" perform this conversion to the specified internal type of that estimator."
],
"metadata": {
Expand All @@ -50,13 +50,13 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 27,
"outputs": [
{
"data": {
"text/plain": "xarray.core.dataarray.DataArray"
},
"execution_count": 1,
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -86,13 +86,13 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 28,
"outputs": [
{
"data": {
"text/plain": "pandas.core.frame.DataFrame"
},
"execution_count": 2,
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -113,13 +113,13 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 29,
"outputs": [
{
"data": {
"text/plain": "dask.dataframe.core.DataFrame"
},
"execution_count": 3,
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -134,13 +134,13 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 30,
"outputs": [
{
"data": {
"text/plain": "xarray.core.dataarray.DataArray"
},
"execution_count": 4,
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -156,15 +156,11 @@
{
"cell_type": "markdown",
"source": [
"## Collections Converters\n",
"# Collections Converters\n",
"\n",
"Collections of time series are the fundamental data type for machine\n",
"learning algorithms. In older versions of the toolkit, collections of time series\n",
"were called panels (a term from econometrics, not machine learning), and there are\n",
"still references to panel. The main\n",
"characteristics of collections of time series that effect storage is that they can be\n",
"univariate or multivariate and they can be equal length or unequal length. The main\n",
"data structures for storing collections are as follows:\n",
"Previously, collections of time series were called panels (a term from econometrics,\n",
"not machine learning), and there are still references to panel. The main\n",
"data structures for storing collections are as follows\n",
"\n",
"- \"numpy3D\": 3D np.ndarray of format `(n_cases, n_channels, n_timepoints)`\n",
"- \"np-list\": python list of 2D numpy array of length `[n_cases]`, each of shape\n",
Expand All @@ -173,25 +169,25 @@
"`(n_timepoints_i, n_channels)`\n",
"- \"numpy2D\": 2D np.ndarray of shape `(n_cases, n_timepoints)`\n",
"\n",
"Other supported types which may be useful are:\n",
"Other supported types which may be useful in forecasting are\n",
"\n",
"- \"nested_univ\": a pd.DataFrame of shape `(n_cases, n_channels)` where each cell is a\n",
" pd.Series of length `(n_timepoints)`\n",
" - \"pd-multiindex\": pd.DataFrame with multi-index `(cases, timepoints)`\n",
" - \"pd-wide\": pd.DataFrame in wide format, with shape `(n_timepoints, n_cases)`\n",
" - \"pd-wide\": pd.DataFrame in wide format, `cols = (instance*timepoints)`\n",
" - \"dask_panel\": dask frame with one instance and one time index\n",
"\n",
"AS with series, collection conversion can be performed with the method `convert`,\n",
"which wraps methods in `aeon.datatypes._panel._convert`. However, internal estimator\n",
"conversion is now handled with the function `_convert_X` in the `aeon.utils.validation\n",
".collection` package as follows"
"As with series, conversion is performed with the method `convert` and auto conversion\n",
" happens in estimator base classes. These wrap methods in `aeon.datatypes\n",
"._panel._convert`"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 35,
"outputs": [
{
"name": "stdout",
Expand All @@ -206,7 +202,7 @@
"\n",
"# 10 multivariate time series with 3 channels of length 100 in \"numpy3D\" format\n",
"multi = np.random.random(size=(10, 3, 100))\n",
"np_list = convert_collection(multi, output_type=\"np-list\")\n",
"np_list = convert(multi, from_type=\"numpy3D\", to_type=\"np-list\")\n",
"print(\n",
" f\" Type = {type(np_list)}, type first {type(np_list[0])} shape first \"\n",
" f\"{np_list[0].shape}\"\n",
Expand All @@ -218,7 +214,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 36,
"outputs": [
{
"name": "stdout",
Expand All @@ -229,7 +225,7 @@
}
],
"source": [
"df_list = convert_collection(multi, output_type=\"df-list\")\n",
"df_list = convert(multi, from_type=\"numpy3D\", to_type=\"df-list\")\n",
"print(\n",
" f\" Type = {type(df_list)}, type first {type(df_list[0])} shape first \"\n",
" f\"{df_list[0].shape}\"\n",
Expand All @@ -254,13 +250,21 @@
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"execution_count": 39,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Type = <class 'pandas.core.frame.DataFrame'>,shape (3000, 4)\n"
]
}
],
"source": [
"from aeon.utils.conversion._convert_collection import _from_numpy3d_to_pd_multiindex\n",
"\n",
"mi = _from_numpy3d_to_pd_multiindex(multi)\n",
"print(f\" Type = {type(mi)},shape {mi.shape}\")"
"long = from_3d_numpy_to_long(multi)\n",
"print(f\" Type = {type(long)},shape {long.shape}\")"
],
"metadata": {
"collapsed": false,
Expand All @@ -271,9 +275,20 @@
},
{
"cell_type": "code",
"execution_count": 7,
"outputs": [],
"source": [],
"execution_count": 40,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Type = <class 'pandas.core.frame.DataFrame'>,shape (1000, 3)\n"
]
}
],
"source": [
"mi = from_3d_numpy_to_multi_index(multi)\n",
"print(f\" Type = {type(mi)},shape {mi.shape}\")"
],
"metadata": {
"collapsed": false
}
Expand Down
19 changes: 13 additions & 6 deletions examples/datasets/data_loading.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,19 @@
{
"cell_type": "markdown",
"source": [
"# Loading data into aeon\n",
"aeon supports a range of data input formats. Example problems are described in\n",
"provided_data.ipyn. Downloading data is described in benchmarking_data.ipynb. You\n",
"can of course load and format the data so that it conforms to the input types\n",
"describe in data_storage. aeon also provides data formats for time series for both\n",
"forecasting and machine learning. These are all text files with a particular\n",
"# Loading data in aeon\n",
"\n",
"`aeon` supports a range of data input formats. Accepted datatypes are provided in the\n",
"[data conversions](examples/datasets/data_conversions.ipynb) and\n",
"[data storage](examples/datasets/data_storage.ipynb) notebooks. Example problems are\n",
"described in the [provided data notebook](examples/datasets/provided_data.ipynb), with\n",
"guidance on downloading popular benchmark data provided in the\n",
"[benchmarking data notebook](examples/datasets/benchmarking_data.ipynb).\n",
"\n",
"This notebook provides guidance on loading data from a few popular data file formats used in\n",
"time series machine learning and forecasting scenarios.\n",
"You can of course load data from whatever format you wish and then format the data so that\n",
"it conforms to the input types described. These are all text files with a particular\n",
"structure. Both formats store a single time series per row.\n",
"\n",
"1. The `.ts` and `.tsf` format used by the aeon packages and the [time series](https://timeseriesclassification.com) and [forecasting](https://forecastingdata.org)\n",
Expand Down
Loading
Loading