aeon-toolkit · MatthewMiddlehurst · Jul 5, 2023 · Jul 5, 2023 · Jul 5, 2023 · Jul 5, 2023
diff --git a/examples/datasets/data_conversions.ipynb b/examples/datasets/data_conversions.ipynb
@@ -5,12 +5,12 @@
    "source": [
     "# Data conversions in aeon\n",
     "\n",
-    "We recommend you follow the following strategy: Use `pd.Series` or `pd\n",
-    ".DataFrame` for forecasting and for classification, clustering and regression, use 3D\n",
-    " numpy of shape `(n_cases, n_channels, n_timepoints)` if your collection of time series are equal length, or a\n",
-    "  list of 2D numpy of length `[n_cases]` if not equal length. All data loaded from\n",
-    "  file with our [data loaders](data_loading.ipynb)  use this\n",
-    "  strategy.\n",
+    "We recommend you follow the data storage described in the [data storage notebook](examples/datasets/data_storage.ipynb)\n",
+    "which can be summarised as follows: Use `pd.Series` or `pd.DataFrame` for tasks\n",
+    "which focus on single series such a forecasting, and for tasks such as classification,\n",
+    "clustering and regression use a 3D numpy array of shape `(n_cases, n_channels, n_timepoints)`\n",
+    "if your collection of time series are equal length, or a list of 2D numpy of length `[n_cases]`\n",
+    "if not equal length. All are [data loaders](examples/datasets/data_loading.ipynb) use this format.\n",
     "\n",
     "However, `aeon` provides a range of converters in the `datatypes` package. These are\n",
     "grouped into converters for single series and converters for collections of series"
@@ -22,7 +22,7 @@
   {
    "cell_type": "markdown",
    "source": [
-    "## Series Converters\n",
+    "# Series Converters\n",
     "\n",
     "Single time series can be stored in the following data structures\n",
     "\n",
@@ -32,16 +32,16 @@
     "- \"xr.DataArray\": xarray DataArray a for a univariate or multivariate time series\n",
     "- \"dask_series\": Dask DataFrame for a univariate or multivariate time series\n",
     "\n",
-    "The above strings are used to internally specify each different data structure for\n",
-    "internal conversion purposes. NOTE the 2D numpy array representation is not consistent with that used in\n",
+    "The above strings are used to internally specify each different data structure. NOTE the\n",
+    " 2D numpy array representation is not consistent with that used in\n",
     "collections. This is an unfortunate difference that is a result of legacy design and\n",
     "norms in different research fields.\n",
     "\n",
     "Conversion to and from these data structures is fairly straightforward, but we\n",
     "provide tools to help. `aeon` contains converters that are wrapped by the method\n",
     "`convert`. This method will attempt to convert from one of the five types to another,\n",
     " and raise an exception if the conversion is invalid (e.g. if the object is not in\n",
-    " fact of type \"from_type\"). Note that estimators will attempt to automatically\n",
+    " fact of type \"from_type\"). Note that series estimators will attempt to automatically\n",
     "  perform this conversion to the specified internal type of that estimator."
    ],
    "metadata": {
@@ -50,13 +50,13 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 27,
    "outputs": [
     {
      "data": {
       "text/plain": "xarray.core.dataarray.DataArray"
      },
-     "execution_count": 1,
+     "execution_count": 27,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -86,13 +86,13 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 28,
    "outputs": [
     {
      "data": {
       "text/plain": "pandas.core.frame.DataFrame"
      },
-     "execution_count": 2,
+     "execution_count": 28,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -113,13 +113,13 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 29,
    "outputs": [
     {
      "data": {
       "text/plain": "dask.dataframe.core.DataFrame"
      },
-     "execution_count": 3,
+     "execution_count": 29,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -134,13 +134,13 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 30,
    "outputs": [
     {
      "data": {
       "text/plain": "xarray.core.dataarray.DataArray"
      },
-     "execution_count": 4,
+     "execution_count": 30,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -156,15 +156,11 @@
   {
    "cell_type": "markdown",
    "source": [
-    "## Collections Converters\n",
+    "# Collections Converters\n",
     "\n",
-    "Collections of time series are the fundamental data type for machine\n",
-    "learning algorithms. In older versions of the toolkit, collections of time series\n",
-    "were called panels (a term from econometrics, not machine learning), and there are\n",
-    "still references to panel. The main\n",
-    "characteristics of collections of time series that effect storage is that they can be\n",
-    "univariate or multivariate and they can be equal length or unequal length. The main\n",
-    "data structures for storing collections are as follows:\n",
+    "Previously, collections of time series were called panels (a term from econometrics,\n",
+    "not machine learning), and there are still references to panel. The main\n",
+    "data structures for storing collections are as follows\n",
     "\n",
     "- \"numpy3D\": 3D np.ndarray of format `(n_cases, n_channels, n_timepoints)`\n",
     "- \"np-list\": python list of 2D numpy array of length `[n_cases]`, each of shape\n",
@@ -173,25 +169,25 @@
     "`(n_timepoints_i, n_channels)`\n",
     "- \"numpy2D\": 2D np.ndarray of shape `(n_cases, n_timepoints)`\n",
     "\n",
-    "Other supported types which may be useful are:\n",
+    "Other supported types which may be useful in forecasting are\n",
     "\n",
     "- \"nested_univ\": a pd.DataFrame of shape `(n_cases, n_channels)` where each cell is a\n",
     " pd.Series of length `(n_timepoints)`\n",
     " - \"pd-multiindex\": pd.DataFrame with multi-index `(cases, timepoints)`\n",
-    " - \"pd-wide\": pd.DataFrame in wide format, with shape  `(n_timepoints, n_cases)`\n",
+    " - \"pd-wide\": pd.DataFrame in wide format, `cols = (instance*timepoints)`\n",
+    " - \"dask_panel\": dask frame with one instance and one time index\n",
     "\n",
-    "AS with series, collection conversion can be  performed with the method `convert`,\n",
-    "which wraps methods in `aeon.datatypes._panel._convert`. However, internal estimator\n",
-    "conversion is now handled with the function `_convert_X` in the `aeon.utils.validation\n",
-    ".collection` package as follows"
+    "As with series, conversion is performed with the method `convert` and auto conversion\n",
+    " happens in estimator base classes. These wrap methods in `aeon.datatypes\n",
+    "._panel._convert`"
    ],
    "metadata": {
     "collapsed": false
    }
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 35,
    "outputs": [
     {
      "name": "stdout",
@@ -206,7 +202,7 @@
     "\n",
     "# 10 multivariate time series with 3 channels of length 100 in \"numpy3D\" format\n",
     "multi = np.random.random(size=(10, 3, 100))\n",
-    "np_list = convert_collection(multi, output_type=\"np-list\")\n",
+    "np_list = convert(multi, from_type=\"numpy3D\", to_type=\"np-list\")\n",
     "print(\n",
     "    f\" Type = {type(np_list)}, type first {type(np_list[0])} shape first \"\n",
     "    f\"{np_list[0].shape}\"\n",
@@ -218,7 +214,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 36,
    "outputs": [
     {
      "name": "stdout",
@@ -229,7 +225,7 @@
     }
    ],
    "source": [
-    "df_list = convert_collection(multi, output_type=\"df-list\")\n",
+    "df_list = convert(multi, from_type=\"numpy3D\", to_type=\"df-list\")\n",
     "print(\n",
     "    f\" Type = {type(df_list)}, type first {type(df_list[0])} shape first \"\n",
     "    f\"{df_list[0].shape}\"\n",
@@ -254,13 +250,21 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "outputs": [],
+   "execution_count": 39,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " Type = <class 'pandas.core.frame.DataFrame'>,shape (3000, 4)\n"
+     ]
+    }
+   ],
    "source": [
     "from aeon.utils.conversion._convert_collection import _from_numpy3d_to_pd_multiindex\n",
     "\n",
-    "mi = _from_numpy3d_to_pd_multiindex(multi)\n",
-    "print(f\" Type = {type(mi)},shape {mi.shape}\")"
+    "long = from_3d_numpy_to_long(multi)\n",
+    "print(f\" Type = {type(long)},shape {long.shape}\")"
    ],
    "metadata": {
     "collapsed": false,
@@ -271,9 +275,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
-   "outputs": [],
-   "source": [],
+   "execution_count": 40,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " Type = <class 'pandas.core.frame.DataFrame'>,shape (1000, 3)\n"
+     ]
+    }
+   ],
+   "source": [
+    "mi = from_3d_numpy_to_multi_index(multi)\n",
+    "print(f\" Type = {type(mi)},shape {mi.shape}\")"
+   ],
    "metadata": {
     "collapsed": false
    }

diff --git a/examples/datasets/data_loading.ipynb b/examples/datasets/data_loading.ipynb
@@ -3,12 +3,19 @@
   {
    "cell_type": "markdown",
    "source": [
-    "# Loading data into aeon\n",
-    "aeon supports a range of data input formats. Example problems are described in\n",
-    "provided_data.ipyn. Downloading data is described in benchmarking_data.ipynb. You\n",
-    "can of course load and format the data so that it conforms to the input types\n",
-    "describe in data_storage. aeon also provides data formats for time series for both\n",
-    "forecasting and machine learning. These are all text files with a particular\n",
+    "# Loading data in aeon\n",
+    "\n",
+    "`aeon` supports a range of data input formats. Accepted datatypes are provided in the\n",
+    "[data conversions](examples/datasets/data_conversions.ipynb) and\n",
+    "[data storage](examples/datasets/data_storage.ipynb) notebooks. Example problems are\n",
+    "described in the [provided data notebook](examples/datasets/provided_data.ipynb), with\n",
+    "guidance on downloading popular benchmark data provided in the\n",
+    "[benchmarking data notebook](examples/datasets/benchmarking_data.ipynb).\n",
+    "\n",
+    "This notebook provides guidance on loading data from a few popular data file formats used in\n",
+    "time series machine learning and forecasting scenarios.\n",
+    "You can of course load data from whatever format you wish and then format the data so that\n",
+    "it conforms to the input types described. These are all text files with a particular\n",
     "structure. Both formats store a single time series per row.\n",
     "\n",
     "1. The `.ts` and `.tsf` format used by the aeon packages and the [time series](https://timeseriesclassification.com) and [forecasting](https://forecastingdata.org)\n",