Merge pull request #379 from lincc-frameworks/clientless_default

Switch the Ensemble to default to not using a distributed client
lincc-frameworks · Feb 16, 2024 · f740b6f · f740b6f
2 parents b698d38 + 462314b
commit f740b6f
Show file tree

Hide file tree

Showing 8 changed files with 207 additions and 291 deletions.
diff --git a/docs/tutorials/binning_slowly_changing_sources.ipynb b/docs/tutorials/binning_slowly_changing_sources.ipynb
@@ -263,15 +263,6 @@
     "ax.set_xlabel(\"Time (MJD)\")\n",
     "ax.set_ylabel(\"Source Count\")"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "ens.client.close()  # Tear down the ensemble client"
-   ]
   }
  ],
  "metadata": {
@@ -290,7 +281,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.13"
+   "version": "3.10.6"
   },
   "vscode": {
    "interpreter": {

diff --git a/docs/tutorials/scaling_to_large_data.ipynb b/docs/tutorials/scaling_to_large_data.ipynb
@@ -28,9 +28,7 @@
    "source": [
     "### The `Dask` Client and Scheduler\n",
     "\n",
-    "An important aspect of `Dask` to understand for optimizing it's performance for large datasets is the Distributed Client. `Dask` has [thorough documentation](https://distributed.dask.org/en/stable/client.html) on this, but the general idea is that the Distributed Client is the entrypoint for setting up a distributed system. The Distributed Client enables asynchronous computation, where Dask's `compute` and `persist` methods are able to run in the background and persist in memory while we continue doing other work.\n",
-    "\n",
-    "In the TAPE `Ensemble`, by default a Distributed Client is spun up in the background, which can be accessed using `Ensemble.client_info()`:\n",
+    "An important aspect of `Dask` to understand for optimizing it's performance for large datasets is the Distributed Client. `Dask` has [thorough documentation](https://distributed.dask.org/en/stable/client.html) on this, but the general idea is that the Distributed Client is the entrypoint for setting up a distributed system. By default, the Tape `Ensemble` operates without a Distributed Client. In the TAPE `Ensemble`, we can have a Distributed Client spun up in the background by indicating `client=True` when initializing the Ensemble, which can be accessed using `Ensemble.client_info()`:\n",
     "\n"
    ]
   },
@@ -42,7 +40,7 @@
    "source": [
     "from tape import Ensemble\n",
     "\n",
-    "ens = Ensemble()\n",
+    "ens = Ensemble(client=True)\n",
     "\n",
     "ens.client_info()"
    ]
@@ -72,7 +70,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ens = Ensemble(n_workers=3, threads_per_worker=2)\n",
+    "ens = Ensemble(client=True, n_workers=3, threads_per_worker=2)\n",
     "\n",
     "ens.client_info()"
    ]
@@ -141,23 +139,6 @@
     "This may be preferable for those who want full control of the `Dask` client API, which may be beneficial when working on external machines/services or when a more complex setup is desired."
    ]
   },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Alternatively, there may be instances where you would prefer to not use the Distributed Client, particularly when working with smaller amounts of data. In these instances, we allow users to disable the creation of a Distributed Client by passing `client=False`, as follows:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "ens=Ensemble(client=False)"
-   ]
-  },
   {
    "attachments": {},
    "cell_type": "markdown",
@@ -181,13 +162,15 @@
     "ens = Ensemble(client=client)\n",
     "\n",
     "# Read in data from a parquet file\n",
-    "ens.from_parquet(\"../../tests/tape_tests/data/source/test_source.parquet\",\n",
-    "                id_col='ps1_objid',\n",
-    "                time_col='midPointTai',\n",
-    "                flux_col='psFlux',\n",
-    "                err_col='psFluxErr',\n",
-    "                band_col='filterName',\n",
-    "                partition_size='5KB')\n",
+    "ens.from_parquet(\n",
+    "    \"../../tests/tape_tests/data/source/test_source.parquet\",\n",
+    "    id_col=\"ps1_objid\",\n",
+    "    time_col=\"midPointTai\",\n",
+    "    flux_col=\"psFlux\",\n",
+    "    err_col=\"psFluxErr\",\n",
+    "    band_col=\"filterName\",\n",
+    "    partition_size=\"5KB\",\n",
+    ")\n",
     "\n",
     "ens.info()"
    ]
@@ -211,8 +194,12 @@
     "from tape.analysis.stetsonj import calc_stetson_J\n",
     "import numpy as np\n",
     "\n",
-    "mapres = ens.batch(calc_stetson_J, use_map=True)  # will not know to look at multiple partitions to get lightcurve data\n",
-    "groupres = ens.batch(calc_stetson_J, use_map=False)  # will know to look at multiple partitions, with shuffling costs\n",
+    "mapres = ens.batch(\n",
+    "    calc_stetson_J, use_map=True\n",
+    ")  # will not know to look at multiple partitions to get lightcurve data\n",
+    "groupres = ens.batch(\n",
+    "    calc_stetson_J, use_map=False\n",
+    ")  # will know to look at multiple partitions, with shuffling costs\n",
     "\n",
     "print(\"number of lightcurve results in mapres: \", len(mapres))\n",
     "print(\"number of lightcurve results in groupres: \", len(groupres))\n",
@@ -249,7 +236,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "py310",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
@@ -263,11 +250,11 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.6"
+   "version": "3.10.11"
   },
   "vscode": {
    "interpreter": {
-    "hash": "08968836a6367873274ed1d5e98a07391f42fc3a62bd5aba54afbd7b11ba8673"
+    "hash": "83afbb17b435d9bf8b0d0042367da76f26510da1c5781f0ff6e6c518eab621ec"
    }
   }
  },