Small changes of the docs

lincc-frameworks · May 23, 2024 · de39358 · de39358
1 parent 78efcd6
commit de39358
Show file tree

Hide file tree

Showing 4 changed files with 23 additions and 24 deletions.
diff --git a/docs/gettingstarted/best_practices.ipynb b/docs/gettingstarted/best_practices.ipynb
@@ -18,15 +18,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Like Dask, Nested-Dask is focused towards working with large amounts of data. In particular, the threshold where this really will matter is when the amount of data exceeds the available memory of your machine/system. In such cases, Nested-Dask provides built-in tooling for working with these datasets and is recommended over using Nested-Pandas. These tools encompassing (but not limited to): \n",
+    "Like Dask, Nested-Dask is focused towards working with large amounts of data. In particular, the threshold where this really will matter is when the amount of data exceeds the available memory of your machine/system and/or if parallel computing is needed. In such cases, Nested-Dask provides built-in tooling for working with these datasets and is recommended over using Nested-Pandas. These tools encompassing (but not limited to): \n",
     "\n",
     "* **lazy computation**: enabling construction of workflows with more control over when computation actually begins\n",
     "\n",
     "* **partitioning**: breaking data up into smaller partitions that can fit into memory, enabling work on each chunk while keeping the overall memory footprint smaller than the full dataset size\n",
     "\n",
     "* **progress tracking**: The [Dask Dashboard](https://docs.dask.org/en/latest/dashboard.html) can be used to track the progress of complex workflows, assess memory usage, find bottlenecks, etc.\n",
     "\n",
-    "* **parallel processing**: Dask workers are able to work in parallel on the partitions of a dataset."
+    "* **parallel processing**: Dask workers are able to work in parallel on the partitions of a dataset, both on a local machine and on a distributed cluster."
    ]
   },
   {
@@ -47,7 +47,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Setting up a Dask client\n",
+    "# Setting up a Dask client, which would apply parallel processing\n",
     "from dask.distributed import Client\n",
     "\n",
     "client = Client()\n",
@@ -64,9 +64,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "By contrast, when working with smaller datasets able to fit into memory it's often better to work directly with Nested-Pandas. This is particularly relevant for workflows that start with large amounts of data and filter down to a small dataset. By the nature of lazy computation, these filtering operations are not automatically applied to the dataset and therefore you're still working effectively at scale. Let's walk through an example where we load a \"large\" dataset, in this case it will fit into memory but let's imagine that it is larger than memory."
-   ]
+   "source": "By contrast, when working with smaller datasets able to fit into memory it's often better to work directly with Nested-Pandas. This is particularly relevant for workflows that start with large amounts of data and filter down to a small dataset and do not require computationally heavy processing of this small dataset. By the nature of lazy computation, these filtering operations are not automatically applied to the dataset, and therefore you're still working effectively at scale. Let's walk through an example where we load a \"large\" dataset, in this case it will fit into memory but let's imagine that it is larger than memory."
   },
   {
    "cell_type": "code",
@@ -99,9 +97,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "When `compute()` is called above, the Dask task graph is executed. However, the ndf object above is still a lazy Dask object meaning that any subsequent work will still need to apply this query work all over again."
-   ]
+   "source": "When `compute()` is called above, the Dask task graph is executed and the query is being run. However, the ndf object above is still a lazy Dask object meaning that any subsequent `.compute()`-like method (e.g. `.head()` or `.to_parquet()`) will still need to apply this query work all over again."
   },
   {
    "cell_type": "code",
@@ -115,8 +111,11 @@
     "# The result will be a series with float values\n",
     "meta = pd.Series(name=\"mean\", dtype=float)\n",
     "\n",
-    "# Dask has to reapply the query here\n",
-    "ndf.reduce(np.mean, \"nested.flux\", meta=meta).compute()"
+    "# Apply a mean operation on the \"nested.flux\" column\n",
+    "mean_flux = ndf.reduce(np.mean, \"nested.flux\", meta=meta)\n",
+    "\n",
+    "# Dask has to reapply the query over `ndf` here, then apply the mean operation\n",
+    "mean_flux.compute()"
    ]
   },
   {
@@ -140,6 +139,16 @@
     "isinstance(nf, npd.NestedFrame)"
    ]
   },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "# Now we can apply the mean operation directly to the nested_pandas.NestedFrame\n",
+    "nf.reduce(np.mean, \"nested.flux\")"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/docs/gettingstarted/contributing.rst b/docs/gettingstarted/contributing.rst
@@ -14,6 +14,3 @@ Download code and install dependencies in a conda environment. Run unit tests at
     git clone https://github.com/lincc-frameworks/nested-dask.git
     cd nested-dask/
     bash ./.setup_dev.sh
-
-    pip install pytest
-    pytest
diff --git a/docs/gettingstarted/installation.rst b/docs/gettingstarted/installation.rst
@@ -25,13 +25,8 @@ development version of nested-dask, you should instead build 'nested-dask' from
     git clone https://github.com/lincc-frameworks/nested-dask.git
     cd nested-dask
     pip install .
-    pip install .[dev]  # it may be necessary to use `pip install .'[dev]'` (with single quotes) depending on your machine.
+    pip install '.[dev]'
 
 The ``pip install .[dev]`` command is optional, and installs dependencies needed to run the unit tests and build
 the documentation. The latest source version of nested-dask may be less stable than a release, and so we recommend 
 running the unit test suite to verify that your local install is performing as expected.
-
-.. code-block:: bash
-
-    pip install pytest
-    pytest
diff --git a/docs/gettingstarted/quickstart.ipynb b/docs/gettingstarted/quickstart.ipynb
@@ -10,9 +10,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "With a valid Python environment, nested-dask and it's dependencies are easy to install using the `pip` package manager. The following command can be used to install it:"
-   ]
+   "source": "With a valid Python environment, nested-dask and its dependencies are easy to install using the `pip` package manager. The following command can be used to install it:"
   },
   {
    "cell_type": "code",
@@ -54,7 +52,7 @@
     "* `npartitions=1` indicates how many partitions the dataset has been split into.\n",
     "*  The `0` and `9` tell us the \"divisions\" of the partitions. When the dataset is sorted by the index, these divisions are ranges to show which index values reside in each partition.\n",
     "\n",
-    "We can signal to Dask that we'd like to actually view the data by using `compute`."
+    "We can signal to Dask that we'd like to actually obtain the data as `nested_pandas.NestedFrame` by using `compute`."
    ]
   },
   {