Merge pull request #65 from lincc-frameworks/add_quickstart

init quickstart
lincc-frameworks · May 8, 2024 · 5a74ad5 · 5a74ad5
2 parents 3d65f46 + 01c9419
commit 5a74ad5
Show file tree

Hide file tree

Showing 2 changed files with 235 additions and 1 deletion.
diff --git a/docs/gettingstarted.rst b/docs/gettingstarted.rst
@@ -9,4 +9,5 @@ we encourage you to open an issue on the
     :maxdepth: 1
 
     Installing nested-pandas <gettingstarted/installation>
-    Contribution Guide <gettingstarted/contributing>
+    Contribution Guide <gettingstarted/contributing>
+    Quickstart Guide <gettingstarted/quickstart>
diff --git a/docs/gettingstarted/quickstart.ipynb b/docs/gettingstarted/quickstart.ipynb
@@ -0,0 +1,233 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Quickstart"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With a valid Python environment, nested-pandas and it's dependencies are easy to install using the `pip` package manager. The following command can be used to install it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# % pip install nested-pandas"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Nested-Pandas is tailored towards efficient analysis of nested datasets. Let's load a toy dataset to show how it works."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from nested_pandas.datasets import generate_data\n",
+    "\n",
+    "# generate_data creates some toy data\n",
+    "nf = generate_data(10, 100)  # 10 rows, 100 nested rows per row\n",
+    "nf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The above dataframe is a `NestedFrame`, which extends the capabilities of the Pandas `DataFrame` to support columns with nested information. In this example, we have the top level dataframe with 10 rows and 2 typical columns, \"a\" and \"b\". The \"nested\" column contains a dataframe in each row. We can inspect the contents of the \"nested\" column using pandas API tooling like `loc`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nf.loc[0][\"nested\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we see that within the \"nested\" column there are `NestedFrame` objects with their own data. In this case we have 3 columns (\"t\", \"flux\", and \"band\"). Alternatively, we could inspect the available columns using some custom properties of the `NestedFrame`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Shows which columns have nested data\n",
+    "nf.nested_columns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Provides a dictionary of \"base\" (top-level) and nested column labels\n",
+    "nf.all_columns"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "nested-pandas extends the Pandas API, meaning any operation you could do in Pandas is available within nested-pandas. However, nested-pandas has additional functionality and tooling to better support working with Nested datasets. For example, let's look at `query`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Normal queries work as expected, rejecting rows from the dataframe that don't meet the criteria\n",
+    "nf.query(\"a > 0.2\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The above query is native Pandas, however with nested-pandas we can use hierarchical column names to extend `query` to nested layers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Applies the query to \"nested\", filtering based on \"t >17\"\n",
+    "nf_g = nf.query(\"nested.t > 17.0\")\n",
+    "nf_g"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This query does not affect the rows of the top-level dataframe, but rather applies the query to the \"nested\" dataframes. If we look at one of them, we can see the effect of the query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# All t <= 17.0 have been removed\n",
+    "nf_g.loc[0][\"nested\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A limited set of functions have been extended in this way so far, with the aim being to fully support this hierarchical access where applicable in the Pandas API."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, we'll end with the flexible `reduce` function. `reduce` functions similarly to Pandas' `apply` but flattens (reduces) the inputs from nested layers into array inputs to the given apply function. For example, let's find the mean flux for each dataframe in \"nested\":"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "# use hierarchical column names to access the flux column\n",
+    "# passed as an array to np.mean\n",
+    "nf.reduce(np.mean, \"nested.flux\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This can be used to apply any custom functions you need for your analysis, and just to illustrate that point further let's define a custom function that just returns it's inputs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def show_inputs(*args):\n",
+    "    return args"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Applying some inputs via reduce, we see how it sends inputs to a given function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nf_inputs = nf.reduce(show_inputs, \"a\", \"nested.band\")\n",
+    "nf_inputs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nf_inputs.loc[0]"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}