Merge branch 'main' into new_head

lincc-frameworks · Mar 28, 2024 · 64a8710 · 64a8710
2 parents 0370fd3 + 3321b58
commit 64a8710
Show file tree

Hide file tree

Showing 21 changed files with 485 additions and 305 deletions.
diff --git a/docs/examples/rrlyr-period.ipynb b/docs/examples/rrlyr-period.ipynb
@@ -42,7 +42,7 @@
    "outputs": [],
    "source": [
     "# Load SDSS Stripe 82 RR Lyrae catalog\n",
-    "ens = Ensemble(client=False).from_dataset(\"s82_rrlyrae\")"
+    "ens = Ensemble(client=False).from_dataset(\"s82_rrlyrae\", sorted=True)"
    ]
   },
   {

diff --git a/docs/gettingstarted/quickstart.ipynb b/docs/gettingstarted/quickstart.ipynb
@@ -12,12 +12,22 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The latest release of TAPE is installable via pip, using the following command:\n",
-    "\n",
-    "```\n",
-    "pip install lf-tape\n",
-    "```\n",
-    "\n",
+    "The latest release of TAPE is installable via pip, using the following command:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install lf-tape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "For more detailed installation instructions, see the [Installation Guide](installation.html)."
    ]
   },
@@ -38,7 +48,7 @@
     "from tape import Ensemble\n",
     "\n",
     "ens = Ensemble()  # Initialize a TAPE Ensemble\n",
-    "ens.from_dataset(\"s82_qso\")"
+    "ens.from_dataset(\"s82_qso\", sorted=True)"
    ]
   },
   {
@@ -200,7 +210,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.13"
+   "version": "3.10.14"
   },
   "vscode": {
    "interpreter": {

diff --git a/docs/index.rst b/docs/index.rst
@@ -11,59 +11,25 @@ TAPE offers a complete ecosystem for loading, filtering, and analyzing
 timeseries data. TAPE is built to enable users to run provided and user-defined 
 analysis functions at scale in a parallelized and/or distributed manner.
 
-Over the survey lifetime of the [LSST](https://www.lsst.org/about), on order 
-~billionsof objects will have multiband lightcurves available, and TAPE has
+Over the survey lifetime of the `LSST <https://www.lsst.org/about>`_, on order 
+of ~billions of objects will have multiband lightcurves available, and TAPE has
 been built as a framework with the goal of making analysis of LSST-scale
 data accessible.
 
 TAPE is built on top of `Dask <https://www.dask.org/>`_, and leverages 
 its "lazy evaluation" to only load data and run computations when needed.
 
-Start with the Getting Started section to learn the basics of installation and
+How to Use This Guide
+==============================================
+
+Begin with the `Getting Started <https://tape.readthedocs.io/en/latest/gettingstarted.html>`_ guide to learn the basics of installation and
 walk through a simple example of using TAPE.
 
-The Tutorials section showcases the fundamental features of TAPE.
+The `Tutorials <https://tape.readthedocs.io/en/latest/tutorials.html>`_ section showcases the fundamental features of TAPE.
 
 API-level information about TAPE is viewable in the 
-API Reference section.
-
-
-
-Dev Guide - Getting Started
----------------------------
-
-Before installing any dependencies or writing code, it's a great idea to create a
-virtual environment. LINCC-Frameworks engineers primarily use `conda` to manage virtual
-environments. If you have conda installed locally, you can run the following to
-create and activate a new environment.
-
-.. code-block:: console
-
-   >> conda create env -n <env_name> python=3.11
-   >> conda activate <env_name>
-
-
-Once you have created a new environment, you can install this project for local
-development using the following commands:
-
-.. code-block:: console
-
-   >> pip install -e .'[dev]'
-   >> pre-commit install
-   >> conda install pandoc
-
-
-Notes:
+`API Reference <https://tape.readthedocs.io/en/latest/autoapi/index.html>`_ section.
 
-1) The single quotes around ``'[dev]'`` may not be required for your operating system.
-2) ``pre-commit install`` will initialize pre-commit for this local repository, so
-   that a set of tests will be run prior to completing a local commit. For more
-   information, see the Python Project Template documentation on
-   `pre-commit <https://lincc-ppt.readthedocs.io/en/latest/practices/precommit.html>`_.
-3) Installing ``pandoc`` allows you to verify that automatic rendering of Jupyter notebooks
-   into documentation for ReadTheDocs works as expected. For more information, see
-   the Python Project Template documentation on
-   `Sphinx and Python Notebooks <https://lincc-ppt.readthedocs.io/en/latest/practices/sphinx.html#python-notebooks>`_.
 
 
 .. toctree::

diff --git a/docs/tutorials/batch_showcase.ipynb b/docs/tutorials/batch_showcase.ipynb
@@ -67,6 +67,7 @@
     "ens.from_source_dict(\n",
     "    source_dict,\n",
     "    column_mapper=ColumnMapper(id_col=\"id\", time_col=\"mjd\", flux_col=\"flux\", err_col=\"err\", band_col=\"band\"),\n",
+    "    sorted=True,\n",
     ")"
    ]
   },
@@ -391,10 +392,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Overwrite the _meta property\n",
+    "# Update the metadata\n",
     "\n",
     "res1_noindex = res1.reset_index()\n",
-    "res1_noindex._meta = real_meta_from_dataframe\n",
+    "res1_noindex = res1_noindex.map_partitions(TapeFrame, meta=real_meta_from_dataframe)\n",
     "res1_noindex"
    ]
   },
@@ -584,7 +585,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.13"
+   "version": "3.10.11"
   },
   "vscode": {
    "interpreter": {

diff --git a/docs/tutorials/binning_slowly_changing_sources.ipynb b/docs/tutorials/binning_slowly_changing_sources.ipynb
@@ -36,6 +36,7 @@
     "    flux_col=\"psFlux\",\n",
     "    err_col=\"psFluxErr\",\n",
     "    band_col=\"filterName\",\n",
+    "    sorted=True,\n",
     ")"
    ]
   },
@@ -118,6 +119,19 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "ens = Ensemble()  # initialize an ensemble object\n",
+    "\n",
+    "# Read in data from a parquet file\n",
+    "ens.from_parquet(\n",
+    "    \"../../tests/tape_tests/data/source/test_source.parquet\",\n",
+    "    id_col=\"ps1_objid\",\n",
+    "    time_col=\"midPointTai\",\n",
+    "    flux_col=\"psFlux\",\n",
+    "    err_col=\"psFluxErr\",\n",
+    "    band_col=\"filterName\",\n",
+    "    sorted=True,\n",
+    ")\n",
+    "\n",
     "ens.bin_sources(time_window=28.0, offset=0.0, custom_aggr={\"midPointTai\": \"min\"})\n",
     "fig, ax = plt.subplots(1, 1)\n",
     "ax.hist(ens.source[\"midPointTai\"].compute().tolist(), 500)\n",
@@ -147,7 +161,7 @@
     "    \"band\": [\"g\", \"g\", \"g\", \"g\", \"g\", \"g\"],\n",
     "}\n",
     "cmap = ColumnMapper(id_col=\"id\", time_col=\"midPointTai\", flux_col=\"flux\", err_col=\"err\", band_col=\"band\")\n",
-    "ens.from_source_dict(rows, column_mapper=cmap)\n",
+    "ens.from_source_dict(rows, column_mapper=cmap, sorted=True)\n",
     "\n",
     "fig, ax = plt.subplots(1, 1)\n",
     "ax.hist(ens.source[\"midPointTai\"].compute().tolist(), 60)\n",
@@ -175,7 +189,7 @@
     "    \"band\": [\"g\", \"g\", \"g\", \"g\", \"g\", \"g\"],\n",
     "}\n",
     "cmap = ColumnMapper(id_col=\"id\", time_col=\"midPointTai\", flux_col=\"flux\", err_col=\"err\", band_col=\"band\")\n",
-    "ens.from_source_dict(rows, column_mapper=cmap)\n",
+    "ens.from_source_dict(rows, column_mapper=cmap, sorted=True)\n",
     "ens.bin_sources(time_window=1.0, offset=0.0)\n",
     "\n",
     "fig, ax = plt.subplots(1, 1)\n",
@@ -205,7 +219,7 @@
     "    \"band\": [\"g\", \"g\", \"g\", \"g\", \"g\", \"g\"],\n",
     "}\n",
     "cmap = ColumnMapper(id_col=\"id\", time_col=\"midPointTai\", flux_col=\"flux\", err_col=\"err\", band_col=\"band\")\n",
-    "ens.from_source_dict(rows, column_mapper=cmap)\n",
+    "ens.from_source_dict(rows, column_mapper=cmap, sorted=True)\n",
     "ens.bin_sources(time_window=1.0, offset=0.5)\n",
     "\n",
     "fig, ax = plt.subplots(1, 1)\n",
@@ -243,6 +257,7 @@
     "    flux_col=\"psFlux\",\n",
     "    err_col=\"psFluxErr\",\n",
     "    band_col=\"filterName\",\n",
+    "    sorted=True,\n",
     ")\n",
     "suggested_offset = ens.find_day_gap_offset()\n",
     "print(f\"Suggested offset is {suggested_offset}\")\n",
@@ -255,19 +270,26 @@
     "    \"band\": [\"g\", \"g\", \"g\", \"g\", \"g\", \"g\"],\n",
     "}\n",
     "cmap = ColumnMapper(id_col=\"id\", time_col=\"midPointTai\", flux_col=\"flux\", err_col=\"err\", band_col=\"band\")\n",
-    "ens.from_source_dict(rows, column_mapper=cmap)\n",
+    "ens.from_source_dict(rows, column_mapper=cmap, sorted=True)\n",
     "ens.bin_sources(time_window=1.0, offset=0.5)\n",
     "\n",
     "fig, ax = plt.subplots(1, 1)\n",
     "ax.hist(ens.source[\"midPointTai\"].compute().tolist(), 60)\n",
     "ax.set_xlabel(\"Time (MJD)\")\n",
     "ax.set_ylabel(\"Source Count\")"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "py310",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
@@ -281,11 +303,11 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.6"
+   "version": "3.10.11"
   },
   "vscode": {
    "interpreter": {
-    "hash": "08968836a6367873274ed1d5e98a07391f42fc3a62bd5aba54afbd7b11ba8673"
+    "hash": "83afbb17b435d9bf8b0d0042367da76f26510da1c5781f0ff6e6c518eab621ec"
    }
   }
  },

diff --git a/docs/tutorials/common_data_operations.ipynb b/docs/tutorials/common_data_operations.ipynb
@@ -37,7 +37,7 @@
     "\n",
     "ens = Ensemble()\n",
     "\n",
-    "ens.from_dataset(\"s82_rrlyrae\", sort=True)"
+    "ens.from_dataset(\"s82_rrlyrae\", sorted=True)"
    ]
   },
   {
@@ -141,7 +141,10 @@
    "source": [
     "### Access using a known ID\n",
     "\n",
-    "If you'd like to access a particular lightcurve given an ID, you can use the `to_timeseries()` function. This allows you to supply a given object ID, and returns a `TimeSeries` object (see [working_with_the_timeseries](working_with_the_timeseries.ipynb))."
+    "If you'd like to access a particular lightcurve given an ID, you can use the `to_timeseries()` function. This allows you to supply a given object ID, and returns a `TimeSeries` object (see [working_with_the_timeseries](working_with_the_timeseries.ipynb)).\n",
+    "\n",
+    "> **_Note:_**\n",
+    "that this loads data from all available bands."
    ]
   },
   {
@@ -249,9 +252,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ens.calc_nobs(by_band=True)\n",
+    "ens.calc_nobs(by_band=True, temporary=False)\n",
     "\n",
-    "ens.object[[\"nobs_u\", \"nobs_g\", \"nobs_r\", \"nobs_i\", \"nobs_z\", \"nobs_total\"]].head(5)"
+    "ens.object.head(5)[[\"nobs_u\", \"nobs_g\", \"nobs_r\", \"nobs_i\", \"nobs_z\", \"nobs_total\"]]"
    ]
   },
   {
@@ -464,8 +467,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ens.source.repartition(partition_size=\"100MB\").update_ensemble()  # 100MBs is generally recommended\n",
-    "ens.source  # In this case, we have a small set of data that easily fits into one partition"
+    "ens.source.repartition(partition_size=\"100MB\")  # 100MBs is generally recommended\n",
+    "# In this case, we have a small set of data that easily fits into one partition"
    ]
   },
   {
@@ -492,6 +495,28 @@
     "print(\"Number of post-sampled objects: \", len(subset_ens.object))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For reproducible results, you can also specify a random seed via the `random_state` parameter. By re-using the same seed in your `random_state`, you can ensure that a given `Ensemble` will always be sampled the same way."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "subset_ens = ens.sample(\n",
+    "    frac=0.2,  # select a ~fifth of the objects\n",
+    "    random_state=53783594,  # set a random seed for reproducibility\n",
+    ")\n",
+    "\n",
+    "print(\"Number of pre-sampled objects: \", len(ens.object))\n",
+    "print(\"Number of post-sampled objects: \", len(subset_ens.object))"
+   ]
+  },
   {
    "attachments": {},
    "cell_type": "markdown",
@@ -523,6 +548,15 @@
     "In some situations, you may find yourself running a given workflow many times. Due to the nature of lazy-computation, this will involve repeated execution of data I/O, pre-processing steps, initial analysis, etc. In these situations, it may be effective to instead save the ensemble state to disk after completion of these initial processing steps. To accomplish this, we can use the `Ensemble.save_ensemble()` function."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ens.object.head(5)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -551,6 +585,13 @@
     "new_ens = Ensemble()\n",
     "new_ens.from_ensemble(\"./ensemble\")"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {
@@ -569,7 +610,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.11"
+   "version": "3.10.14"
   },
   "vscode": {
    "interpreter": {