Merge pull request #374 from lincc-frameworks/load_lsdb

Loaders for LSDB and hipscat (reimplementation using LSDB) data
lincc-frameworks · Feb 8, 2024 · d7b5d17 · d7b5d17
2 parents 65d1fbf + f26917d
commit d7b5d17
Show file tree

Hide file tree

Showing 34 changed files with 936 additions and 64 deletions.
diff --git a/docs/tutorials.rst b/docs/tutorials.rst
@@ -8,6 +8,7 @@ functionality.
 
     Working with the TAPE Ensemble object <tutorials/working_with_the_ensemble>
     Working with the TAPE Timeseries object <tutorials/working_with_the_timeseries>
+    Working with HiPSCat and LSDB data <tutorials/working_with_hipscat_and_lsdb>
     Scaling to Large Data Volume <tutorials/scaling_to_large_data>
     Working with Structure Function <tutorials/working_with_structure_function>
     Binning Sources in the Ensemble <tutorials/binning_slowly_changing_sources>

diff --git a/docs/tutorials/working_with_hipscat_and_lsdb.ipynb b/docs/tutorials/working_with_hipscat_and_lsdb.ipynb
@@ -0,0 +1,255 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rel_path = \"../../tests/tape_tests/data/small_sky_hipscat\""
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Using TAPE with LSDB and HiPSCat Data"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The [Hierarchical Progressive Survey Catalog (HiPSCat)](https://hipscat.readthedocs.io/en/latest/) format is a partitioning of objects on a sphere. Its purpose is for storing data from large astronomy surveys, with the main feature being the adaptive sizing of partitions based on the number of objects in a given region of the sky, using [healpix](https://healpix.jpl.nasa.gov/).\n",
+    "\n",
+    "The [Large Survey Database (LSDB)](https://lsdb.readthedocs.io/en/latest/) is a framework that facilitates and enables spatial analysis for extremely large astronomical databases (i.e. querying and crossmatching O(1B) sources). This package uses dask to parallelize operations across multiple HiPSCat partitioned surveys.\n",
+    "\n",
+    "Both HiPSCat and LSDB are strong tools in the arsenal of a TAPE user. HiPSCat provides a scalable data format for working at the scale of LSST. While LSDB provides tooling to prepare more complex datasets for TAPE analysis, including operations like cross-matching multiple surveys, cone searches to select data from specific regions of the sky, etc. In this notebook, we'll walk through the process by which these can be used with TAPE."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading from HiPSCat data"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "TAPE offers a built-in HiPSCat loader function, which can be used to quickly load in a dataset that is in the HiPSCat format. We'll use a small dummy dataset for this example. Before loading, let's just peek at the data we'll be working with."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pyarrow.parquet as pq\n",
+    "import os\n",
+    "\n",
+    "object_path = os.path.join(rel_path, \"small_sky_object_catalog\")\n",
+    "source_path = os.path.join(rel_path, \"small_sky_source_catalog\")\n",
+    "\n",
+    "# Object Schema\n",
+    "pq.read_metadata(os.path.join(object_path, \"_common_metadata\")).schema"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Source Schema\n",
+    "pq.read_metadata(os.path.join(source_path, \"_common_metadata\")).schema"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The schema indicates which fields are available in each catalog. Notice the `_hipscat_index` in both, this is a specially constructed index that the data is sorted on and enables efficient use of the HiPSCat format. It's recommended to use this as the ID column in TAPE when loading from hipscatted object and source catalogs. With this established, let's load this data into TAPE."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tape import Ensemble, ColumnMapper\n",
+    "\n",
+    "ens = Ensemble(client=False)\n",
+    "\n",
+    "# Setup a ColumnMapper\n",
+    "colmap = ColumnMapper(\n",
+    "    id_col=\"_hipscat_index\",  # using _hipscat_index is recommended\n",
+    "    time_col=\"mjd\",  # pulling these from the source schema list above\n",
+    "    flux_col=\"mag\",\n",
+    "    err_col=\"Norder\",  # we don't have an error column, using a random column for this toy example\n",
+    "    band_col=\"band\",\n",
+    ")\n",
+    "\n",
+    "ens.from_hipscat(source_path, object_path, column_mapper=colmap, object_index=\"id\", source_index=\"object_id\")\n",
+    "\n",
+    "ens.object.head(5)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the `from_hipscat` call, we additionally needed to specify `object_index` and `source_index`, these are a column from both tables that map to the same object-level identifier. It's used to join object and source, and convert the source `_hipscat_index` (which is unique per source) to use the object `_hipscat_index` (unique per object). From here, the `_hipscat_index` will serve as an object ID that ties sources together for TAPE operations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# We're now free to work with our TAPE Ensemble as normal\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "ts = ens.to_timeseries(12751184493818150912)  # select a lightcurve using the _hipscat_index\n",
+    "\n",
+    "# Let's plot this, though it's toy data so it won't look like anything...\n",
+    "plt.plot(ts.data[\"mjd\"], ts.data[\"mag\"], \".\")\n",
+    "plt.title(ts.meta[\"id\"])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading from LSDB Catalogs\n",
+    "\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`Ensemble.from_hipscat` is used to directly ingest HiPSCat data into TAPE. In many cases, you may prefer to do a few operations on your HiPSCat data first using LSDB. Let's walk through how this would look."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Loading into LSDB\n",
+    "import lsdb\n",
+    "\n",
+    "# Load the dataset into LSDB catalog objects\n",
+    "object_cat = lsdb.read_hipscat(object_path)\n",
+    "source_cat = lsdb.read_hipscat(source_path)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We've now loaded our catalogs into LSDB catalog objects. From here, we can do LSDB operations on the catalogs. For example, let's perform a cone search to narrow down our list of objects."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "object_cat_cone = object_cat.cone_search(\n",
+    "    ra=315.0,\n",
+    "    dec=-69.5,\n",
+    "    radius=10.0,\n",
+    ")\n",
+    "\n",
+    "print(f\"Original Number of Objects: {len(object_cat._ddf)}\")\n",
+    "print(f\"New Number of Objects: {len(object_cat_cone._ddf)}\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With our cone search performed, we can now move into TAPE. We'll first need to create a new source catalog, `joined_source_cat`, which incorporates the result of the cone search and also reindexes onto the object `_hipscat_index`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# We do this to get the source catalog indexed by the objects hipscat index\n",
+    "joined_source_cat = object_cat_cone.join(\n",
+    "    source_cat, left_on=\"id\", right_on=\"object_id\", suffixes=(\"_object\", \"\")\n",
+    ")\n",
+    "\n",
+    "colmap = ColumnMapper(\n",
+    "    id_col=\"_hipscat_index\",\n",
+    "    time_col=\"mjd\",\n",
+    "    flux_col=\"mag\",\n",
+    "    err_col=\"Norder\",  # no error column...\n",
+    "    band_col=\"band\",\n",
+    ")\n",
+    "\n",
+    "ens = Ensemble(client=False)\n",
+    "\n",
+    "# We just pass in the catalog objects\n",
+    "ens.from_lsdb(joined_source_cat, object_cat_cone, column_mapper=colmap)\n",
+    "\n",
+    "ens.object.compute()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And from here, we're once again able to work with our TAPE Ensemble as normal."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "83afbb17b435d9bf8b0d0042367da76f26510da1c5781f0ff6e6c518eab621ec"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/pyproject.toml b/pyproject.toml
@@ -28,6 +28,7 @@ dependencies = [
     "deprecated",
     "ipykernel", # Support for Jupyter notebooks
     "light-curve>=0.7.3,<0.8.0",
+    "lsdb"
 ]
 
 # On a mac, install optional dependencies with `pip install '.[dev]'` (include the single quotes)