-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #374 from lincc-frameworks/load_lsdb
Loaders for LSDB and hipscat (reimplementation using LSDB) data
- Loading branch information
Showing
34 changed files
with
936 additions
and
64 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,255 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"rel_path = \"../../tests/tape_tests/data/small_sky_hipscat\"" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Using TAPE with LSDB and HiPSCat Data" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The [Hierarchical Progressive Survey Catalog (HiPSCat)](https://hipscat.readthedocs.io/en/latest/) format is a partitioning of objects on a sphere. Its purpose is for storing data from large astronomy surveys, with the main feature being the adaptive sizing of partitions based on the number of objects in a given region of the sky, using [healpix](https://healpix.jpl.nasa.gov/).\n", | ||
"\n", | ||
"The [Large Survey Database (LSDB)](https://lsdb.readthedocs.io/en/latest/) is a framework that facilitates and enables spatial analysis for extremely large astronomical databases (i.e. querying and crossmatching O(1B) sources). This package uses dask to parallelize operations across multiple HiPSCat partitioned surveys.\n", | ||
"\n", | ||
"Both HiPSCat and LSDB are strong tools in the arsenal of a TAPE user. HiPSCat provides a scalable data format for working at the scale of LSST. While LSDB provides tooling to prepare more complex datasets for TAPE analysis, including operations like cross-matching multiple surveys, cone searches to select data from specific regions of the sky, etc. In this notebook, we'll walk through the process by which these can be used with TAPE." | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Loading from HiPSCat data" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"TAPE offers a built-in HiPSCat loader function, which can be used to quickly load in a dataset that is in the HiPSCat format. We'll use a small dummy dataset for this example. Before loading, let's just peek at the data we'll be working with." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import pyarrow.parquet as pq\n", | ||
"import os\n", | ||
"\n", | ||
"object_path = os.path.join(rel_path, \"small_sky_object_catalog\")\n", | ||
"source_path = os.path.join(rel_path, \"small_sky_source_catalog\")\n", | ||
"\n", | ||
"# Object Schema\n", | ||
"pq.read_metadata(os.path.join(object_path, \"_common_metadata\")).schema" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Source Schema\n", | ||
"pq.read_metadata(os.path.join(source_path, \"_common_metadata\")).schema" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The schema indicates which fields are available in each catalog. Notice the `_hipscat_index` in both, this is a specially constructed index that the data is sorted on and enables efficient use of the HiPSCat format. It's recommended to use this as the ID column in TAPE when loading from hipscatted object and source catalogs. With this established, let's load this data into TAPE." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from tape import Ensemble, ColumnMapper\n", | ||
"\n", | ||
"ens = Ensemble(client=False)\n", | ||
"\n", | ||
"# Setup a ColumnMapper\n", | ||
"colmap = ColumnMapper(\n", | ||
" id_col=\"_hipscat_index\", # using _hipscat_index is recommended\n", | ||
" time_col=\"mjd\", # pulling these from the source schema list above\n", | ||
" flux_col=\"mag\",\n", | ||
" err_col=\"Norder\", # we don't have an error column, using a random column for this toy example\n", | ||
" band_col=\"band\",\n", | ||
")\n", | ||
"\n", | ||
"ens.from_hipscat(source_path, object_path, column_mapper=colmap, object_index=\"id\", source_index=\"object_id\")\n", | ||
"\n", | ||
"ens.object.head(5)" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In the `from_hipscat` call, we additionally needed to specify `object_index` and `source_index`, these are a column from both tables that map to the same object-level identifier. It's used to join object and source, and convert the source `_hipscat_index` (which is unique per source) to use the object `_hipscat_index` (unique per object). From here, the `_hipscat_index` will serve as an object ID that ties sources together for TAPE operations." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# We're now free to work with our TAPE Ensemble as normal\n", | ||
"import matplotlib.pyplot as plt\n", | ||
"\n", | ||
"ts = ens.to_timeseries(12751184493818150912) # select a lightcurve using the _hipscat_index\n", | ||
"\n", | ||
"# Let's plot this, though it's toy data so it won't look like anything...\n", | ||
"plt.plot(ts.data[\"mjd\"], ts.data[\"mag\"], \".\")\n", | ||
"plt.title(ts.meta[\"id\"])" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Loading from LSDB Catalogs\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"`Ensemble.from_hipscat` is used to directly ingest HiPSCat data into TAPE. In many cases, you may prefer to do a few operations on your HiPSCat data first using LSDB. Let's walk through how this would look." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Loading into LSDB\n", | ||
"import lsdb\n", | ||
"\n", | ||
"# Load the dataset into LSDB catalog objects\n", | ||
"object_cat = lsdb.read_hipscat(object_path)\n", | ||
"source_cat = lsdb.read_hipscat(source_path)" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We've now loaded our catalogs into LSDB catalog objects. From here, we can do LSDB operations on the catalogs. For example, let's perform a cone search to narrow down our list of objects." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"object_cat_cone = object_cat.cone_search(\n", | ||
" ra=315.0,\n", | ||
" dec=-69.5,\n", | ||
" radius=10.0,\n", | ||
")\n", | ||
"\n", | ||
"print(f\"Original Number of Objects: {len(object_cat._ddf)}\")\n", | ||
"print(f\"New Number of Objects: {len(object_cat_cone._ddf)}\")" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"With our cone search performed, we can now move into TAPE. We'll first need to create a new source catalog, `joined_source_cat`, which incorporates the result of the cone search and also reindexes onto the object `_hipscat_index`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# We do this to get the source catalog indexed by the objects hipscat index\n", | ||
"joined_source_cat = object_cat_cone.join(\n", | ||
" source_cat, left_on=\"id\", right_on=\"object_id\", suffixes=(\"_object\", \"\")\n", | ||
")\n", | ||
"\n", | ||
"colmap = ColumnMapper(\n", | ||
" id_col=\"_hipscat_index\",\n", | ||
" time_col=\"mjd\",\n", | ||
" flux_col=\"mag\",\n", | ||
" err_col=\"Norder\", # no error column...\n", | ||
" band_col=\"band\",\n", | ||
")\n", | ||
"\n", | ||
"ens = Ensemble(client=False)\n", | ||
"\n", | ||
"# We just pass in the catalog objects\n", | ||
"ens.from_lsdb(joined_source_cat, object_cat_cone, column_mapper=colmap)\n", | ||
"\n", | ||
"ens.object.compute()" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"And from here, we're once again able to work with our TAPE Ensemble as normal." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.11" | ||
}, | ||
"vscode": { | ||
"interpreter": { | ||
"hash": "83afbb17b435d9bf8b0d0042367da76f26510da1c5781f0ff6e6c518eab621ec" | ||
} | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.