-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'nits' of https://github.com/lincc-frameworks/nested-dask …
…into nits
- Loading branch information
Showing
5 changed files
with
259 additions
and
19 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,236 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "6171e5bbd47ce869", | ||
"metadata": {}, | ||
"source": [ | ||
"# Load large catalog data from the LSDB\n", | ||
"\n", | ||
"Here we load a small part of ZTF DR14 stored as HiPSCat catalog using the [LSDB](https://lsdb.readthedocs.io/)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c055a44b8ce3b34", | ||
"metadata": {}, | ||
"source": [ | ||
"## Install LSDB and its dependencies and import the necessary modules\n", | ||
"\n", | ||
"We also need `aiohttp`, which is an optional LSDB's dependency, needed to access the catalog data from the web." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "af1710055600582", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:00.759441Z", | ||
"start_time": "2024-05-24T12:53:58.854875Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import pandas as pd\n", | ||
"\n", | ||
"# Comment the following line to skip LSDB installation\n", | ||
"%pip install aiohttp lsdb" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e4d03aa76aeeb1c0", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:03.834087Z", | ||
"start_time": "2024-05-24T12:54:00.761330Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import nested_pandas as npd\n", | ||
"from lsdb import read_hipscat\n", | ||
"from nested_dask import NestedFrame\n", | ||
"from nested_pandas.series.packer import pack" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e169686259687cb2", | ||
"metadata": {}, | ||
"source": [ | ||
"## Load ZTF DR14\n", | ||
"For the demonstration purposes we use a light version of the ZTF DR14 catalog distributed by LINCC Frameworks, a half-degree circle around RA=180, Dec=10.\n", | ||
"We load the data from HTTPS as two LSDB catalogs: objects (metadata catalog) and source (light curve catalog)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "a403e00e2fd8d081", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:06.745405Z", | ||
"start_time": "2024-05-24T12:54:03.834904Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"catalogs_dir = \"https://epyc.astro.washington.edu/~lincc-frameworks/half_degree_surveys/ztf/\"\n", | ||
"\n", | ||
"lsdb_object = read_hipscat(\n", | ||
" f\"{catalogs_dir}/ztf_object\",\n", | ||
" columns=[\"ra\", \"dec\", \"ps1_objid\"],\n", | ||
")\n", | ||
"lsdb_source = read_hipscat(\n", | ||
" f\"{catalogs_dir}/ztf_source\",\n", | ||
" columns=[\"mjd\", \"mag\", \"magerr\", \"band\", \"ps1_objid\", \"catflags\"],\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "4ed4201f2c59f542", | ||
"metadata": {}, | ||
"source": [ | ||
"We need to merge these two catalogs to get the light curve data.\n", | ||
"It is done with LSDB's `.join()` method which would give us a new catalog with all the columns from both catalogs. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "b9b57bced7f810c0", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:06.770931Z", | ||
"start_time": "2024-05-24T12:54:06.746786Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# We can ignore warning here - for this particular case we don't need margin cache\n", | ||
"lsdb_joined = lsdb_object.join(\n", | ||
" lsdb_source,\n", | ||
" left_on=\"ps1_objid\",\n", | ||
" right_on=\"ps1_objid\",\n", | ||
" suffixes=(\"\", \"\"),\n", | ||
")\n", | ||
"joined_ddf = lsdb_joined._ddf\n", | ||
"joined_ddf" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8ac9c6ceaf6bc3d2", | ||
"metadata": {}, | ||
"source": [ | ||
"## Convert LSDB joined catalog to `nested_dask.NestedFrame`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "347f97583a3c1ba4", | ||
"metadata": {}, | ||
"source": [ | ||
"First, we plan the computation to convert the joined Dask DataFrame to a NestedFrame." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "9522ce0977ff9fdf", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:06.789983Z", | ||
"start_time": "2024-05-24T12:54:06.772721Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"def convert_to_nested_frame(df: pd.DataFrame, nested_columns: list[str]):\n", | ||
" other_columns = [col for col in df.columns if col not in nested_columns]\n", | ||
"\n", | ||
" # Since object rows are repeated, we just drop duplicates\n", | ||
" object_df = df[other_columns].groupby(level=0).first()\n", | ||
" nested_frame = npd.NestedFrame(object_df)\n", | ||
"\n", | ||
" source_df = df[nested_columns]\n", | ||
" # lc is for light curve\n", | ||
" # https://github.com/lincc-frameworks/nested-pandas/issues/88\n", | ||
" # nested_frame.add_nested(source_df, 'lc')\n", | ||
" nested_frame[\"lc\"] = pack(source_df, name=\"lc\")\n", | ||
"\n", | ||
" return nested_frame\n", | ||
"\n", | ||
"\n", | ||
"ddf = joined_ddf.map_partitions(\n", | ||
" lambda df: convert_to_nested_frame(df, nested_columns=lsdb_source.columns),\n", | ||
" meta=convert_to_nested_frame(joined_ddf._meta, nested_columns=lsdb_source.columns),\n", | ||
")\n", | ||
"nested_ddf = NestedFrame.from_dask_dataframe(ddf)\n", | ||
"nested_ddf" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "de6820724bcd4781", | ||
"metadata": {}, | ||
"source": [ | ||
"Second, we compute the NestedFrame." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "a82bd831fc1f6f92", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:19.282406Z", | ||
"start_time": "2024-05-24T12:54:06.790699Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"ndf = nested_ddf.compute()\n", | ||
"ndf" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "9c82f593efca9d30", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:19.284710Z", | ||
"start_time": "2024-05-24T12:54:19.283179Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 2 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython2", | ||
"version": "2.7.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |