diff --git a/docs/tutorials.rst b/docs/tutorials.rst index 553c832..b012c61 100644 --- a/docs/tutorials.rst +++ b/docs/tutorials.rst @@ -4,5 +4,4 @@ Tutorials .. toctree:: Loading Data into Nested-Dask - Work with HiPSCat catalogs via LSDB Using the `nest` Accessor diff --git a/docs/tutorials/work_with_lsdb.ipynb b/docs/tutorials/work_with_lsdb.ipynb deleted file mode 100644 index 77369fd..0000000 --- a/docs/tutorials/work_with_lsdb.ipynb +++ /dev/null @@ -1,447 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "6171e5bbd47ce869", - "metadata": {}, - "source": [ - "# Load large catalog data with LSDB\n", - "\n", - "Here we load a small part of ZTF DR14 stored as HiPSCat catalog using [LSDB](https://lsdb.readthedocs.io/).\n", - "\n", - "The notebook is an adaptation of [the tutorial](https://github.com/lincc-frameworks/Rare_Gems_Demo/blob/main/Notebook_2_Basic_Time_Domain.ipynb) presented by Neven Caplar at the Rare Gems in Big Data conference, May 2024. " - ] - }, - { - "cell_type": "markdown", - "id": "c055a44b8ce3b34", - "metadata": {}, - "source": [ - "## Install dependencies for the notebook\n", - "\n", - "The notebook requires `nested-dask` and few other packages to be installed.\n", - "- `lsdb` to load and join \"object\" (pointing) and \"source\" (detection) ZTF catalogs\n", - "- `aiohttp` is `lsdb`'s optional dependency to download the data via web\n", - "- `light-curve` to extract features from light curves\n", - "- `matplotlib` to plot the results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "af1710055600582", - "metadata": { - "ExecuteTime": { - "end_time": "2024-05-30T21:15:57.356540Z", - "start_time": "2024-05-30T21:15:55.782198Z" - } - }, - "outputs": [], - "source": [ - "# Uncomment the following line to install nested-dask\n", - "# %pip install nested-dask\n", - "\n", - "# Comment the following line to skip dependencies installation\n", - "%pip install --quiet lsdb==0.3.0 aiohttp light-curve matplotlib" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e4d03aa76aeeb1c0", - "metadata": { - "ExecuteTime": { - "end_time": "2024-05-30T21:15:59.778253Z", - "start_time": "2024-05-30T21:15:57.358002Z" - } - }, - "outputs": [], - "source": [ - "import dask.array\n", - "import dask.distributed\n", - "import light_curve as licu\n", - "import matplotlib.pyplot as plt\n", - "import nested_pandas as npd\n", - "import numpy as np\n", - "import pandas as pd\n", - "from dask_expr import from_legacy_dataframe\n", - "from lsdb import read_hipscat\n", - "from matplotlib.colors import LogNorm\n", - "from nested_dask import NestedFrame" - ] - }, - { - "cell_type": "markdown", - "id": "e169686259687cb2", - "metadata": {}, - "source": [ - "## Load ZTF DR14\n", - "For the demonstration purposes we use a light version of the ZTF DR14 catalog distributed by LINCC Frameworks, a half-degree circle around RA=180, Dec=10.\n", - "We load the data from HTTPS as two LSDB catalogs: objects (metadata catalog) and source (light curve catalog)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a403e00e2fd8d081", - "metadata": { - "ExecuteTime": { - "end_time": "2024-05-30T21:16:02.906421Z", - "start_time": "2024-05-30T21:15:59.779007Z" - } - }, - "outputs": [], - "source": [ - "catalogs_dir = \"https://epyc.astro.washington.edu/~lincc-frameworks/half_degree_surveys/ztf/\"\n", - "\n", - "lsdb_object = read_hipscat(\n", - " f\"{catalogs_dir}/ztf_object\",\n", - " columns=[\"ra\", \"dec\", \"ps1_objid\"],\n", - ")\n", - "lsdb_source = read_hipscat(\n", - " f\"{catalogs_dir}/ztf_source\",\n", - " columns=[\"mjd\", \"mag\", \"magerr\", \"band\", \"ps1_objid\", \"catflags\"],\n", - ")\n", - "lc_columns = [\"mjd\", \"mag\", \"magerr\", \"band\", \"catflags\"]" - ] - }, - { - "cell_type": "markdown", - "id": "4ed4201f2c59f542", - "metadata": {}, - "source": [ - "We need to merge these two catalogs to get the light curve data.\n", - "It is done with LSDB's `.join()` method which would give us a new catalog with all the columns from both catalogs. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b9b57bced7f810c0", - "metadata": { - "ExecuteTime": { - "end_time": "2024-05-30T21:16:02.931031Z", - "start_time": "2024-05-30T21:16:02.907834Z" - } - }, - "outputs": [], - "source": [ - "# We can ignore warning here - for this particular case we don't need margin cache\n", - "lsdb_joined = lsdb_object.join(\n", - " lsdb_source,\n", - " left_on=\"ps1_objid\",\n", - " right_on=\"ps1_objid\",\n", - " suffixes=(\"\", \"\"),\n", - ")\n", - "joined_ddf = lsdb_joined._ddf\n", - "joined_ddf" - ] - }, - { - "cell_type": "markdown", - "id": "8ac9c6ceaf6bc3d2", - "metadata": {}, - "source": [ - "## Convert LSDB joined catalog to `nested_dask.NestedFrame`" - ] - }, - { - "cell_type": "markdown", - "id": "347f97583a3c1ba4", - "metadata": {}, - "source": [ - "First, we plan the computation to convert the joined Dask DataFrame to a NestedFrame." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9522ce0977ff9fdf", - "metadata": { - "ExecuteTime": { - "end_time": "2024-05-30T21:16:02.951964Z", - "start_time": "2024-05-30T21:16:02.931819Z" - } - }, - "outputs": [], - "source": [ - "def convert_to_nested_frame(df: pd.DataFrame, nested_columns: list[str]):\n", - " other_columns = [col for col in df.columns if col not in nested_columns]\n", - "\n", - " # Since object rows are repeated, we just drop duplicates\n", - " object_df = df[other_columns].groupby(level=0).first()\n", - " nested_frame = npd.NestedFrame(object_df)\n", - "\n", - " source_df = df[nested_columns]\n", - " # lc is for light curve\n", - " return nested_frame.add_nested(source_df, \"lc\")\n", - "\n", - "\n", - "ddf = joined_ddf.map_partitions(\n", - " lambda df: convert_to_nested_frame(df, nested_columns=lc_columns),\n", - " meta=convert_to_nested_frame(joined_ddf._meta, nested_columns=lc_columns),\n", - ")\n", - "nested_ddf = NestedFrame.from_dask_dataframe(from_legacy_dataframe(ddf))\n", - "nested_ddf" - ] - }, - { - "cell_type": "markdown", - "id": "de6820724bcd4781", - "metadata": {}, - "source": [ - "Now we filter our dataframe by the `catflags` column (0 flags correspond to the perfect observational conditions) and the `band` column to be equal to `r`.\n", - "After filtering the detections, we are going to count the number of detections per object and keep only those objects with more than 10 detections." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a82bd831fc1f6f92", - "metadata": { - "ExecuteTime": { - "end_time": "2024-05-30T21:16:02.963360Z", - "start_time": "2024-05-30T21:16:02.952708Z" - } - }, - "outputs": [], - "source": [ - "r_band = nested_ddf.query(\"lc.catflags == 0 and lc.band == 'r'\")\n", - "nobs = r_band.reduce(np.size, \"lc.mjd\", meta={0: int}).rename(columns={0: \"nobs\"})\n", - "r_band = r_band[nobs[\"nobs\"] > 10]\n", - "r_band" - ] - }, - { - "cell_type": "markdown", - "id": "f99f2d9052f5843f", - "metadata": {}, - "source": [ - "Later we are going to extract features, so we need to prepare light-curve data to be in the same float format." - ] - }, - { - "cell_type": "markdown", - "id": "92d6ec9f5b0889e1", - "metadata": {}, - "source": [ - "### Extract features from ZTF light curves\n", - "\n", - "Now we are going to extract some features:\n", - "- Top periodogram peak\n", - "- Mean magnitude\n", - "- Von Neumann's eta statistics\n", - "- Excess variance statistics\n", - "- Number of observations\n", - "\n", - "We are going to use [`light-curve`](https://github.com/light-curve/light-curve-python) package for this purposes" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "672b9dbe67192b2d", - "metadata": { - "ExecuteTime": { - "end_time": "2024-05-30T21:16:02.978308Z", - "start_time": "2024-05-30T21:16:02.964225Z" - } - }, - "outputs": [], - "source": [ - "extractor = licu.Extractor(\n", - " licu.Periodogram(\n", - " peaks=1, max_freq_factor=50.0, fast=True\n", - " ), # Would give two features: peak period and signa-to-noise ratio of the peak\n", - " licu.WeightedMean(), # Mean magnitude\n", - " licu.Eta(), # Von Neumann's eta statistics\n", - " licu.ExcessVariance(), # Excess variance statistics\n", - " licu.ObservationCount(), # Number of observations\n", - ")\n", - "\n", - "\n", - "# light-curve requires all arrays to be the same dtype.\n", - "# It also requires the time array to be ordered and to have no duplicates.\n", - "def extract_features(mjd, mag, magerr, **kwargs):\n", - " # We offset date, so we still would have <1 second precision\n", - " t = np.asarray(mjd - 60000, dtype=np.float32)\n", - " _, sort_index = np.unique(t, return_index=True)\n", - " features = extractor(\n", - " t[sort_index],\n", - " mag[sort_index],\n", - " magerr[sort_index],\n", - " **kwargs,\n", - " )\n", - " # Return the features as a dictionary\n", - " return dict(zip(extractor.names, features))\n", - "\n", - "\n", - "features = r_band.reduce(\n", - " extract_features,\n", - " \"lc.mjd\",\n", - " \"lc.mag\",\n", - " \"lc.magerr\",\n", - " meta={name: np.float32 for name in extractor.names},\n", - ")\n", - "\n", - "df_w_features = r_band.join(features)\n", - "df_w_features" - ] - }, - { - "cell_type": "markdown", - "id": "6f2db5bc7d5f23c", - "metadata": {}, - "source": [ - "Before we are going next and actually run the computation, let's create a Dask client which would allow us to run the computation in parallel." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a8f0e5c9816cd9f9", - "metadata": { - "ExecuteTime": { - "end_time": "2024-05-30T21:16:03.687894Z", - "start_time": "2024-05-30T21:16:02.979043Z" - } - }, - "outputs": [], - "source": [ - "client = dask.distributed.Client()" - ] - }, - { - "cell_type": "markdown", - "id": "fb55750bd55f324f", - "metadata": {}, - "source": [ - "Now we can collect some statistics and plot it. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "eb6fb4857b9dc97d", - "metadata": { - "ExecuteTime": { - "end_time": "2024-05-30T21:16:30.171468Z", - "start_time": "2024-05-30T21:16:03.688907Z" - } - }, - "outputs": [], - "source": [ - "lg_period_bins = np.linspace(-1, 2, 101)\n", - "lg_snr_bins = np.linspace(0, 2, 101)\n", - "\n", - "lg_period = dask.array.log10(df_w_features[\"period_0\"].to_dask_array())\n", - "lg_snr = dask.array.log10(df_w_features[\"period_s_to_n_0\"].to_dask_array())\n", - "\n", - "hist2d = dask.array.histogram2d(\n", - " lg_period,\n", - " lg_snr,\n", - " bins=[lg_period_bins, lg_snr_bins],\n", - ")\n", - "# Run the computation\n", - "hist2d = hist2d[0].compute()\n", - "\n", - "# Plot the 2D histogram\n", - "plt.imshow(\n", - " hist2d.T,\n", - " extent=(lg_period_bins[0], lg_period_bins[-1], lg_snr_bins[0], lg_snr_bins[-1]),\n", - " origin=\"lower\",\n", - " norm=LogNorm(vmin=1, vmax=hist2d.max()),\n", - ")\n", - "plt.colorbar(label=\"Number of stars\")\n", - "plt.xlabel(\"lg Period/day\")\n", - "plt.ylabel(\"lg S/N\")" - ] - }, - { - "cell_type": "markdown", - "id": "88dfbcbd324f9ac2", - "metadata": {}, - "source": [ - "Let's select a bright star with a high signal-to-noise ratio and plot its phase light curve. We also filter periods ~1 day, because they are very likely to be bogus." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8f8adf145ef471fb", - "metadata": { - "ExecuteTime": { - "end_time": "2024-05-30T21:16:47.074714Z", - "start_time": "2024-05-30T21:16:30.174392Z" - } - }, - "outputs": [], - "source": [ - "bright_periodic_stars = df_w_features.query(\n", - " \"period_s_to_n_0 > 20 and weighted_mean < 15 and (period_0 < 0.9 or period_0 > 1.1)\"\n", - ")\n", - "df = bright_periodic_stars.compute()\n", - "\n", - "obj = df.iloc[0]\n", - "lc = obj[\"lc\"]\n", - "period = obj[\"period_0\"]\n", - "ra = obj[\"ra\"]\n", - "dec = obj[\"dec\"]\n", - "\n", - "plt.errorbar(lc[\"mjd\"] % period / period, lc[\"mag\"], lc[\"magerr\"], fmt=\"o\")\n", - "plt.gca().invert_yaxis()\n", - "plt.xlabel(\"Phase\")\n", - "plt.ylabel(\"Magnitude\")\n", - "plt.xlim([0, 1])\n", - "plt.title(f\"RA={ra:.4f}, Dec={dec:.4f} period={period:.4f} days\")\n", - "\n", - "print(\"Search this object for on the SNAD ZTF Viewer:\")\n", - "print(f\"https://ztf.snad.space/search/{ra}%20{dec}/1\")" - ] - }, - { - "cell_type": "markdown", - "id": "5922a307eb316c33", - "metadata": {}, - "source": [ - "It looks like a nice RR Lyrae star!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "76cfeb23acf497cd", - "metadata": { - "ExecuteTime": { - "end_time": "2024-05-30T21:16:47.499116Z", - "start_time": "2024-05-30T21:16:47.075603Z" - } - }, - "outputs": [], - "source": [ - "client.close()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 2 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "3.10.14" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}