Skip to content

Commit

Permalink
Grammar/clarity; update plot title
Browse files Browse the repository at this point in the history
  • Loading branch information
OliviaLynn committed Nov 1, 2024
1 parent c43acbb commit 83c62ba
Show file tree
Hide file tree
Showing 5 changed files with 41 additions and 33 deletions.
Binary file modified docs/tutorials/_static/crossmatching-performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/tutorials/manual_verification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"source": [
"# Manual catalog verification\n",
"\n",
"This notebook presents methods for verifying that a directory contains a valid HATS catalog, and performing manual verification through inspecting the catalog metadata and contents."
"This notebook presents methods for verifying that a directory contains a valid HATS catalog and performing manual verification through inspecting the catalog metadata and contents."
]
},
{
Expand Down Expand Up @@ -40,7 +40,7 @@
"source": [
"### Explaining the input and output\n",
"\n",
"The `strict` argument takes us through a different code path that rigorously tests the contents of all ancillary metadata files, and the consistency of the partition pixels.\n",
"The `strict` argument takes us through a different code path that rigorously tests the contents of all ancillary metadata files and the consistency of the partition pixels.\n",
"\n",
"Here, we use the `verbose=True` argument to print out a little bit more information about our catalog. It will repeat the path that we're looking at, display the total number of partitions, and calculate the approximate sky coverage, based on the area of the HATS tiles.\n",
"\n",
Expand Down
24 changes: 15 additions & 9 deletions docs/tutorials/margins.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,14 @@
"source": [
"# Margins\n",
"\n",
"LSDB can handle datasets larger than memory by breaking them down into smaller spatially-connected parts and working on each part one at a time. One of the main tasks enabled by LSDB are spatial queries such as cross-matching; To ensure accurate comparisons, all nearby data points need to be loaded simultaneously. LSDB uses HATS' method of organizing data spatially to achieve this. However, there's a limitation: at the boundaries of each divided section, some data points are going to be missed. This means that for operations requiring comparisons with neighboring points, such as cross-matching, the process might miss some matches for points near these boundaries because not all nearby points are included when analyzing one section at a time.\n",
"LSDB can handle datasets larger than memory by breaking them down into smaller, spatially-connected parts and working on each part one at a time. One of the main tasks enabled by LSDB are spatial queries such as cross-matching; to ensure accurate comparisons, all nearby data points need to be loaded simultaneously. LSDB uses HATS' method of organizing data spatially to achieve this.\n",
"\n",
"However, there's a limitation: at the boundaries of each divided section, some data points are going to be missed. This means that for operations requiring comparisons with neighboring points, such as cross-matching, the process might miss some matches for points near these boundaries because not all nearby points are included when analyzing one section at a time.\n",
"\n",
"![Margin Boundary Example](_static/pixel-boundary-example.png)\n",
"*Here we see an example of a boundary between HEALPix pixels, where the green points are in one partition and the red points in another. Working with one partition at a time, we would miss potential matches with points close to the boundary*\n",
"\n",
"To solve this, we could try to also load the neighboring partitions for each partition we crossmatch. However, this would mean needing to load lots of unnecessary data, slowing down operations and causing issues with running out of memory. So for each catalog we also create a margin cache. This means that for each partition, we create a file that contains the points in the catalog within a certain distance to the pixel's boundary.\n",
"To solve this, we could try to also load the neighboring partitions for each partition we crossmatch. However, this would require loading lots of unnecessary data, slowing down operations and causing issues with running out of memory. So, for each catalog, we also create a margin cache. This means that for each partition, we create a file that contains the points in the catalog within a certain distance to the pixel's boundary.\n",
"\n",
"![Margin Cache Example](_static/margin-pix.png)\n",
"*An example of a margin cache (orange) for the same green pixel. The margin cache for this pixel contains the points within 10 arcseconds of the boundary.*"
Expand Down Expand Up @@ -76,7 +78,7 @@
"collapsed": false
},
"source": [
"Here we see the margin catalog that has been loaded with the catalog, in this case using a margin threshold of 10 arcseconds.\n",
"Here we see the margin catalog that has been loaded with the catalog, using a margin threshold of 10 arcseconds.\n",
"\n",
"Let's plot the catalog and its margin together:"
]
Expand Down Expand Up @@ -183,9 +185,9 @@
"source": [
"## Using the Margin Catalog\n",
"\n",
"Performing operations like cross-matching and joining require a margin to be loaded in the catalog on the right side of the operation. If this right catalog has been loaded with a margin, the function will be carried out accurately using the margin, and by default will throw an error if the margin has not been set. This can be overwritten using the `require_right_margin` parameter, but this may cause inaccurate results!\n",
"Performing operations like cross-matching and joining requires a margin to be loaded in the catalog on the right side of the operation. If this right catalog has been loaded with a margin, the function will be carried out accurately using the margin, and by default will throw an error if the margin has not been set. This can be overwritten using the `require_right_margin` parameter, but this may cause inaccurate results!\n",
"\n",
"We can see this trying to perform a crossmatch with gaia"
"We can see this when trying to perform a crossmatch with gaia:"
]
},
{
Expand All @@ -209,7 +211,7 @@
"collapsed": false
},
"source": [
"If we perform a crossmatch with gaia on the left and the ztf catalog we loaded with a margin on the right, the function works and we get the result"
"If we perform a crossmatch with gaia on the left and the ztf catalog we loaded with a margin on the right, the function works, and we get the result:"
]
},
{
Expand Down Expand Up @@ -246,7 +248,7 @@
"collapsed": false
},
"source": [
"If we try the other way around, we have not loaded the right catalog (gaia) with a margin cache, and so we get an error"
"If we try the other way around, we have not loaded the right catalog (gaia) with a margin cache, and so we get an error."
]
},
{
Expand All @@ -272,7 +274,7 @@
"collapsed": false
},
"source": [
"We can plot the result of the crossmatch below, with the gaia objects in green and the ztf objects in red"
"We can plot the result of the crossmatch below, with the gaia objects in green and the ztf objects in red."
]
},
{
Expand Down Expand Up @@ -356,7 +358,11 @@
"source": [
"## Avoiding Duplicates\n",
"\n",
"Joining the margin cache to the catalog's data would introduce duplicate points, where points near the boundary would appear in the margin of one partition, and the main file of another partition. To avoid this, we keep two separate task graphs, one for the catalog and one for its margin. For operations that don't require the margin, the task graphs are kept separate, and when `compute` is called on the catalog, only the catalog's task graph is computed without joining the margin or even loading the margin files. For operations like crossmatching that require the margin, the task graphs are combined with the margin joined and used. For these operations, we use only the margin for the catalog on the right side of the operation. This means that for each left catalog point that is considered, all of the possible nearby matches in the right catalog are also loaded, and so the results are kept accurate. But since there are no duplicates of the left catalog points, there are no duplicate results."
"Joining the margin cache to the catalog's data would introduce duplicate points, where points near the boundary would appear in both the margin of one partition and the main file of another partition. To avoid this, we keep two separate task graphs, one for the catalog and one for its margin.\n",
"\n",
"For operations that don't require the margin, the task graphs remain separate, and when `compute` is called on the catalog, only the catalog's task graph is computed—without joining the margin or even loading the margin files.\n",
"\n",
"For operations like crossmatching that require the margin, the task graphs are combined with the margin joined and used. For these operations, we use only the margin for the catalog on the right side of the operation. This means that for each left catalog point that is considered, all of the possible nearby matches in the right catalog are also loaded, so the results are kept accurate. But since there are no duplicates of the left catalog points, there are no duplicate results."
]
}
],
Expand Down
24 changes: 13 additions & 11 deletions docs/tutorials/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,28 @@ Performance

LSDB is a high-performance package built to support the analysis of large-scale astronomical datasets.
One of the performance goals of LSDB is to add as little overhead over the input-output operations as possible.
We achieve this aim for catalog cross-matching, spatial and data filtering operations by using
We achieve this for catalog cross-matching and spatial- and data-filtering operations by using
the `HATS <https://github.com/astronomy-commons/hats>`_ data format,
efficient algorithms,
and `Dask <https://dask.org/>`_ framework for parallel computing.

Here we demonstrate the results of the performance tests of LSDB for cross-matching operations,
performed on `Bridges2 cluster at Pittsburgh Supercomputing Center <https://www.psc.edu/resources/bridges-2/>`_
using a asingle node with 128 cores and 256 GB of memory.
Here, we demonstrate the results of LSDB performance tests for cross-matching operations,
performed on the `Bridges2 cluster at Pittsburgh Supercomputing Center <https://www.psc.edu/resources/bridges-2/>`_
using a single node with 128 cores and 256 GB of memory.

Cross-matching performance overhead
-----------------------------------

We compare I/O speed and cross-matching performance of LSDB on an example cross-matching of
ZTF DR14 (metadata only, 1.2B rows, 60GB)
and Gaia DR3 (1.8B rows, 972GB) catalogs.

The cross-matching took 46 minutes and produced a catalog of 498GB.
LSDB would read more data than it would write, so to get a lower boundary estimate we would use the output size, which gives us 185MB/s of the cross-matching speed.
LSDB would read more data than it would write in this case, so to get a lower boundary estimate, we use the output size, which gives us 185MB/s as the cross-matching speed.

We compare this to just copying both catalogs with ``cp -r`` command, which took 86 minutes and produced 1030GB of data,
which corresponds to 204MB/s as the copy speed.

We compare it to just copying both catalogs with ``cp -r`` command, which took 86 minutes and produced 1030GB of data,
which corresponds to 204MB/s of the copy speed.
These allow us to conclude that LSDB cross-matching overhead is 5-15% compared to the I/O operations.

The details of this analysis are given in
Expand Down Expand Up @@ -65,13 +67,13 @@ The results of the analysis are shown in the following plot:

Some observations from the plot:

* Construction of the ``SkyCoord`` objects in astropy is the most time-consuming step; in this step spherical coordinates are converted to Cartesian, so ``match_coordinates_sky()`` has less work to do comparing to other algorithms. So if your analysis doesn't require the ``SkyCoord`` objects anywhere else, it would be more fair to add up the time of the ``SkyCoord`` objects construction and the ``match_coordinates_sky()`` execution.
* Construction of the ``SkyCoord`` objects in astropy is the most time-consuming step; in this step, spherical coordinates are converted to Cartesian, so ``match_coordinates_sky()`` has less work to do compared to other algorithms. So if your analysis doesn't require the ``SkyCoord`` objects anywhere else, it would be more fair to add up the time of the ``SkyCoord`` objects construction and the ``match_coordinates_sky()`` execution.
* All algorithms but LSDB have a nearly linear dependency on the number of rows in the input catalogs starting from a small number of rows. LSDB has a constant overhead associated with the graph construction and Dask overhead, which is negligible for large catalogs, where the time starts to grow linearly.
* LSDB is the only method allowing to parallelize the cross-matching operation, so we run it with 1, 4, 16, and 64 workers.
* 16 and 64-worker cases show the same performance, which shows the limits of the parallelization, at least with the hardware setup used in the analysis.
* LSDB is the only method allowing users to parallelize the cross-matching operation, so we run it with 1, 4, 16, and 64 workers.
* 16 and 64-worker cases show the same performance, which demonstrates the limits of the parallelization, at least with the hardware setup used in the analysis.
* Despite the fact that LSDB's crossmatching algorithm does similar work converting spherical coordinates to Cartesian, it's getting faster than astropy's algorithm for larger catalogs, even with a single worker. This is probably due to the fact that LSDB utilises a batching approach, which constructs shallower k-D trees for each partition of the data, and thus less time is spent on the tree traversal.

Summarizing, the cross-matching approach implemented in LSDB is competitive with the existing tools and is more efficient for large catalogs, starting with roughly one million rows.
Also, LSDB allows work with out-of-memory datasets, which is not possible with astropy and smatch, and not demonstrated in the analysis.
Also, LSDB enables the use of out-of-memory datasets, which is not possible with astropy and smatch, and not demonstrated in the analysis.

The complete code of the analysis is available `here <https://github.com/lincc-frameworks/notebooks_lf/tree/main/sprints/2024/05_30/xmatch_bench>`_.
22 changes: 11 additions & 11 deletions docs/tutorials/remote_data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"\n",
"We use [fsspec](https://github.com/fsspec/filesystem_spec) and [universal_pathlib](https://github.com/fsspec/universal_pathlib) to create connections to remote data sources. Please refer to their documentation for a list of supported filesystems and any filesystem-specific parameters.\n",
"\n",
"If you're using pypi/pip for package management, you can install ALL of the fsspec implementations, as well as some other nice-to-have dependencies with `pip install 'lsdb[full]'`\n",
"If you're using PyPI/pip for package management, you can install ALL of the fsspec implementations, as well as some other nice-to-have dependencies with `pip install 'lsdb[full]'`.\n",
"\n",
"Below, we provide some a basic workflow for accessing remote data, as well as filesystem-specific hints."
]
Expand Down Expand Up @@ -57,16 +57,16 @@
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Occasionally, with HTTPS data, you may see issues with missing certificates. If you encounter a `FileNotFoundError`, but you're pretty sure the file should be found:\n",
"\n",
"1. Check your network and server availability\n",
"2. On Linux, be sure that openSSL and ca-certificates are in place\n",
"3. On Mac, run `/Applications/Python\\ 3.*/Install\\ Certificates.command`"
]
}
"cell_type": "markdown",
"metadata": {},
"source": [
"Occasionally, with HTTPS data, you may see issues with missing certificates. If you encounter a `FileNotFoundError`, but you're pretty sure the file should be found:\n",
"\n",
"1. Check your network and server availability\n",
"2. On Linux, be sure that openSSL and ca-certificates are in place\n",
"3. On Mac, run `/Applications/Python\\ 3.*/Install\\ Certificates.command`"
]
}
],
"metadata": {
"kernelspec": {
Expand Down

0 comments on commit 83c62ba

Please sign in to comment.