From 874189d4eb6ca08cb603e60aebccc24e01c7ca04 Mon Sep 17 00:00:00 2001 From: "Corey J. Nolet" Date: Fri, 8 Sep 2023 17:12:06 -0400 Subject: [PATCH] More updates to ann-bench docs (#1810) Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) - Ray Douglass (https://github.com/raydouglass) URL: https://github.com/rapidsai/raft/pull/1810 --- conda/recipes/raft-ann-bench-cpu/meta.yaml | 2 + conda/recipes/raft-ann-bench/meta.yaml | 7 +++ docs/source/ann_benchmarks_param_tuning.md | 66 +++++++++++++--------- docs/source/raft_ann_benchmarks.md | 19 +++---- 4 files changed, 57 insertions(+), 37 deletions(-) diff --git a/conda/recipes/raft-ann-bench-cpu/meta.yaml b/conda/recipes/raft-ann-bench-cpu/meta.yaml index 699e485d0b..06737b0497 100644 --- a/conda/recipes/raft-ann-bench-cpu/meta.yaml +++ b/conda/recipes/raft-ann-bench-cpu/meta.yaml @@ -50,6 +50,7 @@ requirements: - nlohmann_json {{ nlohmann_json_version }} - python - pyyaml + - pandas run: - glog {{ glog_version }} @@ -57,6 +58,7 @@ requirements: - matplotlib - python - pyyaml + - pandas - benchmark about: diff --git a/conda/recipes/raft-ann-bench/meta.yaml b/conda/recipes/raft-ann-bench/meta.yaml index a5c20b0a28..b817968379 100644 --- a/conda/recipes/raft-ann-bench/meta.yaml +++ b/conda/recipes/raft-ann-bench/meta.yaml @@ -75,6 +75,12 @@ requirements: - faiss-proc=*=cuda - libfaiss {{ faiss_version }} {% endif %} + - h5py {{ h5py_version }} + - benchmark + - matplotlib + - python + - pandas + - pyyaml run: - python @@ -94,6 +100,7 @@ requirements: - glog {{ glog_version }} - matplotlib - python + - pandas - pyyaml about: home: https://rapids.ai/ diff --git a/docs/source/ann_benchmarks_param_tuning.md b/docs/source/ann_benchmarks_param_tuning.md index ca8ffa5e18..712d22f0aa 100644 --- a/docs/source/ann_benchmarks_param_tuning.md +++ b/docs/source/ann_benchmarks_param_tuning.md @@ -11,43 +11,49 @@ IVF-flat uses an inverted-file index, which partitions the vectors into a series IVF-flat is a simple algorithm which won't save any space, but it provides competitive search times even at higher levels of recall. -| Parameter | Type | Required | Data Type | Default | Description | -|-----------|------------------|----------|---------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `nlists` | `build_param` | Y | Positive Integer >0 | | Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained. | -| `niter` | `build_param` | N | Positive Integer >0 | 20 | Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained. | -| `ratio` | `build_param` | N | Positive Integer >0 | 2 | `1/ratio` is the number of training points which should be used to train the clusters. | -| `nprobe` | `search_params` | Y | Positive Integer >0 | | The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index. | +| Parameter | Type | Required | Data Type | Default | Description | +|-----------------------|------------------|----------|----------------------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `nlists` | `build_param` | Y | Positive Integer >0 | | Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained. | +| `niter` | `build_param` | N | Positive Integer >0 | 20 | Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained. | +| `ratio` | `build_param` | N | Positive Integer >0 | 2 | `1/ratio` is the number of training points which should be used to train the clusters. | +| `dataset_memory_type` | `build_param` | N | ["device", "host", "mmap"] | "device" | What memory type should the dataset reside? | +| `query_memory_type` | `search_params` | N | ["device", "host", "mmap"] | "device | What memory type should the queries reside? | +| `nprobe` | `search_params` | Y | Positive Integer >0 | | The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index. | ### `raft_ivf_pq` IVF-pq is an inverted-file index, which partitions the vectors into a series of clusters, or lists, in a similar way to IVF-flat above. The difference is that IVF-PQ uses product quantization to also compress the vectors, giving the index a smaller memory footprint. Unfortunately, higher levels of compression can also shrink recall, which a refinement step can improve when the original vectors are still available. -| Parameter | Type | Required | Data Type | Default | Description | -|-------------------------|----------------|---|------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `nlists` | `build_param` | Y | Positive Integer >0 | | Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained. | -| `niter` | `build_param` | N | Positive Integer >0 | 20 | Number of k-means iterations to use when training the clusters. | -| `ratio` | `build_param` | N | Positive Integer >0 | 2 | `1/ratio` is the number of training points which should be used to train the clusters. | +| Parameter | Type | Required | Data Type | Default | Description | +|-------------------------|----------------|---|----------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `nlists` | `build_param` | Y | Positive Integer >0 | | Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained. | +| `niter` | `build_param` | N | Positive Integer >0 | 20 | Number of k-means iterations to use when training the clusters. | +| `ratio` | `build_param` | N | Positive Integer >0 | 2 | `1/ratio` is the number of training points which should be used to train the clusters. | | `pq_dim` | `build_param` | N | Positive Integer. Multiple of 8. | 0 | Dimensionality of the vector after product quantization. When 0, a heuristic is used to select this value. `pq_dim` * `pq_bits` must be a multiple of 8. | -| `pq_bits` | `build_param` | N | Positive Integer. [4-8] | 8 | Bit length of the vector element after quantization. | -| `codebook_kind` | `build_param` | N | ["cluster", "subspace"] | "subspace" | Type of codebook. See the [API docs](https://docs.rapids.ai/api/raft/nightly/cpp_api/neighbors_ivf_pq/#_CPPv412codebook_gen) for more detail | -| `nprobe` | `search_params` | Y | Positive Integer >0 | | The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index. | -| `internalDistanceDtype` | `search_params` | N | [`float`, `half`] | `half` | The precision to use for the distance computations. Lower precision can increase performance at the cost of accuracy. | -| `smemLutDtype` | `search_params` | N | [`float`, `half`, `fp8`] | `half` | The precision to use for the lookup table in shared memory. Lower precision can increase performance at the cost of accuracy. | -| `refine_ratio` | `search_params` | N| Positive Number >=0 | 0 | `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors. | +| `pq_bits` | `build_param` | N | Positive Integer. [4-8] | 8 | Bit length of the vector element after quantization. | +| `codebook_kind` | `build_param` | N | ["cluster", "subspace"] | "subspace" | Type of codebook. See the [API docs](https://docs.rapids.ai/api/raft/nightly/cpp_api/neighbors_ivf_pq/#_CPPv412codebook_gen) for more detail | +| `dataset_memory_type` | `build_param` | N | ["device", "host", "mmap"] | "device" | What memory type should the dataset reside? | +| `query_memory_type` | `search_params` | N | ["device", "host", "mmap"] | "device | What memory type should the queries reside? | +| `nprobe` | `search_params` | Y | Positive Integer >0 | | The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index. | +| `internalDistanceDtype` | `search_params` | N | [`float`, `half`] | `half` | The precision to use for the distance computations. Lower precision can increase performance at the cost of accuracy. | +| `smemLutDtype` | `search_params` | N | [`float`, `half`, `fp8`] | `half` | The precision to use for the lookup table in shared memory. Lower precision can increase performance at the cost of accuracy. | +| `refine_ratio` | `search_params` | N| Positive Number >=0 | 0 | `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors. | ### `raft_cagra` CAGRA uses a graph-based index, which creates an intermediate, approximate kNN graph using IVF-PQ and then further refining and optimizing to create a final kNN graph. This kNN graph is used by CAGRA as an index for search. -| Parameter | Type | Required | Data Type | Default | Description | -|-----------|----------------|----------|---------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `graph_degree` | `build_param` | N | Positive Integer >0 | 64 | Degree of the final kNN graph index. | -| `intermediate_graph_degree` | `build_param` | N | Positive Integer >0 | 128 | Degree of the intermediate kNN graph. | -| `itopk` | `search_wdith` | N | Positive Integer >0 | 64 | Number of intermediate search results retained during the search. Higher values improve search accuracy at the cost of speed. | -| `search_width` | `search_param` | N | Positive Integer >0 | 1 | Number of graph nodes to select as the starting point for the search in each iteration. | -| `max_iterations` | `search_param` | N | Integer >=0 | 0 | Upper limit of search iterations. Auto select when 0. | -| `algo` | `search_param` | N | string | "auto" | Algorithm to use for search. Possible values: {"auto", "single_cta", "multi_cta", "multi_kernel"} | +| Parameter | Type | Required | Data Type | Default | Description | +|-----------------------------|----------------|----------|----------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `graph_degree` | `build_param` | N | Positive Integer >0 | 64 | Degree of the final kNN graph index. | +| `intermediate_graph_degree` | `build_param` | N | Positive Integer >0 | 128 | Degree of the intermediate kNN graph. | +| `dataset_memory_type` | `build_param` | N | ["device", "host", "mmap"] | "device" | What memory type should the dataset reside? | +| `query_memory_type` | `search_params` | N | ["device", "host", "mmap"] | "device | What memory type should the queries reside? | +| `itopk` | `search_wdith` | N | Positive Integer >0 | 64 | Number of intermediate search results retained during the search. Higher values improve search accuracy at the cost of speed. | +| `search_width` | `search_param` | N | Positive Integer >0 | 1 | Number of graph nodes to select as the starting point for the search in each iteration. | +| `max_iterations` | `search_param` | N | Integer >=0 | 0 | Upper limit of search iterations. Auto select when 0. | +| `algo` | `search_param` | N | string | "auto" | Algorithm to use for search. Possible values: {"auto", "single_cta", "multi_cta", "multi_kernel"} | ## FAISS Indexes @@ -85,6 +91,12 @@ IVF-pq is an inverted-file index, which partitions the vectors into a series of ### `hnswlib` -## GGNN Index +| Parameter | Type | Required | Data Type | Default | Description | +|------------------|-----------------|----------|--------------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `efConstruction` | `build_param` | Y | Positive Integer >0 | | Controls index time and accuracy. Bigger values increase the index quality. At some point, increasing this will no longer improve the quality. | +| `M` | `build_param` | Y | Positive Integer often between 2-100 | | Number of bi-directional links create for every new element during construction. Higher values work for higher intrinsic dimensionality and/or high recall, low values can work for datasets with low intrinsic dimensionality and/or low recalls. Also affects the algorithm's memory consumption. | +| `numThreads` | `build_param` | N | Positive Integer >0 | 1 | Number of threads to use to build the index. | +| `ef` | `search_param` | Y | Positive Integer >0 | | Size of the dynamic list for the nearest neighbors used for search. Higher value leads to more accurate but slower search. Cannot be lower than `k`. | +| `numThreads` | `search_params` | N | Positive Integer >0 | 1 | Number of threads to use for queries. | -### `ggnn` +Please refer to [HNSW algorithm parameters guide](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) from `hnswlib` to learn more about these arguments. diff --git a/docs/source/raft_ann_benchmarks.md b/docs/source/raft_ann_benchmarks.md index af0b040d34..8ae2d2535b 100644 --- a/docs/source/raft_ann_benchmarks.md +++ b/docs/source/raft_ann_benchmarks.md @@ -34,12 +34,11 @@ There are 4 general steps to running the benchmarks and visualizing the results: We provide a collection of lightweight Python scripts that are wrappers over lower level scripts and executables to run our benchmarks. Either Python scripts or [low-level scripts and executables](ann_benchmarks_low_level.md) are valid methods to run benchmarks, -however plots are only provided through our Python scripts. An environment variable `RAFT_HOME` is -expected to be defined to run these scripts; this variable holds the directory where RAFT is cloned. +however plots are only provided through our Python scripts. ### End-to-end example: Million-scale -The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset By default the datasets will be stored and used from the folder indicated by the RAPIDS_DATASET_ROOT_DIR environment variable if defined, otherwise a datasets sub-folder from where the script is being called: +The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset By default the datasets will be stored and used from the folder indicated by the `RAPIDS_DATASET_ROOT_DIR` environment variable if defined, otherwise a datasets sub-folder from where the script is being called: ```bash @@ -56,7 +55,7 @@ python -m raft-ann-bench.data_export --dataset deep-image-96-inner python -m raft-ann-bench.plot --dataset deep-image-96-inner ``` -Configuration files already exist for the following list of the million-scale datasets. Please refer to [ann-benchmarks datasets](https://github.com/erikbern/ann-benchmarks/#data-sets) for more information, including actual train and sizes. These all work out-of-the-box with the `--dataset` argument. Other million-scale datasets from `ann-benchmarks.com` will work, but will require a json configuration file to be created in `python/raft-ann-bench/src/raft-ann-bench/run/conf`. +Configuration files already exist for the following list of the million-scale datasets. Please refer to [ann-benchmarks datasets](https://github.com/erikbern/ann-benchmarks/#data-sets) for more information, including actual train and sizes. These all work out-of-the-box with the `--dataset` argument. Other million-scale datasets from `ann-benchmarks.com` will work, but will require a json configuration file to be created in `$CONDA_PREFIX/lib/python3.xx/site-packages/raft-ann-bench/run/conf`, or you can specify the `--configuration` option to use a specific file. - `deep-image-96-angular` - `fashion-mnist-784-euclidean` - `glove-50-angular` @@ -93,7 +92,7 @@ python -m raft-ann-bench.data_export --dataset deep-1B python -m raft-ann-bench.plot --dataset deep-1B ``` -The usage of `python -m raft-ann-bench.split-groundtruth` is: +The usage of `python -m raft-ann-bench.split_groundtruth` is: ```bash usage: split_groundtruth.py [-h] --groundtruth GROUNDTRUTH @@ -125,7 +124,7 @@ will be normalized to inner product. So, for example, the dataset `glove-100-ang will be written at location `datasets/glove-100-inner/`. #### Step 2: Build and Search Index -The script `bench/ann/run.py` will build and search indices for a given dataset and its +The script `raft-ann-bench.run` will build and search indices for a given dataset and its specified configuration. To confirgure which algorithms are available, we use `algos.yaml`. To configure building/searching indices for a dataset, look at [index configuration](#json-index-config). @@ -182,7 +181,7 @@ it is assumed both are `True`. is available in `algos.yaml` and not disabled, as well as having an associated executable. #### Step 3: Data Export -The script `bench/ann/data_export.py` will convert the intermediate JSON outputs produced by `raft-ann-bench.run` to more +The script `raft-ann-bench.data_export` will convert the intermediate JSON outputs produced by `raft-ann-bench.run` to more easily readable CSV files, which are needed to build charts made by `raft-ann-bench.plot`. ```bash @@ -198,7 +197,7 @@ Build statistics CSV file is stored in `/result/build//result/search/`. #### Step 4: Plot Results -The script `bench/ann/plot.py` will plot results for all algorithms found in index search statistics +The script `raft-ann-bench.plot` will plot results for all algorithms found in index search statistics CSV file in `/result/search/<-k{k}-batch_size{batch_size}>.csv`. The usage of this script is: @@ -262,7 +261,7 @@ The `index` section will contain a list of index objects, each of which will hav "algo": "algo_name", "file": "sift-128-euclidean/algo_name/param1_val1-param2_val2", "build_param": { "param1": "val1", "param2": "val2" }, - "search_params": { "search_param1": "search_val1" } + "search_params": [{ "search_param1": "search_val1" }] } ``` @@ -345,7 +344,7 @@ How to interpret these JSON objects is totally left to the implementation and sh } ``` -2. Next, add corresponding `if` case to functions `create_algo()` (in `bench/ann/) and `create_search_param()` by calling parsing functions. The string literal in `if` condition statement must be the same as the value of `algo` in configuration file. For example, +2. Next, add corresponding `if` case to functions `create_algo()` (in `cpp/bench/ann/) and `create_search_param()` by calling parsing functions. The string literal in `if` condition statement must be the same as the value of `algo` in configuration file. For example, ```c++ // JSON configuration file contains a line like: "algo" : "hnswlib" if (algo == "hnswlib") {