diff --git a/docs/source/raft_ann_benchmarks.html b/docs/source/raft_ann_benchmarks.html deleted file mode 100644 index ca3c7dad2b..0000000000 --- a/docs/source/raft_ann_benchmarks.html +++ /dev/null @@ -1,1459 +0,0 @@ -raft_ann_benchmarks

RAFT ANN Benchmarks

-

This project provides a benchmark program for various ANN search implementations. It’s especially suitable for comparing GPU implementations as well as comparing GPU against CPU.

-

Installing the benchmarks

-

There are two main ways pre-compiled benchmarks are distributed: Docker and conda. The following sub sections demonstrate how to install and run each path.

-

Docker

-

We provide images for GPU enabled systems, as well as systems without a GPU. The following images are available:

- -

The containers can be used in two manners:

- -

For GPU systems, where $DATA_FOLDER is a local folder where you want datasets stored in $DATA_FOLDER/datasets and results in $DATA_FOLDER/results:

-
docker run --gpus all --rm -it \
-    -v $DATA_FOLDER:/home/rapids/benchmarks  \
-    rapidsai/raft-ann-bench:23.10a-cuda11.8-py3.10 \
-     "--dataset deep-image-96-angular" \
-     "--normalize" \
-     "--algorithms raft_cagra" \
-     ""
-
- -

Where:

-
docker run --gpus all --rm -it \
-    -v $DATA_FOLDER:/home/rapids/benchmarks  \ # <- local folder to store datasets and results
-    rapidsai/raft-ann-bench:23.10a-cuda11.8-py3.10 \ # <- image to use, either `raft-ann-bench` or `raft-ann-bench-datasets`
-     "--dataset deep-image-96-angular" \ # <- dataset name
-     "--normalize" \ # <- whether to normalize the dataset
-     "--algorithms raft_cagra" \ # <- what algorithm(s) to use as a ; separated list, as well as any other argument to pass to `raft_ann_benchmarks.run`
-     "" # optional argumetns to pass to `raft_ann_benchmarks.plot`
-
- -

For CPU systems the same interface applies, except for not needing the gpus argument and using the cpu imnges: -

docker run  all --rm -it \
-    -v $DATA_FOLDER:/home/rapids/benchmarks  \
-    rapidsai/raft-ann-bench-cpu:23.10a-py3.10 \
-     "--dataset deep-image-96-angular" \
-     "--normalize" \
-     "--algorithms raft_cagra" \
-     ""
-

- -
docker run --gpus all --rm -it \
-    -v $DATA_FOLDER:/home/rapids/benchmarks  \
-    rapidsai/raft-ann-bench:23.10a-cuda11.8-py3.10 \
-    --entrypoint /bin/bash
-
- -

Conda

-

If containers are not an option or not preferred, the easiest way to install the ANN benchmarks is through conda. We provide packages for GPU enabled systems, as well for systems without a GPU. We suggest using mamba as it generally leads to a faster install time:

-
mamba create --name raft_ann_benchmarks
-conda activate raft_ann_benchmarks
-
-# to install GPU package:
-mamba install -c rapidsai -c conda-forge -c nvidia raft-ann-bench cuda-version=11.8*
-
-# to install CPU package for usage in CPU-only systems:
-mamba install -c rapidsai -c conda-forge  raft-ann-bench-cpu
-
- -

The channel rapidsai can easily be substituted rapidsai-nightly if nightly benchmarks are desired. The CPU package currently allows to run the HNSW benchmarks.

-

Please see the build instructions to build the benchmarks from source.

-

Running the benchmarks

-

Python Pacakge Usage

-

There are 4 general steps to running the benchmarks and visualizing the results: -1. Prepare Dataset -2. Build Index and Search Index -3. Data Export -4. Plot Results

-

We provide a collection of lightweight Python scripts that are wrappers over -lower level scripts and executables to run our benchmarks. Either Python scripts or -low-level scripts and executables are valid methods to run benchmarks, -however plots are only provided through our Python scripts.

-

End-to-end example: Million-scale

-

The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset By default the datasets will be stored and used from the folder indicated by the RAPIDS_DATASET_ROOT_DIR environment variable if defined, otherwise a datasets sub-folder from where the script is being called:

-
# (1) prepare dataset.
-python -m raft-ann-bench.get_dataset --dataset deep-image-96-angular --normalize
-
-# (2) build and search index
-python -m raft-ann-bench.run --dataset deep-image-96-inner
-
-# (3) export data
-python -m raft-ann-bench.data_export --dataset deep-image-96-inner
-
-# (4) plot results
-python -m raft-ann-bench.plot --dataset deep-image-96-inner
-
- -

Configuration files already exist for the following list of the million-scale datasets. Please refer to ann-benchmarks datasets for more information, including actual train and sizes. These all work out-of-the-box with the --dataset argument. Other million-scale datasets from ann-benchmarks.com will work, but will require a json configuration file to be created in $CONDA_PREFIX/lib/python3.xx/site-packages/raft-ann-bench/run/conf, or you can specify the --configuration option to use a specific file. -- deep-image-96-angular -- fashion-mnist-784-euclidean -- glove-50-angular -- glove-100-angular -- lastfm-65-angular -- mnist-784-euclidean -- nytimes-256-angular -- sift-128-euclidean

-

End-to-end example: Billion-scale

-

raft-ann-bench.get_dataset cannot be used to download the billion-scale datasets -because they are so large. You should instead use our billion-scale datasets guide to download and prepare them. -All other python mentioned below work as intended once the -billion-scale dataset has been downloaded. -To download Billion-scale datasets, visit big-ann-benchmarks

-

The steps below demonstrate how to download, install, and run benchmarks on a subset of 100M vectors from the Yandex Deep-1B dataset. Please note that datasets of this scale are recommended for GPUs with larger amounts of memory, such as the A100 or H100. -

mkdir -p datasets/deep-1B
-# (1) prepare dataset
-# download manually "Ground Truth" file of "Yandex DEEP"
-# suppose the file name is deep_new_groundtruth.public.10K.bin
-python -m raft-ann-bench.split_groundtruth --groundtruth datasets/deep-1B/deep_new_groundtruth.public.10K.bin
-# two files 'groundtruth.neighbors.ibin' and 'groundtruth.distances.fbin' should be produced
-
-# (2) build and search index
-python -m raft-ann-bench.run --dataset deep-1B
-
-# (3) export data
-python -m raft-ann-bench.data_export --dataset deep-1B
-
-# (4) plot results
-python -m raft-ann-bench.plot --dataset deep-1B
-

-

The usage of python -m raft-ann-bench.split_groundtruth is: -

usage: split_groundtruth.py [-h] --groundtruth GROUNDTRUTH
-
-options:
-  -h, --help            show this help message and exit
-  --groundtruth GROUNDTRUTH
-                        Path to billion-scale dataset groundtruth file (default: None)
-

-
Step 1: Prepare Dataset
-

The script raft-ann-bench.get_dataset will download and unpack the dataset in directory -that the user provides. As of now, only million-scale datasets are supported by this -script. For more information on datasets and formats.

-

The usage of this script is: -

usage: get_dataset.py [-h] [--name NAME] [--dataset-path DATASET_PATH] [--normalize]
-
-options:
-  -h, --help            show this help message and exit
-  --dataset DATASET     dataset to download (default: glove-100-angular)
-  --dataset-path DATASET_PATH
-                        path to download dataset (default: ${RAPIDS_DATASET_ROOT_DIR})
-  --normalize           normalize cosine distance to inner product (default: False)
-

-

When option normalize is provided to the script, any dataset that has cosine distances -will be normalized to inner product. So, for example, the dataset glove-100-angular -will be written at location datasets/glove-100-inner/.

-

Step 2: Build and Search Index

-

The script raft-ann-bench.run will build and search indices for a given dataset and its -specified configuration. -To confirgure which algorithms are available, we use algos.yaml. -To configure building/searching indices for a dataset, look at index configuration. -An entry in algos.yaml looks like: -

raft_ivf_pq:
-  executable: RAFT_IVF_PQ_ANN_BENCH
-  requires_gpu: true
-
-executable : specifies the name of the binary that will build/search the index. It is assumed to be -available in raft/cpp/build/. -requires_gpu : denotes whether an algorithm requires GPU to run.

-

The usage of the script raft-ann-bench.run is: -

usage: run.py [-h] [-k COUNT] [-bs BATCH_SIZE] [--configuration CONFIGURATION] [--dataset DATASET] [--dataset-path DATASET_PATH] [--build] [--search] [--algorithms ALGORITHMS] [--indices INDICES]
-              [-f]
-
-options:
-  -h, --help            show this help message and exit
-  -k COUNT, --count COUNT
-                        the number of nearest neighbors to search for (default: 10)
-  -bs BATCH_SIZE, --batch-size BATCH_SIZE
-                        number of query vectors to use in each query trial (default: 10000)
-  --configuration CONFIGURATION
-                        path to configuration file for a dataset (default: None)
-  --dataset DATASET     dataset whose configuration file will be used (default: glove-100-inner)
-  --dataset-path DATASET_PATH
-                        path to dataset folder (default: ${RAPIDS_DATASET_ROOT_DIR})
-  --build
-  --search
-  --algorithms ALGORITHMS
-                        run only comma separated list of named algorithms (default: None)
-  --indices INDICES     run only comma separated list of named indices. parameter `algorithms` is ignored (default: None)
-  -f, --force           re-run algorithms even if their results already exist (default: False)
-

-

configuration and dataset : configuration is a path to a configuration file for a given dataset. -The configuration file should be name as <dataset>.json. It is optional if the name of the dataset is -provided with the dataset argument, in which case -a configuration file will be searched for as python/raft-ann-bench/src/raft-ann-bench/run/conf/<dataset>.json. -For every algorithm run by this script, it outputs an index build statistics JSON file in <dataset-path/<dataset>/result/build/<algo-k{k}-batch_size{batch_size}.json> -and an index search statistics JSON file in <dataset-path/<dataset>/result/search/<algo-k{k}-batch_size{batch_size}.json>.

-

dataset-path : -1. data is read from <dataset-path>/<dataset> -2. indices are built in <dataset-path>/<dataset>/index -3. build/search results are stored in <dataset-path>/<dataset>/result

-

build and search : if both parameters are not supplied to the script then -it is assumed both are True.

-

indices and algorithms : these parameters ensure that the algorithm specified for an index -is available in algos.yaml and not disabled, as well as having an associated executable.

-

Step 3: Data Export

-

The script raft-ann-bench.data_export will convert the intermediate JSON outputs produced by raft-ann-bench.run to more -easily readable CSV files, which are needed to build charts made by raft-ann-bench.plot.

-

usage: data_export.py [-h] [--dataset DATASET] [--dataset-path DATASET_PATH]
-
-options:
-  -h, --help            show this help message and exit
-  --dataset DATASET     dataset to download (default: glove-100-inner)
-  --dataset-path DATASET_PATH
-                        path to dataset folder (default: ${RAPIDS_DATASET_ROOT_DIR})
-
-Build statistics CSV file is stored in <dataset-path/<dataset>/result/build/<algo-k{k}-batch_size{batch_size}.csv> -and index search statistics CSV file in <dataset-path/<dataset>/result/search/<algo-k{k}-batch_size{batch_size}.csv>.

-

Step 4: Plot Results

-

The script raft-ann-bench.plot will plot results for all algorithms found in index search statistics -CSV file in <dataset-path/<dataset>/result/search/<-k{k}-batch_size{batch_size}>.csv.

-

The usage of this script is: -

usage: plot.py [-h] [--dataset DATASET] [--dataset-path DATASET_PATH] [--output-filepath OUTPUT_FILEPATH] [--algorithms ALGORITHMS] [-k COUNT] [-bs BATCH_SIZE] [--build] [--search]
-               [--x-scale X_SCALE] [--y-scale {linear,log,symlog,logit}] [--raw]
-
-options:
-  -h, --help            show this help message and exit
-  --dataset DATASET     dataset to download (default: glove-100-inner)
-  --dataset-path DATASET_PATH
-                        path to dataset folder (default: ${RAPIDS_DATASET_ROOT_DIR})
-  --output-filepath OUTPUT_FILEPATH
-                        directory for PNG to be saved (default: os.getcwd())
-  --algorithms ALGORITHMS
-                        plot only comma separated list of named algorithms (default: None)
-  -k COUNT, --count COUNT
-                        the number of nearest neighbors to search for (default: 10)
-  -bs BATCH_SIZE, --batch-size BATCH_SIZE
-                        number of query vectors to use in each query trial (default: 10000)
-  --build
-  --search
-  --x-scale X_SCALE     Scale to use when drawing the X-axis. Typically linear, logit or a2 (default: linear)
-  --y-scale {linear,log,symlog,logit}
-                        Scale to use when drawing the Y-axis (default: linear)
-  --raw                 Show raw results (not just Pareto frontier) in faded colours (default: False)
-

-

The figure below is the resulting plot of running our benchmarks as of August 2023 for a batch size of 10, on an NVIDIA H100 GPU and an Intel Xeon Platinum 8480CL CPU. It presents the throughput (in Queries-Per-Second) performance for every level of recall.

-

Throughput vs recall plot comparing popular ANN algorithms with RAFT's at batch size 10

-

Creating and customizing dataset configurations

-

A single configuration file will often define a set of algorithms, with associated index and search parameters, for a specific dataset. A configuration file uses json format with 4 major parts: -1. Dataset information -2. Algorithm information -3. Index parameters -4. Search parameters

-

Below is a simple example configuration file for the 1M-scale sift-128-euclidean dataset:

-
{
-  "dataset": {
-    "name": "sift-128-euclidean",
-    "base_file": "sift-128-euclidean/base.fbin",
-    "query_file": "sift-128-euclidean/query.fbin", 
-    "subset_size": 1000000,
-    "groundtruth_neighbors_file": "sift-128-euclidean/groundtruth.neighbors.ibin",
-    "distance": "euclidean"
-  },
-  "index": []
-}
-
- -

The index section will contain a list of index objects, each of which will have the following form: -

{
-   "name": "algo_name.unique_index_name",
-   "algo": "algo_name",
-   "file": "sift-128-euclidean/algo_name/param1_val1-param2_val2",
-   "build_param": { "param1": "val1", "param2": "val2" },
-   "search_params": [{ "search_param1": "search_val1" }]
-}
-

-

The table below contains the possible settings for the algo field. Each unique algorithm will have its own set of build_param and search_params settings. The ANN Algorithm Parameter Tuning Guide contains detailed instructions on choosing build and search parameters for each supported algorithm.

- - - - - - - - - - - - - - - - - - - - - - - - - -
LibraryAlgorithms
FAISSfaiss_gpu_ivf_flat, faiss_gpu_ivf_pq
GGNNggnn
HNSWlibhnswlib
RAFTraft_cagra, raft_ivf_flat, raft_ivf_pq
-

By default, the index will be placed in bench/ann/data/<dataset_name>/index/<name>. Using sift-128-euclidean for the dataset with the algo example above, the indexes would be placed in bench/ann/data/sift-128-euclidean/index/algo_name/param1_val1-param2_val2.

-

Adding a new ANN algorithm

-

Implementation and Configuration

-

Implementation of a new algorithm should be a C++ class that inherits class ANN (defined in cpp/bench/ann/src/ann.h) and implements all the pure virtual functions.

-

In addition, it should define two structs for building and searching parameters. The searching parameter class should inherit struct ANN<T>::AnnSearchParam. Take class HnswLib as an example, its definition is: -

template<typename T>
-class HnswLib : public ANN<T> {
-public:
-  struct BuildParam {
-    int M;
-    int ef_construction;
-    int num_threads;
-  };
-
-  using typename ANN<T>::AnnSearchParam;
-  struct SearchParam : public AnnSearchParam {
-    int ef;
-    int num_threads;
-  };
-
-  // ...
-};
-

-

The benchmark program uses JSON format in a configuration file to specify indexes to build, along with the build and search parameters. To add the new algorithm to the benchmark, need be able to specify build_param, whose value is a JSON object, and search_params, whose value is an array of JSON objects, for this algorithm in configuration file. The build_param and search_param arguments will vary depending on the algorithm. Take the configuration for HnswLib as an example: -

{
-  "name" : "hnswlib.M12.ef500.th32",
-  "algo" : "hnswlib",
-  "build_param": {"M":12, "efConstruction":500, "numThreads":32},
-  "file" : "/path/to/file",
-  "search_params" : [
-    {"ef":10, "numThreads":1},
-    {"ef":20, "numThreads":1},
-    {"ef":40, "numThreads":1},
-  ],
-  "search_result_file" : "/path/to/file"
-},
-
-How to interpret these JSON objects is totally left to the implementation and should be specified in cpp/bench/ann/src/factory.cuh: -1. First, add two functions for parsing JSON object to struct BuildParam and struct SearchParam, respectively: -
template<typename T>
-void parse_build_param(const nlohmann::json& conf,
-                       typename cuann::HnswLib<T>::BuildParam& param) {
-  param.ef_construction = conf.at("efConstruction");
-  param.M = conf.at("M");
-  if (conf.contains("numThreads")) {
-    param.num_threads = conf.at("numThreads");
-  }
-}
-
-template<typename T>
-void parse_search_param(const nlohmann::json& conf,
-                        typename cuann::HnswLib<T>::SearchParam& param) {
-  param.ef = conf.at("ef");
-  if (conf.contains("numThreads")) {
-    param.num_threads = conf.at("numThreads");
-  }
-}
-

-
    -
  1. Next, add corresponding if case to functions create_algo() (in cpp/bench/ann/) andcreate_search_param()by calling parsing functions. The string literal inifcondition statement must be the same as the value ofalgo` in configuration file. For example, -
      // JSON configuration file contains a line like:  "algo" : "hnswlib"
    -  if (algo == "hnswlib") {
    -     // ...
    -  }
    -
  2. -
-

Adding a CMake Target

-

In raft/cpp/bench/ann/CMakeLists.txt, we provide a CMake function to configure a new Benchmark target with the following signature: -

ConfigureAnnBench(
-  NAME <algo_name> 
-  PATH </path/to/algo/benchmark/source/file> 
-  INCLUDES <additional_include_directories> 
-  CXXFLAGS <additional_cxx_flags>
-  LINKS <additional_link_library_targets>
-)
-

-

To add a target for HNSWLIB, we would call the function as: -

ConfigureAnnBench(
-  NAME HNSWLIB PATH bench/ann/src/hnswlib/hnswlib_benchmark.cpp INCLUDES
-  ${CMAKE_CURRENT_BINARY_DIR}/_deps/hnswlib-src/hnswlib CXXFLAGS "${HNSW_CXX_FLAGS}"
-)
-

-

This will create an executable called HNSWLIB_ANN_BENCH, which can then be used to run HNSWLIB benchmarks.

\ No newline at end of file