Skip to content

Commit

Permalink
DOC Multiple updates to docs and remove straggling print in code
Browse files Browse the repository at this point in the history
  • Loading branch information
dantegd committed Oct 5, 2023
1 parent 46d0c7e commit e8f0d68
Show file tree
Hide file tree
Showing 2 changed files with 95 additions and 80 deletions.
173 changes: 95 additions & 78 deletions docs/source/raft_ann_benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,86 @@

This project provides a benchmark program for various ANN search implementations. It's especially suitable for comparing GPU implementations as well as comparing GPU against CPU.

## Installing the benchmarks
## Table of Contents

There are two main ways pre-compiled benchmarks are distributed: Docker and conda. The following subsections demonstrate how to install and run each path.
- [Installing and Running the Benchmarks](#installing--and-running-the-benchmarks)
- [Using conda](#conda)
- [End-to-end example: Million-scale](end-to-end-example-million-scale)
- [Using Docker](#docker)
- [End-to-end example: Billion-scale](#end-to-end-example-billion-scale)
- [Creating and customizing dataset configurations](#creating-and-customizing-dataset-configurations)
- [Adding a new ANN algorithm](#adding-a-new-ann-algorithm)

### Docker
## Installing and Running the Benchmarks

There are two main ways pre-compiled benchmarks are distributed:

- [Conda](#Conda): Great solution for users not using containers but want an easy to install and use Python package. Pip wheels are planned to be added as an alternative for users that cannot use conda and prefer to not use containers.
- [Docker](#Docker): Great solution that only needs docker and NVIDIA docker to use. Provides a single docker run command for basic dataset benchmarking, as well as all the functionality of the conda solution inside the containers.

## Conda

If containers are not an option or not preferred, the easiest way to install the ANN benchmarks is through conda. We provide packages for GPU enabled systems, as well for systems without a GPU. We suggest using mamba as it generally leads to a faster install time:

```bash

mamba create --name raft_ann_benchmarks
conda activate raft_ann_benchmarks

# to install GPU package:
mamba install -c rapidsai -c conda-forge -c nvidia raft-ann-bench cuda-version=11.8*

# to install CPU package for usage in CPU-only systems:
mamba install -c rapidsai -c conda-forge raft-ann-bench-cpu
```

The channel `rapidsai` can easily be substituted `rapidsai-nightly` if nightly benchmarks are desired. The CPU package currently allows to run the HNSW benchmarks.

Please see the [build instructions](ann_benchmarks_build.md) to build the benchmarks from source.

## Running the benchmarks

### Python Package Usage
There are 4 general steps to running the benchmarks and visualizing the results:
1. Prepare Dataset
2. Build Index and Search Index
3. Data Export
4. Plot Results

We provide a collection of lightweight Python scripts that are wrappers over
lower level scripts and executables to run our benchmarks. Either Python scripts or
[low-level scripts and executables](ann_benchmarks_low_level.md) are valid methods to run benchmarks,
however plots are only provided through our Python scripts.

### End-to-end example: Million-scale

The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset By default the datasets will be stored and used from the folder indicated by the `RAPIDS_DATASET_ROOT_DIR` environment variable if defined, otherwise a datasets sub-folder from where the script is being called:

```bash

# (1) prepare dataset.
python -m raft-ann-bench.get_dataset --dataset deep-image-96-angular --normalize

# (2) build and search index
python -m raft-ann-bench.run --dataset deep-image-96-inner

# (3) export data
python -m raft-ann-bench.data_export --dataset deep-image-96-inner

# (4) plot results
python -m raft-ann-bench.plot --dataset deep-image-96-inner
```

Configuration files already exist for the following list of the million-scale datasets. Please refer to [ann-benchmarks datasets](https://github.com/erikbern/ann-benchmarks/#data-sets) for more information, including actual train and sizes. These all work out-of-the-box with the `--dataset` argument. Other million-scale datasets from `ann-benchmarks.com` will work, but will require a json configuration file to be created in `$CONDA_PREFIX/lib/python3.xx/site-packages/raft-ann-bench/run/conf`, or you can specify the `--configuration` option to use a specific file.
- `deep-image-96-angular`
- `fashion-mnist-784-euclidean`
- `glove-50-angular`
- `glove-100-angular`
- `mnist-784-euclidean`
- `nytimes-256-angular`
- `sift-128-euclidean`

## Docker

We provide images for GPU enabled systems, as well as systems without a GPU. The following images are available:

Expand Down Expand Up @@ -35,7 +110,7 @@ Supported Python versions: 3.9 and 3.10.
docker pull nvcr.io/nvidia/rapidsai/raft-ann-bench:23.08-cuda11.8-py3.10 #substitute raft-ann-bench for the exact desired container.
```

#### Container Usage
### Container Usage

The container can be used in two different ways:

Expand All @@ -44,7 +119,8 @@ The container can be used in two different ways:
For GPU systems, where `$DATA_FOLDER` is a local folder where you want datasets stored in `$DATA_FOLDER/datasets` and results in `$DATA_FOLDER/result` (we highly recommend `$DATA_FOLDER` to be a dedicated folder for the datasets and results of the containers):

```bash
docker run --gpus all --rm -it \
export DATA_FOLDER=path/to/store/datasets/and/results
docker run --gpus all --rm -it -u $(id -u) \
-v $DATA_FOLDER:/home/rapids/benchmarks \
rapidsai/raft-ann-bench:23.10a-cuda11.8-py3.10 \
"--dataset deep-image-96-angular" \
Expand All @@ -56,18 +132,22 @@ docker run --gpus all --rm -it \
Where:

```bash
docker run --gpus all --rm -it \
-v $DATA_FOLDER:/home/rapids/benchmarks \ # <- local folder to store datasets and results
export DATA_FOLDER=path/to/store/datasets/and/results # <- local folder to store datasets and results
docker run --gpus all --rm -it -u $(id -u) \
-v $DATA_FOLDER:/home/rapids/benchmarks \
rapidsai/raft-ann-bench:23.10a-cuda11.8-py3.10 \ # <- image to use, either `raft-ann-bench` or `raft-ann-bench-datasets`, can choose RAPIDS, cuda and python versions.
"--dataset deep-image-96-angular" \ # <- dataset name
"--normalize" \ # <- whether to normalize the dataset, leave string empty ("") to not normalize.
"--algorithms raft_cagra" \ # <- what algorithm(s) to use as a ; separated list, as well as any other argument to pass to `raft_ann_benchmarks.run`
"" # optional arguments to pass to `raft_ann_benchmarks.plot`
```

*** Note about user and file permissions: *** The flag `-u $(id -u)` allows the user inside the container to match the `uid` of the user outside the container, allowing the container to read and write to the mounted volume indicated by $DATA_FOLDER.

For CPU systems the same interface applies, except for not needing the gpus argument and using the cpu images:
```bash
docker run all --rm -it \
export DATA_FOLDER=path/to/store/datasets/and/results
docker run all --rm -it -u $(id -u) \
-v $DATA_FOLDER:/home/rapids/benchmarks \
rapidsai/raft-ann-bench-cpu:23.10a-py3.10 \
"--dataset deep-image-96-angular" \
Expand All @@ -81,85 +161,22 @@ docker run all --rm -it \
2. **Using the preinstalled `raft_ann_benchmarks` python package (advanced mode)**: The docker containers are built using the conda packages described in the following section, so they can be used directly as if they were installed manually following the instructions in the next section. This is recommended for advanced users, and is the option that allows the full flexibility of the benchmarking scripts. To use the python scripts directly, use the following command:

```bash
docker run --gpus all --rm -it \
export DATA_FOLDER=path/to/store/datasets/and/results
docker run --gpus all --rm -it -u $(id -u) \
-v $DATA_FOLDER:/home/rapids/benchmarks \
rapidsai/raft-ann-bench:23.10a-cuda11.8-py3.10 \
--entrypoint /bin/bash
```

This will drop you into a command line in the container, with the `raft_ann_benchmarks` python package ready to use:
This will drop you into a command line in the container, with the `raft_ann_benchmarks` python package ready to use, as was described in the prior [conda section](#conda):

```
(base) root@00b068fbb862:/home/rapids#
```

Additionally, the containers could be run in dettached mode without any issue.

See the [python package usage](#python-package-usage) for more details on how to use the python package.

### Conda

If containers are not an option or not preferred, the easiest way to install the ANN benchmarks is through conda. We provide packages for GPU enabled systems, as well for systems without a GPU. We suggest using mamba as it generally leads to a faster install time:

```bash

mamba create --name raft_ann_benchmarks
conda activate raft_ann_benchmarks

# to install GPU package:
mamba install -c rapidsai -c conda-forge -c nvidia raft-ann-bench cuda-version=11.8*

# to install CPU package for usage in CPU-only systems:
mamba install -c rapidsai -c conda-forge raft-ann-bench-cpu
```

The channel `rapidsai` can easily be substituted `rapidsai-nightly` if nightly benchmarks are desired. The CPU package currently allows to run the HNSW benchmarks.

Please see the [build instructions](ann_benchmarks_build.md) to build the benchmarks from source.

## Running the benchmarks

### Python Package Usage
There are 4 general steps to running the benchmarks and visualizing the results:
1. Prepare Dataset
2. Build Index and Search Index
3. Data Export
4. Plot Results

We provide a collection of lightweight Python scripts that are wrappers over
lower level scripts and executables to run our benchmarks. Either Python scripts or
[low-level scripts and executables](ann_benchmarks_low_level.md) are valid methods to run benchmarks,
however plots are only provided through our Python scripts.

### End-to-end example: Million-scale

The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset By default the datasets will be stored and used from the folder indicated by the `RAPIDS_DATASET_ROOT_DIR` environment variable if defined, otherwise a datasets sub-folder from where the script is being called:

```bash

# (1) prepare dataset.
python -m raft-ann-bench.get_dataset --dataset deep-image-96-angular --normalize

# (2) build and search index
python -m raft-ann-bench.run --dataset deep-image-96-inner

# (3) export data
python -m raft-ann-bench.data_export --dataset deep-image-96-inner

# (4) plot results
python -m raft-ann-bench.plot --dataset deep-image-96-inner
```

Configuration files already exist for the following list of the million-scale datasets. Please refer to [ann-benchmarks datasets](https://github.com/erikbern/ann-benchmarks/#data-sets) for more information, including actual train and sizes. These all work out-of-the-box with the `--dataset` argument. Other million-scale datasets from `ann-benchmarks.com` will work, but will require a json configuration file to be created in `$CONDA_PREFIX/lib/python3.xx/site-packages/raft-ann-bench/run/conf`, or you can specify the `--configuration` option to use a specific file.
- `deep-image-96-angular`
- `fashion-mnist-784-euclidean`
- `glove-50-angular`
- `glove-100-angular`
- `mnist-784-euclidean`
- `nytimes-256-angular`
- `sift-128-euclidean`

### End-to-end example: Billion-scale
## End-to-end example: Billion-scale
`raft-ann-bench.get_dataset` cannot be used to download the [billion-scale datasets](ann_benchmarks_dataset.md#billion-scale)
because they are so large. You should instead use our billion-scale datasets guide to download and prepare them.
All other python mentioned below work as intended once the
Expand Down Expand Up @@ -196,7 +213,7 @@ options:
Path to billion-scale dataset groundtruth file (default: None)
```

##### Step 1: Prepare Dataset<a id='prep-dataset'></a>
#### Step 1: Prepare Dataset<a id='prep-dataset'></a>
The script `raft-ann-bench.get_dataset` will download and unpack the dataset in directory
that the user provides. As of now, only million-scale datasets are supported by this
script. For more information on [datasets and formats](ann_benchmarks_dataset.md).
Expand All @@ -217,7 +234,7 @@ When option `normalize` is provided to the script, any dataset that has cosine d
will be normalized to inner product. So, for example, the dataset `glove-100-angular`
will be written at location `datasets/glove-100-inner/`.

#### Step 2: Build and Search Index
### Step 2: Build and Search Index
The script `raft-ann-bench.run` will build and search indices for a given dataset and its
specified configuration.
To confirgure which algorithms are available, we use `algos.yaml`.
Expand Down Expand Up @@ -274,7 +291,7 @@ it is assumed both are `True`.
`indices` and `algorithms` : these parameters ensure that the algorithm specified for an index
is available in `algos.yaml` and not disabled, as well as having an associated executable.

#### Step 3: Data Export
### Step 3: Data Export
The script `raft-ann-bench.data_export` will convert the intermediate JSON outputs produced by `raft-ann-bench.run` to more
easily readable CSV files, which are needed to build charts made by `raft-ann-bench.plot`.

Expand Down
2 changes: 0 additions & 2 deletions python/raft-ann-bench/src/raft-ann-bench/run/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,7 @@ def positive_int(input_str: str) -> int:

def validate_algorithm(algos_conf, algo, gpu_present):
algos_conf_keys = set(algos_conf.keys())
print("algo", algo)
if gpu_present:
print("algo gpu_present", algo)
return algo in algos_conf_keys
else:
return (
Expand Down

0 comments on commit e8f0d68

Please sign in to comment.