DOC Multiple updates to docs and remove straggling print in code

rapidsai · Oct 5, 2023 · e8f0d68 · e8f0d68
1 parent 46d0c7e
commit e8f0d68
Show file tree

Hide file tree

Showing 2 changed files with 95 additions and 80 deletions.
diff --git a/docs/source/raft_ann_benchmarks.md b/docs/source/raft_ann_benchmarks.md
@@ -2,11 +2,86 @@
 
 This project provides a benchmark program for various ANN search implementations. It's especially suitable for comparing GPU implementations as well as comparing GPU against CPU.
 
-## Installing the benchmarks
+## Table of Contents
 
-There are two main ways pre-compiled benchmarks are distributed: Docker and conda. The following subsections demonstrate how to install and run each path.
+- [Installing and Running the Benchmarks](#installing--and-running-the-benchmarks)
+    - [Using conda](#conda)
+      - [End-to-end example: Million-scale](end-to-end-example-million-scale)
+    - [Using Docker](#docker)
+- [End-to-end example: Billion-scale](#end-to-end-example-billion-scale)
+- [Creating and customizing dataset configurations](#creating-and-customizing-dataset-configurations)
+- [Adding a new ANN algorithm](#adding-a-new-ann-algorithm)
 
-### Docker
+## Installing and Running the Benchmarks
+
+There are two main ways pre-compiled benchmarks are distributed:
+
+- [Conda](#Conda): Great solution for users not using containers but want an easy to install and use Python package. Pip wheels are planned to be added as an alternative for users that cannot use conda and prefer to not use containers.
+- [Docker](#Docker): Great solution that only needs docker and NVIDIA docker to use. Provides a single docker run command for basic dataset benchmarking, as well as all the functionality of the conda solution inside the containers.
+
+## Conda
+
+If containers are not an option or not preferred, the easiest way to install the ANN benchmarks is through conda. We provide packages for GPU enabled systems, as well for systems without a GPU. We suggest using mamba as it generally leads to a faster install time:
+
+```bash
+
+mamba create --name raft_ann_benchmarks
+conda activate raft_ann_benchmarks
+
+# to install GPU package:
+mamba install -c rapidsai -c conda-forge -c nvidia raft-ann-bench cuda-version=11.8*
+
+# to install CPU package for usage in CPU-only systems:
+mamba install -c rapidsai -c conda-forge  raft-ann-bench-cpu
+```
+
+The channel `rapidsai` can easily be substituted `rapidsai-nightly` if nightly benchmarks are desired. The CPU package currently allows to run the HNSW benchmarks.
+
+Please see the [build instructions](ann_benchmarks_build.md) to build the benchmarks from source.
+
+## Running the benchmarks
+
+### Python Package Usage
+There are 4 general steps to running the benchmarks and visualizing the results:
+1. Prepare Dataset
+2. Build Index and Search Index
+3. Data Export
+4. Plot Results
+
+We provide a collection of lightweight Python scripts that are wrappers over
+lower level scripts and executables to run our benchmarks. Either Python scripts or
+[low-level scripts and executables](ann_benchmarks_low_level.md) are valid methods to run benchmarks,
+however plots are only provided through our Python scripts.
+
+### End-to-end example: Million-scale
+
+The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset By default the datasets will be stored and used from the folder indicated by the `RAPIDS_DATASET_ROOT_DIR` environment variable if defined, otherwise a datasets sub-folder from where the script is being called:
+
+```bash
+
+# (1) prepare dataset.
+python -m raft-ann-bench.get_dataset --dataset deep-image-96-angular --normalize
+
+# (2) build and search index
+python -m raft-ann-bench.run --dataset deep-image-96-inner
+
+# (3) export data
+python -m raft-ann-bench.data_export --dataset deep-image-96-inner
+
+# (4) plot results
+python -m raft-ann-bench.plot --dataset deep-image-96-inner
+```
+
+Configuration files already exist for the following list of the million-scale datasets. Please refer to [ann-benchmarks datasets](https://github.com/erikbern/ann-benchmarks/#data-sets) for more information, including actual train and sizes. These all work out-of-the-box with the `--dataset` argument. Other million-scale datasets from `ann-benchmarks.com` will work, but will require a json configuration file to be created in `$CONDA_PREFIX/lib/python3.xx/site-packages/raft-ann-bench/run/conf`, or you can specify the `--configuration` option to use a specific file.
+- `deep-image-96-angular`
+- `fashion-mnist-784-euclidean`
+- `glove-50-angular`
+- `glove-100-angular`
+- `mnist-784-euclidean`
+- `nytimes-256-angular`
+- `sift-128-euclidean`
+
+## Docker
 
 We provide images for GPU enabled systems, as well as systems without a GPU. The following images are available:
 
@@ -35,7 +110,7 @@ Supported Python versions: 3.9 and 3.10.
 docker pull nvcr.io/nvidia/rapidsai/raft-ann-bench:23.08-cuda11.8-py3.10 #substitute raft-ann-bench for the exact desired container.
 ```
 
-#### Container Usage
+### Container Usage
 
 The container can be used in two different ways:
 
@@ -44,7 +119,8 @@ The container can be used in two different ways:
 For GPU systems, where `$DATA_FOLDER` is a local folder where you want datasets stored in `$DATA_FOLDER/datasets` and results in `$DATA_FOLDER/result` (we highly recommend `$DATA_FOLDER` to be a dedicated folder for the datasets and results of the containers):
 
 ```bash
-docker run --gpus all --rm -it \
+export DATA_FOLDER=path/to/store/datasets/and/results
+docker run --gpus all --rm -it -u $(id -u) \
     -v $DATA_FOLDER:/home/rapids/benchmarks  \
     rapidsai/raft-ann-bench:23.10a-cuda11.8-py3.10 \
     "--dataset deep-image-96-angular" \
@@ -56,18 +132,22 @@ docker run --gpus all --rm -it \
 Where:
 
 ```bash
-docker run --gpus all --rm -it \
-    -v $DATA_FOLDER:/home/rapids/benchmarks  \ # <- local folder to store datasets and results
+export DATA_FOLDER=path/to/store/datasets/and/results # <- local folder to store datasets and results
+docker run --gpus all --rm -it -u $(id -u) \
+    -v $DATA_FOLDER:/home/rapids/benchmarks  \
     rapidsai/raft-ann-bench:23.10a-cuda11.8-py3.10 \ # <- image to use, either `raft-ann-bench` or `raft-ann-bench-datasets`, can choose RAPIDS, cuda and python versions.
     "--dataset deep-image-96-angular" \ # <- dataset name
     "--normalize" \ # <- whether to normalize the dataset, leave string empty ("") to not normalize.
     "--algorithms raft_cagra" \ # <- what algorithm(s) to use as a ; separated list, as well as any other argument to pass to `raft_ann_benchmarks.run`
     "" # optional arguments to pass to `raft_ann_benchmarks.plot`
 ```
 
+*** Note about user and file permissions: *** The flag `-u $(id -u)` allows the user inside the container to match the `uid` of the user outside the container, allowing the container to read and write to the mounted volume indicated by $DATA_FOLDER.
+
 For CPU systems the same interface applies, except for not needing the gpus argument and using the cpu images:
 ```bash
-docker run  all --rm -it \
+export DATA_FOLDER=path/to/store/datasets/and/results
+docker run  all --rm -it -u $(id -u) \
     -v $DATA_FOLDER:/home/rapids/benchmarks  \
     rapidsai/raft-ann-bench-cpu:23.10a-py3.10 \
      "--dataset deep-image-96-angular" \
@@ -81,85 +161,22 @@ docker run  all --rm -it \
 2. **Using the preinstalled `raft_ann_benchmarks` python package (advanced mode)**: The docker containers are built using the conda packages described in the following section, so they can be used directly as if they were installed manually following the instructions in the next section. This is recommended for advanced users, and is the option that allows the full flexibility of the benchmarking scripts. To use the python scripts directly, use the following command:
 
 ```bash
-docker run --gpus all --rm -it \
+export DATA_FOLDER=path/to/store/datasets/and/results
+docker run --gpus all --rm -it -u $(id -u) \
     -v $DATA_FOLDER:/home/rapids/benchmarks  \
     rapidsai/raft-ann-bench:23.10a-cuda11.8-py3.10 \
     --entrypoint /bin/bash
 ```
 
-This will drop you into a command line in the container, with the `raft_ann_benchmarks` python package ready to use:
+This will drop you into a command line in the container, with the `raft_ann_benchmarks` python package ready to use, as was described in the prior [conda section](#conda):
 
 ```
 (base) root@00b068fbb862:/home/rapids#
 ```
 
 Additionally, the containers could be run in dettached mode without any issue.
 
-See the [python package usage](#python-package-usage) for more details on how to use the python package.
-
-### Conda
-
-If containers are not an option or not preferred, the easiest way to install the ANN benchmarks is through conda. We provide packages for GPU enabled systems, as well for systems without a GPU. We suggest using mamba as it generally leads to a faster install time:
-
-```bash
-
-mamba create --name raft_ann_benchmarks
-conda activate raft_ann_benchmarks
-
-# to install GPU package:
-mamba install -c rapidsai -c conda-forge -c nvidia raft-ann-bench cuda-version=11.8*
-
-# to install CPU package for usage in CPU-only systems:
-mamba install -c rapidsai -c conda-forge  raft-ann-bench-cpu
-```
-
-The channel `rapidsai` can easily be substituted `rapidsai-nightly` if nightly benchmarks are desired. The CPU package currently allows to run the HNSW benchmarks.
-
-Please see the [build instructions](ann_benchmarks_build.md) to build the benchmarks from source.
-
-## Running the benchmarks
-
-### Python Package Usage
-There are 4 general steps to running the benchmarks and visualizing the results:
-1. Prepare Dataset
-2. Build Index and Search Index
-3. Data Export
-4. Plot Results
-
-We provide a collection of lightweight Python scripts that are wrappers over
-lower level scripts and executables to run our benchmarks. Either Python scripts or
-[low-level scripts and executables](ann_benchmarks_low_level.md) are valid methods to run benchmarks,
-however plots are only provided through our Python scripts.
-
-### End-to-end example: Million-scale
-
-The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset By default the datasets will be stored and used from the folder indicated by the `RAPIDS_DATASET_ROOT_DIR` environment variable if defined, otherwise a datasets sub-folder from where the script is being called:
-
-```bash
-
-# (1) prepare dataset.
-python -m raft-ann-bench.get_dataset --dataset deep-image-96-angular --normalize
-
-# (2) build and search index
-python -m raft-ann-bench.run --dataset deep-image-96-inner
-
-# (3) export data
-python -m raft-ann-bench.data_export --dataset deep-image-96-inner
-
-# (4) plot results
-python -m raft-ann-bench.plot --dataset deep-image-96-inner
-```
-
-Configuration files already exist for the following list of the million-scale datasets. Please refer to [ann-benchmarks datasets](https://github.com/erikbern/ann-benchmarks/#data-sets) for more information, including actual train and sizes. These all work out-of-the-box with the `--dataset` argument. Other million-scale datasets from `ann-benchmarks.com` will work, but will require a json configuration file to be created in `$CONDA_PREFIX/lib/python3.xx/site-packages/raft-ann-bench/run/conf`, or you can specify the `--configuration` option to use a specific file.
-- `deep-image-96-angular`
-- `fashion-mnist-784-euclidean`
-- `glove-50-angular`
-- `glove-100-angular`
-- `mnist-784-euclidean`
-- `nytimes-256-angular`
-- `sift-128-euclidean`
-
-### End-to-end example: Billion-scale
+## End-to-end example: Billion-scale
 `raft-ann-bench.get_dataset` cannot be used to download the [billion-scale datasets](ann_benchmarks_dataset.md#billion-scale)
 because they are so large. You should instead use our billion-scale datasets guide to download and prepare them.
 All other python  mentioned below work as intended once the
@@ -196,7 +213,7 @@ options:
                         Path to billion-scale dataset groundtruth file (default: None)
 ```
 
-##### Step 1: Prepare Dataset<a id='prep-dataset'></a>
+#### Step 1: Prepare Dataset<a id='prep-dataset'></a>
 The script `raft-ann-bench.get_dataset` will download and unpack the dataset in directory
 that the user provides. As of now, only million-scale datasets are supported by this
 script. For more information on [datasets and formats](ann_benchmarks_dataset.md).
@@ -217,7 +234,7 @@ When option `normalize` is provided to the script, any dataset that has cosine d
 will be normalized to inner product. So, for example, the dataset `glove-100-angular` 
 will be written at location `datasets/glove-100-inner/`.
 
-#### Step 2: Build and Search Index
+### Step 2: Build and Search Index
 The script `raft-ann-bench.run` will build and search indices for a given dataset and its
 specified configuration.
 To confirgure which algorithms are available, we use `algos.yaml`.
@@ -274,7 +291,7 @@ it is assumed both are `True`.
 `indices` and `algorithms` : these parameters ensure that the algorithm specified for an index 
 is available in `algos.yaml` and not disabled, as well as having an associated executable.
 
-#### Step 3: Data Export
+### Step 3: Data Export
 The script `raft-ann-bench.data_export` will convert the intermediate JSON outputs produced by `raft-ann-bench.run` to more
 easily readable CSV files, which are needed to build charts made by `raft-ann-bench.plot`.
 

diff --git a/python/raft-ann-bench/src/raft-ann-bench/run/__main__.py b/python/raft-ann-bench/src/raft-ann-bench/run/__main__.py
@@ -36,9 +36,7 @@ def positive_int(input_str: str) -> int:
 
 def validate_algorithm(algos_conf, algo, gpu_present):
     algos_conf_keys = set(algos_conf.keys())
-    print("algo", algo)
     if gpu_present:
-        print("algo gpu_present", algo)
         return algo in algos_conf_keys
     else:
         return (