SNMG ANN #1993

viclafargue · 2023-11-14T18:27:41Z

The goal of this PR is to implement a distributed (single-node-multiple-GPUs) implementation of ANN indexes. It will allow to build, extend and search an index on multiple GPUs.

Before building the index, the user has to choose between two modes :

Sharding mode : The index dataset is split, each GPU trains its own index with its respective share of the dataset. This is intended to both increase the search throughput and the maximal size of the index.
Index duplication mode : The index is built once on a GPU and then copied over to others. Alternatively, the index dataset is sent to each GPU to be built there. This intended to increase the search throughput.

copy-pr-bot · 2023-11-14T18:27:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cpp/include/raft/neighbors/ann_mg.cuh

viclafargue · 2023-11-27T17:45:10Z

The PR is ready for a first review. In its current state, it implements the build, extend and search ANN methods (IVF-Flat and IVF-PQ only for now) in index duplication and sharding mode. For now, the index duplication mode only works by copying the index dataset over and building the index on each GPU separately. I am now looking to improve the API in such a way that it would allow to build the index on a GPU and copy it over. Serialization on disk would work, but does not seem ideal. Then, transferring the index attributes through NCCL seem like not very safe. What would you recommend?

lijinf2 · 2024-01-30T23:27:08Z

cpp/include/raft/neighbors/detail/ann_mg.cuh

+      for (int rank = 0; rank < num_ranks_; rank++) {
+        RAFT_CUDA_TRY(cudaSetDevice(dev_ids_[rank]));
+        auto& ann_if = ann_interfaces_.emplace_back();
+        ann_if.build(dev_resources_[rank], index_params, index_dataset);


Is it every GPU will copy the host dataset into device so the total number of copies will be num_ranks_?

Another related question, will GPU 1 not start and wait until GPU 0 build finishes? If that's the case, the total runtime of the for loop seems to be single GPU build time * num_ranks_.

Yes exactly, in the index duplication mode the dataset is copied in full to each GPU for training. An alternative method is to train a model locally, serialize it and distribute it with either one of the distribute_flat, distribute_pq or distribute_cagra functions.

Another related question, will GPU 1 not start and wait until GPU 0 build finishes? If that's the case, the total runtime of the for loop seems to be single GPU build time * num_ranks_.

The build, extend and search functions take in a handle parameter containing the CUDA stream on which the kernels should be launched. These operations are supposed to be asynchronous allowing fast switching of GPUs. However, this has not yet been tested. An actual benchmark would be necessary to confirm that things scale as expected.

An alternative method is to train a model locally, serialize it and distribute it with either one of the

This is definitely what we want here. We're going to have to wait for the index to build anyways, but in replicated mode we should only have to build it once and then broadcast it to the other GPUs.

We have a problem here. Building a GPU index is not only a GPU operation. It can have significant CPU work (e.g. CAGRA graph optimization, NN descent data pre/post proc, host side sub-sampling for IVF-methods).

Furthermore there are cases where our algorithms block CPU thread while waiting for GPU kernels to finish ( e.g. wait for return values that determine memory allocation size).

We cannot launch build on a single CPU thread and expect that it will run parallel just because the GPU ops are asynchronous. Most are, but the the few that I cite above will essentially serialize the whole process.

At least we would need different worker threads for each GPU stream. But I would recommend one process per GPU.

We should also keep in mind that build is multi-threaded. It spawns OpenMP threads to help shuffle data in host memory (singe thread is not enough to saturate mem bandwidth). We should document that this can be controlled with the OMP_NUM_THREADS variable.

lijinf2 · 2024-01-31T07:51:36Z

cpp/include/raft/neighbors/detail/ann_mg.cuh

+    RAFT_NCCL_TRY(ncclCommInitAll(nccl_comms_.data(), num_ranks_, dev_ids_.data()));
+    for (int rank = 0; rank < num_ranks_; rank++) {
+      RAFT_CUDA_TRY(cudaSetDevice(dev_ids_[rank]));
+      raft::comms::build_comms_nccl_only(&dev_resources_[rank], nccl_comms_[rank], num_ranks_, rank);


The NCCL initialization seems to be "one process multiple GPUs".

Is it possible to adapt it to "one process or thread one GPU"? May have to use something like std::thread. But the benefit is to enable the APIs of the PR to be reusable to Dask/Spark. Both Dask and Spark currently comply with one process one GPU when initializing NCCL.

The single process solution was better suited to implement the much requested feature in RAFT for now. But, I agree that in the end we should definitely look into making it possible to make things run on Dask/Spark. This would probably involve the use of multiple processes/threads and a much broader use of NCCL.
cc @cjnolet

Note that RAFT also developer guide also suggests one process per GPU https://github.com/rapidsai/raft/blob/branch-24.06/docs/source/developer_guide.md#multi-gpu

lijinf2 · 2024-01-31T07:56:50Z

Thank you Victor. I have learned a lot from the code! I like the idea of combining three algorithms into one unified interface. A few questions to make myself more familiar with the PR and design choice. Will be wonderful if Spark Rapids ML can leverage the APIs in this PR.

cjnolet

Hey Victor. This is a pretty sizeable PR so my suggestions will come in a few different passes through the changes. I took an initial look. Overall I think it's headed in the right direction. A lot of my suggestions so far are mechanical things. I'll take a closer look at the impl next.

.gitignore

cpp/CMakeLists.txt

cpp/bench/ann/src/raft/raft_ann_mg_wrapper.h

cpp/include/raft/neighbors/ann_mg.cuh

cpp/include/raft/neighbors/detail/ann_mg.cuh

cjnolet · 2024-04-10T19:42:34Z

cpp/include/raft/neighbors/detail/ann_mg.cuh

+      for (int rank = 0; rank < num_ranks_; rank++) {
+        RAFT_CUDA_TRY(cudaSetDevice(dev_ids_[rank]));
+        auto& ann_if = ann_interfaces_.emplace_back();
+        ann_if.build(dev_resources_[rank], index_params, index_dataset);


An alternative method is to train a model locally, serialize it and distribute it with either one of the

This is definitely what we want here. We're going to have to wait for the index to build anyways, but in replicated mode we should only have to build it once and then broadcast it to the other GPUs.

cjnolet · 2024-04-10T20:36:16Z

python/raft-ann-bench/src/raft-ann-bench/run/conf/sift-128-euclidean.json

@@ -472,6 +472,19 @@
        {"nprobe": 2000}
      ]
    },
+    {
+      "name": "raft_ann_mg.nlist16384",
+      "algo": "raft_ann_mg",


I think raft_ivf_flat_mg and raft_ivf_pq_mg might make more sense here.

tfeher

Thanks Victor for the PR! The code is well structured and clean, but I want to point out a few issues that we need to discuss (see below). I think these can be conceptually easily fixed by adhering to our One Process per GPU principle.

cpp/test/neighbors/ann_mg.cuh

cpp/include/raft/neighbors/detail/ann_mg.cuh

tfeher · 2024-04-18T20:58:49Z

cpp/include/raft/neighbors/detail/ann_mg.cuh

+    RAFT_NCCL_TRY(ncclCommInitAll(nccl_comms_.data(), num_ranks_, dev_ids_.data()));
+    for (int rank = 0; rank < num_ranks_; rank++) {
+      RAFT_CUDA_TRY(cudaSetDevice(dev_ids_[rank]));
+      raft::comms::build_comms_nccl_only(&dev_resources_[rank], nccl_comms_[rank], num_ranks_, rank);


Note that RAFT also developer guide also suggests one process per GPU https://github.com/rapidsai/raft/blob/branch-24.06/docs/source/developer_guide.md#multi-gpu

tfeher · 2024-04-18T21:08:45Z

cpp/include/raft/neighbors/detail/ann_mg.cuh

+      for (int rank = 0; rank < num_ranks_; rank++) {
+        RAFT_CUDA_TRY(cudaSetDevice(dev_ids_[rank]));
+        auto& ann_if = ann_interfaces_.emplace_back();
+        ann_if.build(dev_resources_[rank], index_params, index_dataset);


We have a problem here. Building a GPU index is not only a GPU operation. It can have significant CPU work (e.g. CAGRA graph optimization, NN descent data pre/post proc, host side sub-sampling for IVF-methods).

Furthermore there are cases where our algorithms block CPU thread while waiting for GPU kernels to finish ( e.g. wait for return values that determine memory allocation size).

We cannot launch build on a single CPU thread and expect that it will run parallel just because the GPU ops are asynchronous. Most are, but the the few that I cite above will essentially serialize the whole process.

At least we would need different worker threads for each GPU stream. But I would recommend one process per GPU.

We should also keep in mind that build is multi-threaded. It spawns OpenMP threads to help shuffle data in host memory (singe thread is not enough to saturate mem bandwidth). We should document that this can be controlled with the OMP_NUM_THREADS variable.

cpp/include/raft/neighbors/detail/ann_mg.cuh

tfeher · 2024-04-18T21:34:57Z

cpp/include/raft/neighbors/detail/ann_mg.cuh

+        auto d_trans = raft::make_device_vector<IdxT, IdxT>(root_handle_, num_ranks_);
+        raft::copy(d_trans.data_handle(), h_trans.data(), num_ranks_, resource::get_cuda_stream(root_handle_));
+        auto translations = std::make_optional<raft::device_vector_view<IdxT, IdxT>>(d_trans.view());
+        raft::neighbors::brute_force::knn_merge_parts<float, IdxT>(root_handle_,


Out of scope for the current PR, but we might consider as a follow up: IVF-PQ or CAGRA-Q only return approximate distances. While merging parts based on the approximate distances, we might be throwing out good neighbors due to innacurate distance values. If we plan to do refinement, then we can treat the in_neighbors as candidates for refinement, and run refinement directly instead of calling knn_merge_parts.

cjnolet · 2024-06-26T14:20:37Z

/ok to test

jameslamb

Giving this a packaging-codeowners approval... building/packaging changes look fine to me.

This PR implements a distributed (single-node-multiple-GPUs) implementation of ANN indexes. It allows to build, extend and search an index on multiple GPUs. Before building the index, the user has to choose between two modes : **Sharding mode** : The index dataset is split, each GPU trains its own index with its respective share of the dataset. This is intended to both increase the search throughput and the maximal size of the index. **Index duplication mode** : The index is built once on a GPU and then copied over to others. Alternatively, the index dataset is sent to each GPU to be built there. This intended to increase the search throughput. SNMG indexes can be serialized and de-serialized. Local models can also be deserialized and deployed in index duplication mode. ![bench](https://github.com/user-attachments/assets/e313d0ef-02eb-482a-9104-9e1bb400456d) Migrated from rapidsai/raft#1993 Authors: - Victor Lafargue (https://github.com/viclafargue) - James Lamb (https://github.com/jameslamb) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - James Lamb (https://github.com/jameslamb) - Corey J. Nolet (https://github.com/cjnolet) URL: #231

SNMG ANN

3b74685

github-actions bot added cpp CMake labels Nov 14, 2023

cjnolet assigned viclafargue Nov 14, 2023

cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Vector Search labels Nov 14, 2023

tfeher reviewed Nov 15, 2023

View reviewed changes

cpp/include/raft/neighbors/ann_mg.cuh Outdated Show resolved Hide resolved

viclafargue added 4 commits November 15, 2023 18:48

Complete main parts and add tests

c655d91

Debugging

9aeb456

Implement search on shards

224a59f

Debugging

ebca042

viclafargue marked this pull request as ready for review November 27, 2023 17:45

viclafargue requested review from a team as code owners November 27, 2023 17:45

ANN benchmark integration + offset fix + translations fix

8141f72

viclafargue requested a review from a team as a code owner January 10, 2024 14:42

github-actions bot added the python label Jan 10, 2024

Adding CAGRA capability

6d29eb1

lijinf2 reviewed Jan 30, 2024

View reviewed changes

lijinf2 reviewed Jan 31, 2024

View reviewed changes

viclafargue added 2 commits February 1, 2024 18:57

Testing serialization + use of pre-computed methods

34bb6c4

Add distribution feature

2c3fbc2

tfeher mentioned this pull request Feb 9, 2024

[BUG] knn_merge_parts not implemented for k>1024 #2171

Open

viclafargue requested a review from cjnolet February 27, 2024 15:21

Merge remote-tracking branch 'origin/branch-24.04' into snmg-ann

b5680b2

viclafargue requested a review from a team as a code owner March 26, 2024 17:06

github-actions bot added the ci label Mar 26, 2024

viclafargue changed the base branch from branch-23.12 to branch-24.04 March 26, 2024 17:06

cjnolet requested changes Apr 10, 2024

View reviewed changes

tfeher requested changes Apr 18, 2024

View reviewed changes

viclafargue added 2 commits April 23, 2024 16:19

SNMG ANN bench update

4581fbd

OpenMP

e3b03b8

github-actions bot removed the ci label May 3, 2024

viclafargue added 9 commits May 6, 2024 16:04

Answering reviews

a6707c3

NCCL clique helper

4a91a7f

SNMG ANN IVF-Flat & IVF-PQ bench + fixes

0a37d63

Fixes & improvements

8417684

Setting NCCL init apart for bench

34d4fd3

Mempool + NCCL fix

3f15c43

SNMG cagra bench

b77f938

SNMG CAGRA bench

fc748ae

Merge branch 'branch-24.08' into snmg-ann

410562b

viclafargue changed the base branch from branch-24.04 to branch-24.08 June 17, 2024 15:10

viclafargue requested a review from a team as a code owner June 17, 2024 15:10

viclafargue requested a review from jameslamb June 17, 2024 15:10

Increase search batch size + fix build

666d47f

viclafargue added 2 commits July 8, 2024 12:00

mdspan feature for build and extend

1a559a6

style fix

11d30da

viclafargue force-pushed the snmg-ann branch from 16ea111 to 11d30da Compare July 8, 2024 10:40

Merge branch 'branch-24.08' into snmg-ann

9af470e

jameslamb removed the request for review from a team July 8, 2024 14:45

jameslamb approved these changes Jul 9, 2024

View reviewed changes

viclafargue mentioned this pull request Jul 18, 2024

SNMG ANN rapidsai/cuvs#231

Merged

cjnolet closed this Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNMG ANN #1993

SNMG ANN #1993

viclafargue commented Nov 14, 2023 •

edited

Loading

copy-pr-bot bot commented Nov 14, 2023

viclafargue commented Nov 27, 2023 •

edited

Loading

lijinf2 Jan 30, 2024

lijinf2 Jan 30, 2024

viclafargue Feb 2, 2024 •

edited

Loading

cjnolet Apr 10, 2024

tfeher Apr 18, 2024 •

edited

Loading

lijinf2 Jan 31, 2024

viclafargue Feb 2, 2024

tfeher Apr 18, 2024

lijinf2 commented Jan 31, 2024

cjnolet left a comment

cjnolet Apr 10, 2024

cjnolet Apr 10, 2024

tfeher left a comment

tfeher Apr 18, 2024

tfeher Apr 18, 2024 •

edited

Loading

tfeher Apr 18, 2024

cjnolet commented Jun 26, 2024

jameslamb left a comment

SNMG ANN #1993

SNMG ANN #1993

Conversation

viclafargue commented Nov 14, 2023 • edited Loading

copy-pr-bot bot commented Nov 14, 2023

viclafargue commented Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viclafargue Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tfeher Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lijinf2 commented Jan 31, 2024

cjnolet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tfeher Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjnolet commented Jun 26, 2024

jameslamb left a comment

Choose a reason for hiding this comment

viclafargue commented Nov 14, 2023 •

edited

Loading

viclafargue commented Nov 27, 2023 •

edited

Loading

viclafargue Feb 2, 2024 •

edited

Loading

tfeher Apr 18, 2024 •

edited

Loading

tfeher Apr 18, 2024 •

edited

Loading