Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diskann Benchmarking Wrapper #260

Open
wants to merge 108 commits into
base: branch-25.02
Choose a base branch
from

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Jul 29, 2024

Brings DiskANN into cuvs-bench

  • Build and search in-memory DiskANN index
  • Build and search SSD DiskANN index
  • Build a cuvs Vamana index on GPU and serialize it in DiskANN format. Search on CPU using in-memory DiskANN search API.

@tarang-jain tarang-jain added feature request New feature or request non-breaking Introduces a non-breaking change labels Jul 29, 2024
@tarang-jain tarang-jain self-assigned this Jul 29, 2024
@github-actions github-actions bot added the Python label Aug 3, 2024
Copy link

copy-pr-bot bot commented Dec 2, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@tarang-jain
Copy link
Contributor Author

/ok to test

@@ -32,6 +32,12 @@ option(CUVS_ANN_BENCH_USE_CUVS_BRUTE_FORCE "Include cuVS brute force knn in benc
option(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB "Include cuVS CAGRA with HNSW search in benchmark" ON)
option(CUVS_ANN_BENCH_USE_HNSWLIB "Include hnsw algorithm in benchmark" ON)
option(CUVS_ANN_BENCH_USE_GGNN "Include ggnn algorithm in benchmark" OFF)
option(CUVS_ANN_BENCH_USE_DISKANN "Include DISKANN search in benchmark" ON)
option(CUVS_ANN_BENCH_USE_CUVS_VAMANA "Include cuVS Vamana with DiskANN search in benchmark" ON)
if(CMAKE_SYSTEM_PROCESSOR MATCHES "(ARM|arm|aarch64)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does MSFT DiskANN repo not support ARM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they have mkl-devel as a dependency, which is not meant to be installed in aarch64.

cpp/bench/ann/src/cuvs/cuvs_cagra_diskann_wrapper.h Outdated Show resolved Hide resolved
template <typename T>
void diskann_ssd<T>::build(const T* dataset, size_t nrow)
{
diskann::build_disk_index<float>(base_file_.c_str(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this defined? Is it using the CPU DiskANN repo to build the ssd verson of the index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is right. It is just to benchmark the CPU build

@cjnolet
Copy link
Member

cjnolet commented Dec 5, 2024

/ok to test

@tarang-jain
Copy link
Contributor Author

/ok to test

@cjnolet cjnolet changed the base branch from branch-24.12 to branch-25.02 December 10, 2024 20:47
@cjnolet
Copy link
Member

cjnolet commented Dec 12, 2024

/ok to test

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid algorithm-specific changes to cpp/bench/ann/src/common/benchmark.hpp as per our offline discussion.
I'm not opposed to exposing dataset/index filepaths to algorithms in general though.
But maybe it would make sense to hide (filter out) them when dumping all build parameters to avoid cluttering the console/reports and exposing private user filepaths?

cpp/bench/ann/src/cuvs/cuvs_vamana_wrapper.h Show resolved Hide resolved
@tarang-jain
Copy link
Contributor Author

@achirkin are you suggesting that we expose the dataset / index filepaths to all algorithms, but only use them for diskann ssd indexes?

@@ -144,7 +150,8 @@ void bench_build(::benchmark::State& state,

const auto algo_property = parse_algo_property(algo->get_preference(), index.build_param);

const T* base_set = dataset->base_set(algo_property.dataset_memory_type);
const T* base_set = nullptr;
if (index.algo != "diskann_ssd") base_set = dataset->base_set(algo_property.dataset_memory_type);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achirkin if we do not have this line, the entire dataset will be read into the base_set variable needlessly for DiskANN ssd based indexes. For these indexes, only the path to the dataset needs to be known and the DiskANN index building APIs will read the data from the path. In fact, it is important we do it this way because those SSD indexes are designed to let the data remain on disk if there are extremely large datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarking CMake cpp feature request New feature or request non-breaking Introduces a non-breaking change Python
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

6 participants