-
Notifications
You must be signed in to change notification settings - Fork 197
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add
wiki_all
dataset config and documentation. (#1918)
…et more clarify on how the dataset was generated. Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #1918
- Loading branch information
Showing
5 changed files
with
652 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# Wiki-all Dataset | ||
|
||
The `wiki-all` dataset was created to stress vector search algorithms at scale with both a large number of vectors and dimensions. The entire dataset contains 88M vectors with 768 dimensions and is meant for testing the types of vectors one would typically encounter in retrieval augmented generation (RAG) workloads. The full dataset is ~251GB in size, which is intentionally larger than the typical memory of GPUs. The massive scale is intended to promote the use of compression and efficient out-of-core methods for both indexing and search. | ||
|
||
## Getting the dataset | ||
|
||
The dataset is composed of all the available languages of in the [Cohere Wikipedia dataset](https://huggingface.co/datasets/Cohere/wikipedia-22-12). An [English version]( https://www.kaggle.com/datasets/jjinho/wikipedia-20230701) is also available. | ||
|
||
|
||
The dataset is composed of English wiki texts from [Kaggle](https://www.kaggle.com/datasets/jjinho/wikipedia-20230701) and multi-lingual wiki texts from [Cohere Wikipedia](https://huggingface.co/datasets/Cohere/wikipedia-22-12). | ||
|
||
Cohere's English Texts are older (2022) and smaller than the Kaggle English Wiki texts (2023) so the English texts have been removed from Cohere completely. The final Wiki texts include English Wiki from Kaggle and the other languages from Cohere. The English texts constitute 50% of the total text size. | ||
|
||
To form the final dataset, the Wiki texts were chunked into 85 million 128-token pieces. For reference, Cohere chunks Wiki texts into 104-token pieces. Finally, the embeddings of each chunk were computed using the [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) embedding model. The resulting dataset is an embedding matrix of size 88 million by 768. Also included with the dataset is a query file containing 10k query vectors and a groundtruth file to evaluate nearest neighbors algorithms. | ||
|
||
### Full dataset | ||
|
||
A version of the dataset is made available in the binary format that can be used directly by the [raft-ann-bench](https://docs.rapids.ai/api/raft/nightly/raft_ann_benchmarks/) tool. The full 88M dataset is ~251GB and the download link below contains tarballs that have been split into multiple parts. | ||
|
||
The following will download all 10 the parts and untar them to a `wiki_all_88M` directory: | ||
```bash | ||
curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.{00..9} | tar -xf - -C /datasets/wiki_all_88M/ | ||
``` | ||
|
||
The above has the unfortunate drawback that if the command should fail for any reason, all the parts need to be re-downloaded. The files can also be downloaded individually and then untarred to the directory. Each file is ~27GB and there are 10 of them. | ||
|
||
```bash | ||
curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.00 | ||
... | ||
curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.09 | ||
|
||
cat wiki_all.tar.* | tar -xf - -C /datasets/wiki_all_88M/ | ||
``` | ||
|
||
### 1M and 10M subsets | ||
|
||
Also available are 1M and 10M subsets of the full dataset which are 2.9GB and 29GB, respectively. These subsets also include query sets of 10k vectors and corresponding groundtruth files. | ||
|
||
```bash | ||
curl -s https://data.rapids.ai/raft/datasets/wiki_all_1M/wiki_all_1M.tar | ||
curl -s https://data.rapids.ai/raft/datasets/wiki_all_10M/wiki_all_10M.tar | ||
``` | ||
|
||
## Using the dataset | ||
|
||
After the dataset is downloaded and extracted to the `wiki_all_88M` directory (or `wiki_all_1M`/`wiki_all_10M` depending on whether the subsets are used), the files can be used in the benchmarking tool. The dataset name is `wiki_all` (or `wiki_all_1M`/`wiki_all_10M`), and the benchmarking tool can be used by specifying the appropriate name `--dataset wiki_all_88M` in the scripts. |
200 changes: 200 additions & 0 deletions
200
python/raft-ann-bench/src/raft-ann-bench/run/conf/wiki_all_10M.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
{ | ||
"dataset": { | ||
"name": "wiki_all_10M", | ||
"base_file": "wiki_all_10M/base.88M.fbin", | ||
"query_file": "wiki_all_10M/queries.fbin", | ||
"groundtruth_neighbors_file": "wiki_all_10M/groundtruth.88M.neighbors.ibin", | ||
"distance": "euclidean" | ||
}, | ||
"search_basic_param": { | ||
"batch_size": 10000, | ||
"k": 10 | ||
}, | ||
"index": [ | ||
{ | ||
"name": "hnswlib.M16.ef50", | ||
"algo": "hnswlib", | ||
"build_param": { "M": 16, "efConstruction": 50, "numThreads": 56 }, | ||
"file": "wiki_all_10M/hnswlib/M16.ef50", | ||
"search_params": [ | ||
{ "ef": 10, "numThreads": 56 }, | ||
{ "ef": 20, "numThreads": 56 }, | ||
{ "ef": 40, "numThreads": 56 }, | ||
{ "ef": 60, "numThreads": 56 }, | ||
{ "ef": 80, "numThreads": 56 }, | ||
{ "ef": 120, "numThreads": 56 }, | ||
{ "ef": 200, "numThreads": 56 }, | ||
{ "ef": 400, "numThreads": 56 }, | ||
{ "ef": 600, "numThreads": 56 }, | ||
{ "ef": 800, "numThreads": 56 } | ||
] | ||
}, | ||
{ | ||
"name": "faiss_ivf_pq.M32-nlist16K", | ||
"algo": "faiss_gpu_ivf_pq", | ||
"build_param": { | ||
"M": 32, | ||
"nlist": 16384, | ||
"ratio": 2 | ||
}, | ||
"file": "wiki_all_10M/faiss_ivf_pq/M32-nlist16K_ratio2", | ||
"search_params": [ | ||
{ "nprobe": 10 }, | ||
{ "nprobe": 20 }, | ||
{ "nprobe": 30 }, | ||
{ "nprobe": 40 }, | ||
{ "nprobe": 50 }, | ||
{ "nprobe": 100 }, | ||
{ "nprobe": 200 }, | ||
{ "nprobe": 500 } | ||
] | ||
}, | ||
{ | ||
"name": "faiss_ivf_pq.M64-nlist16K", | ||
"algo": "faiss_gpu_ivf_pq", | ||
"build_param": { | ||
"M": 64, | ||
"nlist": 16384, | ||
"ratio": 2 | ||
}, | ||
"file": "wiki_all_10M/faiss_ivf_pq/M64-nlist16K_ratio2", | ||
"search_params": [ | ||
{ "nprobe": 10 }, | ||
{ "nprobe": 20 }, | ||
{ "nprobe": 30 }, | ||
{ "nprobe": 40 }, | ||
{ "nprobe": 50 }, | ||
{ "nprobe": 100 }, | ||
{ "nprobe": 200 }, | ||
{ "nprobe": 500 } | ||
] | ||
}, | ||
{ | ||
"name": "raft_ivf_pq.d128-nlist16K", | ||
"algo": "raft_ivf_pq", | ||
"build_param": { | ||
"pq_dim": 128, | ||
"pq_bits": 8, | ||
"nlist": 16384, | ||
"niter": 10, | ||
"ratio": 10 | ||
}, | ||
"file": "wiki_all_10M/raft_ivf_pq/d128-nlist16K", | ||
"search_params": [ | ||
{ "nprobe": 20, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 1 }, | ||
{ "nprobe": 30, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 1 }, | ||
{ "nprobe": 40, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 1 }, | ||
{ "nprobe": 50, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 1 }, | ||
{ "nprobe": 100, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 1 }, | ||
{ "nprobe": 200, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 1 }, | ||
{ "nprobe": 500, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 1 } | ||
] | ||
}, | ||
{ | ||
"name": "raft_ivf_pq.d64-nlist16K", | ||
"algo": "raft_ivf_pq", | ||
"build_param": { | ||
"pq_dim": 64, | ||
"pq_bits": 8, | ||
"nlist": 16384, | ||
"niter": 10, | ||
"ratio": 10 | ||
}, | ||
"file": "wiki_all_10M/raft_ivf_pq/d64-nlist16K", | ||
"search_params": [ | ||
{ "nprobe": 20, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 }, | ||
{ "nprobe": 30, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 }, | ||
{ "nprobe": 40, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 }, | ||
{ "nprobe": 50, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 }, | ||
{ "nprobe": 100, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 }, | ||
{ "nprobe": 200, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 }, | ||
{ "nprobe": 500, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 } | ||
] | ||
}, | ||
{ | ||
"name": "raft_ivf_pq.d32-nlist16K", | ||
"algo": "raft_ivf_pq", | ||
"build_param": { | ||
"pq_dim": 32, | ||
"pq_bits": 8, | ||
"nlist": 16384, | ||
"niter": 10, | ||
"ratio": 10 | ||
}, | ||
"file": "wiki_all_10M/raft_ivf_pq/d32-nlist16K", | ||
"search_params": [ | ||
{ "nprobe": 20, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 32 }, | ||
{ "nprobe": 30, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 32 }, | ||
{ "nprobe": 40, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 32 }, | ||
{ "nprobe": 50, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 32 }, | ||
{ "nprobe": 100, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 32 }, | ||
{ "nprobe": 200, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 32 }, | ||
{ "nprobe": 500, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 32 } | ||
] | ||
}, | ||
{ | ||
"name": "raft_ivf_pq.d32X-nlist16K", | ||
"algo": "raft_ivf_pq", | ||
"build_param": { | ||
"pq_dim": 32, | ||
"pq_bits": 8, | ||
"nlist": 16384, | ||
"niter": 10, | ||
"ratio": 10 | ||
}, | ||
"file": "wiki_all_10M/raft_ivf_pq/d32-nlist16K", | ||
"search_params": [ | ||
{ "nprobe": 20, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 16 }, | ||
{ "nprobe": 30, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 16 }, | ||
{ "nprobe": 40, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 16 }, | ||
{ "nprobe": 50, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 16 }, | ||
{ "nprobe": 100, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 16 }, | ||
{ "nprobe": 200, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 16 }, | ||
{ "nprobe": 500, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 16 }, | ||
{ "nprobe": 30, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 8 }, | ||
{ "nprobe": 40, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 8 }, | ||
{ "nprobe": 50, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 8 }, | ||
{ "nprobe": 100, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 8 }, | ||
{ "nprobe": 200, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 8 }, | ||
{ "nprobe": 500, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 8 }, | ||
{ "nprobe": 30, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 }, | ||
{ "nprobe": 40, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 }, | ||
{ "nprobe": 50, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 }, | ||
{ "nprobe": 100, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 }, | ||
{ "nprobe": 200, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 }, | ||
{ "nprobe": 500, "internalDistanceDtype": "half", "smemLutDtype": "half", "refine_ratio": 4 } | ||
|
||
] | ||
}, | ||
{ | ||
"name": "raft_cagra.dim32.multi_cta", | ||
"algo": "raft_cagra", | ||
"build_param": { "graph_degree": 32, "intermediate_graph_degree": 48 }, | ||
"file": "wiki_all_10M/raft_cagra/dim32.ibin", | ||
"search_params": [ | ||
{ "itopk": 32, "search_width": 1, "max_iterations": 0, "algo": "multi_cta" }, | ||
{ "itopk": 32, "search_width": 1, "max_iterations": 32, "algo": "multi_cta" }, | ||
{ "itopk": 32, "search_width": 1, "max_iterations": 36, "algo": "multi_cta" }, | ||
{ "itopk": 32, "search_width": 1, "max_iterations": 40, "algo": "multi_cta" }, | ||
{ "itopk": 32, "search_width": 1, "max_iterations": 44, "algo": "multi_cta" }, | ||
{ "itopk": 32, "search_width": 1, "max_iterations": 48, "algo": "multi_cta" }, | ||
{ "itopk": 32, "search_width": 2, "max_iterations": 16, "algo": "multi_cta" }, | ||
{ "itopk": 32, "search_width": 2, "max_iterations": 24, "algo": "multi_cta" }, | ||
{ "itopk": 32, "search_width": 2, "max_iterations": 26, "algo": "multi_cta" }, | ||
{ "itopk": 32, "search_width": 2, "max_iterations": 32, "algo": "multi_cta" }, | ||
{ "itopk": 64, "search_width": 4, "max_iterations": 16, "algo": "multi_cta" }, | ||
{ "itopk": 64, "search_width": 1, "max_iterations": 64, "algo": "multi_cta" }, | ||
{ "itopk": 96, "search_width": 2, "max_iterations": 48, "algo": "multi_cta" }, | ||
{ "itopk": 128, "search_width": 8, "max_iterations": 16, "algo": "multi_cta" }, | ||
{ "itopk": 128, "search_width": 2, "max_iterations": 64, "algo": "multi_cta" }, | ||
{ "itopk": 192, "search_width": 8, "max_iterations": 24, "algo": "multi_cta" }, | ||
{ "itopk": 192, "search_width": 2, "max_iterations": 96, "algo": "multi_cta" }, | ||
{ "itopk": 256, "search_width": 8, "max_iterations": 32, "algo": "multi_cta" }, | ||
{ "itopk": 384, "search_width": 8, "max_iterations": 48, "algo": "multi_cta" }, | ||
{ "itopk": 512, "search_width": 8, "max_iterations": 64, "algo": "multi_cta" } | ||
] | ||
} | ||
|
||
] | ||
} | ||
|
Oops, something went wrong.