Skip to content

Commit

Permalink
Fixing docs and datasets.yaml
Browse files Browse the repository at this point in the history
  • Loading branch information
cjnolet committed Nov 1, 2023
1 parent 657b0db commit 67380ed
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 7 deletions.
8 changes: 4 additions & 4 deletions docs/source/wiki_all_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

The `wiki-all` dataset was created to stress vector search algorithms at scale with both a large number of vectors and dimensions. The entire dataset contains 88M vectors with 768 dimensions and is meant for testing the types of vectors one would typically encounter in retrieval augmented generation (RAG) workloads. The full dataset is ~251GB in size, which is intentionally larger than the typical memory of GPUs. The massive scale is intended to promote the use of compression and efficient out-of-core methods for both indexing and search.

## Getting the dataset

The dataset is composed of all the available languages of in the [Cohere Wikipedia dataset](https://huggingface.co/datasets/Cohere/wikipedia-22-12). An [English version]( https://www.kaggle.com/datasets/jjinho/wikipedia-20230701) is also available.


Expand All @@ -13,13 +11,15 @@ Cohere's English Texts are older (2022) and smaller than the Kaggle English Wiki

To form the final dataset, the Wiki texts were chunked into 85 million 128-token pieces. For reference, Cohere chunks Wiki texts into 104-token pieces. Finally, the embeddings of each chunk were computed using the [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) embedding model. The resulting dataset is an embedding matrix of size 88 million by 768. Also included with the dataset is a query file containing 10k query vectors and a groundtruth file to evaluate nearest neighbors algorithms.

## Getting the dataset

### Full dataset

A version of the dataset is made available in the binary format that can be used directly by the [raft-ann-bench](https://docs.rapids.ai/api/raft/nightly/raft_ann_benchmarks/) tool. The full 88M dataset is ~251GB and the download link below contains tarballs that have been split into multiple parts.

The following will download all 10 the parts and untar them to a `wiki_all_88M` directory:
```bash
curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.{00..9} | tar -xf - -C /datasets/wiki_all_88M/
curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.{00..9} | tar -xf - -C wiki_all_88M/
```

The above has the unfortunate drawback that if the command should fail for any reason, all the parts need to be re-downloaded. The files can also be downloaded individually and then untarred to the directory. Each file is ~27GB and there are 10 of them.
Expand All @@ -29,7 +29,7 @@ curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.00
...
curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.09

cat wiki_all.tar.* | tar -xf - -C /datasets/wiki_all_88M/
cat wiki_all.tar.* | tar -xf - -C wiki_all_88M/
```

### 1M and 10M subsets
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,21 +100,21 @@
distance: euclidean

- name: wiki_all_1M,
dims: 784
dims: 768
base_file: wiki_all_1M/base.1MM.fbin,
query_file: wiki_all_1M/queries.fbin,
groundtruth_neighbors_file: wiki_all_1M/groundtruth.1M.neighbors.ibin,
distance: euclidean

- name: wiki_all_10M,
dims: 784
dims: 768
base_file: wiki_all_10M/base.10M.fbin,
query_file: wiki_all_10M/queries.fbin,
groundtruth_neighbors_file: wiki_all_10M/groundtruth.10M.neighbors.ibin,
distance: euclidean

- name: wiki_all_88M,
dims: 784
dims: 768
base_file: wiki_all_88M/base.88M.fbin,
query_file: wiki_all_88M/queries.fbin,
groundtruth_neighbors_file: wiki_all_88M/groundtruth.88M.neighbors.ibin,
Expand Down

0 comments on commit 67380ed

Please sign in to comment.