Skip to content

Commit

Permalink
Edit benchmark guide
Browse files Browse the repository at this point in the history
  • Loading branch information
tfeher committed Nov 23, 2023
1 parent 55f7039 commit 226745c
Showing 1 changed file with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions docs/source/ann_benchmarks_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,14 @@ Commonly used datasets can be downloaded from two websites:
```
Besides ground truth files for the whole billion-scale datasets, this site also provides ground truth files for the first 10M or 100M vectors of the base sets. This mean we can use these billion-scale datasets as million-scale datasets. To facilitate this, an optional parameter `subset_size` for dataset can be used. See the next step for further explanation.
3. Synthetic dataset
To generate a synthetic dataset with random data you can use the following command
```bash
python -m raft-ann-bench.generate_dataset --rows 1000000 --cols 128 --dtype float32 dataset/base.fbin
```
Here `rows` stands determines the number of dataset vectors, and `cols` refers to the number of features each vector has.
By default random blobs are generated using [make_blobs](https://docs.rapids.ai/api/cuml/latest/api/#cuml.datasets.make_blobs), alternatively uniform random can be also used. Keep in mind that large number of dimensions and uniform random numbers will lead to a dataset that hard to search accurately using ANN methods.
## Generate ground truth
If you have a dataset, but no corresponding ground truth file, then you can generate ground trunth using the `generate_groundtruth` utility. Example usage:
Expand Down

0 comments on commit 226745c

Please sign in to comment.