From 226745ccefa6076f5be7349f73d2adf80f2266b8 Mon Sep 17 00:00:00 2001 From: Tamas Bela Feher Date: Thu, 23 Nov 2023 16:34:14 +0100 Subject: [PATCH] Edit benchmark guide --- docs/source/ann_benchmarks_dataset.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/source/ann_benchmarks_dataset.md b/docs/source/ann_benchmarks_dataset.md index 821345b07c..fd950843fe 100644 --- a/docs/source/ann_benchmarks_dataset.md +++ b/docs/source/ann_benchmarks_dataset.md @@ -46,6 +46,14 @@ Commonly used datasets can be downloaded from two websites: ``` Besides ground truth files for the whole billion-scale datasets, this site also provides ground truth files for the first 10M or 100M vectors of the base sets. This mean we can use these billion-scale datasets as million-scale datasets. To facilitate this, an optional parameter `subset_size` for dataset can be used. See the next step for further explanation. +3. Synthetic dataset +To generate a synthetic dataset with random data you can use the following command +```bash +python -m raft-ann-bench.generate_dataset --rows 1000000 --cols 128 --dtype float32 dataset/base.fbin +``` +Here `rows` stands determines the number of dataset vectors, and `cols` refers to the number of features each vector has. +By default random blobs are generated using [make_blobs](https://docs.rapids.ai/api/cuml/latest/api/#cuml.datasets.make_blobs), alternatively uniform random can be also used. Keep in mind that large number of dimensions and uniform random numbers will lead to a dataset that hard to search accurately using ANN methods. + ## Generate ground truth If you have a dataset, but no corresponding ground truth file, then you can generate ground trunth using the `generate_groundtruth` utility. Example usage: