You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue Summary:
I am experimenting with the Filtered-Disk ANN algorithm (the memory version) on the Sift dataset (1M). I followed the steps outlined in the markdown file to generate synthetic labels, build the index, and perform the search. However, I'm encountering an issue with low recall levels when using distributions other than random, such as the one_per_point (uniform distribution) and the Zipf distribution.
Expected Behavior:
I expected the recall levels to be consistent across different label distributions, as suggested in the Filtered-Disk ANN paper. Specifically, I expected the recall to remain high even with distributions other than random.
Observed Behavior:
When using the one_label_per_row and Zipf distributions, the recall levels are significantly lower compared to the random distribution. The only difference is that in the random setting we're dealing with 50% specificity, in the other cases the specificity is low (5%)
Steps to Reproduce:
I used the same parameters as described in the Filtered-Disk ANN paper.
I have attached the Bash script to reproduce the experiment.
Reading truthset file data/sift_gt.bin ...
Metadata: #pts = 10000, #dims = 10...
L2: Using AVX2 distance computation DistanceL2Float
Resizing took: 0.0387484s
From graph header, expected_file_size: 379395480, _max_observed_degree: 96, _start: 123742, file_frozen_pts: 0
Loading vamana graph data/sift_filtered_index...done. Index has 1000000 nodes and 93848864 out-edges, _start is set to 123742
Identified 13 distinct label(s)
Num frozen points:0 _nd: 1000000 _start: 123742 size(_location_to_tag): 0 size(_tag_to_location):0 Max points: 1000000
Index loaded
Using 48 threads to search
Ls QPS Avg dist cmps Mean Latency (mus) 99.9 Latency Recall@10
=================================================================================
10 51225.34 296.26 869.38 24950.13 46.54
100 13521.95 1165.10 3472.06 30612.68 49.84
600 3233.68 4394.70 14768.81 55484.78 49.64
650 3143.47 4660.87 14896.70 55146.52 49.64
Done searching. Now saving results
Random
Metadata: #pts = 10000, #dims = 10...
L2: Using AVX2 distance computation DistanceL2Float
Resizing took: 0.0366022s
From graph header, expected_file_size: 380582992, _max_observed_degree: 96, _start: 123742, file_frozen_pts: 0
Loading vamana graph data/sift_filtered_index...done. Index has 1000000 nodes and 94145742 out-edges, _start is set to 123742
Identified 13 distinct label(s)
Num frozen points:0 _nd: 1000000 _start: 123742 size(_location_to_tag): 0 size(_tag_to_location):0 Max points: 1000000
Index loaded
Using 48 threads to search
Ls QPS Avg dist cmps Mean Latency (mus) 99.9 Latency Recall@10
=================================================================================
10 43119.56 599.91 1056.97 29310.13 81.11
100 8915.00 2456.43 5341.05 30007.46 98.84
600 1743.98 8778.31 27424.71 92325.24 99.86
650 1612.75 9283.70 29650.47 90650.07 99.87
Done searching. Now saving results s
Questions:
I'm trying to understand the cause of this low recall, is it because of the attribute distribution or linked directly to the specificity or maybe the set of parameters used.
Thanks in advance.
The text was updated successfully, but these errors were encountered:
Issue Summary:
I am experimenting with the Filtered-Disk ANN algorithm (the memory version) on the Sift dataset (1M). I followed the steps outlined in the markdown file to generate synthetic labels, build the index, and perform the search. However, I'm encountering an issue with low recall levels when using distributions other than random, such as the one_per_point (uniform distribution) and the Zipf distribution.
Expected Behavior:
I expected the recall levels to be consistent across different label distributions, as suggested in the Filtered-Disk ANN paper. Specifically, I expected the recall to remain high even with distributions other than random.
Observed Behavior:
When using the one_label_per_row and Zipf distributions, the recall levels are significantly lower compared to the random distribution. The only difference is that in the random setting we're dealing with 50% specificity, in the other cases the specificity is low (5%)
Steps to Reproduce:
I used the same parameters as described in the Filtered-Disk ANN paper.
I have attached the Bash script to reproduce the experiment.
Results
Questions:
I'm trying to understand the cause of this low recall, is it because of the attribute distribution or linked directly to the specificity or maybe the set of parameters used.
Thanks in advance.
The text was updated successfully, but these errors were encountered: