Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward-merge branch-24.02 to branch-24.04 #2131

Merged
merged 1 commit into from
Jan 25, 2024
Merged

Conversation

GPUtester
Copy link
Contributor

Forward-merge triggered by push to branch-24.02 that creates a PR to keep branch-24.04 up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge.

This PR address #1901 by subsampling the input dataset for PQ codebook training to reduce the runtime. 

Currently, a similar strategy is applied to `per_cluster` method, but not to the default `per_subset` method. This PR fixes this gap. Similar to the subsampling mechanism of the `per_cluster` method, we pick at minimum `256*max(pq_book_size, pq_dim)` number of input rows for training each code book. 

https://github.com/rapidsai/raft/blob/cf4e03d0b952c1baac73f695f94d6482d8c391d8/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh#L408

The following performance numbers are generated using Deep-100M dataset. After subsampling, the search time and accuracy are not impacted (within +-5%) except one case where I saw 9% performance drop on search (using 10K batch for search). More extensive benchmarking across datasets seems to be needed for justification. 

Dataset | n_iter | n_list | pq_bits | pq_dim | ratio | Original time (s) | Subsampling (s) | Speedup [subsampling]
-- | -- | -- | -- | -- | -- | -- | -- | --
Deep-100M | 25 | 50000 | 4 | 96 | 10 | 129 | 89.5 | 1.44
Deep-100M | 25 | 50000 | 5 | 96 | 10 | 128 | 89.4 | 1.43
Deep-100M | 25 | 50000 | 6 | 96 | 10 | 131 | 90 | 1.46
Deep-100M | 25 | 50000 | 7 | 96 | 10 | 129 | 91.1 | 1.42
Deep-100M | 25 | 50000 | 8 | 96 | 10 | 149 | 93.4 | 1.60

Note, after subsampling, the PQ codebook generation is no longer a bottleneck in the IVF-PQ index building. More optimizations on PQ codebook generation seem unnecessary. Although we could in theory apply the custom kernel approach (#2050) 
 with subsampling, my early tests show the current GEMM approach performs better than the custom kernel after subsampling.  

Using multiple stream could improve the performance further by overlapping kernels for different `pq_dim`, given kernels are small after subsampling and may not fully utilize GPU. However, as mention above, since the entire PQ codebook is fast, this optimization may not be worthwhile. 

TODO 

- [x] Benchmark the performance/accuracy impacts on multiple datasets

Authors:
  - Rui Lan (https://github.com/abc99lr)
  - Ray Douglass (https://github.com/raydouglass)
  - gpuCI (https://github.com/GPUtester)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #2052
@GPUtester GPUtester requested review from a team as code owners January 25, 2024 05:50
@GPUtester GPUtester merged commit ea0d2f6 into branch-24.04 Jan 25, 2024
17 of 18 checks passed
@GPUtester
Copy link
Contributor Author

SUCCESS - forward-merge complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants