Forward-merge branch-24.02 to branch-24.04 #2131

GPUtester · 2024-01-25T05:50:21Z

Forward-merge triggered by push to branch-24.02 that creates a PR to keep branch-24.04 up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge.

This PR address #1901 by subsampling the input dataset for PQ codebook training to reduce the runtime. Currently, a similar strategy is applied to `per_cluster` method, but not to the default `per_subset` method. This PR fixes this gap. Similar to the subsampling mechanism of the `per_cluster` method, we pick at minimum `256*max(pq_book_size, pq_dim)` number of input rows for training each code book. https://github.com/rapidsai/raft/blob/cf4e03d0b952c1baac73f695f94d6482d8c391d8/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh#L408 The following performance numbers are generated using Deep-100M dataset. After subsampling, the search time and accuracy are not impacted (within +-5%) except one case where I saw 9% performance drop on search (using 10K batch for search). More extensive benchmarking across datasets seems to be needed for justification. Dataset | n_iter | n_list | pq_bits | pq_dim | ratio | Original time (s) | Subsampling (s) | Speedup [subsampling] -- | -- | -- | -- | -- | -- | -- | -- | -- Deep-100M | 25 | 50000 | 4 | 96 | 10 | 129 | 89.5 | 1.44 Deep-100M | 25 | 50000 | 5 | 96 | 10 | 128 | 89.4 | 1.43 Deep-100M | 25 | 50000 | 6 | 96 | 10 | 131 | 90 | 1.46 Deep-100M | 25 | 50000 | 7 | 96 | 10 | 129 | 91.1 | 1.42 Deep-100M | 25 | 50000 | 8 | 96 | 10 | 149 | 93.4 | 1.60 Note, after subsampling, the PQ codebook generation is no longer a bottleneck in the IVF-PQ index building. More optimizations on PQ codebook generation seem unnecessary. Although we could in theory apply the custom kernel approach (#2050) with subsampling, my early tests show the current GEMM approach performs better than the custom kernel after subsampling. Using multiple stream could improve the performance further by overlapping kernels for different `pq_dim`, given kernels are small after subsampling and may not fully utilize GPU. However, as mention above, since the entire PQ codebook is fast, this optimization may not be worthwhile. TODO - [x] Benchmark the performance/accuracy impacts on multiple datasets Authors: - Rui Lan (https://github.com/abc99lr) - Ray Douglass (https://github.com/raydouglass) - gpuCI (https://github.com/GPUtester) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: #2052

GPUtester · 2024-01-25T05:50:33Z

SUCCESS - forward-merge complete.

GPUtester requested review from a team as code owners January 25, 2024 05:50

GPUtester merged commit ea0d2f6 into branch-24.04 Jan 25, 2024
17 of 18 checks passed

github-actions bot added cpp python labels Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forward-merge branch-24.02 to branch-24.04 #2131

Forward-merge branch-24.02 to branch-24.04 #2131

GPUtester commented Jan 25, 2024

GPUtester commented Jan 25, 2024

Forward-merge branch-24.02 to branch-24.04 #2131

Forward-merge branch-24.02 to branch-24.04 #2131

Conversation

GPUtester commented Jan 25, 2024

GPUtester commented Jan 25, 2024