[RELEASE] raft v24.04 #2240

raydouglass · 2024-03-21T17:00:14Z

❄️ Code freeze for `branch-24.04` and v24.04 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-24.04 until release (merging of this PR).

What is the purpose of this PR?

Update documentation
Allow testing for the new release
Enable a means to merge branch-24.04 into main for the release

Forward-merge branch-24.02 to branch-24.04

Branch 24.04 merge branch 24.02

Part of rapidsai/rmm#1389. This removes now-optional and soon-to-be deprecated `supports_streams()` from RAFT's custom `device_memory_resource` implementations. Authors: - Mark Harris (https://github.com/harrism) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Michael Schellenberger Costa (https://github.com/miscco) - Corey J. Nolet (https://github.com/cjnolet) URL: #2121

Forward-merge branch-24.02 to branch-24.04

We can build a CAGRA graph even for datasets that do not fit GPU mem. The IVF-PQ build method only requires that the temporary IVF-PQ index (that we use for creating the knn-graph) fits the GPU. But once the CAGRA graph is constructed, we try to initialize the CAGRA index which would copy the dataset to device memory. This PR adds a build flag, that would disable copying the dataset into the index. That enables building CAGRA graph for large datasets, and also allows users to customize allocator used for storing the data. Authors: - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2126

This PR fixes using RAFT from a build dir, i.e. `cmake -S cuml/cpp -B cuml/cpp/build -Draft_ROOT=raft/cpp/build`. Without this fix, CMake errors when cuML does `find_package(RAFT)` with the following error: ``` CMake Error at raft/cpp/build/latest/raft-targets.cmake:56 (set_target_properties): The link interface of target "raft::raft" contains: hnswlib::hnswlib but the target was not found. Possible reasons include: * There is a typo in the target name. * A find_package call is missing for an IMPORTED target. * An ALIAS target is missing. ``` Authors: - Paul Taylor (https://github.com/trxcllnt) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #2145

Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2156

…atency (#1786) This PR aims at reducing the latency in IVF-PQ and related functions, especially with small work sizes and in the "throughput" benchmark mode. - Add kernel config caching to ivf_pq::search::compute_similarity kernel - Add kernel config caching to select::warpsort - Fix the memory_resource usage in `matrix::select_k`: make sure all temporary allocations use raft's workspace memory resource. Authors: - Artem M. Chirkin (https://github.com/achirkin) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Tamas Bela Feher (https://github.com/tfeher) URL: #1786

Forward-merge branch-24.02 to branch-24.04

* switches to CUDA 12.2.2 for building conda packages and wheels * adds new tests running against CUDA 12.2.2 ### Notes for Reviewers This is part of ongoing work to build and test packages against CUDA 12.2.2 across all of RAPIDS. For more details see: * rapidsai/build-planning#7 * rapidsai/shared-workflows#166 Planning a second round of PRs to revert these references back to a proper `branch-24.{nn}` release branch of `shared-workflows` once rapidsai/shared-workflows#166 is merged. I intentionally did not add a CUDA 12.2 environment for ANN benchmarks, as I assumed that would be more involved and because it isn't strictly necessary to support building and publishing packages that support CUDA 12.2. https://github.com/rapidsai/raft/blob/93a504e00229c89c5b61814bdc24de09afe26534/dependencies.yaml#L23-L26 *(created with `rapids-reviser`)* Authors: - James Lamb (https://github.com/jameslamb) - Bradley Dice (https://github.com/bdice) Approvers: - Bradley Dice (https://github.com/bdice) - Jake Awe (https://github.com/AyodeAwe) URL: #2092

Forward-merge branch-24.02 to branch-24.04

NumPy 2 is expected to be released in the near future. For the RAPIDS 24.04 release, we will pin to `numpy>=1.23,<2.0a0`. This PR adds an upper bound to affected RAPIDS repositories. xref: rapidsai/build-planning#29 Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Ray Douglass (https://github.com/raydouglass) - Corey J. Nolet (https://github.com/cjnolet) URL: #2222

Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: #2224

- batch input matrix with N-dim to at most 65535 to avoid cutlass gridY limitation. Authors: - Mahesh Doijade (https://github.com/mdoijade) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2215

As of rapidsai/ucx-py#1032 and rapidsai/ucxx#205, the `ucx` version pinning in RAPIDS has been updated. This PR aligns with those changes. Closes rapidsai/build-planning#27. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Jake Awe (https://github.com/AyodeAwe) - https://github.com/jakirkham URL: #2227

This will avoid confusion for users launching only `./build.sh pylibraft`. Authors: - Micka (https://github.com/lowener) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) - Corey J. Nolet (https://github.com/cjnolet) - Robert Maynard (https://github.com/robertmaynard) - Ray Douglass (https://github.com/raydouglass) URL: #2090

@seberg

This PR is based on @seberg work in #1928 . From the PR: This is a follow up on #1926, since the rank sorting seemed a bit hard to understand. It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal? If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs. OTOH, if the NVML device order doesn't matter, we could just sort the workers directly. The original #1587 mentions: NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of: Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation. At least raise an error, because if NCCL will end up raising the error it will be very confusing. Authors: - Vibhu Jawa (https://github.com/VibhuJawa) - Sebastian Berg (https://github.com/seberg) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2228

@tfeher

This PR addresses #2204 and #2205. * fixes illegal access / test coverage for mean row-wise kernel * fixes illegal access / test coverage for stdev row-wise kernel * modified sum kernels to utilize Kahan/Neumaier summation per thread, also increase load per thread to benefit from this FYI, @tfeher Authors: - Malte Förster (https://github.com/mfoerste4) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2223

…2220) The local `copyright.py` script is bug-prone. Replace it with a more robust centralized script from `pre-commit-hooks`. Issue: rapidsai/build-planning#30 Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - Jake Awe (https://github.com/AyodeAwe) URL: #2220

Add a `cagra::compress` function that implements CAGRA-Q (VQ + PQ) compression of a given dataset. The result, `compressed_dataset`, is supposed to complement the CAGRA graph during `cagra::search` in place of a raw dataset. ### Current state: - The code runs and produces a meaningful output (tested internally by running the original prototype search with the generated compressed dataset); the recall levels are approximately the same as with the prototype implementation. - No test coverage yet (need to coordinate with the search PR #2206) - Full `pq_bits` support ([4,5,6,7,8] - same as in IVF-PQ) - Any `pq_dim` values are accepted, but the dataset is not padded and thus `dim` must be a multiple of `pq_dim`. - The codebook math type is hardcoded to `half` to match the prototype implementation for now. This could be a runtime (build) parameter as well. - All common input data types should work (`uint8_t`, `int8_t`, `half`, and `float` compile), but I tested only `float`. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2213

…2212) Compilation of IVF-PQ search kernels can be time consuming. In `libraft.so` the compilation is done in parallel for kernels without filtering and with `int64_t` index type. We have test with `uint32_t` index type as well as tests for `bitset_filter` with both 32 and 64 bit index types. This PR adds explicit template instantiations for the test. This way we avoid repeated compilation of the kernels with filter and this also enables parallel compilation of the `compute_similarity` kernel for different template types. The kernels with these additional type parameters are not added to `libraft.so`, only linked together with the test executable. Note that this PR does not increase the number of compiled kernels, but it enables to compile them in parallel. Authors: - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Artem M. Chirkin (https://github.com/achirkin) - Ben Frederickson (https://github.com/benfred) URL: #2212

Generating ANN bench ground truth is affected by bug #2171, when k>1024. This PR fixes the issue for the ground truth generation. Authors: - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2180

There was a bug appearing for negative floating point numbers with a max reduce operation. The `std::numeric_limits<T>::min()` is greater than the negative floating point values whereas we want it to be smaller than all representable values. This PR replaces the `min` with the `lowest`. Authors: - Akif ÇÖRDÜK (https://github.com/akifcorduk) - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Bradley Dice (https://github.com/bdice) - Tamas Bela Feher (https://github.com/tfeher) URL: #2226

- Adds cosine 1-NN cutlass based kernel for SM 8.0 or higher using tensor cores. - based on 3x TF32 - unifies the fusedDistanceNN kernels for L2/cosine. - expose this API in pylibraft as `fused_distance_nn_arg_min` supporting cosine & L2 distance metrics. Authors: - Mahesh Doijade (https://github.com/mdoijade) - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Ben Frederickson (https://github.com/benfred) URL: #2125

The random sampling of IVF methods was reverted (#2144) due to large memory utilization #2141. This PR improves the memory consumption of subsamling: it is O(n_train) where n_train is the size of the subsampled dataset. This PR adds the following new APIs: - random::excess_sampling (todo may just call as sample_without_replacement) - matrix::sample_rows - matrix::gather for host input matrix Authors: - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Artem M. Chirkin (https://github.com/achirkin) - Ben Frederickson (https://github.com/benfred) URL: #2155

@tfeher

This PR is a followup to #2169. To enable IVF-flat with k>256 we need an additional select_k invocation which was unexpectedly slow. There are two reasons for that: First problem is the data handed to select_k: The valid data length per row is much smaller than the conservative maximum that could be achieved by probing the N largest probes. Therefore each query row contains roughly ~50% dummy values. This is also the case for IVF-PQ, but did not show up as prominent due to the second reason. The second problem, and also a difference to the IVF-PQ algorithm - is that a 64bit payload data type is used for selectK. The performance of selectK with 64bit index type is significantly slower than with 32bit, especially when many elements are in the same range: ``` Benchmark Time CPU Iterations ----------------------------------------------------------------------------------------------------- SelectK/float/uint32_t/kRadix11bitsExtraPass/1/manual_time 1.68 ms 1.74 ms 413 1357#200000#512 SelectK/float/uint32_t/kRadix11bitsExtraPass/3/manual_time 2.31 ms 2.37 ms 302 1357#200000#512#same-leading-bits SelectK/float/int64_t/kRadix11bitsExtraPass/1/manual_time 5.92 ms 5.98 ms 116 1357#200000#512 SelectK/float/int64_t/kRadix11bitsExtraPass/3/manual_time 83.7 ms 83.8 ms 8 1357#200000#512#same-leading-bits ----------------------------------------------------------------------------------------------------- ``` The data distribution within a IVF-flat benchmark resulted in a select_k time of ~24ms. ### scope: * additional parameter added to select_k to optionally pass individual row lengths for every batch entry. This parameter is utilized by both IVF-Flat and IVF-PQ and results in a ~2x speedup (50 nodes out of 5000) of the final `select_k`. * refactor ivf-flat search to work with 32bit indices by storing positions instead of actual indices. This allows to utilize 32bit index type select_k for ~10x speedup in the final `select_k`. FYI @tfeher @achirkin ### not in scope: * General optimization of select_k: In the current implementation there is no difference in the type of the payload and the actual index type. Especially the type of the histogram has a large effect on performance (due to the atomics). Authors: - Malte Förster (https://github.com/mfoerste4) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2221

Rel: #1889 ## Limitations - Only 8-bit PQ is supported - Sub-space size is only 2 supported Authors: - tsuki (https://github.com/enp1s0) - Artem M. Chirkin (https://github.com/achirkin) - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2206

Add the relevant options to the CAGRA parameter parser and refinement to the CAGRA ANN benchmark. No changes to the library code. NB: the new option won't work correctly until #2206 is merged. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2233

- This PR is one part of the feature of #1969 Authors: - James Rong (https://github.com/rhdong) Approvers: - Ben Frederickson (https://github.com/benfred) - Micka (https://github.com/lowener) - Corey J. Nolet (https://github.com/cjnolet) Authors: - rhdong (https://github.com/rhdong) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Micka (https://github.com/lowener) URL: #2109

raydouglass and others added 30 commits January 18, 2024 14:51

DOC v24.04 Updates [skip ci]

a03f0af

Merge pull request #2100 from rapidsai/branch-24.02

9165d89

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2101 from rapidsai/branch-24.02

c3b30f3

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2103 from rapidsai/branch-24.02

3790c2a

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2112 from rapidsai/branch-24.02

5464ae7

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2114 from rapidsai/branch-24.02

e9ba740

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2115 from rapidsai/branch-24.02

38d154e

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2116 from rapidsai/branch-24.02

c90cdfa

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2120 from rapidsai/branch-24.02

c9886d7

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2122 from rapidsai/branch-24.02

2647757

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2123 from rapidsai/branch-24.02

8a9feb5

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2127 from rapidsai/branch-24.02

7c73401

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2128 from rapidsai/branch-24.02

c73fd0d

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2130 from rapidsai/branch-24.02

0f24281

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2131 from rapidsai/branch-24.02

ea0d2f6

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2133 from rapidsai/branch-24.02

4f8af07

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2136 from rapidsai/branch-24.02

d4ae271

Forward-merge branch-24.02 to branch-24.04

Merge branch 'branch-24.02' into branch-24.04-merge-branch-24.02

ac428e3

Merge pull request #2143 from vyasr/branch-24.04-merge-branch-24.02

e407677

Branch 24.04 merge branch 24.02

Merge pull request #2146 from rapidsai/branch-24.02

6ebe2e0

Forward-merge branch-24.02 to branch-24.04

Launch neighborhood_recall kernel on CUDA stream (#2156)

9f6af2f

Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2156

Merge pull request #2164 from rapidsai/branch-24.02

423517c

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2170 from rapidsai/branch-24.02

eb6fdef

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2177 from rapidsai/branch-24.02

16cdf90

Forward-merge branch-24.02 to branch-24.04

Merge pull request #2178 from rapidsai/branch-24.02

1636d27

Forward-merge branch-24.02 to branch-24.04

bdice and others added 18 commits March 13, 2024 13:40

Adding cuVS notice to README and front page of docs. (#2224)

cb80657

Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: #2224

Batch cutlass distance kernels along N matrix dim (#2215)

56c0b3a

- batch input matrix with N-dim to at most 65535 to avoid cutlass gridY limitation. Authors: - Mahesh Doijade (https://github.com/mdoijade) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2215

raydouglass requested review from a team as code owners March 21, 2024 17:00

github-actions bot added cpp CMake python ci labels Mar 21, 2024

KyleFromNVIDIA and others added 3 commits March 21, 2024 13:32

Update pre-commit-hooks to v0.0.3 (#2239)

9637b3c

Use conda env create --yes instead of --force. (#2247)

5f2bd19

Update Changelog [skip ci]

86ed46e

raydouglass merged commit e0d40e5 into main Apr 10, 2024
3 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RELEASE] raft v24.04 #2240

[RELEASE] raft v24.04 #2240

raydouglass commented Mar 21, 2024 •

edited

Loading

[RELEASE] raft v24.04 #2240

[RELEASE] raft v24.04 #2240

Conversation

raydouglass commented Mar 21, 2024 • edited Loading

❄️ Code freeze for branch-24.04 and v24.04 release

What does this mean?

What is the purpose of this PR?

raydouglass commented Mar 21, 2024 •

edited

Loading

❄️ Code freeze for `branch-24.04` and v24.04 release