Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Persistent CAGRA kernel #2316

Closed
wants to merge 46 commits into from

Conversation

achirkin
Copy link
Contributor

@achirkin achirkin commented May 14, 2024

An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.

CAGRA throughput @ recall = 0 976 CAGRA Latency @ recall = 0 976

API

In the current implementation, the public API does not change. An extra parameter persistent is added to the ann::cagra::search_params (only valid when algo == SINGLE_CTA).
The persistent kernel is managed by a global runner object in a shared_ptr; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch). When there's no hearbeat in the last few seconds (kLiveInterval), the runner shuts down the kernel and cleans up the associated resources.

An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.

Integration notes

lightweight_uvector

RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple cudaGetDevice calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.

To workaround this, I introduce a lightweight_uvector in /detail/cagra/search_plan.cuh for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version of rmm::uvector: it does not check during resize/destruction whether the current device has changed since construction.
We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.

Shared resource queues / ring buffers

resource_queue_t is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads.
We may consider putting this in a common public namespace in raft if we envision more uses for it.

Persistent runner structs

launcher_t and persistent_runner_base_t look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.

Other code changes (solved)

This depends on (includes all the changes from):

achirkin added 30 commits May 14, 2024 10:57
The host fills in the work_descriptors, which is in the pinned memory and then arrives at the
input barriers (managed memory, on device) to mark that the descriptors are ready to read.
Then it waits on the comnpletion latch (managed memory, on host).

The device reads the descriptors when the readiness barriers allows that.
The descriptors are read by multiple threads at the same time (hoping for a single coalesced read).
Minimize the host<->device latencies by using host pinned memory and device memory for intra-device comm
…mark loop event sync optional

When using the persistent kernel variant, the calling CPU thread has to synchronize
with the GPU (wait on the completion flag) - i.e. there's no way to use events for this.
As a result, the event recording and sync in the benchmark loop introduce significant latency overheads.
To avoid this, I make the event optional (dependant on the search mode: persistent/original).

Originally, the benchmark used size_t indices, whereas CAGRA operated with uint32_t.
As a result, we had to do a linear mapping (on GPU), which adds a kernel to the benchmark loop,
which goes against the event optimization above.
Hence, I changed the benchmark index type.
Restructure input/output a bit to pad the atomics to 128 bytes.
This reduces the latency/single threaded time by 3x on a PCIe machine.
1) Make the persistent kernel allocate the hashmap in advance.
2) Introduce lightweight_uvector, which does not call any CUDA functions when not needed.
Since the worker and job queues were decoupled, it's not necessary to wait for the job to be
read anymore. As soon as the descriptor handle is read, it can be returned to the queue.
@achirkin achirkin added feature request New feature or request non-breaking Non-breaking change 2 - In Progress Currenty a work in progress labels May 14, 2024
@achirkin achirkin self-assigned this May 14, 2024
@github-actions github-actions bot removed the CMake label May 15, 2024
@cjnolet cjnolet added the 5 - DO NOT MERGE Hold off on merging; see PR for details label May 17, 2024
@achirkin achirkin changed the base branch from branch-24.06 to branch-24.08 June 13, 2024 05:45
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for this work! Just a few initial comments here.

For the API discussion, one question is how to start / stop the kernel. The current approach is nice, just one bool flag in search params, and rest is done automatically. I believe this would not preclude adding explicit start/stop methods if we deem that necessary.

Other question is what parameters do we need to control for the persistent kernel (apart from whether or not to use persistent kernel)? Eg.: max num threads, kLiveInterval? If we want to provide control for these, then we could have additional fields for search_params, but we could also think of them as a property of the index, since the long running kernel is associated with the index.

@@ -73,7 +76,7 @@ struct AlgoProperty {

class AnnBase {
public:
using index_type = size_t;
using index_type = uint32_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to change this for all benchmarks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a temporary change for testing, I've removed that in cuVS PR (cuVS doesn't support uint32_t indexes in the refinement step)

@@ -124,6 +124,8 @@ struct search_params : ann::search_params {
uint32_t num_random_samplings = 1;
/** Bit mask used for initial random seed node selection. */
uint64_t rand_xor_mask = 0x128394;
/** Whether to use the persistent version of the kernel (only SINGLE_CTA is supported a.t.m.) */
bool persistent = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only parameter that needs to be controlled? What about the temporary buffer size (queue for queries, or max number of threads)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer sizes are compile-time constants in the current design. I'd like to move some of the new constants to the parameters, but that conflicts with the implicit mechanics of running the kernel. They are not search parameters, but the "runner" parameters and should not be changed across calls to the kernel.
So this brings up again the question whether we want to have implicit vs explicit kernel runner.

@achirkin
Copy link
Contributor Author

Closing this in favor of rapidsai/cuvs#215

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currenty a work in progress 5 - DO NOT MERGE Hold off on merging; see PR for details cpp feature request New feature or request non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

3 participants