-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Persistent CAGRA kernel #2316
Conversation
The host fills in the work_descriptors, which is in the pinned memory and then arrives at the input barriers (managed memory, on device) to mark that the descriptors are ready to read. Then it waits on the comnpletion latch (managed memory, on host). The device reads the descriptors when the readiness barriers allows that. The descriptors are read by multiple threads at the same time (hoping for a single coalesced read).
Minimize the host<->device latencies by using host pinned memory and device memory for intra-device comm
…ueue submitting with worker releasing
…safety related to the runner.
…mark loop event sync optional When using the persistent kernel variant, the calling CPU thread has to synchronize with the GPU (wait on the completion flag) - i.e. there's no way to use events for this. As a result, the event recording and sync in the benchmark loop introduce significant latency overheads. To avoid this, I make the event optional (dependant on the search mode: persistent/original). Originally, the benchmark used size_t indices, whereas CAGRA operated with uint32_t. As a result, we had to do a linear mapping (on GPU), which adds a kernel to the benchmark loop, which goes against the event optimization above. Hence, I changed the benchmark index type.
Restructure input/output a bit to pad the atomics to 128 bytes. This reduces the latency/single threaded time by 3x on a PCIe machine.
1) Make the persistent kernel allocate the hashmap in advance. 2) Introduce lightweight_uvector, which does not call any CUDA functions when not needed.
… the throughput when .pop is the bottleneck
Since the worker and job queues were decoupled, it's not necessary to wait for the job to be read anymore. As soon as the descriptor handle is read, it can be returned to the queue.
…the 'persistent' search parameter
…ment helper to avoid unnecessary allocations and stream syncs in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for this work! Just a few initial comments here.
For the API discussion, one question is how to start / stop the kernel. The current approach is nice, just one bool flag in search params, and rest is done automatically. I believe this would not preclude adding explicit start/stop methods if we deem that necessary.
Other question is what parameters do we need to control for the persistent kernel (apart from whether or not to use persistent kernel)? Eg.: max num threads
, kLiveInterval
? If we want to provide control for these, then we could have additional fields for search_params
, but we could also think of them as a property of the index
, since the long running kernel is associated with the index.
@@ -73,7 +76,7 @@ struct AlgoProperty { | |||
|
|||
class AnnBase { | |||
public: | |||
using index_type = size_t; | |||
using index_type = uint32_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to change this for all benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a temporary change for testing, I've removed that in cuVS PR (cuVS doesn't support uint32_t indexes in the refinement step)
@@ -124,6 +124,8 @@ struct search_params : ann::search_params { | |||
uint32_t num_random_samplings = 1; | |||
/** Bit mask used for initial random seed node selection. */ | |||
uint64_t rand_xor_mask = 0x128394; | |||
/** Whether to use the persistent version of the kernel (only SINGLE_CTA is supported a.t.m.) */ | |||
bool persistent = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only parameter that needs to be controlled? What about the temporary buffer size (queue for queries, or max number of threads)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer sizes are compile-time constants in the current design. I'd like to move some of the new constants to the parameters, but that conflicts with the implicit mechanics of running the kernel. They are not search parameters, but the "runner" parameters and should not be changed across calls to the kernel.
So this brings up again the question whether we want to have implicit vs explicit kernel runner.
Closing this in favor of rapidsai/cuvs#215 |
An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.
API
In the current implementation, the public API does not change. An extra parameter
persistent
is added to theann::cagra::search_params
(only valid whenalgo == SINGLE_CTA
).The persistent kernel is managed by a global runner object in a
shared_ptr
; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch
). When there's no hearbeat in the last few seconds (kLiveInterval
), the runner shuts down the kernel and cleans up the associated resources.An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.
Integration notes
lightweight_uvector
RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple
cudaGetDevice
calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.To workaround this, I introduce a
lightweight_uvector
in/detail/cagra/search_plan.cuh
for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version ofrmm::uvector
: it does not check during resize/destruction whether the current device has changed since construction.We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.
Shared resource queues / ring buffers
resource_queue_t
is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads.We may consider putting this in a common public namespace in raft if we envision more uses for it.
Persistent runner structs
launcher_t
andpersistent_runner_base_t
look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.Other code changes (solved)
This depends on (includes all the changes from):