-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding throughput
and latency
modes to raft-ann-bench
#1920
Adding throughput
and latency
modes to raft-ann-bench
#1920
Conversation
…let/raft into fea-2312-benchmarks_throughput_mode
I'm finding that googlebench appears to reserve the right to iterate state as it chooses, which makes it really hard to run the iteration all the way through to the end and rely on the end-to-end time. I could be wrong about this, but that's the behavior I'm seeing so far. Need to dig in a bit further. So far I've tried stream per thread, the stream pool, various different numbers of threads, syncing after each search, syncing only at the end. The behavior I'm seeing just doesn't match my expectations and while the GPU seems to be getting higher utilization than without the thread pool, it's still not getting high enough to match the cost of the threading. Something else is going on here. |
…eads the processing a bit better, but so far things are looking good.
So far, the results seem too good to be real, for both HNSW and RAFT. I'm still investigating to make sure we're computing the timings correctly. From what I've explored so far, it seems we are computing them properly and I do notice the GPU utilization stays near 100% during queries. Here are some initial results at batch size 100.
|
48f192a
to
18b57cc
Compare
…let/raft into fea-2312-benchmarks_throughput_mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! some minor comments/questions below:
Co-authored-by: Ben Frederickson <[email protected]>
Co-authored-by: Ben Frederickson <[email protected]>
…let/raft into fea-2312-benchmarks_throughput_mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great!
/merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cjnolet, I am late to the party, still I would like to give my comments to the PR. Overall this looks great, and I am happy to see this change. Still I think we should have a follow up PR to improve the benchmark columns, and explanations we give.
if (state.thread_index() == 0) { | ||
auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count(); | ||
state.counters.insert({{"end_to_end", duration}}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see why do we need end_to_end
. I would prefer that we have two values Latency
and Thoughput
. Throughput
is already there as items_per_second
, and average latency can be defined as follows:
if (state.thread_index() == 0) { | |
auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count(); | |
state.counters.insert({{"end_to_end", duration}}); | |
} | |
auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count(); | |
state.counters.insert( | |
{"Latency", {duration / double(state.iterations()), benchmark::Counter::kAvgThreads}}); |
Using kAvgThreads
ensures that we see an average value over all theads instead of an accumulated value over all threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End-to-end time is the most important indicator of latency here AFAIC and this is an important measurement. Given that this is a pipelined benchmark and gbench guarantees all threads will reach the beginning and end of the state loop together, we have the ability to capture the time spent in the state loop by all threads. This is essentially the time a user can expect to wait to process total_queries. This is probably the most realistic measure of end to end latency we could provide.
Please also note that the measure you are suggesting here would yield the same result, but would be an average. The end_to_end field is intentionally not an average. One thing I'd like to discuss going in the future is allowing the number of iterations to be locked to number of queries (10k in most cases) so that end to end time becomes more directly comparable across experiments. It's an easy argument to set on the gbench side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, since threads are synced end_to_end
time and Latency
are redundant. But they answer different questions, so let's keep them both.
The number of iterations should not be locked to the number of queries: sometimes we need short benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue here: https://github.com/rapidsai/raft/pull/1940/files#r1376961045
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of iterations should not be locked to the number of queries: sometimes we need short benchmarks.
My point in mentioning that we can do that is because sometimes we want to lock the number of iterations so we can directly compare the end-to-end times across experiments. This is something that's possible today and I actually found it somewhat useful while I was verifying the various measurements in this PR.
if (cudart.found()) { | ||
state.counters.insert({{"GPU Time", gpu_timer.total_time() / state.iterations()}, | ||
{"GPU QPS", queries_processed / gpu_timer.total_time()}}); | ||
state.counters.insert({{"GPU", gpu_timer.total_time() / double(state.iterations())}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep it as "GPU Time" or "GPU Latency", since that is more descriptive.
state.counters.insert({{"GPU", gpu_timer.total_time() / double(state.iterations())}}); | |
double gpu_latency_per_iteration = gpu_timer.total_time() / double(state.iterations()); | |
state.counters.insert({"GPU Latency", {gpu_latency_per_iteration, benchmark::Counter::kAvgThreads}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shall look into the gpu_timer
as a separate issue. What is the expectation from the ANN::search()
methods that we call within the benchmark loop, are they expected to sync? Currently they do, but I would like to review it later because different streams and events are synced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a little stuck here- ideally we would be using the same steam to queue up the start/end events that we are using in the raft device_resources objects, right? That way we know that the events being measured on that stream are blocking only while the actual kernels for search() are being executed, so we are getting an accurate measure of only the time spent in the gpu (and making this data available on each stream for profiling). Right now, I don't think that's the case, but I could be mistaken.
At the moment, the device_resources instances are using cuda_default_stream_per_thread but the gpu_timer is creating a new stream "with flags". I didn't do a ton of investigating here, but is there a default stream per thread with flags? Does that even make sense? I take it the flags part is needed to queue up the start/stop events?
I had actually removed the GPU column during the search step altogether but added it back before merging so that we could have this conversation (since it's technically not changing any behavior from the way it was done before threading- it was still using the same streams before).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the "Does that even make sense?" question also occurred to me and I suspect the answer is no (but @achirkin might prove me wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing something, but as far as I remember, the new-stream per test is for better robustness and fairness against other implementations.
First of all, raft-ann-bench had been used with other GPU algorithms, including third-party, internal, and, sometimes, unfinished ones. This means:
- An implementation may be unaware of any raft machinery
- An implementation may use its own thread/stream pools, and we know nothing about the streams used internally.
- An implementation may mess up the used stream and make debugging harder
I created a new stream with flags to make sure it's non-blocking against whatever streams are allocated internally in any implementation. Hence all synchronization would happen only via explicit waiting on start/stop events.
Also in this setup, the GPU timing includes not only the kernel time, but also all CPU time. It really should be the same as the cpu wall time, because the start event is submitted to an empty stream on each iteration before the algorithm starts. The GPU timing is a high-precision timing, it supposed to be more accurate than the wall time if the iteration time is very small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding and the behavior I've seen, a new gpu_timer stream is created for each thread and the call to search() for each RAFT index type is synchronous wrt it's thread, thus the gpu_timer will wait before for the search to complete before each lap, and thus that wait time will cause the same time gap between start and end events.
Im not sure the faiss indexes will function this way, as I believe they are syncing on the stream that's passed into search() (plus any additional syncs that happen to the stream in the corresponding StandardGouResources instance. We should be able to set the stream on that instance, so we could very well set the cuda stream per thread on it, I believe, which would make it function more like raft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assumption here is that we will have a cudaStreamSync()
or equivalent while calling ivf_pq::search()
. That would block the CPU thread, and delay scheduling the stop_
event.
[Update]: I have missed that actually we pass the stream from timer to search: ivf_pq::search(, gpu_timer.stream())
, so the ANN algorithm has a chance to (and expected to) sync with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue the discussion here: https://github.com/rapidsai/raft/pull/1940/files#r1376957640
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assumption here is that we will have a cudaStreamSync() or equivalent while calling ivf_pq::search(). That would block the CPU thread, and delay scheduling the stop_ event.
This should be the case, though, shouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind cuda_lap
is to not require any extra synchronization, such as cudaStreamSync()
. The only requirement is that ivf_pq::search() does something in the stream passed to it.
What cuda_lap
does:
- Waits till previous work in
stream_
finishes and submits thestart_
event into it. A somewhat risky assumption here is that the event triggers immediately. Maybe, in reality it could trigger some time between the submission and the first use of stream_ (which could be a synchronization or a data transfer or a kernel)? To fix it, we can just add another cudaEventSynchronize or cudaStreamSynchronize right after submitting the event (but I'm not sure this needs fixing). - Submits the
stop_
event into thestream_
after the call to the implementation and immediately synchronizes on it. Hence it captures everything the implementation has done in thestream_
. However, since the implementation is called in the same thread, thestop_
event is submitted no earlier than the CPU side of the algorithm finishes. Hence, thestop_
time is the larger of the two (cpu completion time, gpu completion time).
As a result, the two events together capture the total execution time, including the GPU work that could still be happening in the stream_
after the CPU side has already finished. Yet, the GPU time cannot be larger than our recorded wall time, because we synchronize on the stop_
event within the benchmark iteration (in the destructor of the cuda_lap
).
The implementation must mark its work in the stream_
, but not necessarily via synchronizing with the host.
Currently, both FAISS and raft_ivf_pq wrappers have their own streams created as a part of their states. That is, we don't create a new raft_resources handle with this stream and don't set this as the main stream in FAISS! This is done to make sure we can cache the algo handles between runs without using already destroyed streams. In both cases, I use an extra sync event to establish a dependency between the internal implementation streams and the passed cuda_lap stream.
} | ||
|
||
// This will be the total number of queries across all threads | ||
state.counters.insert({{"total_queries", queries_processed}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add the dataset rows as a benchmark column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a bad idea.
* The following are important for getting accuracy QPS measurements on both CPU | ||
* and GPU These make sure that | ||
* - `end_to_end` ~ (`Time` * `Iterations`) | ||
* - `items_per_second` ~ (`total_queries` / `end_to_end`) | ||
* - `Time` = `end_to_end` / `Iterations` | ||
* | ||
* - Latency = `Time` | ||
* - Throughput = `items_per_second` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still not have the desired effect to make Time == Latency. I recommend above to add a specific latency column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been finding that empirically, Time == Latency because we are using the process time to measure the cpu time and the "time" is based on the avg of individual latencies spent in the thread loops. It's not exact based on end_to_end, but it's very very close. Close enough to where I think it's well above the noise level. I'm hoping we can utilize the gbench timers where they make sense to that we can ease the pains for the users of having a bunch of additional columns that contain similar values. If there are measurements that are obviously wrong, I think we should consider adding additonal columns with the correct measurements. I was at least able to consistently see that Time was pretty much Latency with our current settings. I was even able to compute values that were very close to is using our manually measured columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some benchmarks I have measured small but statistically significant decrease in Time
while increasing number of threads. That cannot be the case if Time
is actual latency. To highlight the issue, I have added a dummy ANN class in #1940, which only sleeps for a fixed amount of time. Let's continue the discussion there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That cannot be the case if Time is actual latency.
I think we might be saying different things here when we say "latency". I"m referring to latency as being the average latency of each individual batch (for each thread) and not the total latency summed together. From my understanding, this is why Time
becomes ~CPU / n_threads up until the saturation point where the hardware starts to kick back and stall while it's waiting to schedule new work. Have you also observed this behavior? IVF-PQ is a great example there, where after 8 threads or so, you notice the Time
field stays the same or even increases.
Separating the way the benhcmarks are measured into
throughput
andlatency
modes.latency
mode accumulates the times for each batch to be processed and then estimates QPS and provides the average time spent doing processing on the GPU. For batch size of 1, this becomes a fairly estimate of average latency per query. For larger batches, it becomes a fairly accurate estimate of time spent per batch.throughput
mode pipelines the individual batches using a thread pool (and stream pool for the GPU algos). For both smaller and larger batches, this gives a good estimate of the amount of data we can push through the hardware in a period of time.A good comprehensive comparison will include both of these numbers.