-
Notifications
You must be signed in to change notification settings - Fork 61
Benchmarking cuda streams and graphs
We want to implement a benchmark that studies the performance of co-executing kernels by using either cuda streams or cuda graphs. The benchmark should categorize (when and why) and quantify (how much) the overhead costs or performance benefits of using co-execution.
We will measure the throughput (total execution time to complete the benchmark), across different points for which the varying parameters are:
- Number of benchmark iterations (epochs) (
E
) - Number of cuda streams/ fanout for cuda graphs (
S
) - Number of kernels launched per stream/ number of kernels launched in every branch of the fork for cuda graphs (
K
) - Total size of the problem (
A
) - Block size (
B
) - Multithreaded (boolean value) (
M
)
The amount of work is
W = E * A
Ideal speedup is then P = S
K
should not affect the results if there is no measurable overheads of breaking down one kernel call into several.
Benchmark receives E
, S
, A
, K
, B
and M
as command line parameters; outputs throughput.
Kernels:
- empty
- axpy
- newton (with fixed number of iterations)
Cuda streams:
The benchmark splits the work on an array(s) of size A
into S*K
pieces. The work is performed on S
streams that launch K
sequential kernels each working on a problem with size A/(S*K)
, then the GPU is synchronized. If M
is true, we create S
CPU threads and associate each GPU stream to a CPU thread, launching K
kernels from each thread on a separate kernel. This is repeated E
times.
Cuda graph:
The benchmark splits the work on an array(s) of size A
into S*K
pieces. The graph forks into S
branches that launch K
sequential kernels each working on a problem with size A/(S*K)
, then the GPU is synchronized. This is repeated E
times.
Either with nvprof
or nsys
- nvprof
illustrates better the overlap in execution but is not supported for A100.