CUDA C++ is a language that is very similar to C++. When compiled by nvcc
, CUDA can be linked with standard C++ files to create GPU-accelerated programs.
CUDA implementation code usually has the file extension .cu
, which tells CMake that it should use nvcc
to compile the code. CUDA headers use the standard .h
files used by C++ and C implementation code, so C++ is easily linkable with CUDA C++.
The heart of this simple CUDA application is a function that adds two numbers in an array.
This array can be very large (~1M elements in this program). Such a large number of elements would take a significant amount of time on a single-core CPU, but CUDA allows the operation to be broken into hundreds of smaller threads.
Here's how it's done:
// function to add the elements of two arrays
__global__
void add(int n, float* x, float* y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
__global__
tells the compiler that this function is intended to be run on a GPU and called from the CPU.__global__
functions are known as kernels. host code (code that runs on the CPU) calls kernels that run device code (code that runs on the GPU)- The CUDA compiler provides its own variables to
__global__
functions. Three of these are:threadIdx.x
(the thread ID)blockIdx.x
(the block ID of that thread).blockDim.x
(the number of blocks )gridDim.x
(the number of threads in the grid)
- The structure of this for loop is so common that it has a name: the grid-stride loop.