Setting thread count per block via NUM_THREADS #11

te42kyfo · 2024-06-04T12:51:32Z

In /src/common/util.c, there is a function that is being used in other places to determine the number of threads in a thread block size:

int get_cuda_num_threads(void)
{
    const char* num_threads_env = getenv("NUM_THREADS");
    return (num_threads_env == NULL) ? 32 : atoi(num_threads_env);
}

This looks to me that it is a similar setting as OMP_NUM_THREADS. I don't think that it is useful in the same way. A value less than 32 would just lead to an underutilized warp, a value between 32 and 64 would reduce the total number of threads per SM on some GPUs (consumer ones, they cannot track as many thread blocks per SM as the HPC GPUs), and anything above would not change the total number of threads per SM, as the GPU would just put fewer larger blocks on the SM. In addition, if the num_threads is not a divisor of the (GPU dependent) maximum number of threads per SM, the thread count per SM would oscillate.

I suggest choosing a value for this per kernel, or to improve the default value to 128. That always works, independent of the register count and maximizes occupancy on all GPUs.

The setting does actually have a performance impact:

> ./MDBench-VL-NVCC-X86-AVX2-SP
... 
Performance: 303.70 million atom updates per second

>NUM_THREADS=128 ./MDBench-VL-NVCC-X86-AVX2-SP
...
Performance: 393.34 million atom updates per second

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting thread count per block via NUM_THREADS #11

Setting thread count per block via NUM_THREADS #11

te42kyfo commented Jun 4, 2024

Setting thread count per block via NUM_THREADS #11

Setting thread count per block via NUM_THREADS #11

Comments

te42kyfo commented Jun 4, 2024