You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This looks to me that it is a similar setting as OMP_NUM_THREADS. I don't think that it is useful in the same way. A value less than 32 would just lead to an underutilized warp, a value between 32 and 64 would reduce the total number of threads per SM on some GPUs (consumer ones, they cannot track as many thread blocks per SM as the HPC GPUs), and anything above would not change the total number of threads per SM, as the GPU would just put fewer larger blocks on the SM. In addition, if the num_threads is not a divisor of the (GPU dependent) maximum number of threads per SM, the thread count per SM would oscillate.
I suggest choosing a value for this per kernel, or to improve the default value to 128. That always works, independent of the register count and maximizes occupancy on all GPUs.
The setting does actually have a performance impact:
> ./MDBench-VL-NVCC-X86-AVX2-SP
...
Performance: 303.70 million atom updates per second
>NUM_THREADS=128 ./MDBench-VL-NVCC-X86-AVX2-SP
...
Performance: 393.34 million atom updates per second
The text was updated successfully, but these errors were encountered:
In /src/common/util.c, there is a function that is being used in other places to determine the number of threads in a thread block size:
This looks to me that it is a similar setting as OMP_NUM_THREADS. I don't think that it is useful in the same way. A value less than 32 would just lead to an underutilized warp, a value between 32 and 64 would reduce the total number of threads per SM on some GPUs (consumer ones, they cannot track as many thread blocks per SM as the HPC GPUs), and anything above would not change the total number of threads per SM, as the GPU would just put fewer larger blocks on the SM. In addition, if the num_threads is not a divisor of the (GPU dependent) maximum number of threads per SM, the thread count per SM would oscillate.
I suggest choosing a value for this per kernel, or to improve the default value to 128. That always works, independent of the register count and maximizes occupancy on all GPUs.
The setting does actually have a performance impact:
The text was updated successfully, but these errors were encountered: