You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we're selecting our optimal blocking on the command line, with default {2,8} that is optimal for 16 threads.
On our benchmarks, we pick the best one for each number of threads, but the compiler can't do that, as OpenMP's OMP_NUM_THREADS change at run time.
We need to lower code that can interpret that environment variable (via OpenMP dialect) and create a dynamic loop blocking based on run time values, so that we only need to generate the code once and it can run on any number of threads.
We also need to know which are the best factors for each number of threads (cost model, per arch) and have a generated dispatch table so that we can chose them at run time.
The text was updated successfully, but these errors were encountered:
Currently, we're selecting our optimal blocking on the command line, with default
{2,8}
that is optimal for 16 threads.On our benchmarks, we pick the best one for each number of threads, but the compiler can't do that, as OpenMP's
OMP_NUM_THREADS
change at run time.We need to lower code that can interpret that environment variable (via OpenMP dialect) and create a dynamic loop blocking based on run time values, so that we only need to generate the code once and it can run on any number of threads.
We also need to know which are the best factors for each number of threads (cost model, per arch) and have a generated dispatch table so that we can chose them at run time.
The text was updated successfully, but these errors were encountered: