-
Notifications
You must be signed in to change notification settings - Fork 1
[FEATURE] Add support for layout local size(s) #60
Comments
In here, the |
@almarklein that is correct, there is another relevant component in the shader. You probably are already aware of this, but I'll outline it for completeness: In your "CPU" code you have to specify the "dispatch workgroup" size to run, ie the number of times to run the shader (which is the one you specified). Then in the GPU shader you have to specify the size of the thread block "layout size". In more practical terms, if you have a buffer with 500x200 elements, you are able to define your thread block "layout size" inside the shader as: layout (local_size_x = 5, local_size_y = 2, local_size_z = 1) in; Which means that each iteration you will process This means that then your dispatch can be something like:
This means it will run that shader 100x100x1 times. You can split this into different sizes to process the same dataset. For example if your layout size is (1, 1, 1) and your dispatch size is (500,100,2) then you would still end up processing all the elements. |
Actually no :) TBH most of my experience with opengl was with the es2 subset. So thanks for the details! And this sounds like a useful feature indeed. |
Hi, I really need this feature in order to achieve any sort of passable performance. My task is implementing matrix multiplication (GEMM), and in order for the speed to be faster than my CPU (at least a couple of hundred GFLOPS) I need local layout control and shared memory. Specifically I need to be able to do something like this (glsl code, example of shared (group/local) memory):
(local invocation id is also required) Having shared memory could be made it's own github issue. Being able to use shared memory for group local computation would roughly double the performance of my matrix multiplication (still a far cry from the full possible speed, but somewhat passable). The reason this is a lot faster is because local memory is much faster than global memory. Copying a chunk of the global memory to local memory first, and then computing things on that local chunk (on-chip) is much faster. The GPU's caching system can only do so much to alleviate this problem. |
It would be very useful if it's possible to define layout as part of the function definition. It could be something like:
The text was updated successfully, but these errors were encountered: