Skip to content
This repository has been archived by the owner on Oct 11, 2021. It is now read-only.

[FEATURE] Add support for layout local size(s) #60

Open
axsaucedo opened this issue Nov 7, 2020 · 4 comments
Open

[FEATURE] Add support for layout local size(s) #60

axsaucedo opened this issue Nov 7, 2020 · 4 comments

Comments

@axsaucedo
Copy link

It would be very useful if it's possible to define layout as part of the function definition. It could be something like:

@python2shader
def compute_shader_multiply(
        index: ("input", "GlobalInvocationId", ivec3),
        data1: ("buffer", 0, Array(f32)),
        data2: ("buffer", 1, Array(f32)),
        data3: ("buffer", 2, Array(f32)),
        layout=[x,y,z]):
    i = index.x
    data3[i] = data1[i] * data2[i]
@almarklein
Copy link
Member

In here, the x, y and z are ints, I assume? Do you mean the shape for how the shader is dispatched, as the n in here, or is there equivalent GLSL?

@axsaucedo
Copy link
Author

@almarklein that is correct, there is another relevant component in the shader.

You probably are already aware of this, but I'll outline it for completeness:

In your "CPU" code you have to specify the "dispatch workgroup" size to run, ie the number of times to run the shader (which is the one you specified). Then in the GPU shader you have to specify the size of the thread block "layout size".

In more practical terms, if you have a buffer with 500x200 elements, you are able to define your thread block "layout size" inside the shader as:

layout (local_size_x = 5, local_size_y = 2, local_size_z = 1) in;

Which means that each iteration you will process 5*2*1 = 10 buffer elements in a thread block.

This means that then your dispatch can be something like:

obj.dispatch(100, 100, 1)

This means it will run that shader 100x100x1 times.

You can split this into different sizes to process the same dataset. For example if your layout size is (1, 1, 1) and your dispatch size is (500,100,2) then you would still end up processing all the elements.

@almarklein
Copy link
Member

You probably are already aware of this, but I'll outline it for completeness:

Actually no :) TBH most of my experience with opengl was with the es2 subset. So thanks for the details! And this sounds like a useful feature indeed.

@CaiusTSM
Copy link

CaiusTSM commented Jan 27, 2021

Hi, I really need this feature in order to achieve any sort of passable performance. My task is implementing matrix multiplication (GEMM), and in order for the speed to be faster than my CPU (at least a couple of hundred GFLOPS) I need local layout control and shared memory. Specifically I need to be able to do something like this (glsl code, example of shared (group/local) memory):

#define SIZE 64

layout (local_size_x = SIZE, local_size_y = 1, local_size_z = 1) in;

shared float shared_data[SIZE];

...

void main() {
... Here is some computation in which each work item in the work group computes one element of the shared_data.
... Then the work group is synchronized with barrier() / memoryBarrierShared() (each work item waits for the entire group to finish filling their part of shared_data)
... Then for example, the shared_data is summed up all together and the output at index.x is set to that result.
}

(local invocation id is also required)

Having shared memory could be made it's own github issue. Being able to use shared memory for group local computation would roughly double the performance of my matrix multiplication (still a far cry from the full possible speed, but somewhat passable). The reason this is a lot faster is because local memory is much faster than global memory. Copying a chunk of the global memory to local memory first, and then computing things on that local chunk (on-chip) is much faster. The GPU's caching system can only do so much to alleviate this problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants