Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSO function optimization #135

Open
NicerNewerCar opened this issue Jun 22, 2023 · 1 comment
Open

PSO function optimization #135

NicerNewerCar opened this issue Jun 22, 2023 · 1 comment

Comments

@NicerNewerCar
Copy link
Contributor

NicerNewerCar commented Jun 22, 2023

This issue will be used to document the optimization of the PSO function.

Information was obtained from Intel VTune Profiler after tracking 20 frames (10 mc3, 10 rad) of the WN00105 dataset.

Initial Results

image

As we can see the worst offender during the PSO function run is the NCC function:

float ncc(Buffer* f, Buffer* g, Buffer* mask, unsigned n)
{
float nbPixel = ncc_sum(mask, n);
float meanF = ncc_sum(f, n) / nbPixel;
float meanG = ncc_sum(g, n) / nbPixel;
#if DEBUG
std::cerr << "meanF: " << meanF << std::endl;
std::cerr << "meanG: " << meanG << std::endl;
#endif
size_t numThreads, numBlocks, sizeMem;
get_device_params(n, numThreads, numBlocks, sizeMem);
Kernel* kernel = ncc_kernel_.compile(Ncc_cl, "ncc_kernel");
kernel->block1d(numThreads);
kernel->grid1d(numBlocks);
kernel->addBufferArg(f);
kernel->addArg(meanF);
kernel->addBufferArg(g);
kernel->addArg(meanG);
kernel->addBufferArg(mask);
kernel->addBufferArg(d_nums);
kernel->addBufferArg(d_den1s);
kernel->addBufferArg(d_den2s);
kernel->addArg(n);
kernel->launch();
delete kernel;
float den = sqrt(ncc_sum(d_den1s,n)*ncc_sum(d_den2s,n));
if (den < 1e-5) {
return 1e5;
}
return ncc_sum(d_nums,n)/den;
}

We can also see that the majority of the run time of the NCC function is from the NCC_SUM function (which should be expected since we are making 6 calls to the sum function during a single run of the ncc function). We can also see that the sum function is dominated by calls for launching the kernel and the Buffer write method. WIthin the kernel launch function calls to the OpenCL API are made, with the calls to clEnqueueNDRangeKernel taking up ~5.196 seconds and calls to clFinish taking up ~4.330 seconds. As for buffer writes, this is a little counter-intuitive because we are actually reading from the buffer and writing to the variable we passed in. But all of the runtime for Buffer::write is caused by clEnqueueReadBuffer command.

image

image

static float ncc_sum(Buffer* f, unsigned n)
{
size_t numThreads, numBlocks, sizeMem;
get_device_params(n, numThreads, numBlocks, sizeMem);
Kernel* kernel = ncc_sum_kernel_.compile(NccSum_cl, "ncc_sum_kernel");
while (n > 1)
{
#if DEBUG
std::cerr << "ncc_sum[" << n << "] numThreads = " << numThreads << std::endl;
std::cerr << "ncc_sum[" << n << "] numBlocks = " << numBlocks << std::endl;
std::cerr << "ncc_sum[" << n << "] sizeMem = " << sizeMem << std::endl;
#endif
kernel->block2d(numThreads, 1);
kernel->grid2d(1, numBlocks);
kernel->addBufferArg(f);
kernel->addBufferArg(d_sums);
kernel->addLocalMem(sizeMem);
kernel->addArg(n);
kernel->launch();
#if DEBUG
float *tmp = new float[numBlocks];
d_sums->write(tmp, numBlocks*sizeof(float));
for (unsigned j=0; j<numBlocks; j++) {
std::cerr << " " << tmp[j];
}
std::cerr << std::endl;
delete tmp;
#endif
n = numBlocks;
get_device_params(n, numThreads, numBlocks, sizeMem);
f = d_sums;
kernel->reset();
}
delete kernel;
float h_sum;
d_sums->write(&h_sum, sizeof(float));
return h_sum;
}

Moving forward

Buffer::Write

I would suggest removing the unnecessary call to Buffer::Write from within the ncc_sum driver:

float h_sum;
d_sums->write(&h_sum, sizeof(float));
return h_sum;

That way we can keep the sums on the GPU, since they are used by the regular NCC kernel. The calculation of meanG and meanF can take place in the NCC kernel

float nbPixel = ncc_sum(mask, n);
float meanF = ncc_sum(f, n) / nbPixel;
float meanG = ncc_sum(g, n) / nbPixel;

kernel->addBufferArg(f);
kernel->addArg(meanF);
kernel->addBufferArg(g);
kernel->addArg(meanG);
kernel->addBufferArg(mask);
kernel->addBufferArg(d_nums);
kernel->addBufferArg(d_den1s);
kernel->addBufferArg(d_den2s);
kernel->addArg(n);

We will still need to read from the summation buffer once the execution of the regular ncc kernel is over, to return the expected float value. This will reduce our calls to Buffer::write during the ncc driver from 6 to 3. We could also investigate the viability of utilizing a kernel to calculate the den value, further reducing our calls from 3 to 2.

float den = sqrt(ncc_sum(d_den1s,n)*ncc_sum(d_den2s,n));
if (den < 1e-5) {
return 1e5;
}
return ncc_sum(d_nums,n)/den;

Kernel::Launch

I would suggest looking into improving the sum kernel somehow. Currently, I am building OCLGrind to help aid in this investigation. OCLGrind does not support OpenGL-OpenCL interop.

__kernel
void ncc_sum_kernel(
__global const float* f,
__global float* sums,
__local float* buffer,
unsigned n)
{
unsigned i = get_global_id(0) + get_global_id(1)*get_global_size(1); // global index
unsigned t = get_local_id(0); // thread index
buffer[t] = (i < n) ? f[i] : 0.0f;
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned s = get_local_size(0)/2; s > 0; s >>= 1) {
if (t < s) {
buffer[t] += buffer[t + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (t == 0) {
sums[get_global_id(1)] = buffer[0];
}
}
// vim: ts=4 syntax=cpp noexpandtab

Outside of the PSO function

It would appear that everything outside of the PSO function runs in a negligible time. NOTE: Anything in the stack that is in the format func@{memory address} are calls to functions from external libraries such as OpenCL, OpenGL, Microsoft Direct3D, Windows USER32, etc

image

@amymmorton
Copy link
Collaborator

This was really helpful to revisit today- Thanks @NicerNewerCar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants