You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
float den = sqrt(ncc_sum(d_den1s,n)*ncc_sum(d_den2s,n));
if (den < 1e-5) {
return1e5;
}
returnncc_sum(d_nums,n)/den;
}
We can also see that the majority of the run time of the NCC function is from the NCC_SUM function (which should be expected since we are making 6 calls to the sum function during a single run of the ncc function). We can also see that the sum function is dominated by calls for launching the kernel and the Buffer write method. WIthin the kernel launch function calls to the OpenCL API are made, with the calls to clEnqueueNDRangeKernel taking up ~5.196 seconds and calls to clFinish taking up ~4.330 seconds. As for buffer writes, this is a little counter-intuitive because we are actually reading from the buffer and writing to the variable we passed in. But all of the runtime for Buffer::write is caused by clEnqueueReadBuffer command.
That way we can keep the sums on the GPU, since they are used by the regular NCC kernel. The calculation of meanG and meanF can take place in the NCC kernel
We will still need to read from the summation buffer once the execution of the regular ncc kernel is over, to return the expected float value. This will reduce our calls to Buffer::write during the ncc driver from 6 to 3. We could also investigate the viability of utilizing a kernel to calculate the den value, further reducing our calls from 3 to 2.
float den = sqrt(ncc_sum(d_den1s,n)*ncc_sum(d_den2s,n));
if (den < 1e-5) {
return1e5;
}
returnncc_sum(d_nums,n)/den;
Kernel::Launch
I would suggest looking into improving the sum kernel somehow. Currently, I am building OCLGrind to help aid in this investigation. OCLGrind does not support OpenGL-OpenCL interop.
unsigned i = get_global_id(0) + get_global_id(1)*get_global_size(1); // global index
unsigned t = get_local_id(0); // thread index
buffer[t] = (i < n) ? f[i] : 0.0f;
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned s = get_local_size(0)/2; s > 0; s >>= 1) {
if (t < s) {
buffer[t] += buffer[t + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (t == 0) {
sums[get_global_id(1)] = buffer[0];
}
}
// vim: ts=4 syntax=cpp noexpandtab
Outside of the PSO function
It would appear that everything outside of the PSO function runs in a negligible time. NOTE: Anything in the stack that is in the format func@{memory address} are calls to functions from external libraries such as OpenCL, OpenGL, Microsoft Direct3D, Windows USER32, etc
The text was updated successfully, but these errors were encountered:
This issue will be used to document the optimization of the PSO function.
Information was obtained from Intel VTune Profiler after tracking 20 frames (10 mc3, 10 rad) of the WN00105 dataset.
Initial Results
As we can see the worst offender during the PSO function run is the NCC function:
Autoscoper/libautoscoper/src/gpu/opencl/Ncc.cpp
Lines 186 to 226 in 62a7679
We can also see that the majority of the run time of the NCC function is from the NCC_SUM function (which should be expected since we are making 6 calls to the sum function during a single run of the ncc function). We can also see that the sum function is dominated by calls for launching the kernel and the Buffer write method. WIthin the kernel launch function calls to the OpenCL API are made, with the calls to clEnqueueNDRangeKernel taking up
~5.196 seconds
and calls to clFinish taking up~4.330 seconds
. As for buffer writes, this is a little counter-intuitive because we are actually reading from the buffer and writing to the variable we passed in. But all of the runtime forBuffer::write
is caused by clEnqueueReadBuffer command.Autoscoper/libautoscoper/src/gpu/opencl/Ncc.cpp
Lines 106 to 153 in 62a7679
Moving forward
Buffer::Write
I would suggest removing the unnecessary call to
Buffer::Write
from within the ncc_sum driver:Autoscoper/libautoscoper/src/gpu/opencl/Ncc.cpp
Lines 150 to 152 in 62a7679
That way we can keep the sums on the GPU, since they are used by the regular NCC kernel. The calculation of
meanG
andmeanF
can take place in the NCC kernelAutoscoper/libautoscoper/src/gpu/opencl/Ncc.cpp
Lines 188 to 190 in 62a7679
Autoscoper/libautoscoper/src/gpu/opencl/Ncc.cpp
Lines 205 to 213 in 62a7679
We will still need to read from the summation buffer once the execution of the regular ncc kernel is over, to return the expected float value. This will reduce our calls to
Buffer::write
during the ncc driver from 6 to 3. We could also investigate the viability of utilizing a kernel to calculate theden
value, further reducing our calls from 3 to 2.Autoscoper/libautoscoper/src/gpu/opencl/Ncc.cpp
Lines 219 to 225 in 62a7679
Kernel::Launch
I would suggest looking into improving the sum kernel somehow.
Currently, I am building OCLGrind to help aid in this investigation.OCLGrind does not support OpenGL-OpenCL interop.Autoscoper/libautoscoper/src/gpu/opencl/kernel/NccSum.cl
Lines 1 to 26 in 62a7679
Outside of the PSO function
It would appear that everything outside of the PSO function runs in a negligible time. NOTE: Anything in the stack that is in the format
func@{memory address}
are calls to functions from external libraries such as OpenCL, OpenGL, Microsoft Direct3D, Windows USER32, etcThe text was updated successfully, but these errors were encountered: