-
Notifications
You must be signed in to change notification settings - Fork 0
OpenCL Device Vector Performance Parameters
This page discusses the performance of bolt::cl::device_vector
in some special cases.
Device Vector is a container designed to encapsulate an OpenCL buffer on the GPU. The performance of device_vector varies depending on the type and location of the buffer. For instance, it's advisable not to use a CL_MEM_READ_WRITE
buffer if the program eventually executes on the host. In the examples, we assume that the system consists of a Host CPU and a discrete GPU device with OpenCL.
#include <bolt/cl/transform.h>
#include <bolt/cl/device_vector.h>
#include <bolt/cl/control.h>
...
// Calls device_vector constructor with default flag: CL_MEM_READ_WRITE
bolt::cl::device_vector<int> in_buffer( h_vector.begin(),
h_vector.end() );
// Create a control structure
bolt::cl::control ctl;
// Force to run on GPU device
ctl.setForceRunMode(bolt::cl::control::OpenCL);
bolt::cl::transform( ctl,
in_buffer.begin(),
in_buffer.end(),
in_buffer.begin(),
bolt::cl::plus<int>() );
...
The code snippet above, demonstrates the correct usage of device_vector -- The buffer resides on the GPU and the code runs on the GPU.
#include <bolt/cl/transform.h>
#include <bolt/cl/device_vector.h>
#include <bolt/cl/control.h>
...
// Calls device_vector constructor with default flag: CL_MEM_READ_WRITE
bolt::cl::device_vector<int> in_buffer( h_vector.begin(),
h_vector.end() );
// Create a control structure
bolt::cl::control ctl;
// Force to run on Multicore host device
ctl.setForceRunMode(bolt::cl::control::MulticoreCpu);
bolt::cl::transform( ctl,
in_buffer.begin(),
in_buffer.end(),
in_buffer.begin(),
bolt::cl::plus<int>() );
...
In the code snippet above, an OpenCL buffer with CL_MEM_READ_WRITE flag is created on the GPU. Notice that transform
takes the TBB path as guided by ctl
and this results in an additional job for the system -- To get the buffer back to the host system from GPU memory. To avoid this performance hit, it's recommended to either use a host vector like std::vector
or use device_vector
with CL_MEM_USE_HOST_PTR
flag, so that the buffer resides on the host memory.
bolt::cl::device_vector<int> in_buffer( h_vector.begin(),
h_vector.end(),
CL_MEM_USE_HOST_PTR );
Note that, if a host vector such as std::vector
is passed to any algorithm, a corresponding device_vector is created with CL_MEM_USE_HOST_PTR
flag.