Skip to content

OpenCL Device Vector Performance Parameters

Jay edited this page Feb 26, 2014 · 13 revisions

This page discusses the performance of bolt::cl::device_vector in some special cases.

Device Vector is a container designed to encapsulate an OpenCL buffer on the GPU. The performance of device_vector varies depending on the type and location of the buffer. For instance, it's advisable not to use a CL_MEM_READ_WRITE buffer if the program eventually executes on the host. In the examples, we assume that the system consists of a Host CPU and a discrete GPU device with OpenCL.

#include <bolt/cl/transform.h>
#include <bolt/cl/device_vector.h>
#include <bolt/cl/control.h>

...
// Calls device_vector constructor with default flag: CL_MEM_READ_WRITE
bolt::cl::device_vector<int> in_buffer( h_vector.begin(),   
                                        h_vector.end() );

// Create a control structure
bolt::cl::control ctl;

// Force to run on GPU device
ctl.setForceRunMode(bolt::cl::control::OpenCL);

bolt::cl::transform( ctl,
                     in_buffer.begin(),
                     in_buffer.end(),
                     in_buffer.begin(),
                     bolt::cl::plus<int>() );
...

The code snippet above, demonstrates the correct usage of device_vector -- The buffer resides on the GPU and the code runs on the GPU.

#include <bolt/cl/transform.h>
#include <bolt/cl/device_vector.h>
#include <bolt/cl/control.h>

...

// Calls device_vector constructor with default flag: CL_MEM_READ_WRITE
bolt::cl::device_vector<int> in_buffer( h_vector.begin(),
                                        h_vector.end() );

// Create a control structure
bolt::cl::control ctl;

// Force to run on Multicore host device
ctl.setForceRunMode(bolt::cl::control::MulticoreCpu);

bolt::cl::transform( ctl,
                     in_buffer.begin(),
                     in_buffer.end(),
                     in_buffer.begin(),
                     bolt::cl::plus<int>() );
...

In the code snippet above, an OpenCL buffer with CL_MEM_READ_WRITE flag is created on the GPU. Notice that transform takes the TBB path as guided by ctl and this results in an additional job for the system -- To get the buffer back to the host system from GPU memory. To avoid this performance hit, it's recommended to either use a host vector like std::vector or use device_vector with CL_MEM_USE_HOST_PTR flag, so that the buffer resides on the host memory.

bolt::cl::device_vector<int> in_buffer( h_vector.begin(),
                                        h_vector.end(),
                                        CL_MEM_USE_HOST_PTR );

Note that, if a host vector such as std::vector is passed to any algorithm, a corresponding device_vector is created with CL_MEM_USE_HOST_PTR flag.

Clone this wiki locally