Part 5: OpenCL Buffers and Memory Affinity

manythreads

0/5 (0 vote)

May 24, 2011

CPOL

9 min read

74219

This fifth article in a series on portable multithreaded programming using OpenCL™ Rob Farber discusses OpenCL™ buffers and demonstrates how to tie computation to data in a multi-device, multi-GPU environment.

Download AMD SDK

The previous article, part 4 of this series, covered the OpenCL™ runtime and demonstrated how to perform concurrent computations among the work queues of heterogeneous devices.

The example source code from part 4 was adapted to queue a user specified number of tasks split amongst multiple CPU and GPU command queues. Both blocking I/O and mapped I/O will be used to tie data to kernels running on each device. The supplied OpenCL kernel, which only adds a number to itself, was purposely kept simple to focus attention on the queue and data management issues. The reader can easily supply other more complicated kernels via a command-line argument to evaluate them in multiple environments when utilizing (1) just the host processor, (2) one or more GPUs devices, or (3) a mix of all devices at once. As per the example in part 4, concurrent execution is initiated succinctly via an OpenMP pragma.

The source code in this article continues to use a simple yet useful preprocessor capability to pass C++ template types to an OpenCL kernel. While the example in this article does not modify data type to evaluate the performance, this capability was kept as a convenience should the reader choose to evaluate the performance of alternative kernels in a mixed and multi-device environment.

Memory Affinity

OpenCL uses a relaxed memory consistency model as discussed in part 2 of this series. The beauty inherent in the OpenCL memory design is that data is localized and associated with a work item, or within a work-group, by the programmer. These work items can then be queued on many devices to achieve very high performance and scalability. This model does require that the programmer assume responsibility to ensure all tasks see a consistent view of memory.

OpenCL utilizes buffer objects within a context to share data between devices, which differs from the programming model expected by those who program conventional shared-memory machines. Buffer objects provide the foundation through which the programmer and OpenCL runtime work together in concert to create a single program that can run without compilation on a multitude of machine and device configurations. For example, the executable from this article ran without recompilation on a single processor, a system with a single GPU, and a larger system with a CPU and multiple GPUs.

It is worth noting that many OpenCL programmers become confused when they encounter the maximum size limitation of a buffer, which is generally much smaller than the size of the memory on their GPU. To avoid this limitation, it is necessary to think in terms of partitioning a computation to run across one or more devices. A buffer then becomes a natural way to express the distribution of work - rather than a mechanism through which a memory image is moved to a single massively threaded device. To preserve efficiency, OpenCL provides both mapped and asynchronous buffers so computation can proceed while additional data is being transferred. In this way, OpenCL programmers can create applications that deliver both high performance and portable parallelism without recompilation on a wide-variety of user hardware configurations.

Explicit, programmer initiated transfers occur by queuing one or more transfers on a command queue. Examples include:

C API:

clEnqueueReadBuffer(), clEnqueueReadImage()
clEnqueueWriteBuffer(), clEnqueueWriteImage()
clEnqueueCopyBuffer(), clEnqueueCopyImage()

C++ API

cl::enqueueReadBuffer(), cl::enqueueWriteBuffer()

Data transfers can be either blocking, in which case the queue waits for the transfer to complete, or asynchronous requiring the use of events for notification when a transfer has completed. Using asynchronous data transfers benefits application performance by allowing computation to overlap with data movement - thus decreasing the time to solution. Since the PCIe bus is full duplex, meaning that it can transfer data in two directions at the same time, there is a potential 2-times increase in data transfer bandwidth that can be achieved.

Alternatively, regions of the object data can be implicitly transferred by mapping buffers into the host address space. These transfers can occur both asynchronously and on a demand basis, meaning only portions of the data required by a calculation are moved and cached on a device. API examples include:

C API:

clEnqueueMapBuffer(), clEnqueueMapImage()
clEnqueueUnmapMemObject()

C++ API:

cl::Buffer() (via various flags discussed below)
cl::enqueueMapBuffer(), cl::enqueueMapImage()
cl::enqueueUnmapMemObject();

It can be confusing determining which flags, or combination of flags, are needed to correctly create and potentially map a buffer to a device. (The following is excerpted from an excellent explanation by user bwatt on the Khronos.org message board.)

There are three flags that can be used in mapping memory or when creating a buffer with the C++ OpenCL wrapper:

CL_MEM_ALLOC_HOST_PTR
CL_MEM_COPY_HOST_PTR
CL_MEM_USE_HOST_PTR

Five valid combinations can be created with these three flags:

Non mapped requiring manual data transfers:

No flags specified
CL_MEM_COPY_HOST_PTR

Mapped buffers:

CL_MEM_ALLOC_HOST_PTR
(CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR)
CL_MEM_USE_HOST_PTR

Letting OpenCL allocate the memory (options 1a and 2a) provides the greatest likelihood of delivering good performance in a portable fashion as the buffer can be internally allocated to best conform to alignment, pinned memory, and other device specific performance criteria. If porting an existing application, it might be necessary to use already allocated regions of memory, which means that CL_MEM_USE_HOST_PTR may be the only option. This may not be the best option from a performance point of view.

Choosing a Queuing Model

Derek Gerstmann’s provided an excellent presentation for SIGGRAPH ASIA 2009, “Advanced OpenCl Event Model Usage”, which discusses several use cases including:

1x In-Order Queue, 1x Context, 1x Device.
1x Out-of-Order Queue, 1x Context, 1x Device.
2x In-Order Queues, 2x Separate Contexts, 2x Devices.
2x In-Order Queues, 1x Combined Context, 2x Devices.

The example in this article utilizes a single context with a single command queue per device, or the “Cooperative Multi-Device Usage Model”. This implies that all objects created in the single context are shared by all the command-queues. Appendix A section A.1 of the OpenCL specification, “Shared OpenCL Objects” indicates:

OpenCL memory objects, program objects and kernel objects are created using a context and can be shared across multiple command-queues created using the same context.
A command-queue can cache changes to the state of a memory object on the device associated with the command-queue.
The application needs to implement appropriate synchronization across threads on the host processor to ensure that the changes to the state of a shared object (such as a command-queue object, memory object, program or kernel object) happen in the correct order (deemed correct by the application) when multiple command-queues in multiple threads are making changes to the state of a shared object.

Succinctly, be careful how buffers are utilized by multiple devices as the runtime may introduce copies and other behavior that can affect performance and even program correctness. A good, recent discussion on this topic occurred in this thread on the Khronos.org message board. On the basis of this discussion, the example code in this article dedicated a buffer per device for data transfers and when mapping memory to the devices.

Example

The source code for testSum.cpp creates a two dimensional integer array in host memory according to the sizes specified by the user on the command-line. This array is filled with random numbers and mapped to one or more devices. The OpenCL kernel, simpleAdd.cl, simply adds each array element to itself.

As discussed in part 4, the C++ template header file, testSum.hpp, uses a simple use of template specialization get the typename of a template parameter for several atomic datatypes (float, double, int, etcetera). Use of this method keeps the code simple and does not complicate it with too much C++ wizardry. Since the kernel is built inside the instantiated class, this typename can be passed to the OpenCL kernel build via preprocessor defines.

The complete code for testSum.hpp follows. Note that it has been adapted to support multiple kernel invocations by essentially removing all the code inside of the initData method. Aside from that, this code is very similar to the C++ template files used in the previous two articles.

#include <cmath>
#include <algorithm>
// The following defines specialized templates to provide a string
// containing the typename
template<class T>
struct TypeName {
  string getName();
private:
  T *t; 
};
 
template<> string TypeName<double>::getName() {return(string("double")); }
template<> string TypeName<float>::getName() {return(string("float")); }
template<> string TypeName<unsigned long>::getName() {return(string("ulong"));}
template<> string TypeName<long>::getName() { return(string("long")); }
template<> string TypeName<unsigned int>::getName() {return(string("uint"));}
template<> string TypeName<int>::getName() {return(string("int")); }
template<> string TypeName<unsigned char>::getName() {return(string("uchar"));}
template<> string TypeName<char>::getName() {return(string("char")); }
 
// specification of the OclTest template
template <typename TYPE1>
class OclTest {
private:
  cl::Kernel kernel;
  string myType;
  cl::Event event;
 
  // variables for the test
  int vecsize;
 
public:
  cl::Event *getEventPtr() { return &event;}
  OclTest() {}
 
  OclTest( cl::CommandQueue& queue, const char* kernelFile, 
          int argc, char *argv[])
  {
    cl::Device device = queue.getInfo<CL_QUEUE_DEVICE>();
    cl::Context context = queue.getInfo<CL_QUEUE_CONTEXT>();
    cout << "---------- building OpenCL kernel (" 
        << kernelFile << ") -----" << endl;
    myType= TypeName<TYPE1>().getName();
    cout << "   My type is " << myType.c_str() << endl;
 
    // Demonstrate using defines in the ocl build
    string buildOptions;
    { // create preprocessor defines for the kernel
      char buf[256]; 
      sprintf(buf,"-D TYPE1=%s ", myType.c_str());
      buildOptions += string(buf);
    }
    
    // build the program from the source in the file
    ifstream file(kernelFile);
    string prog(istreambuf_iterator<char>(file),
              (istreambuf_iterator<char>()));
    cl::Program::Sources source( 1, make_pair(prog.c_str(),
                                        prog.length()+1));
    cl::Program program(context, source);
    file.close();
 
    try {
      cerr << "   buildOptions " << buildOptions << endl;
      cl::vector<cl::Device> foo;
      foo.push_back(device);
      program.build(foo, buildOptions.c_str() );
    } catch(cl::Error& err) {
      // Get the build log
      cerr << "Build failed! " << err.what() 
          << '(' << err.err() << ')' << endl;
      cerr << "retrieving  log ... " << endl;
      cerr 
       << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device)
       << endl;
      exit(-1);
    }
    // Get the name of the kernel from the filename
    string kernelName = string(kernelFile)
      .substr(0,string(kernelFile).find(".cl"));
    kernel = cl::Kernel(program, kernelName.c_str());
  }
  
  inline void initData(int _vecsize)
  {
    vecsize = _vecsize;
  }
 
  inline cl::Kernel& getKernel() { return(kernel); }
  // Methods to return information for queuing work-groups
  cl::NDRange getGlobalWorkItems() {
    return( cl::NDRange( vecsize ) ); 
  }
  cl::NDRange getWorkItemsInWorkGroup() {
      // Only one work item per workgroup
    return( cl::NDRange(1, 1) ); 
  }
};

TestSum.cpp

The focus of the testSum.cpp is to demonstrate the creation of multiple command-queues within a single context while maintaining data affinity across for multiple kernel invocations across a number the devices. Using the C++ OpenCL wrapper make the code to find all the devices and put them into a context fairly compact:

    cl::vector< cl::Platform > platformList;
    cl::Platform::get(&platformList);
 
    // Get all the appropriate devices for the platform the
    // implementation thinks we should be using.
    // find the user-specified devices
    cl::vector<cl::Device> devices;
    for(int i=0; i < deviceType.size(); i++) {
      cl::vector<cl::Device> dev;
      platformList[0].getDevices(deviceType[i], &dev);
      for(int j=0; j < dev.size(); j++) devices.push_back(dev[j]);
    }
 
    // set a single context
    cl_context_properties cprops[] = {CL_CONTEXT_PLATFORM, NULL, 0};
    cl::Context context(devices, cprops);
    cout << "Using the following device(s) in one context" << endl;
    for(int i=0; i < devices.size(); i++)  {
      cout << "  " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;
    }

This code has been tested on a system containing two ATI Radeon HD 5870 GPUs and an AMD Phenom™ II X6 1055T processor running the latest AMD Accelerated Parallel Processing (APP) SDK (formerly known as ATI Stream). (Note: the code in part 3 also works correctly on multi-GPU systems.)

Creating separate command queues are specified, one per device, is also compact:

    // Create the separate command queues to perform work
    cl::vector< cl::CommandQueue > contextQueues;
    for(int i=0; i < devices.size(); i++)  {
#ifdef PROFILING
      cl::CommandQueue queue(context, devices[i],CL_QUEUE_PROFILING_ENABLE);
#else
      cl::CommandQueue queue(context, devices[i],0);
#endif
      contextQueues.push_back( queue );
    }

C++ preprocessor conditionals are utilized to choose between mapped and unmapped buffers. Setting the preprocessor variable USE_MAP will compile the code using mapped buffers that implicitly transfer the data to the OpenCL devices. The default is to use explicit, blocking transfers.

    int nDevices = contextQueues.size();
    unsigned int* vec = new uint[nvec*vecsize];
    int vecBytes=vecsize*sizeof(uint);
    // Fill the host memory with random data for the sums
    srand(0);
    for(int i=0; i < (nvec*vecsize); i++) vec[i] = (rand()&0xffffff);
 
    // Create a separate buffer for each device in the context
#ifdef USE_MAP
    // This maps all of the host data into memory so it does not need
    // to be manuually copied.
    cl::vector< cl::Buffer > d_vec;
    for(int i=0; i < contextQueues.size(); i++) {
      d_vec.push_back(cl::Buffer(context, CL_MEM_COPY_HOST_PTR, 
                            nvec* vecBytes, vec) );
    }
    int vecOffset=vecBytes; // the buffer is of size vec, so use row offset
#else
    cl::vector< cl::Buffer > d_vec;
    for(int i=0; i < contextQueues.size(); i++) {
      d_vec.push_back(cl::Buffer(context, CL_MEM_READ_WRITE, vecBytes) );
    }
    int vecOffset=0; // the buffer is the size of one vector so no offset
#endif

As with the example in part 4, OpenMP has been utilized to concurrently queue commands in parallel on all the devices. This provides a simple and concise way to introduce parallelism via an OpenMP pragma as seen in the code snippet below. More information on OpenMP can be found on the Internet. An excellent starting place is the Wikipedia article, OpenMP. Note that the kernel runs multiple times one each queue.

#pragma omp parallel for
    for(int i=0; i < contextQueues.size(); i++) {
      test[i].initData(vecsize);
      test[i].getKernel().setArg(0,vecsize);
      test[i].getKernel().setArg(1,d_vec[i]);
 
      for(int j=i; j < nvec; j += nDevices) {
#ifdef USE_MAP
       test[i].getKernel().setArg(2,j); // set the offset for the kernel
#else
       test[i].getKernel().setArg(2,0);
       // manually transfer the data to the device
       contextQueues[i].enqueueWriteBuffer(d_vec[i], CL_TRUE,0, vecBytes,
                                      &vec[j*vecsize]);
#endif
       contextQueues[i].enqueueNDRangeKernel(
                    test[i].getKernel(), 
                    cl::NullRange, // offset starts at 0,0
                    test[i].getGlobalWorkItems(), // number of work groups
                    test[i].getWorkItemsInWorkGroup(), // workgroup size
                    NULL, test[i].getEventPtr());
       // manually transfer the data from the device
       contextQueues[i].enqueueReadBuffer(d_vec[i], CL_TRUE,
                                     j * vecOffset,
                                     vecBytes,
                                     &vec[j*vecsize]);
      }
      contextQueues[i].finish(); // wait for everything to finish
    }

Once all the devices have completed, the host double-checks the results and prints out either success or failure:

    // perform the golden test
    {
      int i;
      srand(0);
      for(i=0; i < (nvec*vecsize); i++) {
       unsigned int r = (rand()&0xffffff);
       r += r;
       if(r != vec[i]) break;
      }
      if(i == (nvec*vecsize)) {
       cout << "test passed" << endl;
      } else {
       cout << "TEST FAILED!" << endl;
      }
    }

Following is the complete code for testSum.cpp.

#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
using namespace std;
 
#include "testSum.hpp"
 
int main(int argc, char* argv[])
{
  if( argc < 5) {
    cerr << "Use: {cpu|gpu|both} kernelFile nvec vecsize" << endl;
    exit(EXIT_FAILURE);
  }
 
  // handle command-line arguments
  const string platformName(argv[1]);
  const char* kernelFile = argv[2];
  int nvec = atoi(argv[3]);
  int vecsize = atoi(argv[4]);
 
  cl::vector<int> deviceType;
  cl::vector< cl::CommandQueue > contextQueues;
 
  // crudely parse the command line arguments
  if(platformName.compare("cpu")==0)
    deviceType.push_back(CL_DEVICE_TYPE_CPU);
  else if(platformName.compare("gpu")==0) 
    deviceType.push_back(CL_DEVICE_TYPE_GPU);
  else if(platformName.compare("both")==0) {
    deviceType.push_back(CL_DEVICE_TYPE_GPU);
    deviceType.push_back(CL_DEVICE_TYPE_CPU);
  } else { cerr << "Invalid device type!" << endl; return(1); }
 
 
  // create the contexts and queues
  try {
    cl::vector< cl::Platform > platformList;
    cl::Platform::get(&platformList);
 
    // Get all the appropriate devices for the platform the
    // implementation thinks we should be using.
    // find the user-specified devices
    cl::vector<cl::Device> devices;
    for(int i=0; i < deviceType.size(); i++) {
      cl::vector<cl::Device> dev;
      platformList[0].getDevices(deviceType[i], &dev);
      for(int j=0; j < dev.size(); j++) devices.push_back(dev[j]);
    }
 
    // set a single context
    cl_context_properties cprops[] = {CL_CONTEXT_PLATFORM, NULL, 0};
    cl::Context context(devices, cprops);
    cout << "Using the following device(s) in one context" << endl;
    for(int i=0; i < devices.size(); i++)  {
      cout << "  " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;
    }
 
    // Create the separate command queues to perform work
    cl::vector< cl::CommandQueue > contextQueues;
    for(int i=0; i < devices.size(); i++)  {
#ifdef PROFILING
      cl::CommandQueue queue(context, devices[i],CL_QUEUE_PROFILING_ENABLE);
#else
      cl::CommandQueue queue(context, devices[i],0);
#endif
      contextQueues.push_back( queue );
    }
    
    // Create tests for all the queues
    cl::vector< OclTest<uint> > test;
    for(int i=0; i < contextQueues.size(); i++) {
      test.push_back(OclTest<uint>(contextQueues[i],
                              kernelFile, argc-3, argv+3));
    }
 
    int nDevices = contextQueues.size();
    unsigned int* vec = new uint[nvec*vecsize];
    int vecBytes=vecsize*sizeof(uint);
    // Fill the host memory with random data for the sums
    srand(0);
    for(int i=0; i < (nvec*vecsize); i++) vec[i] = (rand()&0xffffff);
 
    // Create a separate buffer for each device in the context
#ifdef USE_MAP
    // This maps all of the host data into memory so it does not need
    // to be manuually copied.
    cl::vector< cl::Buffer > d_vec;
    for(int i=0; i < contextQueues.size(); i++) {
      d_vec.push_back(cl::Buffer(context, CL_MEM_COPY_HOST_PTR, 
                            nvec* vecBytes, vec) );
    }
    int vecOffset=vecBytes; // the buffer is of size vec, so use row offset
#else
    cl::vector< cl::Buffer > d_vec;
    for(int i=0; i < contextQueues.size(); i++) {
      d_vec.push_back(cl::Buffer(context, CL_MEM_READ_WRITE, vecBytes) );
    }
    int vecOffset=0; // the buffer is the size of one vector so no offset
#endif
 
    // run the tests
#pragma omp parallel for
    for(int i=0; i < contextQueues.size(); i++) {
      test[i].initData(vecsize);
      test[i].getKernel().setArg(0,vecsize);
      test[i].getKernel().setArg(1,d_vec[i]);
 
      for(int j=i; j < nvec; j += nDevices) {
#ifdef USE_MAP
       test[i].getKernel().setArg(2,j); // set the offset for the kernel
#else
       test[i].getKernel().setArg(2,0);
       // manually transfer the data to the device
       contextQueues[i].enqueueWriteBuffer(d_vec[i], CL_TRUE,0, vecBytes,
                                      &vec[j*vecsize]);
#endif
       contextQueues[i].enqueueNDRangeKernel(
                    test[i].getKernel(), 
                    cl::NullRange, // offset starts at 0,0
                    test[i].getGlobalWorkItems(), // number of work groups
                    test[i].getWorkItemsInWorkGroup(), // workgroup size
                    NULL, test[i].getEventPtr());
       // manually transfer the data from the device
       contextQueues[i].enqueueReadBuffer(d_vec[i], CL_TRUE,
                                     j * vecOffset,
                                     vecBytes,
                                     &vec[j*vecsize]);
      }
      contextQueues[i].finish(); // wait for everything to finish
    }
    // perform the golden test
    {
      int i;
      srand(0);
      for(i=0; i < (nvec*vecsize); i++) {
       unsigned int r = (rand()&0xffffff);
       r += r;
       if(r != vec[i]) break;
      }
      if(i == (nvec*vecsize)) {
       cout << "test passed" << endl;
      } else {
       cout << "TEST FAILED!" << endl;
      }
    }
    delete [] vec;
      
  } catch (cl::Error error) {
    cerr << "caught exception: " << error.what() 
        << '(' << error.err() << ')' << endl;
  }
  return EXIT_SUCCESS;
}

The following commands build and run the code using the recently released AMD version 2.5 SDK. Note the specification of the C++ preprocessor define USE_MAP.

echo "---------------"

g++ -I $ATISTREAMSDKROOT/include -fopenmp testSum.cpp -L $ATISTREAMSDKROOT/lib/x86_64 -lOpenCL -o testSum

g++ -D USE_MAP -I $ATISTREAMSDKROOT/include -fopenmp testSum.cpp -L $ATISTREAMSDKROOT/lib/x86_64 -lOpenCL -o testSum

simpleAdd.cl

The complete source listing for simpleAdd.cl follows:

inline __kernel void simpleAdd(int veclen, __global TYPE1* c, int offset)

{
  // get the index of the test we are performing
  int index = get_global_id(0);
 
  c[index + offset*veclen] += c[index + offset*veclen];
}

Command-line options allow setting the following:

cpu, gpu, both:

cpu: run only on the processor cores.
gpu: run on all the GPU devices.
both: run on both cpu and GPU devices.

The filename of the OpenCL kernel.
The number of rows in the array. This also defines how many kernel invocations will happen across the devices (one kernel invocation per row).
The number of columns in each row. This defines how much data will be transferred across the PCIe bus.

The following shows an example output when running on all the devices:

$ ./testSum both simpleAdd.cl 300 10240
Using the following device(s) in one context
  Cypress
  Cypress
  AMD Phenom(tm) II X6 1055T Processor
---------- building OpenCL kernel (simpleAdd.cl) -----
   My type is uint
   buildOptions -D TYPE1=uint 
---------- building OpenCL kernel (simpleAdd.cl) -----
   My type is uint
   buildOptions -D TYPE1=uint 
---------- building OpenCL kernel (simpleAdd.cl) -----
   My type is uint
   buildOptions -D TYPE1=uint 
test passed

Summary

The example in this article demonstrated that it is possible to concisely specify an application that can run on numerous device configurations without recompilation. This is the essence of the portable parallelism of OpenCL.

Tying data to computation is a key capability that permits work to be queued on many devices to achieve very high performance and scalability. Rather than thinking in terms of a memory image, OpenCL programmers need to redefine their thinking in terms of buffers that can be moved as needed amongst devices. In this way, the OpenCL mapping and data transfer capabilities can be utilized. This differs from the monolithic view of a single large memory presented by most SMP (Shared Multi-Processor) conventional architectures. With buffer objects, the OpenCL programmer and runtime can work together to have a single program run, without compilation, on a single processor, a hybrid CPU/GPU system, or a system with a CPU and multiple GPUs.