VexCL: Vector expression template library for OpenCL

Denis Demidov

4.77/5 (18 votes)

Jul 3, 2012

MIT

13 min read

56457

737

This article is an introduction to VexCL. VexCL is vector expression template library created for ease of C++ based OpenCL development.

Download source - 193 KB

Introduction

VexCL is vector expression template library for OpenCL. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform) computations are supported. Source code of the library is publicly available under MIT license. See the Doxygen-generated documentation at http://ddemidov.github.com/vexcl.

This is the first of two articles describing the VexCL library. The first part is an introduction to the VexCL interface. The second part compares VexCL performance with existing GPGPU implementations. The comparison is based on odeint - a modern C++ library for numerical solutions of ordinary differential equations.

To quote Khronos group, the organization behind the OpenCL standard, OpenCL is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers, and handheld/embedded devices. OpenCL (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software.

The weakest sides of OpenCL are, probably, lack of tools and libraries around it and the amount of boilerplate code needed to develop OpenCL applications. The VexCL library tries to solve the latter issue. To start with an example, consider the "hello world" problem of OpenCL: addition of two vectors. The following code block contains the program implementing the task with pure OpenCL. Note that the official C++ bindings are used here; the C variant would be much more verbose.

#include <iostream>
#include <vector>
#include <string>

#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>

// Compute c = a + b.
static const char source[] =
    "#if defined(cl_khr_fp64)\n"
    "#  pragma OPENCL EXTENSION cl_khr_fp64: enable\n"
    "#elif defined(cl_amd_fp64)\n"
    "#  pragma OPENCL EXTENSION cl_amd_fp64: enable\n"
    "#else\n"
    "#  error double precision is not supported\n"
    "#endif\n"
    "kernel void add(\n"
    "       uint n,\n"
    "       global const double *a,\n"
    "       global const double *b,\n"
    "       global double *c\n"
    "       )\n"
    "{\n"
    "    uint i = get_global_id(0);\n"
    "    if (i < n) {\n"
    "       c[i] = a[i] + b[i];\n"
    "    }\n"
    "}\n";

int main() {
    const unsigned int N = 1 << 20;

    try {
	// Get list of OpenCL platforms.
	std::vector<cl::Platform> platform;
	cl::Platform::get(&platform);

	if (platform.empty()) {
	    std::cerr << "OpenCL platforms not found." << std::endl;
	    return 1;
	}

	// Get first available GPU device which supports double precision.
	cl::Context context;
	std::vector<cl::Device> device;
	for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) {
	    std::vector<cl::Device> pldev;

	    try {
		p->getDevices(CL_DEVICE_TYPE_GPU, &pldev);

		for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) {
		    if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue;

		    std::string ext = d->getInfo<CL_DEVICE_EXTENSIONS>();

		    if (
			    ext.find("cl_khr_fp64") == std::string::npos &&
			    ext.find("cl_amd_fp64") == std::string::npos
		       ) continue;

		    device.push_back(*d);
		    context = cl::Context(device);
		}
	    } catch(...) {
		device.clear();
	    }
	}

	if (device.empty()) {
	    std::cerr << "GPUs with double precision not found." << std::endl;
	    return 1;
	}

	std::cout << device[0].getInfo<CL_DEVICE_NAME>() << std::endl;

	// Create command queue.
	cl::CommandQueue queue(context, device[0]);

	// Compile OpenCL program for found device.
	cl::Program program(context, cl::Program::Sources(
		    1, std::make_pair(source, strlen(source))
		    ));

	try {
	    program.build(device);
	} catch (const cl::Error&) {
	    std::cerr
		<< "OpenCL compilation error" << std::endl
		<< program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0])
		<< std::endl;
	    return 1;
	}

	cl::Kernel add = cl::Kernel(program, "add");

	// Prepare input data.
	std::vector<double> a(N, 1);
	std::vector<double> b(N, 2);
	std::vector<double> c(N);

	// Allocate device buffers and transfer input data to device.
	cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
		a.size() * sizeof(double), a.data());

	cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
		b.size() * sizeof(double), b.data());

	cl::Buffer C(context, CL_MEM_READ_WRITE,
		c.size() * sizeof(double));

	// Set kernel arguments.
	add.setArg(0, N);
	add.setArg(1, A);
	add.setArg(2, B);
	add.setArg(3, C);

	// Launch kernel on the compute device.
	queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange);

	// Get result back to host.
	queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(double), c.data());

	// Should get '3' here.
	std::cout << c[42] << std::endl;
    } catch (const cl::Error &err) {
	std::cerr
	    << "OpenCL error: "
	    << err.what() << "(" << err.err() << ")"
	    << std::endl;
	return 1;
    }
}

I am sorry for wasting page space, but that is the point of this example. Feel free to collapse this ugly monster away as soon as you note that it contains 133 lines of code. As you can see, the basic steps of an OpenCL program are selection of the compute device, initialization of the OpenCL context, building the OpenCL program, allocating and initializing memory on the selected device, and running the compute kernel. VexCL strives to simplify (and get rid of, where possible) these steps. A program that solves the same problem with the help of the VexCL library is presented below.

#include <iostream>
#include <vector>
#include <vexcl/vexcl.hpp>

int main() {
    const unsigned int N = 1 << 20;

    try {
	// Init VexCL context: grab one GPU with double precision.
	vex::Context ctx(
		vex::Filter::Type(CL_DEVICE_TYPE_GPU) &&
		vex::Filter::DoublePrecision &&
		vex::Filter::Count(1)
		);

	if (ctx.queue().empty()) {
	    std::cerr << "GPUs with double precision not found." << std::endl;
	    return 1;
	}

	std::cout << ctx << std::endl;

	// Prepare input data.
	std::vector<double> a(N, 1);
	std::vector<double> b(N, 2);
	std::vector<double> c(N);

	// Allocate device vectors and transfer input data to device.
	vex::vector<double> A(ctx.queue(), a);
	vex::vector<double> B(ctx.queue(), b);
	vex::vector<double> C(ctx.queue(), N);

	// Launch kernel on compute device.
	C = A + B;

	// Get result back to host.
	vex::copy(C, c);

	// Should get '3' here.
	std::cout << c[42] << std::endl;
    } catch (const cl::Error &err) {
	std::cerr << "OpenCL error: " << err << std::endl;
	return 1;
    }
}

This program is much more concise (45 lines to be precise) and almost as effective. VexCL uses template metaprogramming techniques, so most of the work is done at compilation time. The only considerable overhead the VexCL example has is dynamic construction of kernel source. But that is a relatively lightweight operation and it is performed only once, when an expression is first encountered in your program.

VexCL makes heavy use of C++11 features, so your compiler has to be modern enough. GCC versions 4.6 and above are fully supported. Microsoft Visual C++ 2010 manages to compile the project with some features disabled: since it does not support variadic templates, only one-argument built-in functions are enabled; user functions are not available at all.

VexCL interface description

Compute device selection

VexCL supports multi-device computations. Compute devices you use may even belong to different OpenCL platforms. For example, a single VexCL context may include AMD and NVIDIA graphic cards as well as Intel CPU. The easiest way to initialize VexCL is by using the vex::Context class. Its constructor accepts a device filter or a functor acting on cl::Device. Several standard filters are defined. Compute devices may be filtered by name, vendor, platform, type (CPU or GPU), etc. Device filters may be combined with logical operators. For example, the following piece of code lists all available GPUs supporting double precision computations:

vex::Context ctx(
    vex::Filter::Type(CL_DEVICE_TYPE_GPU) && vex::Filter::DoublePrecision
    );
std::cout << ctx << std::endl;

In case built-in functionality is not enough, users may provide their own filters.

Almost any class in VexCL accepts std::vector of cl::CommandQueues as a constructor argument. Each command queue is presumably located on a separate compute device. This list of queues may be obtained by a call to the queue() method of the initialized vex::Context class, or be supplied by a user. This allows to easily incorporate VexCL into existing projects that have their own means of OpenCL initialization.

Vector arithmetic

Once you have initialized the VexCL context, you can allocate device vectors on the associated devices. Each device in the VexCL context gets its own share of vector memory and computational work. Vectors in VexCL are partitioned across all devices given at construction time. All vectors use the same partitioning scheme. This guarantees that corresponding elements of two equally sized vectors are located on the same compute devices and no inter-device data transfer is required for computations.

In the example below, device vector x is allocated across all devices supporting double precision.

vex::Context ctx(vex::Filter::DoublePrecision);
const size_t N = 1024 * 1024;
vex::vector<double> x(ctx.queue(), N);

The default partitioning scheme is based on the measuring performance of the code a = b + c; on every device in the context, where a, b, and c are vectors residing on the benchmarked device. This test is performed first time any multi-device vector is allocated. Each device gets part of a vector proportional to its performance. This is another overhead introduced when using the VexCL library in a multi-device context. But this overhead is easily removed by providing a device weighting function. If, for example, you have a homogeneous set of compute devices, then equal partitioning is the best choice. In order to skip the bandwidth test, you assign the same weight to each device:

vex::partitioning_scheme<>::set(
    [](const cl::Context&, const cl::Device&) { return 1.0; }
    );

Note that you have to set the partitioning scheme before you allocate any vector. Otherwise, the default partitioning scheme will be used. The partitioning behaviour may be altered only once. This guarantees that all vectors are consistently partitioned.

Once the device vectors are allocated, simple and intuitive arithmetic expressions may be used to alter their state. For every expression you use, the appropriate kernel is generated, compiled (first time it is encountered in your program), and called automatically. If you want to get sources of the generated kernels output to the standard stream, define a VEXCL_SHOW_KERNELS macro before including the VexCL headers.

const size_t n = 1 << 20;
std::vector<float> x(n);
std::generate(x.begin(), x.end(), [](){ return (float)rand() / RAND_MAX; });

vex::vector<float> X(ctx.queue(), x);
vex::vector<float> Y(ctx.queue(), n);
vex::vector<float> Z(ctx.queue(), n);

Y = 42;
Z = sqrt(2 * X) + pow(cos(Y), 2.0);

Computational work is split between devices. Vector expressions supported by VexCL are embarrassingly parallel, so if you have two or more compute devices, then you should get linear speedup for your code. You can copy the result back to the host or you can use vex::vector::operator[] to read (or write) vector elements directly. Though the latter technique is very ineffective and should be used for debugging purposes only:

copy(Z, x);
assert(x[42] == Z[42]);

Reductions

Reduction of a vector to a single value is a commonly used operation. You use reduction to obtain the sum of vector elements, or to find the maximum/minimum element of a vector. The class template vex::Reductor allows to perform reduction of an arbitrary expression. Supported types of reduction are SUM, MIN, and MAX. This is how you obtain the sum of all elements in an expression:

vex::vector<double> x(ctx.queue(), 4096);
vex::vector<double> y(ctx.queue(), 4096);

// ...

vex::Reductor<double,vex::SUM> sum(ctx.queue());

// Sum:
std::cout << sum(x) << std::endl;

// Inner product:
std::cout << sum(x * y) << std::endl;

// How many elements in x are greater than 1?
std::cout << sum(x > 1) << std::endl;

Another example is finding the maximum distance from the origin for a set of 2D points. Point coordinates are stored in x and y vectors:

vex::Reductor<double, vex::MAX> max(ctx.queue());
std::cout << max(sqrt(pow(x, 2.0) + pow(y, 2.0))) << std::endl;

Note that the vex::Reductor instance allocates a small temporary buffer on each device in its queue list at construction time, so for performance reasons it should be allocated once and used throughout the application lifetime.

User-defined functions

Simple functions to be used in kernels are easily defined in VexCL. All you need is to define the function body, its return type, and the types of its arguments. After that, the function may be used in vector expressions in the same way built-in functions are used. Note that the function body has to be defined at global scope and belongs to the extern const char[] type. These requirements are necessary in order to use the body string as a template parameter. The following example counts 2D points located in the first quadrant:

extern const char between_body[] = "return prm1 <= prm2 && prm2 <= prm3;";
UserFunction<between_body, bool(double, double, double)> between;

// Number of points in first quadrant
size_t points_in_1q(
    const vex::Reductor<size_t, vex::SUM> &sum,
    const vex::vector<double> &x,
    const vex::vector<double> &y
    )
{
    return sum(between(0.0, atan2(y, x), M_PI/2));
}

Any valid vector expression may be used as a function argument. Note that function parameters in the body definition are always named as prm1, prm2, etc. The function body does not have to be a one-liner: any valid sequence of OpenCL operators is allowed. Due to the fact that OpenCL kernels are compiled at runtime, you will not get any compilation errors regarding your function until the first expression using it is encountered in your program.

Custom functions may be used not only to introduce new functionality, but also for performance sake. See the Kernel generation section for further explanations.

Sparse matrix-vector products and stencil convolutions

One of the most common operations in linear algebra is matrix-vector multiplication. The class vex::SpMat holds the representation of a sparse matrix, spanning several devices. Its constructor accepts a sparse matrix stored in CRS format. On the compute devices, the matrix is stored in hybrid ELL-CRS format. In the example below, the device matrix is constructed from host vectors and is used to compute the maximum residual value:

// Construct the matrix.
vex::SpMat<double> A(ctx.queue(), n, row.data(), col.data(), val.data());

// Compute residual. r, f, and u are device vectors.
r = f - A * u;
double res = max(fabs(r));

See file examples/cg.cpp for an example of implementing the conjugate gradient method for the numerical solution of a system of linear equations.

Another commonly used operation is the convolution of a vector with a stencil. VexCL supports three kinds of stencils: simple stencils, generalized stencils, and user-defined stencil operators. Simple stencil is based on a one-dimensional array and is used as is. Generalized stencil is based on a dense two-dimensional matrix and is used together with a built-in or user-defined function of one argument. To define a stencil operator, the user has to provide its dimensions (width and center), and body string.

Simple stencil convolution comes in handy in many situations. For example, it allows us to apply a moving average filter to a device vector. All you need is to construct a vex::stencil object from std::vector, a pair of iterators, or from an initializer list. You also need to specify the position of the stencil center:

// Moving average with 5-points window.
std::vector<double> sdata(5, 0.2);
vex::stencil<double> s(ctx.queue(), sdata, sdata.size() / 2);

// Convolve x with s. x and y are device vectors.
y = x * s;

Generalized stencil is basically a small dense coefficient matrix. It is used in combination with a built-in or user-defined function of one argument. For example, the following nonlinear vector operator:

y[i] = sin(x[i-1] - x[i]) + sin(x[i] - x[i+1]);

may be implemented as:

const uint rows   = 2;
const uint cols   = 3;
const uint center = 1;

vex::gstencil<double> S(ctx.queue(), rows, cols, center, {
    1, -1,  0,	// stencil coefficients for first sin argument.
    0,  1, -1	// stencil coefficients for second sin argument.
    });

y = sin(x * S);

A user-defined function may be used together with generalized stencils. To implement the following operator:

y[i] = x[i] + pow(x[i-1] + x[i+1], 3);

you would write:

extern const char pow3_body[] = "return pow(prm1, 3);";
UserFunction<pow3_body, double(double)> pow3;

void pow3_oper(const vex::vector<double> &x, vex::vector<double> &y) {
    gstencil<double> S(ctx.queue(), 1, 3, 1, {1, 0, 1});
    y = x + pow3(x * S);
}

User-defined stencil operators allow to define efficient arbitrary stencils. You need to specify stencil dimensions and provide the body for the function returning the local value of the stencil. Let us implement the previous example with the help of a stencil operator. Note that in the body, string elements of a vector that stencil is applied to are referred as X[i], where i is relative to the stencil center:

// Defined at global scope:
extern const char pow3_op_body[] =
    "double t = X[-1] + X[1];\n"
    "return X[0] + t * t * t;"

// Define stencil operator:
const uint width  = 3;
const uint center = 1;
vex::StencilOperator<double, width, center, pow3_op_body> pow3_op(ctx.queue());

// Apply stencil operator to device vector x:
y = pow3_op(x);

The latter implementation is more effective because it uses single kernel launch. You see, stencil convolution, as well as sparse matrix-vector multiplication, is a two-step operation. First, halo points have to be exchanged between neighboring compute devices. Second, the OpenCL kernel has to be launched. Because of this, stencil convolution is allowed only in additive expressions. Internally, such expressions are computed separately. First, all terms except stencil convolution are computed and stored to the result vector. Second, stencil convolution is computed and added to the result. So the expression y = x + pow3(x * S); is functionally equivalent to:

y = x;              // First kernel launch.
y += pow3(x * S);   // Second kernel launch.

Kernel generation

Each expression you use in your code results in the OpenCL kernel. For example, the following expression:

x = 2 * y - sin(z);

triggers generation, compilation, and launch of this kernel:

#if defined(cl_khr_fp64)
#  pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
#  pragma OPENCL EXTENSION cl_amd_fp64: enable
#endif
kernel void Sub_Mul_cvsinv(
	ulong n,
	global double *res,
	int prmll,
	global double *prmlr,
	global double *prmr1
	)
{
	size_t i = get_global_id(0);
	size_t grid_size = get_num_groups(0) * get_local_size(0);
	while (i < n) {
		res[i] = ((prmll * prmlr[i]) - sin(prmr1[i]));
		i += grid_size;
	}
}

Kernel for each expression is generated and compiled only once (per OpenCL context), when it is first encountered in your program. Expression in the above example corresponds to the expression tree in the figure below:

Expression tree

As you can see, the type of an expression tree depends on its operations and types of its operands. So an expression 2.0 * y - sin(z) would result in a similar, but different tree, since the type of the constant here is double instead of int. This should be kept in mind if you want to limit the number of generated kernels. Another thing to remember is that there is no way to tell if several terminals of a tree refer to the same data. Consider the example from the User-defined functions section, where we had to count points in the first quadrant. We could implement the example without using a custom function:

return sum(atan2(y, x) >= 0.0 && atan2(y, x) <= M_PI/2);

The resulting kernel would read data from vectors x and y twice, which is not very nice from a performance standpoint. The introduction of the between function allows us to halve the number of global memory reads here!

Custom kernels

As Kozma Prutkov repeatedly said, "One cannot embrace the unembraceable". So in order to be usable, VexCL has to support custom kernels. When writing OpenCL kernel, you have to remember that VexCL vectors are distributed across all compute devices present in the context. vex::vector::operator() returns a cl::Buffer object that stores part of a vector on a given device. Other methods useful for kernel authors are vex::vector::part_size() and vex::vector::part_start(). The former returns the size of a vector partition located on a given device; the latter returns the number of first vector elements located on a given device. So x.part_start(d+1) - x.part_start(d) is always equal to x.part_size(d). If the result depends on the neighbor points, you have to remember that these points are possibly located on a different compute device. In this case the exchange of these halo points has to be arranged manually.

The following example builds and launches a custom kernel for each device in the context:

// Use environment variables to control device selection.
vex::Context ctx( vex::Filter::Env );

const size_t n = 1 << 20;
vex::vector<float> x(ctx.queue(), n);

std::vector<cl::Kernel> kernel(ctx.size());

for(uint d = 0; d < ctx.size(); d++) {
    cl::Program program = vex::build_sources(ctx.context(d), std::string(
	"kernel void dummy(ulong size, global float *x)\n"
	"{\n"
	"    size_t i = get_global_id(0);\n"
	"    if (i < size) x[i] = 4.2;\n"
	"}\n"
	));
    kernel[d] = cl::Kernel(program, "dummy");
}

for(uint d = 0; d < ctx.size(); d++) {
    kernel[d].setArg(0, static_cast<cl_ulong>(x.part_size());
    kernel[d].setArg(1, x(d));
    ctx.queue(d).enqueueNDRangeKernel(kernel[d], cl::NullRange, n, cl::NullRange);
}

In rare cases your kernel contains a syntax error, the error description will be output to the standard error stream and a cl::Error exception will be thrown. By the way, VexCL overloads the output to a stream operator for cl::Error objects to get more readable error messages.

Scalability

Scalability of the library with respect to the number of compute devices is presented in this section. Effective performance (GFLOPS) and bandwidth (GB/sec) were measured by launching a big number of test kernels on one, two, or three Nvidia Tesla C2070 cards. The effect of adding a fourth, somewhat slower, device (Intel Core i7) was tested as well. The results shown are averaged over 20 runs. The details of the experiments may be found in file examples/benchmark.cpp. Basically, the performance of the following code was measured:

// Vector arithmetic
a += b + c * d;
// Reduction
double s = sum(a * b);
// Stencil convolution
y = x * s;
// SpMV
y += A * x;

Effective performance

Effective bandwidth

As you can see, VexCL scales almost linearly with the addition of compute devices. Note that the performance and bandwidth for a stencil convolution operation are much higher than for other primitives. This is due to the fact that much faster local (shared) memory is used in this algorithm, and formulas for effective performance and bandwidth do not take this into account. Another thing worth noting is the overall degradation of performance after an Intel CPU is added to the VexCL context. The only primitive gaining speed from this addition is vector arithmetic. This is probably because the performance of vector arithmetic was used as a basis for problem partitioning.

Conclusion

VexCL allows to write compact and readable code without sacrificing too much performance. In the next article, headmyshoulder will talk about integrating VexCL into odeint - a modern C++ library for numerical solution of ordinary differential equations. odeint now supports GPU computations both with CUDA by using the Thrust library and OpenCL by using VexCL. This allows us to compare the performance of the two solutions.

Acknowledgments

This work has been partially supported by the Russian Foundation for Basic Research (RFBR) grants No 12-01-00033 and 12-07-0007.