Port a CUDA App to oneAPI and DPC++ in 5 Minutes

Jeremy C. Ong

5.00/5 (1 vote)

Nov 10, 2020

CPOL

8 min read

10557

A quick 5-minute introduction to porting a CUDA app to Data Parallel C++ (DPC++)

Introduction

CUDA is an Nvidia-owned, parallel computing platform and programming model to run software on GPUs. It is widely used among researchers and industry practitioners to accelerate computationally-heavy workloads, without needing to adopt a wholly unfamiliar workflow and programming model compared to traditional software development. The additional benefits to adopting CUDA are immediate access to a wide array of existing libraries, as well as the use of a number of tools to both debug and visualize CUDA code.

In this article, we'll be explaining how one might port CUDA code to Intel's oneAPI toolkits, and in particular, port a CUDA kernel to Intel's DPC++ compiler. The "oneAPI" toolkits refer to the Data Parallel C++ (or DPC++ for short) programming model along with a number of APIs intended to support high-performance computing applications. DPC++ is a compiler built on LLVM's Clang compiler, extending modern C++ capabilities with SYCL, an open standard designed to allow C++ applications to target heterogeneous systems.

Why Port CUDA to oneAPI?

You might be wondering why we'd want to do such a port, given CUDA's widespread usage in the community for image analysis, machine learning, and more. In short, there are a few compelling advantages to Intel's platform worth considering.

First, DPC++ can target FPGA accelerators as easily as it can target GPUs.

Second, DPC++ is built on top of Clang and open source standards produced by Khronos. Intel is very keen on bringing work on DPC++ upstream to the LLVM project, which would have immediate impact on the value of the various parallel STL algorithms.

Third, it's worth porting code to DPC++ to at least understand how the general programming model works, which may translate to new insights into how best to architect code that requires acceleration in the future.

Perhaps the greatest potential benefit is the ability to deploy oneAPI software to the Intel DevCloud, a cloud environment providing CPUs, GPUs, and FPGAs at your disposal. In particular, much of the hardware available is cutting edge and perhaps impractical to experiment on at home or in the office. For example, with a few commands, you can easily benchmark your application against both an Arria 10 FPGA and a Xeon Platinum. There are subjective reasons why one might prefer to write DPC++ code as well, namely, DPC++ programs read as semantically correct C++, without needing foreign syntax or attributes you might be accustomed to coming from CUDA.

The CUDA Application

The first order of business is to select a CUDA application to port for demonstration purposes. Here, we'll be porting the venerable Mandelbrot fractal generator as we're more interested in learning the DPC++ programming model itself. Briefly, let's perform a quick scan of the CUDA code. First, we need routines to multiply two complex numbers, add two complex numbers, and compute the squared magnitude of a complex number:

struct complex
{
    float r;
    float i;
};

// __device__ := Invoke this function from device and execute it on device
__device__ complex operator*(complex a, complex b)
{
    return {a.r * b.r - a.i * b.i, a.r * b.i + a.i * b.r};
}

__device__ complex operator+(complex a, complex b)
{
   return {a.r + b.r, a.i + b.i};
}

__device__ float sqr_magnitude(complex c)
{
    return c.r * c.r + c.i * c.i;
}

In CUDA, functions we intend on invoking on the accelerator device require the __device__ attribute. Next, we'll write the function that computes the mandelbrot "value" associated with each pixel:

constexpr static uint32_t max_iterations = 12000u;

__device__ uint32_t mandelbrot_pixel(complex c)
{
    // Evaluate iteratively z_{n + 1} = z_n^2 + c
    // Terminate if the max iterations are reached or if the norm exceeds 2
    complex z = {};

    uint32_t i = 0;
    for (; i != max_iterations; ++i) {
        complex z_next = z * z + c;
        if (sqr_magnitude(z_next) > 4.0) {
            return i;
        } else {
            z = z_next;
        }
    }

    return i;
}

Briefly, this function accepts a constant c, initializes a variable z to 0, then continuously evaluates z_next = z^2 + c; z = z_next until the magnitude of the z_next exceeds 2. The function returns the number of iterations needed for this event to occur. Next, we need the kernel function which will evaluate and write out the color of the pixel corresponding to each invocation.

__global__ void mandelbrot(uint8_t* output, int width, int height)
{
    // Remap workgroup and thread ID to an x-y coordinate on a 2D raster
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= width || y >= height)
    {
        return;
    }

    // Remap pixel values to a range from [-2, 1] for the real part and
    // [-1.5, 1.5] for the imaginary part
    complex c = {static_cast<float>(x) / width * 3.f - 2.f,
    static_cast<float>(y) / height * 3.f - 1.5f};

    // Evaluate the mandelbrot iterations for a single thread and write out the
    // result after first normalizing to the range [0, 256)
    uint32_t iterations = mandelbrot_pixel(c);

    // Tonemap color
    uint32_t color = iterations * 6;

    // For stylistic reasons, draw saturated values as black
    output[y * width + x] = color >= 256 ? 0 : color;
}

The mandelbrot function uses the __global__ attribute to indicate it is intended to be invoked on the host. The pattern employed in this kernel is fairly common, namely, the block size, block index, and thread index are used to associate a specific invocation of the kernel with a pixel in the output raster. The coordinates of this pixel are used to evaluate a color, which is then written out to the output buffer. As every invocation dispatched targets a unique pixel in the output raster, each invocation can operate independently of all other invocations, without needing locks, atomics, or any other synchronization primitive.

Finally, we need a main function to allocate device memory to output to, dispatch our kernel, allocate host memory, readback the output, and finally write the output to an image. For emitting the image, we'll use the single header/source file stb_image_write.h from the venerable stb library collection for simplicity.

int main(int argc, char* argv[])
{
    constexpr static int width          = 512;
    constexpr static int height         = 512;
    constexpr static size_t buffer_size = width * height;

    // Allocate a 512x512 256-bit greyscale image on device
    uint8_t* buffer;
    cudaMalloc(&buffer, buffer_size);

    // Operate with 8x8 workgroup sizes (1 AMD wavefront, 2 NVIDIA warps)
    dim3 workgroup_dim{8, 8};
    dim3 workgroup_count{width / workgroup_dim.x, height / workgroup_dim.y};

    mandelbrot<<<workgroup_count, workgroup_dim>>>(buffer, width, height);

    // Flush all work queued to device
    cudaDeviceSynchronize();

    // Write back device memory to host memory and deallocate device memory
    uint8_t* host_buffer = reinterpret_cast<uint8_t*>(std::malloc(buffer_size));
    cudaMemcpy(host_buffer, buffer, width * height, cudaMemcpyDeviceToHost);
    cudaFree(buffer);

    // Write out results to an image
    int result = stbi_write_png("mandelbrot.png", width, height, 1, host_buffer, width);

    std::free(host_buffer);

    return 0;
}

Finally, if you're following along, please be sure to include the following needed headers at the top of the file:

#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <iostream>

#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"

After compiling and evaluating the code above, you should be able to produce the following PNG image:

Porting to DPC++

To perform a port from CUDA to DPC++, we could painstakingly "translate" CUDA code to DPC++. Fortunately, Intel provides the DPC++ Compatibility Tool to streamline the porting process. While the tool is still in "beta" status at the time of this writing, I had no issues porting the CUDA Mandelbrot code.

First, you'll need to ensure that you have both DPC++ and the compatibility tool installed on your machine. The simplest way to do this is to install the oneAPI toolkits. Both the compiler and compatibility tool are provided in the base toolkit. If you need to target the CUDA backend, you may need to build the toolchain yourself with CUDA support, as the CUDA-enabled toolchain is still experimental. To learn how to do this, please consult the documentation here. Additionally, if you are currently working with CUDA 11, you must have CUDA 10 or 9 installed alongside it for the compatibility tool to run.

Next, after opening a shell in the operating system of your choice, you'll need to invoke a shell script to locally modify various environment variables needed to ensure the Intel oneAPI libraries and executables are locatable. On UNIX-like platforms, the script is called setvars.sh and is located in the installation root of the toolkit (typically something like /opt/intel/oneapi or ~/intel/oneapi). On Windows, the script is provided as setvars.bat and is similarly located in the installation root.

After verifying that the PATH is set correctly, the dpct compatibility tool should be available. For our simple example with a single main.cu file, the following command is enough to perform the conversion and emit the output to the dpct_output folder in the same directory.

dpct --extra-arg="-std=c++17" main.cu --out-root dpct_output

The directory contents of dpct_output are typically cpp source files with the .dp.cpp extension. In addition, you may see various yaml files which enumerate the code replacements made to various files in your project. While they do not participate in the compilation, they can be handy to understand what operations were made and troubleshoot any issues that arise.

To compile the code and test it, invoke the following command:

mkdir build
cd build
dpcpp ../main.dp.cpp -o mandelbrot -lsycl -lOpenCL

On Windows, you'll want to emit an executable with the .exe extension instead. In the same terminal, executing the mandelbrot program should generate an identical image to what we produced above with CUDA.

One small gotcha you may find is that invoking the executable produced above in a new terminal or from your file explorer may result in runtime errors complaining that the various shared libraries could not be located. This is because by default, dpcpp uses dynamic linkage for the sycl library, which is useful for the program to receive passive updates, should the oneAPI installation be upgraded in the future. To remedy this issue, you may wish to either ship the library in the same directory as the executable, or modify the library load path.

Deploying to the Intel DevCloud Platform

To wrap up our port, let's deploy our application to Intel's DevCloud. This will allow us to experiment with hardware provided by Intel. To begin, first create an account via the following DevCloud sign-up page. Afterwards, follow the unique login link that will be subsequently emailed to you and SSH to the DevCloud instance provisioned. The redirected page immediately after sign-in should contain instructions on how to perform this connection on your OS. For the most part, this amounts to a Host entry in your SSH configuration, remapping devcloud to a proxy connection with your credentials.

Afterwards, we can use scp to transfer our source files to the DevCloud instance:

scp -r dpct_output devcloud:~/mandelbrot

In addition, you'll need a Makefile and script to run your application. The following Makefile can be used to compile our example:

CXX = dpcpp
CXXFLAGS = -o
LDFLAGS = -lOpenCL -lsycl
EXE_NAME = mandelbrot
SOURCES = main.dp.cpp
BINDIR = bin

all: main

main:
[ -d $(BINDIR) ] || mkdir $(BINDIR)
$(CXX) $(CXXFLAGS) $(BINDIR)/$(EXE_NAME) $(SOURCES) $(LDFLAGS)

run:
$(BINDIR)/$(EXE_NAME)

clean:
rm -rf $(BINDIR)/$(EXE_NAME)

A script to invoke make and execute the compiled program is also needed (here, we name it run.sh but you can choose your own name and adapt the following command accordingly):

#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh
make run

With this, we're now able to submit jobs to various hardware queues in the DevCloud. The full documentation for interfacing with the job queues is provided here. As a demonstration, the following commands dispatches our request, runs it, and reads back the result.

# ON DEVCLOUD
# Queue submission with a job label, working directory, and script to run
qsub -N mandelbrot -d . run.sh

# Show job submission status
qstat

# ON HOST
scp devcloud:~/mandelbrot/mandelbrot.png .
# Verify the image looks correct

Conclusion

In this article, we've demonstrated how to port an existing CUDA application to DPC++, compile it, and run it on the DevCloud. Assuming familiarity with the commands and simplicity of the original CUDA program, such a port and deployment to edge hardware can occur in minutes. More sophisticated projects may require additional steps not covered here: for example, invoking the compatibility tool on a Visual Studio project or a compiler-commands database produced by a tool such as CMake. Also not covered are various features of the DevCloud, such as the ability to target specific classes of hardware or compute nodes, as well as executing scripts that time execution. To leverage these features and learn more about Intel's DPC++ compiler, please consult the documentation on the Intel Developer Zone.

History

9^th November, 2020: Initial version