CUDA Programming Model on AMD GPUs and Intel CPUs

Nick Kopp

5.00/5 (36 votes)

Apr 4, 2013

LGPL3

14 min read

81097

4385

This article builds upon the earlier High Performance Queries: GPU vs. PLINQ vs. LINQ and ports this to also support OpenCL devices and adds benchmarking so you can easily compare performance.

Introduction

CUDAfy.NET is a powerful open source library for non-graphics programming on NVIDIA Graphics Processing Units (GPU) from Microsoft .NET. Recently support for AMD GPUs and x86 CPUs has been added. This brings the CUDA programming model to devices beyond those from NVIDIA. You can get a better background on CUDAfy in the following CodeProject articles:

Without duplicating too much from these earlier articles, let's ask briefly why we should consider using a GPU for non-graphics tasks? The reason is that given the right task, there is a massive amount of horsepower available in GPUs that is outside games, video processing and CAD packages relatively unused. Initially in specific markets such as seismology, finance, physics, biology, we saw increasing use of GPUs to accelerate the extremely compute intensive tasks. However because programming of GPUs is still a different approach to conventional programming, the benefit had to be significant enough to make the effort worthwhile. NVIDIA's CUDA platform has really simplified development. Arguments that it is still too complicated miss the point that massively parallel programming actually needs a different way of looking at things. How you split an algorithm over 1000s of cores must and should be different. Do not get fooled by Intel's claims for their Xeon Phi. Just because each core is roughly speaking x86, it does not mean you can just run your app there and get improvements. Also be wary of joining the OpenCL camp too quickly. Yes it does in theory allow targeting of AMD GPUs, Intel CPUs, NVIDIA GPUs and even FPGAs, but anyone who has compared CUDA runtime with OpenCL quickly sees how complex and verbose OpenCL can be.

The CUDA programming model remains the finest means of handling massively parallel architectures. Tools such as CUDAfy have lowered the hurdle for .NET developers by exposing the programming interface in a clean and easy manner, arguably cleaner than CUDA itself. There remains the one issue with CUDA - it only supports NVIDIA GPUs. Until now. CUDAfy.NET has been updated to allow use of the CUDA model with AMD GPUs and x86 CPUs. Programming of some of Bittware's FPGA boards should also be possible. How is this done? Under the hood CUDAfy can make use of OpenCL. This requires installing the free NVIDIA, AMD and/or Intel OpenCL SDKs. You as someone with CUDA experience do not need to learn much new to now target all these other processors.

This article builds upon the earlier High Performance Queries: GPU vs. PLINQ vs. LINQ and ports this to also support OpenCL devices and adds benchmarking so you can easily compare performance. Ready for a LINQ vs PLINQ vs CUDA vs AMD GPU vs Intel CPU shoot-out?

Background

In this article, we will consider a GPS track point database. This is a list of all the routes taken by a GPS recorder. This kind of functionality is commonly used in company trucks and cars to monitor where vehicles have been - this is increasingly required by tax authorities to prove private and work related mileage. In our example, we want to query our database of GPS points to return all points within a certain radius of a number of targets between a given start and end date. Doing this the brute force way of calculating the distance between two points on the earth's surface represents a tough challenge, and should prove to be a good example of the pros and cons of performing this with LINQ, PLINQ CUDA and OpenCL.

Prerequisites

We will be using CUDAfy.NET for programming the GPU. The CUDAfy library is included in the downloads, however if you want to use CUDA or OpenCL, you will need to download and install:

For an NVIDIA GPU, you can use CUDA or OpenCL. Get the CUDA SDK here.
For an AMD GPU, you use OpenCL. Get the AMD OpenCL SDK here.
For running CUDAfy on an Intel CPU, download the Intel OpenCL SDK.

Ironically getting CUDA working is the toughest option since it also requires Visual Studio. Without Visual Studio, you can target NVIDIA GPUs by using OpenCL which is part of the CUDA SDK. Please see my previous articles for more information on getting up and running.

See this article on base64 encoding on a GPU, or this one as a basic introduction to CUDAfy and CUDA.

Using the Code

This application generates a random GPS route for a user defined number of points and then allows the user to input search criteria including:

Number of targets
Date and time range
Radius from targets in meters
The same query can then be executed using LINQ, PLINQ, or any of the CUDA and OpenCL capable devices discovered.

The basis of the code is the GPS track point and GPS track query. These are defined as TrackPoint and TrackQuery. It would be convenient if the host and GPU code can be shared where possible:

Code which runs on the GPU is called device code, and it needs to be flagged so when CUDAfy converts to CUDA C code, it picks it up. To do this, add the Cudafy attribute.

[Cudafy]
[StructLayout(LayoutKind.Sequential)]
public struct TrackPoint
{
    public TrackPoint(float lon, float lat, long timeStamp)
    {
        Longitude = lon;
        Latitude = lat;
        TimeStamp = timeStamp;
    }
    ...
}

When Cudafying a struct, all members will be translated by default. To ignore a member, use the CudafyIgnore attribute. In this example, we have a property Time. This is a DateTime type. Properties and DateTime are not supported on the GPU, so we need to flag it to be ignored.

[CudafyIgnore]
public DateTime Time
{
    get { return new DateTime(TimeStamp);  }
    set { TimeStamp = value.Ticks; }
}

Now compared to CUDA, OpenCL is rather primitive and basic. Constructors and methods cannot be placed in structs. Because of this, we've needed to change the code used in the previous article. Nothing more complicated than copy-paste and minor edit, but this is a limitation of OpenCL and CUDAfy cannot (currently) shield you from this. This also answers the question of why keep supporting CUDA if NVIDIA GPUs also support OpenCL - because much more is possible with CUDA.

Now another change that sharp eyed readers will have noticed is that here we are using float instead of double. This is because unlike in CUDA if a device does not support double, then the code will not compile. It does not silently revert to single. Since double is only supported on higher end AMD GPUs, we'll live without it for now.

The method that will do the work in testing whether a GPS point is within radius meters of a target point is defined as:

public double DistanceTo(TrackPoint A, TrackPoint B)
{
    float dDistance = Single.MinValue;
    float dLat1InRad = A.Latitude * CONSTS.PI2;
    float dLong1InRad = A.Longitude * CONSTS.PI2;
    float dLat2InRad = B.Latitude * CONSTS.PI2;
    float dLong2InRad = B.Longitude * CONSTS.PI2;
    
    float dLongitude = dLong2InRad - dLong1InRad;
    float dLatitude = dLat2InRad - dLat1InRad;

    // Intermediate result a.
    float a = GMath.Pow(Math.Sin(dLatitude / 2.0F), 2.0F) +
               GMath.Cos(dLat1InRad) * GMath.Cos(dLat2InRad) *
               GMath.Pow(GMath.Sin(dLongitude / 2.0F), 2.0F);

    // Intermediate result c (great circle distance in Radians).
    float c = 2.0F * GMath.Atan2(GMath.Sqrt(a), GMath.Sqrt(1.0F - a));

    // Distance
    dDistance = CONSTS.EARTHRADIUS * c;
    return dDistance * 1000.0F;
}

The shortest distance between any two points on the surface of the earth is given by the great circle distance. If you look at in-flight magazines, you'll see standard 2D maps that show you that your flight from Europe to California does so via Greenland. This goes much further north than either the departure or arrival point because it's the great circle. Calculating the distance requires a fair bit of floating point math processing.

The code for running the query on the CPU is shown below. We pass an array of targets as well as the other query parameters. Then using the Where method of the LINQ extensions to the _track array, we return all points where the GetNearestTargetIndex method returns less than 255. If the value is 255, then the point was not within the radius of any of the targets, else the index of the target nearest to the point is returned. The standard LINQ will execute on a one processor core. To make use of more cores, simply insert AsParallel before the Where method.

private IEnumerable<TrackPoint> GetPoints(TrackPoint[] targets, 
        float radius, DateTime startTime, DateTime endTime, bool parallel)
{
    long targetStartTime = startTime.Ticks;
    long targetEndTime = endTime.Ticks;
    // Running the query in parallel is as easy as adding AsParallel().
    // We're not concerned with the ordering, so it's ideal.
    if(parallel)
        return _track.AsParallel().Where(tp => GetNearestTargetIndex(tp, 
                targets, radius, targetStartTime, targetEndTime) < 255);
    else
        return _track.Where(tp => GetNearestTargetIndex(tp, targets, 
                radius, targetStartTime, targetEndTime) < 255);
}

private byte GetNearestTargetIndex(TrackPoint tp, TrackPoint[] targets, 
	float radius, long targetStartTime, long targetEndTime)
{
	float minDist = Single.MaxValue;
	byte index = 255; 
	// If we're not within the time range then no need to look further.
	if (tp.TimeStamp >= targetStartTime && tp.TimeStamp <= targetEndTime)
	{
		int noTargets = targets.Length;
		// Go through the targets and get the index of the closest one.
		for (int i = 0; i < noTargets; i++)
		{
			TrackPoint target = targets[i];
			float d = TrackQuery.DistanceTo(tp, target);
			if (d <= radius && d < minDist)
			{
				minDist = d;
				index = (byte)i;
			}
		}
		}
	return index;
}

Let's now look at the code that will communicate with the GPU. This is implemented in the TrackQuery class. The constructor takes care of Cudafying or loading from a previously serialized module. For CUDA, this is the process of generating CUDA C code, compiling this, and then storing the output (PTX) into a module. This module can be serialized in order to save time in future runs or to run on an end user system without the NVIDIA CUDA SDK installed. For OpenCL, it is roughly the same - we generate OpenCL code and store this code into the module, but we do not compile. Compilation takes place when loading the module. Here we wish to Cudafy two types: TrackQuery and TrackPoint.

public class TrackQuery
{
    public TrackQuery(GPGPU gpu)
    {
        var props = gpu.GetDeviceProperties();
        var name = props.PlatformName + props.Name;
        // Does an already serialized valid cudafy module xml file exist. If so
        // no need to re-generate it.
        var mod = CudafyModule.TryDeserialize(name);
        // The gpu can be either a CudaGPU or an OpenCLDevice. 
        // Realize that an NVIDIA GPU can be both!
        // And an Intel CPU can show up as both an AMD and an 
        // Intel OpenCL device if both OpenCL SDKs are 
        // installed.
        CudafyTranslator.Language = (gpu is CudaGPU) ? eLanguage.Cuda : eLanguage.OpenCL;
        if (mod == null || !mod.TryVerifyChecksums())
        {
            // Convert the .NET code within these two types 
            // into either CUDA C or OpenCL C. If CUDA C then
            // also compile into PTX.
            mod = CudafyTranslator.Cudafy(typeof(TrackPoint), typeof(TrackQuery));
            // Store to file for re-use
            mod.Serialize(name);
        }
        try
        {
            // Load the module on to the device. If OpenCL then compile the source.
            gpu.LoadModule(mod);
        }
        catch (Exception ex)
        {
            System.Diagnostics.Debug.WriteLine(mod.CompilerOutput);
            throw;
        }
        _gpu = gpu;
    }

The key to efficient use of a GPU for general purpose programming is to minimize the amount of data that needs to transfer between the host and the GPU device. Here we do this by first loading the track. We can then run multiple queries without having to reload the track. Since by default we're going to operate on 10,000,000 GPS points which represent 160,000,000 bytes (two System.Single and one System.Int64 fields per GPS point: longitude, latitude, and timestamp), this is important. The track is uploaded as an array of points. For clarity, variables that reside on the GPU are suffixed with _dev. As an optimization, if the already allocated memory on the GPU is greater than the new track to be uploaded, we do not free it and reallocate, but just use the required amount (_currentLength). The variable _indexes is an array of bytes equal to the length of the track. Per corresponding point, we will fill in either 255 if the point is not within the radius and date range of any of the targets, or the index of the target if it is within:

public void LoadTrack(TrackPoint[] trackPoints)
{
    _currentLength = trackPoints.Length;
    // If the current number of points on the GPU are less than the number of points
    // we want to load then we resize the allocated memory on the GPU. We simply free
    // the existing memory and allocate new memory. We need arrays on the GPU to hold
    // the track points and the selected indexes. We make an array on the host to hold
    // the returned indexes.
    if (_trackPoints.Length < trackPoints.Length)
    {
        if (_trackPoints_dev != null)
        {
            _gpu.Free(_trackPoints_dev);
            _gpu.Free(_indexes_dev);
        }
        _trackPoints_dev = _gpu.Allocate(trackPoints);
        _indexes_dev = _gpu.Allocate<byte>(trackPoints.Length);
        _indexes = new byte[trackPoints.Length];
    }
    _trackPoints = trackPoints;
    // Copy the GPS points to the GPU.
    _gpu.CopyToDevice(trackPoints, 0, _trackPoints_dev, 0, trackPoints.Length);
}

The method we will call from our application is SelectPoints. GPUs have multiple forms of memory. CPU only systems also do since the CPU has its own cache, however we never explicitly use this. However, on the GPU, use of different memory types is advisable as it can have a major impact on performance. OpenCL and CUDA are very similar in this regard so adding OpenCL support to CUDAfy was relatively straightforward. The biggest and also slowest memory is global. That's where we've stored our track. Modern GPUs typically have at least 1GB of this memory. Another memory is constant memory. It can only be written to by the host. For the GPU, it is read-only and very fast. We will use it for storing our targets. Keep in mind that the amount of constant memory is very limited (think KBytes, not MBytes) and cannot be freed. Constant memory is handled differently on OpenCL, but CUDAfy shields you from this and standard CUDA practice can be used. Device functions are said to be launched. The Launch method of the GPU device takes the following parameters:

Number of blocks of threads in the grid
Number of threads per block (maximum of 1024)
Name of function to launch
Array of track points on device
The number of track points
Radius from targets in meters
Start time in ticks
End time in ticks
Array of bytes for result
The number of targets

Launching essentially starts blocksPerGrid*ciTHREADSPERBLOCK threads in parallel. This number can be quite high but if the GPU function takes more than 0.5 seconds or so, then it is advisable to split the process into multiple calls. The NVIDIA driver does not like blocking the GPU for too long and the launch will time out beyond a point. OpenCL specifies the blocks and threads differently from CUDA but again this is transparent to the CUDAfy developer. Once done, we copy the _indexes_dev array back to the host and then search it for values less than 255, returning the corresponding TrackPoint. We could elect to do something with the index so we identify which target was closest. This can be done using the TrackPointResult struct which is included in the sources as an example.

public IEnumerable<TrackPoint> SelectPoints(TrackPoint[] targets, 
       float radius, DateTime startTime, 
       DateTime endTime, bool naive = false)
{
    int blocksPerGrid;
    // Validate the parameters and calculate how
    // many blocks of threads we will need on the GPU.
    // Each block of threads will execute ciTHREADSPERBLOCK threads.
    Initialize(radius, startTime, endTime, out blocksPerGrid);
    // Copy the targets to constant memory.
    _gpu.CopyToConstantMemory(targets, 0, _targets, 0, targets.Length);
    // Launch blocksPerGrid*ciTHREADSPERBLOCK threads in parallel.
    // Each thread will test one GPS point against all targets.
    _gpu.Launch(blocksPerGrid, ciTHREADSPERBLOCK, "SelectPointsKernel",
                _trackPoints_dev, _currentLength, radius, 
                startTime.Ticks, endTime.Ticks, _indexes_dev, targets.Length);
    // Copy the indexes array back from the GPU to the host
    // and search for all points that have an index < 255.
    // These correspond to GPS points lying within the search criteria.
    _gpu.CopyFromDevice(_indexes_dev, 0, _indexes, 0, _currentLength);
    for (int i = 0; i < _currentLength; i++)
    {
        byte index = _indexes[i];
        if (index < 255)
            yield return _trackPoints[i];
    }
}

Each thread executes the following device code which performs a test on one point. GThread identifies the unique ID of the thread so the thread knows which point to access. OpenCL uses a rather different means of accessing the IDs, one that is arguably cleaner than CUDA. With CUDAfy, you can have your cake and eat it because either the OpenCL or CUDA means can be used for both OpenCL and CUDA targets. For this example, we will make use of yet another form of GPU memory: shared. This memory is shared between all the threads within one block. Again OpenCL and CUDA are similar here. When inefficient use of the global memory is made by, for example, reading small amounts of data in a non-sequential way, performance suffers. In such instances, it is better to use shared memory as an intermediate storage. Having said all that, the new Fermi architectures appear to do a very good job of handling this code, and the simpler implementation SelectPointsKernelNaive will run as fast as the one below with shared memory. Try it and see.

[Cudafy]
public static void SelectPointsKernel(GThread thread, TrackPoint[] track, 
       int noPoints, float radius, long startTime, long endTime, 
       byte[] indexes, int noTargets)
{
    // Here we use another form of GPU memory called shared memory.
    // This is shared between threads of a single block.
    // The size of the shared memory must be constant
    // and here we set it to the number of threads per block.
    // This can be more efficient than the naive implementation below,
    // however on the latest fermi architecture there is little
    // or no difference.
    byte[] cache = thread.AllocateShared<byte>("cache", ciTHREADSPERBLOCK);
    // Get the unique index of the thread: size of the
    // block * block index + thread index.
    int tid = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;
    // Check we are not beyond the end of the points.
    if (tid < noPoints)
    {
        TrackPoint tp = track[tid];
        float minDist = Single.MaxValue;
        byte index = 255;
        if (tp.TimeStamp >= startTime && tp.TimeStamp <= endTime)
        {
            for (int i = 0; i < noTargets; i++)
            {
                // get the target from constant memory
                TrackPoint target = _targets[i];
                // Calculate distance to target and if less
                // than radius and current nearest target
                // set minDist and index.
                float d = DistanceTo(tp, target);
                if (d <= radius && d < minDist)
                {
                    minDist = d;
                    index = (byte)i;
                }
            }
        }
        // set the index.
        cache[thread.threadIdx.x] = index;
        // Synchronize all threads in block
        thread.SyncThreads();
        // Write the results into global memory array
        indexes[tid] = (byte)cache[thread.threadIdx.x];
    }
}

Benchmarks

The time is in milliseconds and lower is better. The results are rather impressive. We see that the use of PLINQ gives a very worthwhile benefit over LINQ but for very demanding applications, the GPU comes into its own with up to 30 times improvement in this example. Even if the track requires re-uploading to the GPU (cost of 150ms on the laptop), the advantage remains considerable. This is very significant and means that using a GPU in more normal line of business applications should be considered. The code here can be readily adapted for other purposes. When integrated into a (web) server application, the load could be drastically reduced providing worthwhile cost savings in terms of equipment and energy.

The difference between CUDA and OpenCL on the same NVIDIA device is minimal in this case. However in some algorithms the difference is much more pronounced. What is interesting is that OpenCL on the first generation Intel Core i5 CPU is about twice as good as PLinq. For socket 1155 Ivy Bridge we can expect better results since the Intel OpenCL SDK can make use of the integrated GPUs. As we move towards Haswell, Intel OpenCL could be very promising especially for low power, small footprint solutions.

The above benchmark was run on a mighty Intel i7 980X machine. I left out the LINQ test since it would make the other times virtually invisible. On-board is an NVIDIA GTX660Ti and an AMD 6570. Obviously things are a bit quicker than with the laptop. The most interesting result is actually that running on the Intel i7 CPU the AMD OpenCL SDK (yellow) wipes the floor with the Intel OpenCL SDK (grey)! Both are of course still left standing when measured against the GPUs. Even the relatively cheap AMD 6570 makes a good effort.

The flexibility of OpenCL to target GPU, CPU or even FPGA is enticing, but having said this, if you know that you will be using an NVIDIA device, there are still many compelling reasons to stick with CUDA. CUDA remains a much more pleasurable development experience, the language and tools are far more refined and the availability of a wide range of libraries that are interoperable with CUDA is a deal clincher. Think of FFT, BLAS and Random number generation as just the beginning of the CUDA offerings. Many of these libraries have been made accessible from CUDAfy, but only for CUDA targets.

Task Manager for LINQ and PLINQ

LINQ versus PLINQ

Task Manager when running query on GPU

GPU

License

This project makes use of the splash screen code featured in the Code Project article How to do Application Initialization while showing a SplashScreen.

The Cudafy.NET SDK includes two large example projects featuring amongst others ray tracing, ripple effects, and fractals. Many of the examples are fully supported on both CUDA and OpenCL. It is available as a dual license software library. The LGPL version is suitable for the development of proprietary or Open Source applications if you can comply with the terms and conditions contained in GNU LGPL version 2.1. Visit the Cudafy website for more information. If using the LGPL version, we ask you to please consider making a donation to Harmony through Education. This small charity is helping handicapped children in developing countries. Read more on the charity page of the Cudafy website.

History

4^th April, 2013: First release
17^th September, 2013: Updated to use CUDAfy V1.26