Faster Sorting in C# by Utilizing GPU with NVIDIA CUDA

Adnan Boz

4.91/5 (9 votes)

Aug 8, 2011

CPOL

8 min read

50373

An entry level example of how to use NVIDIA CUDA technology to achieve better performance within C# with minimum possible amount of code

Introduction

In this post, I would like to give an entry level example of how you can use NVIDIA CUDA technology to achieve better performance within C# with minimum possible amount of code. This is a very specific scenario where the main application is in C#, and the sorting algorithm is provided in C++ as an API and you have an NVIDIA CUDA capable graphics card. I hope this basic sample will give you some insight and encourage you to invest into parallel technologies.

Background

Parallel Computing

For more than two decades, CPUs drove rapid performance increase and price reductions. With that, we may get used to having speed boost of our code running on a new piece of hardware. As you may recall, this relentless performance improvement is described with two words: Moore’s law. While the continuation of Moore’s law is debated on many platforms, because of the slow down starting in 2003, many of us already jumped to the parallel bandwagon using systems based on SIMD or MIMD models. With parallel computing, application logic suitable for parallelization will most likely run faster without waiting for tomorrow’s hardware. (It is “most likely” because there are theoretical limits explained by Amdahl’s law.)

GPGPU

A GPU is a massively multithreaded processor which supports thousands of concurrent threads. Using GPU to do general purpose scientific and engineering computing is named GPGPU. NVIDIA is one of the leading GPU manufacturers and they are providing fully programmable GPU technology for high-level languages like C, C++ and Fortran with the architecture named CUDA. The NVIDIA hardware uses the SIMT (single-instruction multiple-thread) execution model rather than the popular SIMD model. I gave a SIMD example in one my previous posts, how to utilize an SSE4 CPU instruction within C#. And in this post, we will be using SIMT execution model through NVIDIA CUDA. (Under the cover, on SIMD multiple data elements for a single instruction are collected and packed into a single register. On the other hand on SIMT, all threads process data in their own registers.)

Thrust

Writing code using CUDA API is very powerful in terms of controlling the hardware, but it requires to handle memory and execution which is outside the scope of this post. Therefore, I’ll use a high-level programming interface named Thrust. As described here, Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust library provides many fundamental programming logic like sorting, prefix-sums, reductions, transformations, etc.

Thrust consists of header files and is distributed with CUDA 4.0 installation. Therefore, you don’t have to install it additionally.

Radix Sort

The reason I choose sorting is that it is a widely used fundamental computational building block to many problems. Radix sort is one of the oldest algorithms and is very efficient for sorting small keys sequentially with the algorithmic complexity of O(n). One of the two sorting algorithms Thrust provides is Radix Sort, therefore thrust fits our example very well. There is a paper about Designing Efficient Sorting Algorithms for Manycore GPUs, which explains how the algorithm works on the GPU.

Setup

We will be using Visual Studio 2010 to host one C# application and one C++ library. The only additional API is the NVIDIA CUDA Toolkit 4.0. You will need to install the CUDA Toolkit and a supported graphics device driver from the same link.
Also, do not forget to install the Build Customization BUG FIX Update from the same link or from here.

Implementation

The solution consists of one C# console application and a C++ native DLL. The C++ native DLL exports a method which has the code to run the sort algorithm on the GPU. The C# code uses platform invoke to call the exported method of the C++ DLL.

For comparison purposes, I’m also providing a C# implementation of Radix Sort. The C# implementation takes an array of 32 bit integers and processes it in three steps for every 8 bit part:

Creating histogram of data
Running prefix sum(scan)
Adding the data to bin

On some implementations, the histogram is named as count or difference, but here I used the name histogram, because there is very good CUDA implementation I may write about in a future post. It is basically finding out how many of every byte is in the array. Because the max byte value is 0xFF, therefore the size of the histogram is 256, reflecting the maximum number of different values.

The second step is to run prefix sum on the histogram, which is simply adding one value to the right neighbor in the array. In that way, every element in the array will contain its left neighbor. Actually, there is a very efficient way to prefix sum on parallel environment explained by the paper Efficient Parallel Scan Algorithms for GPUs which you may find useful for understanding the logic behind the GPU implementation.

The last step is to place the value into the correct bin by using the prefix sum result at the value index to determine the temporary array index.

The code is as follows:

public static void CPURadixSort(int[] sourceData)
{
 int size = sourceData.Length;
 //temporary array for every byte iteration
 int[] tempData = new int[size];
 //histogram of the last byte
 int[] histogram = new int[256];
 //The prefix sum of the histogram
 int[] prefixSum = new int[256]; 

 unsafe
 {
  fixed (int* pTempData = tempData)
  {
   fixed (int* pSourceData = sourceData)
   {
    int* pTemp = pTempData;
    int* pBck;
    int* pSource = pSourceData;

    //Loop through every byte of 4 byte integer
    for (int byteIdx = 0; byteIdx < 4; byteIdx++)
    {
     	int shift = byteIdx * 8; //total bits to shift the numbers

     	//Calculate histogram of the last byte of the data
     	for (int i = 0; i < size; i++)
     	 histogram[(pSource[i] >> shift) & 0xFF]++;

     	//Calculate prefix-sum of the histogram
     	prefixSum[0] = 0;
     	for (int i = 1; i < 256; i++)
     	 prefixSum[i] = prefixSum[i - 1] + histogram[i - 1];

     	//Get the prefix-sum array index of the last byte, increase
        //it by one. That gives us the index we want to place
        //the data
     	for (int i = 0; i < size; i++)
     	 pTemp[prefixSum[(pSource[i] >> shift) & 0xFF]++] = pSource[i];

     	//Swap the pointers
     	pBck = pSource;
     	pSource = pTemp;
     	pTemp = pBck;

     	//reset the histogram
     	for (int i = 0; i < 256; i++)
     	 histogram[i] = 0;
    }
   }
  }
 }
}

One more detail about the code is that I used unsafe code in order to swap the pointers for the temporary buffer and source data to gain another half second. But, it is possible to remove that part and use the tempData.CopyTo(sourceData, 0) instead.

When I run the code above on an Intel Xeon 2.40 Ghz CPU for 2^25 (33,554,432) random integers, it executed in 3.038 seconds.

The next step is to create the sorting code on the GPU and call it within our C# code. For this, we need to create a DLL in C++ by using the default .NET project wizard. Please follow the steps to accomplish this:

Step 1

Create a Visual C++ console project named “RadixSortGPU.dll” in Visual Studio 2010 by selecting DLL and Empty project on the wizard.

Step 2

Open Build Customization from the C++ project context menu, and check the CUDA 4.0 (.targets, .props) checkbox.

Step 3

Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.

Step 4

Add a new empty source file ending with .cu and paste the following code into it:

#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/detail/type_traits.h>

extern "C" __declspec(dllexport) void __cdecl GPURadixSort(int*, unsigned int);

extern void GPURadixSort(int* data, unsigned int numElements)
{
	thrust::device_vector<int> d_data(data, data + numElements);

	thrust::stable_sort(d_data.begin(), d_data.end());

	thrust::copy(d_data.begin(), d_data.end(), data);
}

This code is the only program logic you need to sort a given integer array on the GPU. For the sake of this example, I did not include any exception handling and validation. But for any production code, you most probably want to add exception handling and validation logic.

As you can see in the code, starting using Thrust is very simple by just including the required headers.

The dllexport definition makes the GPURadixSort usable from within another library. In this way, we will be able to call this method from our C# library.

The GPURadixSort method itself is simply copying the data array into a container named thrust::device_vector, resulting in the data to be placed into the GPU global memory, sorting the data on the GPU and copying back the data from the GPU memory to our input array and
the begin() and end() methods are iterators pointing to the data at the beginning or end of the array. For more information how to use vectors and iterators, please visit the Thrust Quick Start Guide.

Step 5

I’m assuming you already have a C# project you can paste the following code into your class:

[DllImport("RadixSortGPU.dll", CallingConvention = CallingConvention.Cdecl)]
public static extern void GPURadixSort(
 [MarshalAsAttribute(UnmanagedType.LPArray, ArraySubType = UnmanagedType.I4)]
 int[] data, uint numElements);

This code provides the information needed to call the GPURadixSort function exported from our native DLL.

Step 6

That’s it! Now you can call the GPURadixSort within your C# code to sort your array.

Based on my environment with a NVIDIA Quadro 600 GPU, the sort takes about 1.211 seconds, which is almost 3x faster than the CPU version. The NVIDIA Quadro 600 GPU I used is a low profile graphics processor which cost about $150. 96 cores, 1GB memory and 25.6 GB/s memory bandwidth maybe sound powerful, but it is low profile compared to GPUs like Quadro 6000, with 448 cores, 6 GB memory and 144 GB/s memory bandwidth. There are also the Tesla family GPUs for high performance computing needs you may want to check out at NVIDIA.

Beyond

Below I provided some code to check if the GPU memory is enough to hold the given array, which is one of the important validations you want to perform if you target heterogeneous environments.

using namespace std;
...
extern void GPURadixSort(int* data, unsigned int numElements)
{
 int deviceID = -1;
 if (cudaSuccess == cudaGetDevice(&deviceID))
 {
     cudaDeviceProp devprop;
     cudaGetDeviceProperties(&devprop, deviceID);
     unsigned int totalMem = numElements * sizeof(int);
     if (devprop.totalGlobalMem < totalMem)
     {
        //Handle error
     }
 }
 ...
}

As I mentioned before, under the high-level Thrust library, the powerful CUDA API resides. If Thrust does not have the functionality you are looking for, you can always use CUDA directly. It is not as easy as using Thrust, but if you will follow the resources below, it will not take you much to start using this great technology.

You can visit Syllabus for the CUDA Certification Exam for a quick start up.
I’m strongly recommending the following books:
- Programming Massively Parallel Processors: A hands-on Approach
- CUDA by Example: An Introduction to General-Purpose GPU Programming
The following online courses are also very valuable:
- Programming Massively Parallel Processors with CUDA by Stanford University
- High-Performance Computing for Engineering Applications by Wisconsin University

Conclusion

We are using parallelism mostly to gain performance when a program runs slow or we need to handle huge amount of calculation. But, I believe that parallelism is a very important software aspect allowing us to use the underlying hardware wisely and gain advantage in this competitive market, where two functionality wise similar applications can only compete on software qualities like performance. There are many parallel technologies which will allow you to achieve higher performance with the same hardware. One of them is definitely NVIDIA CUDA, and some others you want to look into are TPL, PPL, C++ AMP, CUDA, OpenCL, MPI.

CodeProject