Click here to Skip to main content
15,881,812 members
Articles / Programming Languages / CUDA

GPU Performance Tests

Rate me:
Please Sign up or sign in to vote.
0.00/5 (No votes)
22 Nov 2015CPOL12 min read 9.2K   3   1
It has never been easier for C# desktop developers to write code that takes advantage of the amazing computing performance of modern graphics cards.

It has never been easier for C# desktop developers to write code that takes advantage of the amazing computing performance of modern graphics cards. In this post I will share some ad hoc performance test results for a simple program written in C# as obtained from my current desktop computer, a Dell Precision T3600, 16GB RAM, Intel Xeon E5-2665 0 @ 2.40GHz, NVidia GTX Titan. I will also leave you with the source code to run this same test on your hardware of choice.<o:p>

titan

Source Code<o:p>

https://www.assembla.com/code/cudafyperformance/subversion/nodes<o:p>

Introduction<o:p>

It is likely that the graphics card in your computer supports CUDA or OpenCL. If it does, then you are in for a real treat if you take the time to explore its capabilities. In this article I am showing off the new 4.5 terra-flop GTX Titan card from NVidia ($1000). The one year old GTX 680 costs half that and comes in at a still staggering 3 terra-flops. Even if you have a lower cost GPU card, chances are that its performance will still be pretty impressive compared to your CPU.<o:p>

We will a “test” using the GPU with CUDA, the GPU with OpenCL, the CPU with OpenCL, and the CPU using straight C# - all within the safe confines of a managed C# application. Then we will explore the concept of streams, which allow us to overlap computations with memory transfers. Later on we will leave C#, using only C, and find that there are no performance gains to be found down that path. Finally we will tune our GPU code to make your head hurt, but also to really extract all the computing power from our GPU.<o:p>

Source code for all of this is provided (see above) and a checklist of required downloads is provided below.<o:p>

The Test<o:p>

Smooth one million floating point values using a set of 63 smoothing coefficients.<o:p>

Here is the function that computes the smoothed value of a given point:<o:p>

static float SmoothPoint(float[] data, int index)<o:p>

{<o:p>

    var sum = 0f;<o:p>

    var count = 0;<o:p>

 <o:p>

    for (var coefficientIndex = 0; coefficientIndex < Coefficients.Length; coefficientIndex++)<o:p>

    {<o:p>

        var sourceIndex = index + coefficientIndex - 32;<o:p>

        if (sourceIndex >= 0 && sourceIndex < data.Length)<o:p>

        {<o:p>

            sum += data[sourceIndex] * Coefficients[coefficientIndex];<o:p>

            count++;<o:p>

        }<o:p>

    }<o:p>

 <o:p>

    return sum / count;<o:p>

}<o:p>

Before calling this function, we set up the coefficients as follows:<o:p>

public static float[] Coefficients = new float[64];<o:p>

 <o:p>

static PerfTest()<o:p>

{<o:p>

    for (var i = 0; i < 32; i++)<o:p>

    {<o:p>

        Coefficients[i] = Coefficients[62 - i] = i / 31f;<o:p>

    }<o:p>

}<o:p>

The Results<o:p>

Here are the results to save you the time of scrolling to the end – which is what you would do now anyway.

ScreenShot

The CPU is 70x slower for this specific task than one of the CUDA implementations.<o:p>

The CPU<o:p>

To execute this test using the Xeon, I set up the call this way:<o:p>

public static void CpuSmooth()<o:p>

{<o:p>

    const int mb = 1024 * 1024;<o:p>

    var dataIn = new float[mb];<o:p>

    var dataOut = new float[mb];<o:p>

    for (var i = 0; i < mb; i++)<o:p>

    {<o:p>

        dataIn[i] = i;<o:p>

    }<o:p>

 <o:p>

    const int loops = 1024;<o:p>

 <o:p>

    for (var loop = 0; loop < loops; loop++)<o:p>

    {<o:p>

        Parallel.For(0, mb, index => dataOut[index] = SmoothPoint(dataIn, index));<o:p>

    }<o:p>

}<o:p>

<o:p> 

As you can see below, the parallel for loop does a nice job of keeping all the processors busy.<o:p>

CpuUsage

The GPU<o:p>

To execute the SmoothPoint function on the GTX Titan, I need to do the following: <o:p>

1) Obtain Software:<o:p>

a) For CUDA for your NVidia GPU, obtain CUDA from NVidia: https://developer.nvidia.com/cuda-downloads<o:p>

b) For OpenCL for your GPU, your video driver should be all that you need.<o:p>

c) For OpenCL for your Intel CPU, obtain this SDK: http://software.intel.com/en-us/vcsource/tools/opencl-sdk<o:p>

c) For CUDAfy, for your C# development, obtain this DLL: http://cudafy.codeplex.com/.
As of this writing, you need to obtain version 1.21 (beta) if you want OpenCL support in CUDAfy.
<o:p>

d) For development, obtain Visual Studio. If you are doing CUDA work, you need the 2010 C++ compiler lurking on your HDD. You can use VS2010 or VS2012 for all the work in this article.<o:p>

2) For CUDA work, ensure that your environment variable PATH contains a link to the VS2010 C++ compiler. Mine includes this string: C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin<o:p>

3) Add a reference to Cudafy.net to the application.<o:p>

4) Replace the CPU parallel for loop with this code:<o:p>

public static void Smooth(GPGPU gpu)<o:p>

{<o:p>

    const int mb = 1024 * 1024;<o:p>

    var cpuMemIn = new float[mb];<o:p>

    var cpuMemOut = new float[mb];<o:p>

    for (var i = 0; i < mb; i++)<o:p>

    {<o:p>

        cpuMemIn[i] = i;<o:p>

    }<o:p>

 <o:p>

    var gpuMemIn = gpu.Allocate<float>(mb);<o:p>

    var gpuMemOut = gpu.Allocate<float>(mb);<o:p>

 <o:p>

    gpu.CopyToConstantMemory(Coefficients, Coefficients);<o:p>

 <o:p>

    const int loops = 1024;<o:p>

    for (var loop = 0; loop < loops; loop++)<o:p>

    {<o:p>

        gpu.CopyToDevice(cpuMemIn, gpuMemIn);<o:p>

        gpu.Launch(1024, 1024, SmoothKernel, gpuMemIn, gpuMemOut);<o:p>

        gpu.CopyFromDevice(gpuMemOut, cpuMemOut);<o:p>

    }<o:p>

 <o:p>

    gpu.Free(gpuMemIn);<o:p>

    gpu.Free(gpuMemOut);<o:p>

}<o:p>

<o:p> 

5) Write the SmoothKernel function:<o:p>

static void SmoothKernel(GThread gThread, float[] dataIn, float[] dataOut)<o:p>

{<o:p>

    dataOut[gThread.get_global_id(0)] = SmoothPoint(dataIn, gThread.get_global_id(0));<o:p>

}<o:p>

6) Adorn the Coefficients, SmoothKernel, and SmoothPoint with the [Cudafy] attribute.<o:p>

7) Execute some setup code to access the GPU.<o:p>

If I want to use CUDA, I create a GPGPU object like this: <o:p>

public static GPGPU NewCuda()<o:p>

{<o:p>

    var gpu = CudafyHost.GetDevice(eGPUType.Cuda);<o:p>

    CudafyTranslator.Language = eLanguage.Cuda;<o:p>

    var module = CudafyTranslator.Cudafy(eArchitecture.sm_35);<o:p>

    gpu.LoadModule(module);<o:p>

    return gpu;<o:p>

}<o:p>

If I want to use OpenCL, I create a GPGPU object like this: <o:p>

public static GPGPU NewOpenCl()<o:p>

{<o:p>

    var gpu = CudafyHost.GetDevice(eGPUType.OpenCL);<o:p>

    CudafyTranslator.Language = eLanguage.OpenCL;<o:p>

    var module = CudafyTranslator.Cudafy();<o:p>

    gpu.LoadModule(module);<o:p>

    return gpu;<o:p>

}<o:p>

<o:p> 

CUDA vs OpenCL<o:p>

As you can see above, CUDAfy gives you a choice in GPU technologies to use with your C# application. I believe it is pretty amazing that I can write some code in C# and have that code executed on the GPU using either CUDA or OpenCL and on the CPU using straight C# or OpenCL for Intel CPUs. There have been a few heated debates on CUDA vs. OpenCL for GPUs and I certainly do not want to give the impression that I know which technology is better. Here are some points to consider:<o:p>

1) OpenCL is available for many video card technologies. CUDA is available for NVidia-based cards only (from Asus, EVGA, Msi, etc.). OpenCL is also available as a driver that uses the main CPU.<o:p>

2) CUDAfy with OpenCL uses the video card driver to compile the code. CUDAfy with CUDA uses the C++ compiler at run time - but you can use a premade CUDAfy module (*.cdfy) or embed the code in the .NET assembly using the cudafycl tool.<o:p>

3) Streaming in CUDA can achieve a 2X improvement in performance. I’ve been told OpenCL supports streams too, but I have not figured out how that works yet.<o:p>

Under the Hood<o:p>

Behind the scenes, CUDAfy magically creates either a CUDA or an OpenCL rendition of your code. The CUDA code must be compiled using a C++ compiler with the NVida CUDA extensions. The OpenCL code is processed by the device driver so there is much less headache in the distribution of your code.<o:p>

Cuda.cu<o:p>

#if defined(cl_khr_fp64)<o:p>

#pragma OPENCL EXTENSION cl_khr_fp64: enable<o:p>

#elif defined(cl_amd_fp64)<o:p>

#pragma OPENCL EXTENSION cl_amd_fp64: enable<o:p>

#endif<o:p>

 <o:p>

// GlsecPerfCSharp.PerfTest<o:p>

__kernel void SmoothKernel(global float* dataIn, int dataInLen0, global float* dataOut, int dataOutLen0 , __constant  float* Coefficients);<o:p>

// GlsecPerfCSharp.PerfTest<o:p>

 float SmoothPoint(global float* data, int dataLen0, int index , __constant  float* Coefficients);<o:p>

 <o:p>

// GlsecPerfCSharp.PerfTest<o:p>

#define CoefficientsLen0 64<o:p>

// GlsecPerfCSharp.PerfTest<o:p>

__kernel void SmoothKernel(global float* dataIn, int dataInLen0, global float* dataOut, int dataOutLen0 , __constant  float* Coefficients)<o:p>

{<o:p>

       dataOut[(get_global_id(0))] = SmoothPoint(dataIn, dataInLen0, get_global_id(0), Coefficients);<o:p>

}<o:p>

// GlsecPerfCSharp.PerfTest<o:p>

 float SmoothPoint(global float* data, int dataLen0, int index , __constant  float* Coefficients)<o:p>

{<o:p>

       float num = 0.0f;<o:p>

       int num2 = 0;<o:p>

       for (int i = 0; i < CoefficientsLen0; i++)<o:p>

       {<o:p>

              int num3 = index + i - 32;<o:p>

              if (num3 >= 0 && num3 < dataLen0)<o:p>

              {<o:p>

                     num += data[(num3)] * Coefficients[(i)];<o:p>

                     num2++;<o:p>

              }<o:p>

       }<o:p>

       return num / (float)num2;<o:p>

}<o:p>

<o:p> 

OpenCL.cpp<o:p>

__device__ int get_global_id(int dimension)<o:p>

        {<o:p>

            int result = 0;<o:p>

            if (dimension == 0)<o:p>

                result = blockIdx.x * blockDim.x + threadIdx.x;<o:p>

            else if (dimension == 1)<o:p>

                result = blockIdx.y * blockDim.y + threadIdx.y;<o:p>

            else  if (dimension == 2)<o:p>

                result = blockIdx.z * blockDim.z + threadIdx.z;<o:p>

            return result;<o:p>

        }<o:p>

 <o:p>

// GlsecPerfCSharp.PerfTest<o:p>

extern "C" __global__ void SmoothKernel( float* dataIn, int dataInLen0,  float* dataOut, int dataOutLen0);<o:p>

// GlsecPerfCSharp.PerfTest<o:p>

__device__ float SmoothPoint( float* data, int dataLen0, int index);<o:p>

 <o:p>

// GlsecPerfCSharp.PerfTest<o:p>

__constant__ float Coefficients[64];<o:p>

#define CoefficientsLen0 64<o:p>

// GlsecPerfCSharp.PerfTest<o:p>

extern "C" __global__ void SmoothKernel( float* dataIn, int dataInLen0,  float* dataOut, int dataOutLen0)<o:p>

{<o:p>

       dataOut[(get_global_id(0))] = SmoothPoint(dataIn, dataInLen0, get_global_id(0));<o:p>

}<o:p>

// GlsecPerfCSharp.PerfTest<o:p>

__device__ float SmoothPoint( float* data, int dataLen0, int index)<o:p>

{<o:p>

       float num = 0.0f;<o:p>

       int num2 = 0;<o:p>

       for (int i = 0; i < CoefficientsLen0; i++)<o:p>

       {<o:p>

              int num3 = index + i - 32;<o:p>

              if (num3 >= 0 && num3 < dataLen0)<o:p>

              {<o:p>

                     num += data[(num3)] * Coefficients[(i)];<o:p>

                     num2++;<o:p>

              }<o:p>

       }<o:p>

       return num / (float)num2;<o:p>

}<o:p>

<o:p> 

CUDA Streaming<o:p>

Simply stated, “streaming” in CUDA allows the GPU to perform concurrent tasks. In this application, the performance gains in CUDA are due to three overlapped operations. At any point in the performance test, the CUDA code performing each of these three tasks concurrently:<o:p>

Upload raw data from the host memory (CPU) to the device (GPU) memory.<o:p>

Process (smooth) the data in device memory.<o:p>

Download smoothed data from the device to the host.<o:p>

Synchronize to wait for all operations issued on the given stream to complete before proceeding.<o:p>

The slight difference in performance is due to the way the tasks are scheduled in CUDA. These are the three scheduling methods I implemented:<o:p>

<o:p>Methods

Now I don’t have the stamina to turn this blog post into a tutorial on CUDA streaming. Feel free to examine the source code and see how the above three methods are implemented.<o:p>

CUDA C vs. CUDAfy C#<o:p>

Some have wondered if the overhead of C# could be significant. Therefore I put together a straight C version of the same streaming performance test. The source code at assembla now includes this new test.<o:p>

The results show that in this test at least, there is no overhead in using C#. Here are the results:

ScreenShot

Faster!<o:p>

It turns out that much of the time in the smoothing kernel is spent retrieving the input data and the smoothing coefficients from RAM. NVidia calls this “device memory”. Each smoothing coefficient is accessed 1 million times and each data point in the source is accessed 64 times. Maybe we can do something about that. NVidia tells us that device memory is relatively slow.<o:p>

I had already broken the smoothing problem down into 1024 “blocks”, where each block has 1024 threads. This means I have allocated 1 thread per data point. It turns out that the threads within a block can share this really fast memory called, well, “shared memory”. Shared memory is at least two orders of magnitude faster than device memory. So the idea is to allocate and load the shared memory with all the smoothing coefficients and all the data points from device memory that the threads in that block will need. We need 64 coefficients and (because we are smoothing +/- 32 values around each data point) we need 32 + 1024 + 32 data points loaded from device memory into shared memory.<o:p>

Since we have 1024 threads, I decided to let them move the first 1024 data points from device memory into shared memory in parallel:<o:p>

static void FastKernel(GThread gThread, float[] dataIn, float[] dataOut)<o:p>

{<o:p>

    var threadIndex = gThread.threadIdx.x;<o:p>

    var dataIndex = gThread.blockIdx.x * ThreadsPerBlock + threadIndex;<o:p>

<o:p> 

    var dataCopy = gThread.AllocateShared<float>("d", ThreadsPerBlock + 64);<o:p>

<o:p> 

    dataCopy[threadIndex + 32] = dataIn[dataIndex];<o:p>

<o:p> 

I also decided to use the first 64 threads to copy the 64 coefficient from device memory to shared memory.<o:p>

    var coefficients = gThread.AllocateShared<float>("c", 64);<o:p>

<o:p> 

    if (threadIndex < 64)<o:p>

    {<o:p>

        coefficients[threadIndex] = Coefficients[threadIndex];<o:p>

    }<o:p>

<o:p> 

Finally, we load in the 32 points on either side of the 1024 data points, being careful not to exceed the source data range.<o:p>

    else if (threadIndex < 96)<o:p>

    {<o:p>

        var zeroTo31 = threadIndex - 64;<o:p>

        var tempDataIndex = gThread.blockIdx.x * ThreadsPerBlock - 32 + zeroTo31;<o:p>

        if (tempDataIndex >= 0)<o:p>

            dataCopy[zeroTo31] = dataIn[tempDataIndex];<o:p>

    }<o:p>

    else if (threadIndex < 128)<o:p>

    {<o:p>

        var zeroTo31 = threadIndex - 96;<o:p>

        var tempDataIndex = (gThread.blockIdx.x + 1) * ThreadsPerBlock + zeroTo31;<o:p>

        if (tempDataIndex < dataIn.Length)<o:p>

            dataCopy[ThreadsPerBlock + 32 + zeroTo31] = dataIn[tempDataIndex];<o:p>

    }<o:p>

<o:p> 

    gThread.SyncThreads();<o:p>

<o:p> 

The call to SyncThreads ensures that all 1024 threads in this block have copied their assigned data to shared memory before any threads proceed past this point.<o:p>

The rest of the code is pretty much self-explanitory.<o:p>

    var sum = 0f;<o:p>

    var count = 0;<o:p>

<o:p> 

    for (var coefficientIndex = 0; coefficientIndex < coefficients.Length; coefficientIndex++)<o:p>

    {<o:p>

        var sourceIndex = dataIndex + coefficientIndex - 32;<o:p>

        if (sourceIndex >= 0 && sourceIndex < dataIn.Length)<o:p>

        {<o:p>

            var copyIndex = sourceIndex - gThread.blockIdx.x * ThreadsPerBlock + 32;<o:p>

            sum += dataCopy[copyIndex] * coefficients[coefficientIndex];<o:p>

            count++;<o:p>

        }<o:p>

    }<o:p>

<o:p> 

    dataOut[dataIndex] = sum / count;<o:p>

}<o:p>

<o:p> 

It turns out that the copying of the coefficients to shared memory did not buy me much in terms of performance, but copying the data sure helped.<o:p>

The Future<o:p>

I tried executing a “null” kernel that did nothing but return. This gave me a time of about 0.74 ms. Therefore, the next place to take time out of this system is to obtain an NVidia card that sports what is called a “dual copy engine” which allows one upload, one download, and several kernels to all run concurrently.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) LECO Corporation
United States United States
John Hauck has been developing software professionally since 1981, and focused on Windows-based development since 1988. For the past 17 years John has been working at LECO, a scientific laboratory instrument company, where he manages software development. John also served as the manager of software development at Zenith Data Systems, as the Vice President of software development at TechSmith, as the lead medical records developer at Instrument Makar, as the MSU student who developed the time and attendance system for Dart container, and as the high school kid who wrote the manufacturing control system at Wohlert. John loves the Lord, his wife, their three kids, and sailing on Lake Michigan.

Comments and Discussions

 
SuggestionFormat issues.. Pin
Afzaal Ahmad Zeeshan22-Nov-15 10:22
professionalAfzaal Ahmad Zeeshan22-Nov-15 10:22 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.