Advancing OpenCL™ for FPGAs

Intel

2.00/5 (1 vote)

Mar 29, 2019

CPOL

8392

Boosting Performance with Intel® FPGA SDK for OpenCL™ Technology

Field programmable gate arrays (FPGAs) are capable of very high performance, especially powerperformance. This is perhaps not surprising. After all, FPGA hardware itself is malleable―configurable to match the application rather than the other way around. Also not surprising is that this additional degree of freedom―that the application developer can change the hardware as well as the software―should lead to increased complexity everywhere in the application development workflow.

And indeed, this has been the case. Until recently, most developers of FPGA applications relied on tools and methods that have more in common with those used by hardware designers than by software programmers. The languages used have been hardware description languages (HDLs) such as Verilog* and VHDL*. These describe the nature of the logic rather than a flow of instructions. The compile times (called synthesis) have been very long. And an uncomfortable amount of system knowledge has been required, especially for debug and test.

All of this is now changing. Intel has created Intel® FPGA SDK for OpenCL™ technology¹, which provides an alternative to HDL programming. The technology and related tools belong to a class of techniques called high-level synthesis (HLS) that enable designs to be expressed with higher levels of abstraction. Intel FPGA SDK for OpenCL technology is now in widespread use. Amazingly for longtime FPGA application developers, the performance achieved is often close to―or even better than― HDL code. But it also seems apparent that achieving this performance is often limited to developers who already know how the C-to-hardware translation works, and who have an in-house toolkit of optimization methods.

At Boston University, we’ve worked on enumerating, characterizing, and systematizing one such optimization toolkit. There are already a number of best practices for FPGA OpenCL documents. This work augments them, largely by applying additional methods well known to the high-performance computing (HPC) community2. In this methodology, we believe we’re on the right track. It’s taken decades for HPC performance programming to reach its current level of sophistication. We shouldn’t be surprised that users of Intel FPGA SDK for OpenCL technology need to follow a similar path and learning curve.

Please note that you can see more details on the work described here in references 3 and 4 at the end of the article. The first uses the FFT as a detailed case study. The second describes the empirically guided optimization framework. Also of potential interest, in related work, references 5 and 6 show how we augmented the toolflow, which can be used to test/verify design functionality and performance without generating hardware for the entire system. As a result, we can identify design bottlenecks and the impact of optimizations with greater accuracy, and thus achieve rapid turnaround. Figure 1 shows how these pieces fit together with the existing toolflow.

Empirically Guided Code Optimizations

We’ve proposed a series of systematic and empirically guided code optimizations for OpenCL that augment current best practices and substantially improve performance. Our work characterizes and measures the impact of all these optimizations. This not only enables programmers to follow a script when optimizing their own kernels, but also opens the way for the development of autotuners to perform optimizations automatically.

The standard Intel® FPGA SDK for OpenCL™ technology toolflow is shown in light blue. Our augmentations to the standard toolflow are shown in yellow, green, and purple. This article describes the systematic optimizations.

We broadly categorize code optimizations in this domain into three sets:

Intel’s best practices (IBPs)
Universal code optimizations (UCOs)
FPGA-specific optimizations (FSOs)

IBPs are design strategies given in the Intel Best Practices Guide7, which show how to express hardware using OpenCL semantics. We separate these from UCOs and FSOs because IBPs are well-known to the FPGA OpenCL community and there have been several studies characterizing their behavior.

UCOs consist of general approaches to optimizing programs that, to a large degree, are independent of the compute platform, e.g.:

Use 1D arrays
Records of arrays
Predication
Loop merging
Scalar replacement
Precomputing constants

Though described (for example, in reference 2), they are largely missing from IBP documentation. FSOs consist of a number of FPGA-specific optimizations that typically augment IBPs. They’re based on:

Obtaining a particular FPGA-specific mapping not found as an IBP
Facts stated in IBPs, but which we have leveraged and converted into optimizations
Typically used practices which (we have found) should actually be avoided

There are seven code versions, discussed in detail in references 4 and 6, which are incrementally developed. Each version contains one or more applied optimizations. Table 1 summarizes the optimizations and their type (IBP, FSO, and/or UCO).

Table 1. Summary of code versions and optimizations

Version	Optimizations	Type
0	(GPU code for porting to FPGA OpenCL)	―
1	Single thread code with cache optimization	IBP, FSO
2	Implement task parallel computations in separate kernels and connect them using channels	IBP
	Fully unroll all loops with #pragma unroll	IBP, UCO
	Minimize variable declaration outside compute loops (use temps where possible)	IBP, UCO
	Use constants for problem sizes and data values (instead of relying on off-chip memory access)	IBP, FSO, UCO
	Coalesce memory operations	IBP, UCO
3	Implement the entire computation within a single kernel and avoid using channels	FSO
4	Reduce array sizes to infer pipeline registers as registers instead of BRAMs	FSO
5	Perform computations in detail, using temporary variables to store intermediate results	FSO, UCO
6	Use predication instead of conditional branch statements when defining forks in the data path	FSO, UCO

Version 0: Sub-Optimal Baseline Code

A popular starting point (for example, in reference 8) is kernels based on multiple work items (MWI) such as is appropriate for GPUs. Advantages of starting here include ease of exploiting data parallelism through SIMD, and compute unit replication (CUR), which is exclusive to MWI structures.

Algorithm 1 shows a V0-type kernel (based on reference 9). The core operation is to populate a matrix using known values of the first row and the first column. Each unknown entry is computed based on the values of its left, up, and up-left locations. This is achieved using loops which iterate in order over all matrix entries. The max function is implemented using "if-else” statements. In Algorithm 1, SIZE represents the dimension of blocks of matrix entries being processed.

Algorithm 1. Needleman Wunsch-V0

Version 1: Preferred Baseline Code (Used for Reference)

A less intuitive, but preferred, alternative is to use (as a baseline) single-threaded CPU code. In particular, initial designs should be implemented as single work item (SWI) kernels as recommended by IBPs. SWI kernels can infer and exploit all forms of parallelism effectively, and do so in a more efficient way than MWI kernels. The CPU-like baseline code should also be optimized for cache performance. This:

Helps the compiler infer connectivity between parallel pipelines (i.e., whether data can potentially be directly transferred between pipelines instead of being stored in memory)
Improves bandwidth for on-chip data access
Efficiently uses the internal cache of load store units which are responsible for off-chip memory transactions

Algorithm 2 shows the preferred baseline kernel. The first row and column of the matrix are Vector A and Vector B, respectively.

Algorithm 2. Needleman Wunsch-V1

Version 2: IBPs

Given the preferred baseline code, we then apply the following commonly used IBPs:

Multiple task parallel kernels
Fully unroll all loops
Minimizing state register usage
Constant arrays
Coalescing

Algorithm 3 shows the Needleman Wunsch kernel structure after we apply IBPs. Parallelism is exploited using a systolic array, with each processing element (PE) implemented in a separate kernel. Channels are used to connect PEs in a specified sequence. For each inner loop iteration, PEs compute consecutive columns within the same row. This ensures spatial locality for memory transactions. The drawback is data dependencies between kernels, which can’t be reliably broken down by the compiler since it optimizes each kernel as an individual entity. Thus, the overhead of synchronizing data paths can result in performance degradation.

Algorithm 3. Needleman Wunsch-V2

Version 3: Single-Kernel Design

In Version 3, we merge the IBP-optimized task parallel kernels and declare all compute loops within the same kernel. This is because the compiler is still able to automatically infer task parallel pipelines. Having a single kernel carries a number of advantages over the multi-kernel approach, e.g.:

Inherent global synchronization
Reduced resource usage and delays through pipeline merging/reordering
Simplified control logic

Algorithm 4 shows the kernel structure for implementing the systolic array as a single kernel. The compiler can now optimize the entire computation, as opposed to individual PEs. Synchronization overhead is also reduced, since almost all computation is tied to a single loop variable (j ). Nested loops are used because, in this particular case, the cost of initiation intervals is outweighed by the reduction in resource usage. This is because the compiler was unable to infer data access patterns when loops were coalesced.

Algorithm 4. Needleman Wunsch-V3

Version 4: Reduced Array Sizes

Having large variable arrays results in pipeline registers being inferred as BRAMs instead of registers, which can have significant drawbacks on the design. Since BRAMs can’t source and sink data with the same throughput as registers, barrel shifters and memory replication are required. This drastically increases resource usage. Moreover, the compiler is also unable to launch stall-free iterations of compute loops due to memory dependencies. The solution is to break large arrays corresponding to intermediate variables into smaller ones.

Algorithm 5 shows the kernel structure for inferring pipeline registers as registers. All arrays are expressed as individual variables, generated using scripts, with the exception of local storage of Vector B in "left" which has low throughput requirements.

Algorithm 5. Needleman Wunsch-V4

Version 5: Detailed Computations

The OpenCL compiler doesn’t reliably break down large computations being assigned to a single variable into intermediate stages. This reduces the number of possible pipeline stages and can result in larger critical paths and data dependency stalls. Our solution is to do computations in as much detail as possible by using intermediate variables to help the compiler infer pipelines. If the logic is already optimal, these variables will be synthesized away and won’t waste resources.

Algorithm 6 shows the kernel structure after performing computations in detail with a number of intermediate variables added. The "max” function is also explicitly implemented.

Algorithm 6. Needleman Wunsch-V5

Version 6: Predication

We optimize conditional operations by explicitly specifying architecture states which ensure the validity of the computation. Since hardware is persistent and will always exist once synthesized, we avoid using conditional branch statements. Instead, variable values are conditionally assigned such that the output of invalid operations is not committed and hence does not impact the overall result. Algorithm 7 shows the "if-else” operations replaced with conditional assignments

Algorithm 7. Needleman Wunsch-V6

Hardware Specifications

The designs are implemented using an Intel® Arria® 10AX115H3F34I2SG FPGA and Intel® FPGA SDK for OpenCL™ technology 16.0. This FPGA has 427,200 ALMs, 1,506K logic elements, 1,518 DSP blocks, and 53 MB of on-chip storage. For GPU implementations, we use the NVIDIA Tesla* P100 PCIe 12GB GPU with CUDA* 8.0. It has 3,584 CUDA cores and peak bandwidth of 549 GB/s. CPU codes are implemented on a 14-core, 2.4 GHz Intel® Xeon® E5-2680v4 processor with Intel® C++ Compiler v16.0.1.

Optimization Characterization

The optimizations are tested for the full OpenCL compilation flow using these benchmarks:

Needleman Wunsch (NW)
Fast Fourier Transform (FFT)
Range Limited Molecular Dynamics (Range Limited)
Particle Mesh Ewald (PME)
Dense Matrix-Matrix Multiplication (MMM)
Sparse Matrix Dense Vector Multiplication (SpMV) and Cyclic Redundancy Check (CRC)

Table 2 provides a summary of these benchmarks, their associated dwarfs8, tested problem sizes, and applicable code versions that are developed.

Table 2. Benchmark summary

Benchmark	Dwarf	Problem Size	V1	V2	V3	V4	V5	V6
NW	Dynamic Programming	16K x 16K Integer Table	•	•	•	•	•	•
FFT	Spectral Methods	64 point Radix-2 1D FFT, 8,192 Vectors	•	•	•	•	•	•
Range Limited	N-Body	180 Particles per Cell, 15% Pass Rate	•	•	•	•	•	•
PME	Structured Grids	1,000,000 Particles, 323 Grid, 3D Tri-Cubic	•	•	•			•
MMM	Dense Linear Algebra	1K x 1K Matrix, Single Precision	•	•			•	•
SpMV	Sparse Linear Algebra	1K x 1K Matrix, Single Precision, 5%-Sparsity, NZ=51,122	•	•	•		•	•
CRC	Combinational Logic	100 MB CRC32	•	•	•		•	•

Figure 2 shows the results of individual optimizations. In almost all cases, we can see the same trend where traditional optimizations (V2) only result in a fraction of the speedup possible. By applying the additional optimizations on top of V2, performance is improved by orders of magnitude.

Impact of systematic application of the proposed optimizations to a cache-optimized CPU baseline code. In almost all cases, every subsequent code version shows increasing performance, with up to orders of magnitude better performance possible for fully optimized

The average impact of individual optimizations is shown in Figure 3. Generally, each successive set of optimizations applied results in increasing performance. The exception is V5. This is due to higher execution times of V5 for NW and SpMV. In both cases, performing computations in as much detail as possible results in the use of conditional statements that outweigh the benefits of the optimization. Once these statements are removed in V6, the speedup increases.

Performance for different code versions, obtained by averaging the speedup of all applicable benchmarks.

To demonstrate the overall effectiveness of the approach, we compare the performance of the optimized kernels against existing CPU, GPU, Verilog, and FPGA-OpenCL implementations. Table 3 lists the references for these designs. They’re either obtained from the literature or implemented using available source code/libraries (denoted by an asterisk). Verilog FFT measurement from reference 3 has been extended to include off-chip access overhead.

Table 3. References for existing implementations

Benchmark	CPU	GPU	Verilog*	OpenCL™
NW	Rodinia*9	Rodinia*9	Benkrid*10	Zohouri*11
FFT	MKL*12	cuFFT**13	Sanaullah*3	Intel14
Range Limited	―	―	Yang*15	Yang15
PME	Ferit16	Ferit*16	Sanaullah17	―
MMM	MKL*12	cuBLAS*18	Shen*19	Spector*20
SpMV	MKL*12	cuSPARSE*18	Zhou*22	OpenDwarfs*8
CRC	Brumme*23	―	Anand*24	OpenDwarfs8

Figure 4 shows the average speedup achieved over the CPU code, while Figure 5 shows the normalized execution times for all implementations. From the results, we observe that our work outperforms multicore CPU implementations by approximately 1.2x due to the performance of codes written using Intel® Math Kernel Library (Intel® MKL). We’ve also achieved an average of approximately 5x lower execution time than existing FPGA OpenCL work. The GPU speedup of 2.4x relative to our work is due to the use of a high-end GPU (Tesla* P100) compared to a midrange FPGA (Intel® Arria® 10 FPGAs). We therefore also provide an estimate of high-end FPGA performance (Stratix R 10*) using a conservative factor of 4x to account for an increase in resource only. Results show that the optimized kernels on Stratix 10 are expected to outperform GPU designs by 65%, on average.

Conclusions

Comparison with existing Verilog* implementations show that the OpenCL kernels are, on average, within 12% of hand-tuned HDL. This demonstrates that the optimizations are able to bridge the performance-programmability gap for FPGAs and deliver HDL-like performance using OpenCL.

Average speedup with respect to CPU across all applicable benchmarks

Performance of our work compared to existing CPU, GPU, Verilog*, and FPGA OpenCL implementations. Our work outperforms CPU and OpenCL for most of the benchmarks. Moreover, we also achieve speedups over GPU (SpMV, PME) and Verilog (SpMV, Range Limited).