Intel AVX: New Frontiers in Performance Improvements and Energy Efficiency

Intel Software Network

4.00/5 (2 votes)

Jun 19, 2008

CPOL

11 min read

19919

This paper will provide a brief background on ISA, and then give an overview of the new instructions and capabilities of the Intel AVX and advantages of these innovative instructions across various applications and programming models.

Introduction

As the need for more computing performance continues to grow across industry segments, Intel continues to lead in innovation and the delivery of greater compute capacity to support these growing demands and evolving usage models. Intel has a long history of innovating and in leading the charge in expanding the capabilities of the world’s most popular and broadly used computer architecture – Intel® architecture. Intel continues this legacy of innovation with the introduction of the Intel® Advanced Vector Extensions (Intel® AVX) instructions that drive the industry, leading Intel® SSE4 to new levels of performance, flexibility and energy efficiency.

In order to benefit a broad audience of consumer and business customers, Intel will introduce a new set of instructions called Intel AVX, supported by a wide range of Intel platforms starting in the 2010 timeframe. Building on the rich legacy of Intel SSE4 and Intel® 64 instruction set architecture (ISA), the Intel AVX provide the infrastructure and building blocks for delivering the performance required by the growing needs of applications such as financial analysis, media content creation, natural resource industry, and HPC computing. Intel AVX will introduce an instruction set extension to enable flexibility in the programming environment, efficient coding with vector and scalar data sets, and power-efficient performance across wide and narrow vector data processing.

This paper will provide a brief background on ISA, and then give an overview of the new instructions and capabilities of the Intel AVX, and advantages of these innovative instructions across various applications and programming models.

Instruction Set Architecture and Microarchitecture

To better appreciate the significance of these new instructions, it helps to understand the different architectures used in developing today’s modern microprocessors and their uses.

ISA is the part of an overall computer architecture related to programming. The set includes the native data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O. The ISA defines a specification of functions (machine commands) implemented by a particular microprocessor design. Within a family of processors, the ISA is often enhanced over time with new instructions to deliver better performance and expose new features, while at the same time maintaining compatibility with existing applications.
Microarchitecture refers to the actual design, layout and implementation of an ISA in silicon. It includes the overall block design, cores, execution units and types (such as floating-point, integer, branch prediction, SIMD, etc), pipeline definitions, cache memory design, and peripheral support. Within a family of processors, the microarchitecture is often enhanced over time to deliver improvements in performance, energy efficiency and capabilities, while maintaining compatibility to ISA.

Leading the Instruction Set Revolution – A Long History in ISA

Intel uses ISA to deliver the superior capabilities of its microarchitecture while maintaining the necessary application-level compatibility across processor generations. Good examples in maintaining instruction set compatibility include the new Intel® Core™2 family of processors. Like the previous generation Intel Core 2 Duo and Intel Pentium D processors, the Intel Core 2 processors implement a nearly identical version of the ISA and provide application-level compatibility. However, internally they have a new and improved design. Nearly all applications built for Intel Pentium D processors will run on the Intel Core 2 Duo processor and the Intel Core 2 processor without any modification. Even better, nearly all of these applications automatically benefit from the superior performance and energy-efficiency of these processors.

As Intel process technology and microarchitecture are continuously evolving at the pace of our new cadence, so are Intel instruction sets. In each evolution:

Intel will optimize existing instructions to enable them to receive maximum benefit from the latest microarchitecture improvements and deliver greater performance and power efficiency for existing applications without modifications.
Intel will also introduce new sets of instructions designed to optimize the performance and lower the power needs of a broad range of existing and new applications. To maximize the benefit of these new instructions, existing applications should be recompiled with an updated compiler (see here for more details).

As you can see, in each case, existing software will continue to run correctly as our instruction sets evolve and new ones are added. Equally important, new applications incorporating these instructions – and existing applications recompiled to take advantage of them – will see exciting performance improvements.

Intel’s lead in ISA extends to a broad ecosystem of operating systems, including Microsoft Windows* and Vista*, UNIX*, Linux*, and now the Macintosh* operating systems. Our continuing commitment to extending our ISA for the industry includes:

Pioneering architectural consistency to enable software innovations across operating systems, application domains through extended industry ecosystem support
Providing a seamless approach for software vendors to address the market dynamics of product opportunities in 32-bit and 64-bit ISA
Listening to software developers and independent software vendors (ISVs) in our development of new instructions to help developers succeed more easily with us.
Ensuring that existing applications run correctly and perform better.
Maintaining correctness as applications use the new instructions.
Providing ISA leadership to other architecture vendors so that the Intel ISA remains cohesive and performs as a standard – simplifying the job of the ISV community.

Developers benefit from processor capabilities in multiple performance vectors: higher throughput of concurrently executing multiple instructions, processing multiple data in one instruction. Intel has long encouraged such coding practices to help increase overall processor throughput and utilization. Early on, Intel began a proactive program to improve application performance on Intel processors by developing special instruction sets. Early examples include the floating-point (FP) instruction set extensions defined in the 8086 chip. More recent examples include Single Instruction, Multiple Data (SIMD) and MMX™ technology. Using the MMX technology instruction set, programmers had the ability to execute instructions on multiple data elements loaded into MMX technology registers that would deliver increased performance in media applications such as graphics, gaming, streaming video, and more. In the P6 processor, Intel introduced Intel Streaming SIMD Extensions (Intel SSE). Designed for the Intel® Pentium® III processor, Intel SSE extended MMX technology and allowed SIMD computations to be performed on four packed single-precision FP data elements simultaneously using 128-bit registers. With the Intel NetBurst® microarchitecture, Intel SSE2 expanded the SIMD instruction set to a wider spectrum of application domains by offering double-precision FP and 128-bit SIMD integer processing capabilities. Intel SSE2 instructions gave software developers maximum flexibility in implementing algorithms and provided performance enhancements to software such as MPEG-2 video, MP3, 3D graphics, and more.

The launch of the 90 nm process-based Pentium 4 processor brought the Intel SSE3 extensions. Intel SSE3 added 13 additional SIMD instructions that are primarily designed to improve thread synchronization and x87-FP math capabilities. A further advancement, Supplemental Intel SSE3, is now available in the Intel Core microarchitecture. Supplemental Intel SSE3 adds 32 new opcodes — including align and multiply-add — for even greater performance.

The most recent Intel ISA innovation is Intel SSE4 in 2007 - offering a broad collection of instructions for significant performance gains and programming productivity. Intel SSE4 has several compiler vectorization primitives for more efficient media performance, as well as new and innovative string processing instructions. Beginning with the 45 nm Intel microarchitecture based processors, these new instructions have started ramping in a wide range of Intel platforms including desktop, mobile, and server. Intel SSE4 offers dozens of new innovative instructions in two major categories:

Intel SSE4 Vectorizing Compiler and Media Accelerators.
Intel SSE4 Efficient Accelerated String and Text Processing.

AES-NI and PCMULQDQ

Intel ISA innovations continue in 2009 with seven new instructions to accelerate data encryption and decryption. Intel processors based on the Westmere code name will provide six new instructions for symmetric encryption/decryption using the Advance Encryption Standard (AES) and one instruction performing carry-less multiplication (PCMULQDQ) for advanced block cipher encryption. These hardware-based primitives provide added security benefit by avoiding table-lookups to protect against software side channel attacks.

Intel® AVX Architecture

Background and Overview

The need for more computing performance continues to grow across industry segments. To support these growing demands and evolving usage models, Intel continues to lead in innovation and the delivery of greater compute capability:

For financial services that need compute-intensive performance to support timely decisions.
For resource and manufacturing industries that construct and model software solutions across multiple dimensions of space and time.
For service-oriented software innovations targeting personalized or customer-centric experiences that will require new algorithms to distill multiple data sets, correlate historical profiles, transform/decompose feature space representations, and ubiquitous availability of power-efficient compute hardware.

Intel AVX is a new 256-bit SIMD FP vector extension of Intel Architecture. Its introduction is targeted for the Sandy Bridge processor family in the 2010 timeframe. Intel AVX accelerates the trends towards FP intensive computation in general purpose applications like image, video, and audio processing, engineering applications such as 3D modeling and analysis, scientific simulation, and financial analytics.

Intel AVX is a comprehensive ISA extension of the Intel 64 Architecture. The main elements of Intel AVX are:

Support for wider vector data (up to 256-bit).
Efficient instruction encoding scheme that supports 3 and 4 operand instruction syntax.
Flexibility in programming environment, ranging from branch handling to relaxed memory alignment requirements.
New data manipulation and arithmetic compute primitives, including Broadcast, permute, fused-multiply-add, etc.

While any application making heavy use of floating-point or integer SIMD can use Intel AVX, the applications that show the best benefit are those that are strongly floating-point compute intensive and can be vectorized. Example applications include audio processing and audio codecs, image and video editing applications, financial services analysis and modeling software, and manufacturing and engineering software.

When Intel’s engineers set out to design Intel AVX several years ago, it was essential that we provided a comprehensive, backwards-compatible solution with built-in extensibility. Our three operand and wider vector syntax is based on an instruction encoding format that applies to the complete set of existing Intel SSE instructions. In addition, Intel AVX added a 4-operand syntax in the encoding to support FMA, variable blends, and generalized shuffles. Intel AVX’s encoding is highly compact – these instructions typically take fewer bytes than the 64-bit forms of current floating-point instructions, yet they have many reserved fields for future features.

Key Benefits of Intel® AVX

Intel AVX is a comprehensive ISA enhancement that adds new functionality in addition to the compact new encoding format.

A large number (200+) of legacy Intel SSEx instructions are upgraded by the enhanced instruction encoding to take advantage of features like a distinct source operand and flexible memory alignment.
A moderate number (< 100) of legacy 128-bit Intel SSEx instructions have been promoted to process 256-bit vector data.
A number of new data processing and arithmetic operations (< 100) not present in legacy Intel SSEx are added to Intel processors to be launched in 2010 and beyond.

The key advantages of Intel AVX are:

Performance: Intel AVX can improve performance on existing and new applications that lend themselves well to largely vectorizable data sets:
- Wider vector data sets can be processed up to twice the throughput of 128-bit data sets.
- Application performance can scale up with the number of hardware threads and number of cores.
- Application domain can scale out with advanced platform interconnect fabrics, such as Intel QPI.
Power Efficiency: Intel AVX is extremely power efficient. Incremental power is insignificant when the instructions are unused or scarcely used. Combined with the high performance that it can deliver, applications that lend themselves heavily to using Intel AVX can be much more energy efficient and realize a higher performance-per-watt.
Extensibility: Intel AVX has powerful built-in extensibility options for the future without resorting to code growth:
- OS context management rework only needs to be done once.
- Future Vector Integer support to 256 and 512 bits.
- Vector Future FP support to 512 bits and even 1024 bits.
Compatibility: Intel AVX is backward compatible with previous ISA extensions including Intel SSE4:
- Simple porting of existing Intel SSE applications to Intel AVX-128.
- Straightforward porting of existing Intel SSE to Intel AVX-256.
Ubiquity: Intel AVX will be available in a wide range of Intel platforms, from sub-notebook to multi-processor servers.
Support: Intel‘s comprehensive range of developer tools and an extensive online support presence at Intel® Software Network make it easy for developers to start working with Intel AVX today.

Software development platforms will be available in the first half of 2010. Prior to 2010, ISVs will be able to start development using an emulator and other tools that will be available for download from the Intel AVX web site.

Intel will be providing various tools, white papers, and a support forum to help ISVs start development. There are multiple paths for development while keeping in mind that with Intel AVX:

Most apps written with intrinsics need only recompile.
There is a straightforward porting of existing Intel SSE to Intel AVX 256 with Intel libraries, Intel® Integrated Performance Primitives (Intel® IPP), etc.
All Intel SSE/2 instructions are extended via simple prefix (“VEX).

Summary

Intel’s leadership and ongoing work in the development of instruction set extensions for Intel architecture provide a continuing path for enhancing the performance, power efficiency and capabilities of a wide range of software. With Intel AVX we are continuing to work with the ISV community to deliver instruction set extensions that truly enhance software products to provide real benefits (everything from improved performance to substantial cost savings) to their customers.