Remove Memory Bottlenecks Using Intel® Advisor

Intel

5.00/5 (2 votes)

Apr 14, 2020

CPOL

15 min read

6105

Understanding How Your Program is Accessing Memory Helps You Get More from Your Hardware

How your application accesses memory dramatically impacts performance. It’s not enough to parallelize your application by adding threads and vectorization. Effective use of memory bandwidth is just as important. But often, software developers just don’t understand it. Tools that help minimize memory latency and increase bandwidth can help pinpoint performance bottlenecks and diagnose their causes. One such tool is Intel® Advisor, which has features to help optimize memory access and eliminate memory bottlenecks:

Roofline analysis with the new Integrated Roofline feature
Memory Access Pattern Analysis (MAP)
Memory Footprint analysis

Getting Great Performance

To get top performance out of your application, you need to know how well you’re using all system resources. You can see some useful metrics on your overall program in the Intel Advisor Summary view (Figure 1), which gives you an indication of how well the application is vectorized.

Figure 1 – Intel® Advisor Summary view

You’ll also need to systematically investigate the loops in your program that are taking the most time. A key metric for this is Vectorization Efficiency (Figure 2). In this example, Intel Advisor is showing a vectorization gain of 2.19x. But this only gives us a vectorization efficiency score of 55%. Where did we lose 45% of our efficiency? There are many factors that can cause inefficient vectorization.

Figure 2 – Intel Advisor Vectorization Efficiency view

Performance Problems

Bad Access Pattern

Indirect memory access is a common cause of slowdowns. Notice in the following code snippet that we can’t decode A without first decoding B[i]:

  for (i = 0; i < N; i++)
    A[B[i]] = C[i] * D[i];

This gives us an irregular access pattern. The compiler can often vectorize this by using a technique called gather/scatter―which is great because it allows that loop to vectorize, but bad because these gather/scatter instructions aren’t as fast as sequential access. For fast code, it’s important to try to have your data structures arranged so that data is accessed in unit stride. (You’ll see how Intel Advisor can show you this information later.)

Memory Subsystem Latency/Throughput

Getting your code to fit into the various memory caches, and making optimal use of data reuse, are crucial to getting the best performance out of your system. In the following example we’re indexing by i over a very large piece of data. This data is too big to fit in cache, which is bad―and made doubly so by A being a multidimensional array:

  for (i = 0; i < VERY_BIG; i++)
    c[i] = z * A[i][j];

References to A[i][j] and A[i+1][j] are not located next to each other in memory. So, to get each new
reference, we need to bring in a new cache line―and potentially evict a cache line. This “cache thrashing” will
have a negative impact on performance. Techniques such as cache blocking, where we add a new inner loop that
indexes over a much smaller range that is designed to fit in cache, can help optimize these types of applications.

Branchy Code

Applications with a lot of branches (e.g., the for loop below with the if(cond(i)) can be vectorized using mask registers to block the SIMD lanes where the condition is not true. In these iterations, a SIMD lane does not do useful work. Intel Advisor uses the Mask utilization metric (Figure 3). Three elements are being suppressed, giving us a Mask utilization of 5/8 = 62.5%.

  for (i = 0; i < MAX; i++)
    if (cond(i))
      C[i] = A[i] + B[i];

Figure 3 – Mask utilization metric

You could potentially access your data in a unit stride fashion and have excellent vector efficiency, but still not get the performance you need because of low mask utilization (Figure 4). Table 1 shows Memory access types.

Figure 4 – Mask utilization versus efficiency

Table 1. Memory access types

Access Pattern	Small Memory Footprint	Large Memory Footprint
Unit Stride	Effective SIMD No latency or bandwidth bottlenecks	Effective SIMD Bandwidth bottleneck
Constant Stride	Medium SIMD Latency bottlenecks possible	Medium SIMD Latency and bottlenecks possible
Irregular Access, Gather/Scatter	Bad SIMD Latency bottlenecks possible	Bad SIMD Latency Bottlenecks

Are You Bound by CPU/VPU or Memory?

If your application is memory bound, there are several features in Intel Advisor that can help you optimize. But first, you need to determine if you’re memory bound or CPU/VPU bound. A quick way to determine this is by looking at your instructions. The Intel Advisor Code Analytics windows (Figure 5) can give you a very basic way to see the mix of instructions that your code is executing.

Figure 5 – Code Analytics windows

A good rule of thumb is that applications that are executing a lot of memory instructions tend to be memory bound, whereas those that are executing a lot of compute instructions tend to be compute bound. Notice the breakdown in Figure 5. The ratio of scalar to vector instructions is particularly important. You should try to have as many vector instructions as possible.

Another, slightly more complicated, technique is to use the Traits column in the Intel Advisor Survey view (Figure 6).

Figure 6 – Traits column in the Intel Advisor Survey view

Think of Traits as what the compiler needed to do to vectorize your loop. In the latest vector instructions sets, such as Intel® AVX-512, there are many new instructions and idioms the compiler can use to vectorize your code. Techniques like register masking and compress instructions, shown in Figure 6, do allow applications to vectorize when this was not previously possible—but sometimes at a cost. Anything the compiler needed to do to get your data structures to fit in a vector (such as memory manipulation) will often appear in the Traits column. These Traits often indicate a problem that you can explore with Memory Access Pattern analysis.

Helpful Optimization Features

Roofline Analysis

A Roofline chart is a visual representation of application performance in relation to hardware limitations, including memory bandwidth and computational peaks. It was first proposed by researchers at the University of California at Berkeley in the 2008 paper “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” In 2014, this model was extended by researchers at the Technical University of Lisbon in a paper called “Cache-Aware Roofline Model: Upgrading the Loft.” Traditionally, Roofline charts have been calculated and plotted manually. But Intel Advisor now automatically builds Roofline plots.

The Roofline provides insight into:

Where your performance bottlenecks are
How much performance is left on the table because of them
Which bottlenecks are possible to address, and which ones are worth addressing
Why these bottlenecks are most likely occurring
What your next steps should be

Figure 7 – Roofline analysis

The horizontal lines in Figure 7 represent the number of floating-point or integer computations of a given type your hardware can perform in a given span of time. The diagonal lines represent how many bytes of data a given memory subsystem can deliver per second. Each dot is a loop or function in your program, with its position indicating its performance, which is affected by its optimization and its arithmetic intensity (AI).

Intel Advisor Integrated Roofline

The Integrated Roofline model offers a more detailed analysis, showing directly where the bottleneck comes from. Intel Advisor collects data for all memory types using cache simulation (Figure 8).

Figure 8 – Cache simulation in Intel Advisor

With this data, Intel Advisor counts the number of data transfers for a given cache level and computes the specific AI for each loop and each memory level. By observing the changes in this traffic from one level to another, and then comparing it to respective roofs representing the best possible bandwidths for these levels, it’s possible to pinpoint the memory hierarchy bottleneck for the kernel and determine optimization steps (Figure 9).

Figure 9 – Pinpointing the memory hierarchy bottleneck

Memory Access Pattern (MAP) Analysis

Intel Advisor MAP analysis gives you the deepest insight into how you’re accessing memory. Your memory access pattern affects both the efficiency of vectorization as well as how much memory bandwidth you can ultimately achieve. The MAP collection observes data accesses during execution and spots the instructions that contain the memory accesses. The data collected and analyzed appears in the Memory Access Patterns Report tab of the Refinement Reports window.

To run a MAP analysis from the GUI (Figure 10), you need to select loops using the checkboxes in the Survey report and run a MAP collection.

Figure 10 – Memory Access Patterns report

You can also run a MAP collection from the command-line. Use the -mark-up-list option to select
loops to be analyzed.

  advixe-cl -collect map -mark-up-
  list=Multiply.c:78,Multiply.c:71,Multiply.c:50,Multiply.c:6
  1 -project-dir C:/my_advisor_project -- my_application.exe

The Memory Access Patterns report provides information about the types of strides observed in the memory access operations during loop execution. The tool reports both unit/uniform and constant strides (Figure 11).

Unit/Uniform Stride Types

Unit stride (stride 1) instruction accesses memory that consistently changes by one element from iteration to iteration.
Uniform stride 0 instruction accesses the same memory from iteration to iteration.
Constant stride (stride N) instruction accesses memory that consistently changes by N elements (N>1) from iteration to iteration.

Variable Stride Types

Irregular stride instruction accesses memory addresses that change by an unpredictable number of elements from iteration to iteration.
Gather (irregular) stride is detected for v(p)gather* instructions on AVX2 Instruction Set Architecture.

Figure 11 – Stride types

Double-click any line in the Memory Access Patterns report tab to see the selected operation’s source code (Figure 12).

The Source and Details views (Figure 13) both give you insights into another key Intel Advisor memory feature, Memory Footprint.

Memory Footprint Analysis

Memory Footprint is basically the range of memory a given loop accesses. This footprint can be a key indicator of your memory bandwidth. If the range is very large, then you might not be able to fit in cache. Optimization strategies such as cache blocking can make a big difference in these cases. Intel Advisor has three different memory footprint metrics (Figure 14).

Figure 12 – See the selected operation’s source code

Figure 13 – Details view

Figure 14 – Memory footprint metrics

Two basic footprint metrics represent just some aspects of your memory footprint. These metrics are collected by default in Memory Access Pattern (MAP) analysis:

Max Per-Instruction Address Range represents the maximum distance between minimum and maximum memory address values accessed by instructions in this loop. For each memory access instruction, the minimum and maximum address of its access is recorded and the maximum range of this address for all loop instructions is reported. It covers more than one loop instance, with some filtering, which is why sometimes Intel Advisor is less confident in this metric and reports it in gray.
First Instance Site Footprint is a more accurate memory footprint, since it’s aware of overlaps in address ranges in the loop iterations and gaps between address ranges accessed by the loop, but is calculated only for the first instance (call) of this loop.

There’s a more advanced footprint calculated based on cache simulation, called the Simulated Memory Footprint. This metric shows the summarized and overlap-aware picture across all loop instances, but limited to one thread. It is calculated as the number of unique cache lines accessed during cache simulation multiplied by cache line size. To enable it in the GUI, select the Enable CPU cache simulation checkbox in the Memory Access Patterns tab of the Project Properties, and select Model Cache Misses and Loop Footprint Simulation Mode in the dropdown list (Figure 15). Then select the loops of interest with the checkboxes in the Survey view and run a MAP analysis.

Figure 15 – Simulated Memory Footprint

To enable in the command-line, you need to use the MAP command, as previously specified, with these options: -enable-cache-simulation and -cachesim-mode=footprint.

  advixe-cl -collect map -mark-up-
  list=tiling_inter.cpp:56,output.c:1073 -enable-cache-
  simulation -cachesim-mode=footprint -project-dir
  C:\my_advisor_project -- my_application.exe

You can see the results of the analysis in the Intel Advisor GUI Refinement Report view (Figure 16). The more detailed cache-related metrics―like the total number of memory loads, stores, cache misses, and cache-simulated memory footprint―allow a more detailed study of loop behavior with respect to memory. Table 2 shows Intel Advisor footprint metrics applicability, limitations, and relevance for analyzing different types of codes.

Figure 16 – Intel Advisor GUI Refinement Report view

Table 2. Intel Advisor footprint metrics

	Max Per- Instruction Address Range	First Instance Site Footprint	Simulated Memory Footprint
Threads analyzed for the loop/site	1	1	1
Loop instances analyzed	All instances, but with some shortcuts	1, only first instance	Depends on loop-call-count limit option
Aware of address range overlap?	No	Yes	Yes
Suitable for codes with random memory access	No	No	Yes

A Real-World Example

Some of the most common problems in computational science require matrix multiplication. The list of domains that use matrices is almost endless, but artificial intelligence, simulation, and modeling are just a few examples. The sample algorithm below is a triply nested loop where we do a multiply and an add for each iteration. Besides being very computationally intensive, it also accesses a lot of memory. Let’s use Intel Advisor to see how much.

  for(i=0; i<msize; i++) {
    for(j=0; j<msize; j++) {
      for(k=0; k<msize; k++) {
        c[i][j] = c[i][j] + a[i][k] * b[k][j];
      }
    }
  }

Create a Baseline

The elapsed time was 53.94 seconds for our initial run. Figure 17 is a Cache-aware Roofline chart. The red dot is our main computational loop. It’s far below even DRAM bandwidth, and even farther below L1 bandwidth, which is the maximum bandwidth we’re trying to achieve. You can see the precise bandwidth we’re achieving at each level of the memory hierarchy using the Code Analytics tab for the loop (Figure 18).

Why is our performance so poor? How can we do better? These are questions Intel Advisor was designed to answer. First, we need to examine the Survey view (Figure 19) to see what’s going on and whether Intel Advisor has any recommendations. Intel Advisor has noted that we have an Inefficient memory access pattern, and also that the loop has not been vectorized because of an assumed dependency. To examine the memory access pattern, we can run a Memory Access Pattern (MAP) analysis (Figure 20).

Figure 17 – Cache-aware Roofline chart

Figure 18 – Data transfers and bandwidth

Figure 19 – Survey view

Figure 20 – Memory Access Pattern (MAP) analysis

Intel Advisor has detected a constant stride on our read access and a uniform stride of 0 for the write. The range of memory we’re accessing is 32MB, far bigger than any of our cache sizes (Figure 21). We can also see how well the caches are performing using the MAP report (Figure 22). We’re missing over 2,300 cache lines, so it’s no wonder performance is bad. But there are several ways we can fix this.

Figure 21 – Strides distribution

Figure 22 – MAP report

Step 1

Do a loop interchange so that we don’t need a constant stride and also don’t need to access memory over such a wide range. We can also vectorize the loop by including a pragma ivdep that informs the compiler that we don’t have a dependency that prevents vectorization.

  for(i=tidx; i<msize; i=i+numt) {
    for(k=0; k<msize; k++) {
  #pragma ivdep
      for(j=0; j<msize; j++) {
        c[i][j] = c[i][j] + a[i][k] * b[k][j];
      }
    }
  }

The elapsed time for our new run is 4.12 seconds, an improvement of more than12x. Why is our new performance so much better? First, let’s take a look at our Integrated Roofline chart (Figure 23). Each of the red circles represents the bandwidth of the corresponding memory hierarchy: L1, L2, L3, and DRAM. We can see that our computational loop’s L1 memory bandwidth, represented by the leftmost red circle, is now 95 GB/second. We can also use the Survey view (Figure 24) to see that we’re also now vectorized at 100% efficiency using AVX2 instructions.

Figure 23 – Integrated Roofline chart

Figure 24 – Survey view

Our MAP report (Figure 25) informs us that all our accesses are now unit stride and the maximum address
range is 16KB, well within the range of our cache size. Our cache is also performing much better (Figure 26).
We’ve dropped to 512 cache misses, down from 2,302. So we’re getting better performance, but we’re still
not near the peak.

Figure 25 – Map report

Figure 26 – Cache performance

Step 2

Implement cache-blocking so that our computations are over a smaller range of memory:

  for (i0 = ibeg; i0 < ibound; i0 +=mblock) { 
    for (k0 = 0; k0 < msize; k0 += mblock) { 
      for (j0 =0; j0 < msize; j0 += mblock) { 
        for (i = i0; i < i0 + mblock; i++) { 
          for (k = k0; k < k0 + mblock; k++) { 
  #pragma ivdep 
  #ifdef ALIGNED 
    #pragma vector aligned 
  #endif //ALIGNED 
          #pragma nounroll 
            for (j = j0; j < 10 + mblock; j++) { 
              c[i] [j] = c[i] [j] + a[i] [k] * b[k] [j]; 
            }
          }
        }
      }
    }
  }

In the above code, we’re adding three additional nested loops so that we do our computations in sections (or blocks). After we’re done with one block, we move to the next. The elapsed time of our cache-blocked case is 2.60 seconds, a 1.58x improvement from the previous run (Figure 27). Our loop’s L1 memory bandwidth is now 182 GB/second, much closer to the L1 roof. Our vectorization and striding have not changed, but we now have only 15 cache misses for our inner loop, and our address range has been reduced to 480 bytes (Table 3).

Figure 27 – Performance improvements

Table 3. Summary of results

Run	Time Elapsed, Seconds	Total GFlops	Memory Address Range	Cache Misses	Improvement (Time)
Baseline	53.94	0.32	32 MB	2,302	N/A (baseline)
Loop Interchange	4.19	4.17	16 KB	511	12.87x
Blocked	2.6	6.61	480 B	15	20.74x

Optimizing Memory Accesses

It’s crucial to optimize the memory accesses of your program. Understanding how your program is accessing memory, using a tool like Intel Advisor, can help you get the most out of your hardware. By using the Roofline and new Integrated Roofline features of Intel Advisor, you can visualize your memory bottlenecks. You can get even greater memory insight when you combine Roofline features with Memory Access Pattern analysis.

Related Resources

Special Thanks

This article is from The Parallel Universe, Intel’s quarterly magazine that helps you take your software development into the future with the latest tools, tips, and training to expand your expertise. Get it here.