Tuning Autonomous Driving Using Intel® System Studio

Intel

0/5 (0 vote)

Aug 31, 2017

CPOL

10 min read

14007

Intel® GO™ SDK Offers Automotive Solution Developers an Integrated Solutions Environment

Download a free trial of Intel System Studio

Lavanya Chockalingam, Software Technical Consulting Engineer, Intel Corporation

The Internet of Things is a collection of smart devices connected to the cloud. "Things" can be as small and simple as a connected watch or a smartphone, or they can be as large and complex as a car. In fact, cars are rapidly becoming some of the world’s most intelligent connected devices, using sensor technology and powerful processors to sense and continuously respond to their surroundings. Powering these cars requires a complex set of technologies:

Sensors that pick up LIDAR, sonar, radar, and optical signals
A sensor fusion hub that gathers millions of data points
A microprocessor that processes the data
Machine learning algorithms that require an enormous amount of computing power to make the data intelligent and useful

Successfully realizing the enormous opportunities of these automotive innovations has the potential to not only change driving but also to transform society.

Intel® GO™ Automotive Software Development Kit (SDK)

From car to cloud―and the connectivity in between―there is a need for automated driving solutions that include high-performance platforms, software development tools, and robust technologies for the data center. With Intel GO automotive driving solutions, Intel brings its deep expertise in computing, connectivity, and the cloud to the automotive industry.

Autonomous driving on a global scale takes more than high-performance sensing and computing in the vehicle. It requires an extensive infrastructure of data services and connectivity. This data will be shared with all autonomous vehicles to continuously improve their ability to accurately sense and safely respond to surroundings. To communicate with the data center, infrastructure on the road, and other cars, autonomous vehicles will need high-bandwidth, reliable two-way communication along with extensive data center services to receive, label, process, store, and transmit huge quantities of data every second. The software stack within autonomous driving systems must be able to efficiently handle demanding real-time processing requirements while minimizing power consumption.

The Intel GO automotive SDK helps developers and system designers maximize hardware capabilities with a variety of tools:

Computer vision, deep learning, and OpenCL™ toolkits to rapidly develop the necessary middleware and algorithms for perception, fusion, and decision-making
Sensor data labeling tool for the creation of "ground truth" for deep learning training and environment modeling
Autonomous driving-targeted performance libraries, leading compilers, performance and power analyzers, and debuggers to enable full stack optimization and rapid development in a functional safety compliance workflow
Sample reference applications, such as lane change detection and object avoidance, to shorten the learning curve for developers

Intel® System Studio

Intel also provides software development tools that help accelerate time to market for automated driving solutions. Intel System Studio provides developers with a variety of tools including compilers, performance libraries, power and performance analyzers, and debuggers that maximize hardware capabilities while speeding the pace of development. It is a comprehensive and integrated tool suite that provides developers with advanced system tools and technologies to help accelerate the delivery of the next-generation, power-efficient, high-performance, and reliable embedded and mobile devices. This includes tools to:

Build and optimize your code
Debug and trace your code to isolate and resolve defects
Analyze your code for power, performance, and correctness

Build and Optimize Your Code

Intel® C++ Compiler : A high-performance, optimized C and C++ cross-compiler that can offload compute-intensive code to Intel® HD Graphics.
Intel® Math Kernel Library (Intel® MKL): A set of highly optimized linear algebra, fast Fourier transform (FFT), vector math, and statistics functions.
Intel® Threading Building Blocks (Intel® TBB) : C++ parallel computing templates to boost embedded system performance.
Intel® Integrated Performance Primitives (Intel® IPP) : A software library that provides a broad range of highly optimized functionality including general signal and image processing, computer vision, data compression, cryptography, and string manipulation.

Debug and Trace Your Code to Isolate and Resolve Defects

Intel® System Debugger : Includes a System Debug feature that provides source-level debugging of OS kernel software, drivers, and firmware plus a System Trace feature that provides an Eclipse* plug-in, which adds the capability to access the Intel® Trace Hub providing advanced SoC-wide instruction and data events tracing through its trace viewer.
GNU* Project Debugger (GDB) : This Intel-enhanced GDB is for debugging applications natively and remotely on Intel® architecture-based systems.

Analyze Your Code for Power, Performance, and Correctness

Intel® VTune™ Amplifier : This software performance analysis tool is for users developing serial and multithreaded applications.
Intel® Energy Profiler : A platform-wide energy consumption analyzer of power-related data collected on a target platform using the SoC Watch tool.
Intel® Performance Snapshot : Provides a quick, simple view into performance optimization opportunities.
Intel® Inspector : A dynamic memory and threading error-checking tool for users developing serial and multithreaded applications on embedded platforms.
Intel® Graphics Performance Analyzers : Real-time, system-level performance analyzers to optimize CPU/GPU workloads.

Optimizing Performance

Advanced Hotspot Analysis

Matrix multiplication is a commonly used operation in autonomous driving. Intel System Studio tools, mainly the performance analyzers and libraries, can help maximize performance. Consider this example of a very naïve implementation of matrix multiplication using two nested for-loops:

void multiply0(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM])
{
        int i,j,k;

// Basic serial implementation
    for(i=0; i<msize; i++) {
        for(j=0; j<msize; j++) {
            for(k=0; k<msize; k++) {
                                c[i][j] = c[i][j] + a[i][k] * b[k][j];
                        }
                }
        }
}

Advanced hotspot analysis is a fast and easy way to identify performance-critical code sections (hotspots). The periodic instruction pointer sampling performed by Intel VTune Amplifier identifies code locations where an application spends more time. A function may consume much time either because its code is slow or because the function is frequently called. But any improvements in the speed of such functions should have a big impact on overall application performance.

Running an advanced hotspot analysis on the previous matrix multiplication code using Intel VTune Amplifier shows a total elapsed time of 22.9 seconds (Figure 1). Of that time, the CPU was actively executing for 22.6 seconds. The CPI rate (i.e., cycles per instruction) of 1.142 is flagged as a problem. Modern superscalar processors can issue four instructions per cycle, suggesting an ideal CPI of 0.25, but various effects in the pipeline―like long latency memory instructions, branch mispredictions, or instruction starvation in the front end―tend to increase the observed CPI. A CPI of one or less is considered good but different application domains will have different expected values. In our case, we can further analyze the application to see if the CPI can be lowered. Intel VTune Amplifier’s advanced hotspot analysis also indicates the top five hotspot functions to consider for optimization.

Figure 1. Elapsed time and top hotspots before optimization

CPU Utilization

As shown in Figure 2, analysis of the original code indicates that only one of the 88 logical CPUs is being used. This means there is significant room for performance improvement if we can parallelize this sample code.

Figure 2. CPU usage

Parallelizing the sample code as shown below gives an immediate 12x speedup (Figure 3). Also, the CPI has gone below 1, which is also a significant improvement.

void multiply1(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM])
{
    int i,j,k;
   // Basic parallel implementation
   #pragma omp parallel for
    for(i=0; i<msize; i++) {
        for(j=0; j<msize; j++) {
            for(k=0; k<msize; k++) {
                                c[i][j] = c[i][j] + a[i][k] * b[k][j];
                        }
                }
        } 
 }

Figure 3. Performance improvement from parallelization

General Exploration Analysis

Once you have used Basic Hotspots or Advanced Hotspots analysis to determine hotspots in your code, you can perform General Exploration analysis to understand how efficiently your code is passing through the core pipeline. During General Exploration analysis, Intel VTune Amplifier collects a complete list of events for analyzing a typical client application. It calculates a set of predefined ratios used for the metrics and facilitates identifying hardware-level performance problems. Superscalar processors can be conceptually divided into the front end (where instructions are fetched and decoded into the operations that constitute them) and the back end (where the required computation is performed). General Exploration analysis performs this estimate and breaks up all pipeline slots into four categories:

Pipeline slots containing useful work that issued and retired (retired)
Pipeline slots containing useful work that issued and canceled (bad speculation)
Pipeline slots that could not be filled with useful work due to problems in the front end (front-end bound)
Pipeline slots that could not be filled with useful work due to a backup in the back end (back-end bound)

Figure 4 shows the results of running a general exploration analysis on the parallelized example code using Intel VTune Amplifier. Notice that 77.2 percent of pipeline slots are blocked by back-end issues. Drilling down into the source code shows where these back-end issues occur (Figure 5, 49.4 + 27.8 = 77.2 percent back-end bound). Memory issues and L3 latency are very high. The memory bound metric shows how memory subsystem issues affect performance. The L3 bound metric shows how often the CPU stalled on the L3 cache. Avoiding cache misses (L2 misses/L3 hits) improves latency and increases performance.

Figure 4. General exploration analysis results

Figure 5. Identifying issues

Memory Access Analysis

The Intel VTune Amplifier’s Memory Access analysis identifies memory-related issues, like NUMA (non-uniform memory access) problems and bandwidth-limited accesses, and attributes performance events to memory objects (data structures). This information is provided from instrumentation of memory allocations/deallocations and getting static/global variables from symbol information.

By selecting the grouping option of the Function/Memory Object/Allocation stack (Figure 6), you can identify the memory objects that are affecting performance. Out of the three objects listed in the multiply1 function, one has a very high latency of 82 cycles. Double-clicking on this object takes you to the source code, which indicates that array "b" has the highest latency. This is because array "b" is using a column-major order. Interchanging the nested loops changes the access to row-major order and reduces the latency, resulting in better performance (Figure 7).

Figure 6. Identifying memory objects affecting performance

Figure 7. General exploration analysis results

We can see that although the sample is still back-end-bound, it is no longer memory-bound. It is only core-bound. A shortage in hardware compute resources, or dependencies on the software’s instructions, both fall under core-bound. Hence, we can tell that the machine may have run out of out-of-order resources. Certain execution units are overloaded, or there may be dependencies in the program’s data or instruction flow that are limiting performance. In this case, vector capacity usage is low, which indicates floating-point scalar or vector instructions are using only partial vector capacity. This can be solved by vectorizing the code.

Another optimization option is to use Intel Math Kernel Library, which offers highly optimized and threaded implementations of many mathematical operations, including matrix multiplication. The dgemm routine multiplies two double-precision matrices:

void multiply5(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM])
{
        double alpha = 1.0, beta = 0.0;

        cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                    NUM, NUM, NUM,
                    alpha, (const double *)b,
                    NUM, (const double *)a,
                    NUM, beta, (double *)c, NUM);
}

Performance Analysis and Tuning for Image Resizing

Image resizing is commonly used in operation in the autonomous driving space. For example, we ran Intel VTune Amplifier’s advanced hotspots analysis on an open-source OpenCV* version of image resize (Figure 8). We can see that the elapsed time is 0.33 seconds and the top hotspot is the cv:HResizeLinear function, which consumes 0.19 seconds of the total CPU time.

Figure 8. Advanced hotspot analysis

Intel IPP offers developers highly optimized, production-ready building blocks for image processing, signal processing, and data processing (data compression/decompression and cryptography) applications. These building blocks are optimized using the Intel® Streaming SIMD Extensions (Intel® SSE) and Intel® Advanced Vector Extensions (Intel® AVX, Intel® AVX2) instruction sets. Figure 9 shows the analysis results for the image resize that takes advantage of the Intel IPP. We can see that the elapsed time has gone down by a factor of two, and since currently only one core is used, there is an opportunity for further performance improvement using parallelism via the Intel Threading Building Blocks.

Figure 9. Checking code with Intel® VTune™ Amplifier

Conclusion

The Intel System Studio tools in the Intel GO SDK give automotive solution developers an integrated development environment with the ability to build, debug and trace, and tune the performance and power usage of their code. This helps both system and embedded developers meet some of their most daunting challenges:

Accelerate software development to bring competitive automated driving cars to market faster
Quickly target and help resolve defects in complex automated driving (AD), advanced driver assistance systems (ADAS), or software-defined cockpit (SDC) systems
Help speed performance and reduce power consumption

This is all provided in one easy-to-use software package.

Learn more

Intel® System Studio