Neon Intrinsics for Optimized Math, Networking, and String Operations

Jeremy C. Ong

0/5 (0 vote)

Feb 12, 2021

CPOL

9 min read

10706

In this article, we explore multiple ways to incorporate Neon Intrinsics in your application.

Since 2015, Arm has maintained a repository of optimized routines on GitHub. These routines span the gamut of functionality from mathematical operators, functions useful in networking, and string manipulation. The routines leverage Neon intrinsics and assembly code to operate more quickly.

In this article, we’ll first take a tour of the optimized routines provided by Arm. Then, we’ll discuss the Neon intrinsics themselves and their performance characteristics. Finally, we’ll show how we can use Neon intrinsics to accelerate a custom routine in an application and give some practical guidance on structuring code amenable to vectorization.

A Quick SIMD Overview

Let’s start with a quick refresher on what a CPU does when executing a program. CPUs operate in a pipelined fashion, wherein each instruction of a program undergoes a series of stages before ultimately executing.

These stages start with an instruction fetch and decode. In the ideal scenario, the instruction would reside in the instruction cache or resolve from a branch prediction table. Otherwise, the instruction would be fetched from the memory mapped from the executable.

Afterwards, the registers referenced by the instruction are renamed to support out-of-order (OOO) execution. Depending on the instruction type, the instruction is dispatched to one of the available issue queues.

Finally, the instruction runs on the targeted execution unit, which itself is a pipeline. The execution unit is either an Arithmetic Logical Unit (ALU) pipeline for integer add/mul/div hardware, a pipeline for branch and condition logic, or a single instruction, multiple data (SIMD) and floating point pipeline.

One common method of improving program throughput is to employ additional cores, effectively going wide on the available silicon. Many Arm CPUs employed outside of low-wattage microcontrollers have multiple cores available for multithreaded applications. In an ideal case, the speedup afforded is linear in the number of cores (that is, four available cores can at best produce a 4x speedup). However, such a multiplier is often idealistic, as disparate cores must be externally synchronized.

SIMD offers a completely orthogonal axis to multithreading to achieve increased throughput. In contrast to multithreading, synchronization isn’t a concern as SIMD instructions operate within the context of a single CPU core. The basic idea is that when dispatching an instruction, we can request that the operation occur on multiple data streams instead of just one.

SIMD usage (also known as vectorization) is fully complementary to multithreading, and both techniques should be employed if maximum system throughput is desired.

Neon is the SIMD instruction set targeted specifically at Arm CPUs. The full list of Neon intrinsics available is provided in a searchable registry here. We’ll write some Neon code soon, but first, let’s survey the routines provided by Arm.

Using the Arm Optimized Routines

To begin, you’ll want to integrate the library of routines into your toolchain. This isn’t required if you are targeting Android or glibc on Linux, as the C-Runtime Library (CRT) already uses the routines in its implementation. For other toolchains, the suggested approach is to compile the library using the provided Makefile, and link the resulting library to your application.

By default, the Makefile targets AArch64 (ARM64), so if you wish to deploy the code to a 32-bit CPU, you’ll want to change the target ARCH variable in the config.mk file. Note that the repository actually contains three separate libraries corresponding to math routines, string-processing routines, and network routines (currently, this just contains an optimized checksum). These libraries can be compiled and linked separately.

To see the routines provided, refer to the following headers: mathlib.h, networking.h, and stringlib.h (note that the networking header filename lacks the "lib" suffix).

The math library implements the functions (exp, log, pow, sin, cos) for single and double precision.

The networking library provides a checksumming routine.

The string library provides the familiar memory routines: memcpy, memmove, memset, memchr, memrchr, and the string equivalents.

Because the code is generously MIT-licensed, reading the implementation is particularly instructive, especially for practitioners newer to the instruction set.

Options for Employing Vectorization

The vectorized routines provided by Arm used in the CRT operate on scalar or vector quantities. For example, invoking cosf(x)computes the cosine of a single value, but does so by using SIMD instructions internally. Alternatively, the compiler can select vector variants if auto-vectorization permits, or if the vector variants (e.g. __v_cosf) are invoked directly. In general, If we wish to employ SIMD instructions for other routines, we have three options.

First, we could instruct the compiler to enable auto-vectorization and hope that our code is amenable to vectorization. For GCC, auto-vectorization is on by default when compiling with -O3. For other optimization levels, the flag -ftree-vectorize can be passed.

The benefit of relying on the compiler for vectorization is that the code written will remain maximally portable and support every instruction set the compiler supports. Furthermore, the code will generally be free of inline assembly and intrinsics, which tends to make it easier to maintain.

While auto-vectorization is an active area of research that continues to progress, there are still a number of areas where auto-vectorization is not possible. For example, auto-vectorization often breaks when compiling loops with inter-iteration dependences, break clauses, or complex branching conditions. For more information on compiling for Neon with auto-vectorization, refer to this guide from Arm.

Second, we can use assembly, either as standalone code modules or as inline assembly. The available floating point and SIMD instructions are cataloged in this online reference. Compared to using intrinsics (the last option we’ll explore), direct assembly allows you to control register allocation and load/store alignment. Of the options, assembly is the least portable, the most difficult to maintain, but potentially the best performing route.

Third, we can opt to instead write vectorized code using Neon intrinsics. Intrinsics look like function calls in source code, but while intrinsics are defined with an assembly mapping, they still undergo compiler optimizations. So there is no guarantee that you get the exact instruction in the documentation, only that you get one that’s at least as efficient as the one the intrinsics are defined as.

Compared to writing raw assembly, intrinsics operate directly on variables instead of registers. This means you can continue to let the compiler perform register allocation, and you can ignore the intricacies of function calling conventions. Thus, intrinsics afford more explicit vectorization than implicit auto-vectorization, less control than raw assembly, but also less effort than writing and maintaining raw assembly.

For many applications that demand performance, intrinsics are an ideal tradeoff between simplicity and efficiency. To get started with programming with intrinsics, there are two guides that walk you through the set up and application of Neon intrinsics toward implementing and benchmarking a dot-product, and implementing a 1D-signal convolution and threshold operation.

Simple Collision Detection with Neon Intrinsics

Some of the routines in the optimized-routines repository (such as cosf and logf) demonstrate using vector intrinsics to accelerate what is otherwise a scalar operation. That is, executing a function that accepts a single scalar argument.

Another common approach to vectorization is structure-of-arrays (SoA)-style vectorization. Compared to the former approach, the algorithm itself for computing the operation does not change. Instead, we simply use intrinsics to replicate the same algorithm on multiple lanes.

Consider the following simple collision detection routine between two circles:

struct circle
{
    float radius;
    float center_x;
    float center_y;
};

bool does_collide(circle& c1, circle& c2)
{
    // Two circles collide if the distance from c1 to c2 is less
    // than the sum of their radii, or equivalently if the squared
    // distance is less than the square of the radii sum.
    float dx = c1.center_x - c2.center_x;
    float dy = c1.center_y - c2.center_y;
    float d2 = dx * dx + dy * dy;
    float r2 = c1.radius * c1.radius + c2.radius * c2.radius;

    return d2 < r2;
}

/* Disassembly 
        ldr     s0, [x0, 4]
        ldr     s1, [x1, 4]
        fsub    s0, s0, s1
        ldr     s2, [x0, 8]
        ldr     s1, [x1, 8]
        fsub    s2, s2, s1
        ldr     s1, [x0]
        ldr     s3, [x1]
        fmul    s0, s0, s0
        fmul    s2, s2, s2
        fadd    s0, s0, s2
        fmul    s1, s1, s1
        fmul    s3, s3, s3
        fadd    s1, s1, s3
        fcmpe   s0, s1
        cset    w0, mi
        ret

*/

One way to accelerate this is to notice that a number of these operations are repeated and vectorize like so:

#include <arm_neon.h> // assume this is included for snippets below

bool does_collide_neon(circle const& c1, circle const& c2)
{
    // Pack the circle centers into registers with 2 float lanes
    // Note that while unaligned loads into SIMD registers are supported,
    // you are responsible for ensuring that the struct packing and layout
    // is done in a way that leaves the register contents well-defined
    float32x2_t c1_center = vld1_f32(&c1.center_x);
    float32x2_t c2_center = vld1_f32(&c2.center_x);

    // Compute the deltas and square them
    float32x2_t d = vsub_f32(c1_center, c2_center);
    float32x2_t dxd = vmul_f32(d, d);

    float d2 = vpadds_f32(dxd);

    float r_sum = c1.radius + c2.radius;
    float r_sum2 = r_sum * r_sum;
    return d2 < r_sum2;
}

/* Disassembly
        ldr     d0, [x0, 4]
        ldr     d1, [x1, 4]
        fsub    v0.2s, v0.2s, v1.2s
        fmul    v0.2s, v0.2s, v0.2s
        faddp   s0, v0.2s
        ldr     s1, [x0]
        ldr     s2, [x1]
        fadd    s1, s1, s2
        fmul    s1, s1, s1
        fcmpe   s0, s1
        cset    w0, mi
        ret

*/

Above, we vectorize the implementation by noticing we can parallelize the subtraction and multiplication operations when computing the squared-distance.

However, the function above may not be faster than our original implementation. Any throughput gain is impeded by the memory layout of our input, which requires a number of instructions to pack the vector registers. Furthermore, we’re only able to perform two data-parallel operations (a sub and a mul) before needing to perform a cross-lane operation.

Summarizing, the declaration of the circle struct implies that the data is interleaved, which inhibits vectorization.

An alternative approach is to reconsider our memory layout, performing the deinterleaving up front. Suppose we know that, more often than not, we want to test one circle against a collection of other circles. As a motivating example, imagine you have an aiming reticle with a certain radius and you want to see which bounding circles intersect with the aiming reticle. Here’s how we might accelerate this with intrinsics:

struct circles
{
    size_t size;
    // When allocating the arrays below, always round up to a multiple of 4.
    float* radii;
    float* center_xs;
    float* center_ys;
};

// Check if collider collides with each circle within input
// out should point to an array of input.size booleans
void does_collide_neon_soa(circles const& input, circle& collider, bool* out)
{
    // Duplicate the collider properties in 3 separate 4-lane vector registers
    float32x4_t c1_x = vdupq_n_f32(collider.center_x);
    float32x4_t c1_y = vdupq_n_f32(collider.center_y);
    float32x4_t c1_r = vdupq_n_f32(collider.radius);

    for (size_t offset = 0; i != input.size; offset += 4)
    {    
        // Perform 4 collision tests at a time
        float32x4_t x = vld1q_f32(input.center_xs + offset);
        float32x4_t y = vld1q_f32(input.center_ys + offset);

        float32x4_t dx = vsubq_f32(c1_x, x);
        float32x4_t dy = vsubq_f32(c1_y, y);
        float32x4_t dx2 = vmulq_f32(dx, dx);
        float32x4_t dy2 = vmulq_f32(dy, dy);
        float32x4_t d2 = vaddq_f32(dx2, dy2);

        float32x4_t r = vld1q_f32(input.radii + offset);
        float32x4_t r_sum = vaddq_f32(c1_r, r);
        float32x4_t r_sum2 = vmulq_f32(r_sum, r_sum);
        uint32x4_t mask = vcltq_f32(d2, r_sum2);

        // Unpack each lane and avoid uint32_t to bool conversion
	  // using a masking operation
        out[offset] = 1 & vgetq_lane_u32(mask, 0);
        out[offset + 1] = 1 & vgetq_lane_u32(mask, 1);
        out[offset + 2] = 1 & vgetq_lane_u32(mask, 2);
        out[offset + 3] = 1 & vgetq_lane_u32(mask, 3);
    }
}

Here, instead of trying to accelerate a single collision computation, we simply deinterleave the data upfront, and then perform the same algorithm as before, only this time, colliding one circle against four at a time. Compared to our first attempt at using Neon intrinsics, this attempt no longer pays the hefty cost of copying memory to pack registers and performs the majority of operations in a vectorized fashion.

When profiling the functions above, all functions were decorated with the GCC attribute noinline to inhibit auto-vectorization that may occur when the code is inlineable. This is more indicative of a real-world function, but you are encouraged to benchmark inline scenarios as this interacts with register allocation and auto-vectorization in the calling context. Here is a table summarizing the results:

16384 circle-circle tests	Time-per-invocation	Speedup
does_collide	2.724 ns	1x
does_collide_neon	2.717 ns	1.003x
does_collide_neon_soa	0.925 ns	2.945x

For each test above, the function in the left column was used to perform 16,384 collision tests over 100,000 trials to compute the time-per-invocation in the center column. In all cases, the code was compiled with -O3 and run on a Samsung S20. As you can see, with the non-SoA Neon implementation, the speedup is nominal. However, restructuring the data in SoA form yields an impressive nearly 3x speedup.

Wrapping Up

In this article, we explored multiple ways to incorporate Neon intrinsics in your application. The first method that should be considered is to use existing pre-optimized routines. If you aren’t targeting an Android device with a toolchain that already incorporates Arm’s optimized-routines, you should seriously consider integrating the library into your project.

If restructuring your data isn’t an option, there are often opportunities to vectorize a function’s implementation, assuming auto-vectorization hasn’t taken place. This method is more intrusive to the structure of the algorithm, but is transparent to the callers of the function.

If restructuring data is an option, this method enables any algorithm to be vectorized by duplicating the original algorithm across multiple lanes. Navigating all the available options is undoubtedly an investment, requiring a clear understanding of the various tradeoffs and benchmarking. However, the payoff in terms of throughput and energy efficiency is compelling.

For further reading, please consult the following pages: