Porting Intel Intrinsics to Arm Neon Intrinsics

Jeremy C. Ong

5.00/5 (2 votes)

May 4, 2021

CPOL

14 min read

9296

In this article we look at transitioning from x86 to Arm or vice versa for non-portable x86 SSE code

If you are in the business of maintaining code accelerated by SSE intrinsics on Intel and AMD platforms, you have likely looked into how to best port SSE code to Arm devices. Years ago, x86-targeted and Arm-targeted assembly code was neatly partitioned along usage boundaries. In particular, x86 code typically ran in desktop and server environments, while Arm code typically ran on edge devices and mobile hardware.

With the advent of Windows on Arm, macOS M1, and other platforms, the lines between x86 and Arm usage scenarios have started to blur, and it is increasingly important to support both. While both Microsoft and Apple provide x86 emulation modes when running their respective operating systems on Arm, your program will likely suffer from reduced performance and thermal efficiency, at least compared to a native port.

Unfortunately, transitioning from x86 to Arm or vice versa can be difficult, depending on the usage of non-portable code. This article aims to cover a few different approaches to this task, specifically for non-portable x86 SSE code, porting some sample code in the process.

Porting Intrinsics and Performance

First, let us restrict the scope of the article somewhat. We will focus specifically on porting SSE intrinsics (used on Intel and AMD hardware) to Neon intrinsics (targeting Arm’s SIMD instruction set). That is, we will not be covering the underlying assembly that the intrinsics ultimately compile to. While eventually learning to read assembly is important for the low-level programmer, getting started with intrinsics first makes learning the assembly relatively straightforward with tools like the Compiler Explorer.

As an aside, if you have worked with Neon intrinsics on much older versions of GCC and felt that the compiled output was lackluster, it is worth giving them another try as the instruction generation of compiler backends for Arm have generally improved.

Also, we will not cover in great detail the performance characteristics of your code after a port, except to point out some things to avoid during a port for performant code. This may seem like a serious omission, but proper coverage in this regard is quite difficult.

For optimizing low-level code on x86, researchers have been able to microbenchmark instruction performance down to the micro-op dispatch level (see the famous uops study). In comparison, Arm’s instruction set is available in a plethora of fabricated chips with distinct performance characteristics and optimization guidelines. After your initial port, aside from conducting benchmarks, it is recommended that you refer to Arm’s optimization manuals for the specific chips you intend to target. For example, here is the optimization guide for the Cortex-A78.

Intrinsics Refresher

As a quick reminder, SSE intrinsics look like the following:

#include <xmmintrin.h>

__m128 mul(__m128 a, __m128 b)
{
    return _mm_mul_ps(a, b);
}

This simple snippet defines a function mul which takes two 128-bit vectors as arguments, multiplies them lane-wise, and returns the result.

Intrinsics are popular because they let the compiler assist the programmer. In particular, when code is expressed as intrinsics instead of raw assembly, the compiler retains the responsibility of controlling register allocation, negotiating call conventions when traversing function call boundaries, and may often optimize the generated code further, just as the optimizer works with typical code.

In contrast to the SSE snippet above, the same function with Neon intrinsics looks like the following:

#include <arm_neon.h>

float32x4_t mul(float32x4_t a, float32x4_t b)
{
    return vmulq_f32(a, b);
}

This snippet rhymes closely with the SSE snippet, albeit with some important differences.

First, note the specification of the input arguments and output result as a float32x4_t instead of a __m128 type. Unlike SSE register types, Neon register types lead with the component type and are followed by the bit width of the component times the lane count.

Now, suppose we wanted to port code that operated on 128-bit integers instead. The expected Neon type is one that describes four 32-bit integers. Indeed, the Neon register corresponding to SSE’s __m128i is int32x4_t. What about the Neon type corresponding to SSE’s __m128d? In this case, the 128 bit register contains two 64 bit floats, so we might expect the Neon type to be float64x2_t and indeed this is the case!

The important thing to remember here is that SSE types describe the width of the entire vector register, while Neon types describe the width of each component and the component count.

Another important distinction between SSE and Neon types is the treatment of unsigned quantities. In this regard, Neon offers a bit more type safety, encoding the signed versus unsigned nature of the data in the type itself by offering register types such as uint32x4_t and int32x4_t. In contrast, SSE offers only the one register __m128i to store four 32-bit signed and unsigned integers alike.

For SSE programmers, if we want the data to be treated as unsigned, we must choose the appropriate intrinsic function, appending the _epu* suffix if we want to treat the operands as unsigned integral data. Neon, however, enforces this at the type level, and programmers need to perform conversions explicitly where necessary. A nice property of this organization is that there are fewer intrinsic function "names" you would need to memorize, thanks to argument-dependent lookup.

Furthermore, if a particular overload is not supported, the compiler will provide a helpful error message as in the following snippet.

 #include <arm_neon.h>

uint32x4_t sat_add(uint32x4_t a, uint32x4_t b)
{
    return vqaddq_u32(a, b);
}

int32x4_t sat_add(int32x4_t a, int32x4_t b)
{
    // Compile error! "cannot convert 'int32x4_t' to 'uint32x4_t'"
    return vqaddq_u32(a, b);
}

The snippet above uses the Arm-specific intrinsic vqaddq_u32 to add unsigned integers in vectorized fashion, saturating instead of overflowing. Note that A64 GCC will fail to compile the second function because vqaddq_u32 is defined only for unsigned types.

Compared to reading SSE intrinsic functions, Neon functions have a slight learning curve as well. SSE intrinsics are typically structured like so:

[width-prefix]_[op]_[return-type]
_mm_extract_epi32

For example, _mm_extract_epi32 denotes an intrinsic operating on 128-bit registers (denoted by the width prefix _mm) that performs an extract operation to produce a 32-bit signed value. The intrinsic _mm256_mul_ps performs a mul operation on packed scalar floats in a 256-bit register.

In contrast, many Neon intrinsics have the following form:

[op][q]_[type]
vaddq_f64

The presence of the "q" in the intrinsic name indicates that the intrinsic accepts 128-bit registers (as opposed to 64-bit registers). Many of the op names will lead with a "v" meaning "vector."

For example, vaddq_f64 performs a vector add of 64-bit floats. We can infer from the "q" that this intrinsic operates on 128-bit vectors. Thus, the accepted arguments must be float64x2_t, since only two 64-bit floats fit in a 128-bit vector.

The more general form of a Neon intrinsic also supports operations that act on lanes of the SIMD register, among other options. The full form of a Neon intrinsic and its specification is described here.

With that, you should be able to decipher intrinsics wherever you encounter them and, all being well, follow along in the next section without too much difficulty. Now let us investigate two alternative approaches to porting SSE code to run on Arm platforms.

Porting Intrinsics by Hand

The first option worth considering when porting existing SSE code is a manual port of each SSE routine. This is especially viable when porting short isolated snippets of code. In addition, code leveraging fewer "exotic" intrinsics and extremely wide registers (256-bit and greater) will have an easier time porting.

Let us look at an example from Klein, a C++ library written using SSE intrinsics to compute operators in Geometric Algebra (in particular, Projective Geometric Algebra for modeling 3D Euclidean space). The following snippet of SSE code conjugates a vector denoting a plane’s orientation with a rotor (also known as a quaternion), rotating the plane in space.

#include <xmmintrin.h>

#define KLN_SWIZZLE(reg, x, y, z, w) \
    _mm_shuffle_ps((reg), (reg), _MM_SHUFFLE(x, y, z, w))

// a := plane (components indicate orientation and distance from the origin)
// b := rotor (rotor group isomorphic to the quaternions)
__m128 rotate_plane(__m128 a, __m128 b) noexcept
{
    // LSB
     //
     //  a0 (b2^2 + b1^2 + b0^2 + b3^2)) e0 +
     //
     // (2a2(b0 b3 + b2 b1) +
     //  2a3(b1 b3 - b0 b2) +
     //  a1 (b0^2 + b1^2 - b3^2 - b2^2)) e1 +
     //
     // (2a3(b0 b1 + b3 b2) +
     //  2a1(b2 b1 - b0 b3) +
     //  a2 (b0^2 + b2^2 - b1^2 - b3^2)) e2 +
     //
     // (2a1(b0 b2 + b1 b3) +
     //  2a2(b3 b2 - b0 b1) +
     //  a3 (b0^2 + b3^2 - b2^2 - b1^2)) e3
     //
     // MSB

     // Double-cover scale
     __m128 dc_scale = _mm_set_ps(2.f, 2.f, 2.f, 1.f);
     __m128 b_xwyz   = KLN_SWIZZLE(b, 2, 1, 3, 0);
     __m128 b_xzwy   = KLN_SWIZZLE(b, 1, 3, 2, 0);
     __m128 b_xxxx   = KLN_SWIZZLE(b, 0, 0, 0, 0);

     __m128 tmp1
         = _mm_mul_ps(KLN_SWIZZLE(b, 0, 0, 0, 2), KLN_SWIZZLE(b, 2, 1, 3, 2));
     tmp1 = _mm_add_ps(
         tmp1,
         _mm_mul_ps(KLN_SWIZZLE(b, 1, 3, 2, 1), KLN_SWIZZLE(b, 3, 2, 1, 1)));
     // Scale later with (a0, a2, a3, a1)
     tmp1 = _mm_mul_ps(tmp1, dc_scale);

     __m128 tmp2 = _mm_mul_ps(b, b_xwyz);

     tmp2 = _mm_sub_ps(tmp2,
                       _mm_xor_ps(_mm_set_ss(-0.f),
                                  _mm_mul_ps(KLN_SWIZZLE(b, 0, 0, 0, 3),
                                             KLN_SWIZZLE(b, 1, 3, 2, 3))));
     // Scale later with (a0, a3, a1, a2)
     tmp2 = _mm_mul_ps(tmp2, dc_scale);

     // Alternately add and subtract to improve low component stability
     __m128 tmp3 = _mm_mul_ps(b, b);
     tmp3        = _mm_sub_ps(tmp3, _mm_mul_ps(b_xwyz, b_xwyz));
     tmp3        = _mm_add_ps(tmp3, _mm_mul_ps(b_xxxx, b_xxxx));
     tmp3        = _mm_sub_ps(tmp3, _mm_mul_ps(b_xzwy, b_xzwy));
     // Scale later with a

     __m128 out = _mm_mul_ps(tmp1, KLN_SWIZZLE(a, 1, 3, 2, 0));
     out = _mm_add_ps(out, _mm_mul_ps(tmp2, KLN_SWIZZLE(a, 2, 1, 3, 0)));
     out = _mm_add_ps(out, _mm_mul_ps(tmp3, a));
     return out;
 }</xmmintrin.h>

The code pattern above should be fairly familiar to SSE programmers. A general approach is to start from the component-by-component computation to be performed. In this case, we are given two 4 component vectors as __m128 registers. Then, factor out common subexpressions in "vector" fashion before composing and returning the final result. The first parameter (simply named "a" here for brevity) denotes a plane corresponding to the following implicit equation.

The second parameter "b" is also a four-component register, in this case representing the four components of a rotor. The operation we are computing here is the well-known "sandwich operator," written like so:

Let us start our port to Neon with the function signature.

float32x4_t rotate_plane(float32x4_t a, float32x4_t b) noexcept
{
    // TODO
}

Next, we need to learn how to initialize a float32x4_t with some constant values. Fortunately, compilers let us specify initial values with standard aggregate initialization:

    float32_t tmp[4] = {1.f, 2.f, 2.f, 2.f};
    float32x4_t dc_scale = vld1q_f32(tmp);

Note that the lowest address in the register comes first, unlike in the _mm_set_ps intrinsic, which leads with the most significant bytes first.

Unlike constant register initialization, the swizzle operation performed with _mm_shuffle_ps is a common pattern in SSE code that is significantly more difficult to port as there is no exact mirroring intrinsic in Neon. To emulate the functionality, we need a few tools.

First is vgetq_lane_f32, which lets us retrieve a specified component within a vector as a scalar. The corresponding intrinsic for setting a lane from a scalar is vsetq_lane_f32. For moving a component from one vector to another, we have vcopyq_lane_f32. For broadcasting a line to all four components, we have the vdupq_lane_f32 intrinsic. With this, it should be pretty clear how we can go line by line, replacing all swizzles with the corresponding lane queries and assignments.

Unfortunately, replacing the swizzles this way is unlikely to produce good results on Arm hardware. On Intel hardware, for example, a shuffle has a 1 cycle latency penalty and throughput of 1 cycle per instruction. In contrast, the DUP instruction used to extract a lane has a 3 cycle penalty on an Arm Cortex-A78, for example. Each MOV needed to assign a lane incurs another 2 cycle latency penalty.

To get better performance out of Neon, we need to get exposure to instructions that operate on more than just a lane-by-lane granularity. For a great overview of the various options for data permutation, refer to this section of Arm’s Coding for Neon guide.

For starters, we have vextq_f32, which extracts components from two separate vectors, combining them starting from a provided component index. In addition, we have a family of rev intrinsics, which let us reverse the order of components.

Note that we can cast a float32x4_t to a float64x2_t and reverse to generate permutations in this manner. Each REV16, REV32, or REV64 instruction has a 2 cycle latency penalty, but potentially coalesces many individual lane gets and sets.

After more carefully permuting the input vectors minimally, we can arrive at the following function:

#include <arm_neon.h>

float32x4_t rotate_plane(float32x4_t a, float32x4_t b) noexcept
{
    // LSB
    //
    //  a0 (b0^2 + b1^2 + b2^2 + b3^2)) e0 + // tmp 4
    //
    // (2a2(b0 b3 + b2 b1) +                 // tmp 1
    //  2a3(b1 b3 - b0 b2) +                 // tmp 2
    //  a1 (b0^2 + b1^2 - b3^2 - b2^2)) e1 + // tmp 3
    //
    // (2a3(b0 b1 + b3 b2) +                 // tmp 1
    //  2a1(b2 b1 - b0 b3) +                 // tmp 2
    //  a2 (b0^2 + b2^2 - b1^2 - b3^2)) e2 + // tmp 3
    //
    // (2a1(b0 b2 + b1 b3) +                 // tmp 1
    //  2a2(b3 b2 - b0 b1) +                 // tmp 2
    //  a3 (b0^2 + b3^2 - b2^2 - b1^2)) e3   // tmp 3
    //
    // MSB

    // Broadcast b[0] to all components of b_xxxx
    float32x4_t b_0000 = vdupq_laneq_f32(b, 0); // 3:1

    // Execution Latency : Execution Throughput in trailing comments

    // We need b_.312, b_.231, b_.123 (contents of component 0 don’t matter)
    float32x4_t b_3012 = vextq_f32(b, b, 3);                // 2:2
    float32x4_t b_3312 = vcopyq_laneq_f32(b_3012, 1, b, 3); // 2:2
    float32x4_t b_1230 = vextq_f32(b, b, 1);                // 2:2
    float32x4_t b_1231 = vcopyq_laneq_f32(b_1230, 3, b, 1); // 2:2

    // We also need a_.231 and a_.312
    float32x4_t a_1230 = vextq_f32(a, a, 1);                // 2:2
    float32x4_t a_1231 = vcopyq_laneq_f32(a_1230, 3, a, 1); // 2:2
    float32x4_t a_2311 = vextq_f32(a_1231, a_1231, 1);      // 2:2
    float32x4_t a_2312 = vcopyq_laneq_f32(a_2311, 3, a, 2); // 2:2

    // After the permutations above are done, the rest of the port is more natural
    float32x4_t tmp1 = vfmaq_f32(vmulq_f32(b_0000, b_3312), b_1231, b);
    tmp1 = vmulq_f32(tmp1, a_1231);

    float32x4_t tmp2 = vfmsq_f32(vmulq_f32(b, b_3312), b_0000, b_1231);
    tmp2 = vmulq_f32(tmp2, a_2312);

    float32x4_t tmp3_1 = vfmaq_f32(vmulq_f32(b_0000, b_0000), b, b);
    float32x4_t tmp3_2 = vfmaq_f32(vmulq_f32(b_3312, b_3312), b_1231, b_1231);
    float32x4_t tmp3 = vmulq_f32(vsubq_f32(tmp3_1, tmp3_2), a);

    // tmp1 + tmp2 + tmp3
    float32x4_t out = vaddq_f32(vaddq_f32(tmp1, tmp2), tmp3);

    // Compute 0 component and set it directly
    float32x4_t b2 = vmulq_f32(b, b);
    // Add the top two components and the bottom two components
    float32x2_t b2_hadd = vadd_f32(vget_high_f32(b2), vget_low_f32(b2));
    // dot(b, b) in both float32 components
    float32x2_t b_dot_b = vpadd_f32(b2_hadd, b2_hadd);

    float32x4_t tmp4 = vmulq_lane_f32(a, b_dot_b, 0);
    out = vcopyq_laneq_f32(out, 0, tmp4, 0);

    return out;
}

All being well, the annotated expression in the comment at the top of the function shows how the various temporaries needed to evaluate the expression were constructed. The compiled output code is a small routine of instructions, reproduced below:

rotate_plane(__Float32x4_t, __Float32x4_t):
ext v16.16b, v0.16b, v0.16b, #4
ext v3.16b, v1.16b, v1.16b, #12
mov v6.16b, v0.16b
fmul v4.4s, v1.4s, v1.4s
ins v16.s[3], v0.s[1]
ins v3.s[1], v1.s[3]
dup v2.4s, v1.s[0]
ext v7.16b, v1.16b, v1.16b, #4
ext v0.16b, v16.16b, v16.16b, #4
fmul v19.4s, v1.4s, v3.4s
fmul v18.4s, v2.4s, v3.4s
ins v7.s[3], v1.s[1]
ins v0.s[3], v6.s[2]
dup d17, v4.d[1]
dup d5, v4.d[0]
fmul v3.4s, v3.4s, v3.4s
mov v4.16b, v0.16b
mov v0.16b, v19.16b
fadd v5.2s, v5.2s, v17.2s
mov v17.16b, v18.16b
fmla v3.4s, v7.4s, v7.4s
fmls v0.4s, v2.4s, v7.4s
fmul v2.4s, v2.4s, v2.4s
faddp v5.2s, v5.2s, v5.2s
fmla v17.4s, v7.4s, v1.4s
fmul v0.4s, v4.4s, v0.4s
fmla v2.4s, v1.4s, v1.4s
fmul v5.4s, v6.4s, v5.s[0]
fmla v0.4s, v17.4s, v16.4s
fsub v2.4s, v2.4s, v3.4s
fmla v0.4s, v6.4s, v2.4s
ins v0.s[0], v5.s[0]
ret

With optimization settings set, Armv8 Clang opted to produce a slightly better sequence of instructions to permute the vector. While relying on the optimizer is an option for an even more brute force approach, there is no guarantee the optimizer will notice the possible code improvements.

Using Platform-agnostic Headers

The process of writing efficient intrinsics on Neon hardware can seem daunting. Many direct ports of SSE code to Arm code end up being time consuming, and do not always produce the desired result.

Fortunately, there is at least one mature abstraction to ease the task of porting, or potentially even finalize the porting effort in one go. Namely, the SIMD Everywhere project (abbreviated as simply SIMDe).

The premise of SIMDe is that the only change needed to your code is a replacement of the header where you would normally include platform intrinsics. Instead of including, say, xmmintrin.h, you would include SIMDe’s variant matching the instruction set you originally targeted (for example, x86/sse2.h).

Internally, the SIMDe header detects the target architecture you are compiling for and generates instructions matching the intrinsics used when writing code for the original target.

As an example, suppose in our original code, we had an _mm_mul_ps intrinsic. After changing the header to include SIMDe’s sse.h header, the code to invoke _mm_mul_ps will continue to do so when targeting x86 hardware. However, compiling for Arm will also succeed because the SIMDe header will convert the _mm_mul_ps invocation to a vmulq_f32.

To see how this intrinsic "rewriting" is happening directly, you can refer to the SIMDe implementation of _mm_mul_ps here. The same approach is taken for all the supported intrinsics, and the SIMDe implementation tries to select the most efficient replacement implementation possible. A commit like this one may be all you need to get up and running with Neon quickly.

The plan now is straightforward. With a single line change to each file with SSE headers, pointing to SIMDe headers instead, you should have a codebase that is now completely compilable for Arm hardware.

The next step is to profile the result to see if the performance of the SIOMDe direct replacement port is acceptable. While porting with SIMDe is much quicker, we have already seen that direct replacement of x86 intrinsics with their Arm equivalents can result in inefficient code. By profiling the ported code, you can slowly migrate problematic portions of code to a native handwritten port on a case-by-case basis.

To see the effect of SIMDe on our plane rotating function, we can swap out the line to include the SSE header with the following snippet:

#include <arm_neon.h>
typedef float32x4_t __m128;

inline __attribute__((always_inline)) __m128 _mm_set_ps(float e3, float e2, float e1, float e0)
{
    __m128 r;
    alignas(16) float data[4] = {e0, e1, e2, e3};
    r = vld1q_f32(data);
    return r;
}

#define _MM_SHUFFLE(z, y, x, w) (((z) << 6) | ((y) << 4) | ((x) << 2) | (w))

inline __attribute__((always_inline)) __m128 _mm_mul_ps(__m128 a, __m128 b) {
    return vmulq_f32(a, b);
}

inline __attribute__((always_inline)) __m128 _mm_add_ps(__m128 a, __m128 b) {
    return vaddq_f32(a, b);
}

inline __attribute__((always_inline)) __m128 _mm_sub_ps(__m128 a, __m128 b) {
    return vaddq_f32(a, b);
}

inline __attribute__((always_inline)) __m128 _mm_set_ss(float a) {
    return vsetq_lane_f32(a, vdupq_n_f32(0.f), 0);
}

inline __attribute__((always_inline)) __m128 _mm_xor_ps(__m128 a, __m128 b) {
    return veorq_s32(a, b);
}

#define _mm_shuffle_ps(a, b, imm8)                                   \
   __extension__({                                                        \
      float32x4_t ret;                                                   \
      ret = vmovq_n_f32(                                                 \
          vgetq_lane_f32(a, (imm8) & (0x3)));     \
      ret = vsetq_lane_f32(                                              \
          vgetq_lane_f32(a, ((imm8) >> 2) & 0x3), \
          ret, 1);                                                       \
      ret = vsetq_lane_f32(                                              \
          vgetq_lane_f32(b, ((imm8) >> 4) & 0x3), \
          ret, 2);                                                       \
      ret = vsetq_lane_f32(                                              \
          vgetq_lane_f32(b, ((imm8) >> 6) & 0x3), \
          ret, 3);                                                                    \
  }

These routines are lifted directly from the SIMDe header so you can see how the various SSE intrinsics and shuffles map to Neon intrinsics. The AArch64 assembly code generated from this is as follows:

rotate_plane(__Float32x4_t, __Float32x4_t):      // @rotate_plane(__Float32x4_t, __Float32x4_t)
        dup     v3.4s, v1.s[2]
        ext     v3.16b, v1.16b, v3.16b, #4
        dup     v2.4s, v1.s[0]
        ext     v20.16b, v1.16b, v3.16b, #12
        dup     v4.4s, v1.s[1]
        dup     v5.4s, v1.s[3]
        adrp    x8, .LCPI0_1
        ext     v7.16b, v1.16b, v2.16b, #4
        ext     v19.16b, v3.16b, v2.16b, #12
        ext     v3.16b, v3.16b, v20.16b, #12
        dup     v6.4s, v0.s[0]
        ext     v16.16b, v1.16b, v4.16b, #4
        ext     v5.16b, v1.16b, v5.16b, #4
        ext     v17.16b, v1.16b, v7.16b, #12
        ext     v18.16b, v1.16b, v7.16b, #8
        fmul    v3.4s, v19.4s, v3.4s
        ldr     q19, [x8, :lo12:.LCPI0_1]
        ext     v6.16b, v0.16b, v6.16b, #4
        ext     v17.16b, v7.16b, v17.16b, #12
        ext     v7.16b, v7.16b, v18.16b, #12
        ext     v18.16b, v1.16b, v16.16b, #8
        ext     v20.16b, v1.16b, v5.16b, #8
        ext     v2.16b, v5.16b, v2.16b, #12
        ext     v16.16b, v16.16b, v18.16b, #12
        ext     v18.16b, v0.16b, v6.16b, #8
        ext     v5.16b, v5.16b, v20.16b, #12
        ext     v20.16b, v0.16b, v6.16b, #12
        adrp    x8, .LCPI0_0
        ext     v18.16b, v6.16b, v18.16b, #12
        ext     v6.16b, v6.16b, v20.16b, #12
        fmul    v20.4s, v1.4s, v1.4s
        fmul    v2.4s, v2.4s, v5.4s
        fmul    v5.4s, v17.4s, v1.4s
        mov     v1.s[0], v4.s[0]
        ldr     q4, [x8, :lo12:.LCPI0_0]
        eor     v2.16b, v2.16b, v19.16b
        fmul    v1.4s, v16.4s, v1.4s
        fadd    v2.4s, v5.4s, v2.4s
        fmul    v5.4s, v17.4s, v17.4s
        fadd    v5.4s, v20.4s, v5.4s
        dup     v16.4s, v20.s[0]
        fadd    v1.4s, v3.4s, v1.4s
        fmul    v7.4s, v7.4s, v7.4s
        fadd    v5.4s, v16.4s, v5.4s
        fmul    v2.4s, v2.4s, v4.4s
        fmul    v1.4s, v1.4s, v4.4s
        fadd    v3.4s, v7.4s, v5.4s
        fmul    v2.4s, v6.4s, v2.4s
        fmul    v1.4s, v18.4s, v1.4s
        fadd    v1.4s, v1.4s, v2.4s
        fmul    v0.4s, v3.4s, v0.4s
        fadd    v0.4s, v0.4s, v1.4s
        ret

Even with the same optimization settings (-O2) as before, the code we ended up with was 53 instructions, with several more permutation (DUP/EXT) intrinsics compared to our hand-ported version.

The effect of SIMDe on your codebase will be dependent on several factors, one significant factor being the usage of SSE intrinsics that do not map as well to Arm architecture.

Porting to a Unified Vector Library

One more approach worth mentioning is the use of an intermediate library for expressing vector manipulation and compilation. Perhaps one of the most mature options that takes this approach is xsimd.

The idea behind this approach is that instead of attempting to maintain a bespoke set of routines and algorithms for each instruction set, the implementer should instead use a common abstraction layer that has an efficient implementation on each supported architecture.

The major downside to this approach is that integrating a library like xsimd is extremely invasive. Like with SIMDe, optimization opportunities will likely be missed once you lose the capability to drop closer to the hardware. In some cases, xsimd does not support certain operations if they perform well on one architecture yet poorly on another.

Despite these problems, for engineers who do not have the time to dedicate to profiling and optimizing for each architecture, using a library like xsimd can be much better than using poor manual ports.

Conclusion

Porting SSE code to Neon by hand is likely preferable for those who do not have much code to port (relative to your time commitments), or if the performance needed is known to push hardware boundaries.

For smaller codebases, if there is too much research and maintenance needed to optimize bespoke implementations per architecture, libraries like xsimd can be used to simplify working with vectorized code.

Rather than writing or re-writing code to use an abstraction layer like xsimd, SIMDe can be used for porting x86 code to the Arm architecture, replacing parts of the source with custom code for portions that do not have a direct x86 to Arm functionality mapping or that could benefit from performance optimization.

Whichever method you choose to port your code, having code that can run everywhere is now typical, even for low-level engineers. There is an interesting tension between platforms gaining more differentiation (AVX512, for example) and simultaneously proliferating in domains they may not have flourished in before (Arm in the cloud, for example).

Fortunately, tooling to support multi-architecture targeting is rapidly gaining maturity as demand grows. Aside from abstractions like SIMDe and xsimd, portable instruction sets such as Spir-V and WebAssembly are likewise here to stay. That is to say, when porting code, you have the freedom to exercise some discretion between opting for agility, and staying close to the hardware, reclaiming every wasted cycle where possible.

For further reading, be sure to check out Arm’s Coding for Neon series. Consider the Neon Intrinsics Reference equivalent to Intel’s Intrinsics Guide. If you opt to use SIMD Everywhere, their documentation is available on GitHub. The xsimd project is also available on GitHub with additional web documentation. Additionally, free Arm Performance Libraries are available for compiling and running your application.