xsimd: from xtensor to firefox

Serge « sans paille » Guelton

Compiler Engineer / Story Teller / Mozilla employee

PyData Paris — 25th September 2024

#α Vector instruction Unit

Raptor Lake die from an i9-13900K

#0 SIMD Illustration

[scalar]            [vector]

   X     |  [ x_0 | x_1 | x_2 | x_3 ]
   +     |     +     +     +     +
   X     |  [ y_0 | y_1 | y_2 | y_3 ]
   =     |     =     =     =     =
   Z     |  [ z_0 | z_1 | z_2 | z_3 ]

Important: Register Width

#1a Where?

# basically any modern hardware

#1b Where?

# basically any regular computation-intensive computation

#2 How? using Intrinsics

#include <immintrin.h>

for(size_t i = 0; i < vec_size; i += 8)
{
  __m256 v0 = _mm256_load_ps(ptr0 + i),
         v1 = _mm256_load_ps(ptr1 + i);
  __m256 dot = _mm256_fmadd_ps(v0, v1, v1);
  _mm256_store_ps(ptr0 + i, dot);
}

Tedious! Arch-specific!

#3 Programming using Intrinsics (again)

#include <arm_neon.h>

for(size_t i = 0; i < vec_size; i += 8)
{
  float32x4_t v0 = vld1q_f32(ptr0 + i),
              v1 = vld1q_f32(ptr1 + i);
  float32x4_t dot = vfmaq_f32(v0, v1, v1);
  vst1q_f32(ptr0 + i, dot);
}

Same… but different register width, different arch, different names

#4a Costless Abstraction

#include <xsimd/xsimd.hpp>

for(size_t i = 0; i < vec_size; i += simd_size)
{
  auto v0 = xsimd::load_aligned(ptr0 + i);
  auto v1 = xsimd::load_aligned(ptr1 + i);
  auto dot = xsimd::fma(v0, v1, v1);
  dot.store_aligned(ptr0 + i);
}

Same, but arch-independent

<3 C++

#4b Costless?

https://godbolt.org/z/7xxs73494

#5a XSIMD - a humble abstraction

Only 25k of header-only C++11 code

originally based on boost simd

#5b XSIMD - or others

On the way to standardization?

#6a Example of static dispatch

Generic:

template <class A, class T, class>
batch<T, A> ????(batch<T, A> const& self, batch<T, A> const& other, requires_arch<generic>) noexcept
{
    const auto diffmax = std::numeric_limits<T>::max() - self;
    const auto mindiff = min(diffmax, other);
    return self + mindiff;
}

#6b Example of static dispatch

AVX512bw:

template <class A, class T, class>
batch<T, A> sadd(batch<T, A> const& self, batch<T, A> const& other, requires_arch<avx512bw>) noexcept
{
    XSIMD_IF_CONSTEXPR(sizeof(T) == 1)
    {
        return _mm512_adds_epu8(self, other);
    }
    ...
}

#6c Example of dynamic dispatch

#include "sum.hpp"

// The architecture we want to target.
using target_archs = xsimd::arch_list<xsimd::avx2, xsimd::sse2>;

// Create the dispatching function.
auto dispatched = xsimd::dispatch<target_archs>(sum{});

// Call the appropriate implementation based on runtime information.
float res = dispatched(data, 17);

#7 XSIMD + ???

#8a XSIMD in Firefox!

Powers some rendering routines

xsimd::batch<int16_t, Arch> vectmask(/*...*/shortMask);
for (; i < vectWalkEnd; i += numUnicharsPerVector) {
    const auto vect =
      xsimd::batch<int16_t, Arch>::load_aligned(str + i);
    if (xsimd::any((vect & vectmask) != 0))
      return i;
}

#8b Example of migration

$ git diff 3b8da9bd15c0^..3b8da9bd15c0 --numstat
10   1       gfx/thebes/gfxAlphaRecovery.cpp
6    9       gfx/thebes/gfxAlphaRecovery.h
129  0       gfx/thebes/gfxAlphaRecoveryGeneric.h
9    0       gfx/thebes/gfxAlphaRecoveryNeon.cpp
3    124     gfx/thebes/gfxAlphaRecoverySSE2.cpp

#8c XSIMD dans FeuRenard!

Noyau de calcul pour la tranduction dans Firefox: gemmology

Basé sur le projet intgemm

#ω Links & References

1