Bye bye boost.simd, welcome xsimd

boost.simd provides a C++ abstraction of vector type, allowing for efficient vectorization of array computations. It has been (optionally) used as part of the expression template engine of Pythran for a long time, a great collaboration that led to several patches in boost.simd, and great performance for Pythran.

Unfortunately, the project has been silent over the last months (see for instance this issue) and I had to maintain a few custom patches. Turns out the project has been re-branded as bSIMD with a more restrictive license, as detailed in another issue. From the Pythran perspective, this is no good news.

Fortunately, the people from QuantStack have put tremendous effort into providing an equivalent to boost.simd, xsimd. And their library actually provides some improvements in the context of Pythran, it's under a BSD-3-Clause license, so when they proposed to fund the move to xsimd, it was just perfect.

So here is the deal: I do the port, report any API and/or performance issue, and eventually provide patches when relevant. That's what I did over the last three months, let's have a look at the results.

User-level Changes

In order to activate explicit vectorisation, one must pass -DUSE_XSIMD -march=native to the Pythran compiler, in place of -DUSE_BOOST_SIMD -march=native. Fair enough.

For instance, consider the following kernel, taken from the numpy benchmarks suite I maintain.

#pythran export arc_distance(float64 [], float64[], float64[], float64[])

import numpy as np
def arc_distance(theta_1, phi_1,
                       theta_2, phi_2):
    """
    Calculates the pairwise arc distance between all points in vector a and b.
    """
    temp = np.sin((theta_2-theta_1)/2)**2+np.cos(theta_1)*np.cos(theta_2)*np.sin((phi_2-phi_1)/2)**2
    distance_matrix = 2 * (np.arctan2(np.sqrt(temp),np.sqrt(1-temp)))
    return distance_matrix

When compiled with GCC 7.3 and benchmarked with the perf module, one gets

CC=clang CXX=clang++ pythran arc_distance.py -O3 -march=native
python -m perf timeit -s 'N = 10000 ; import numpy as np ; np.random.seed(0); t0, p0, t1, p1 = np.random.randn(N), np.random.randn(N), np.random.randn(N), np.random.randn(N); from arc_distance import arc_distance' 'arc_distance(t0, p0, t1, p1)'
.....................
Mean +- std dev: 1.48 ms +- 0.01 ms

That's our base line. If we recompile it with -DUSE_XSIMD, we get an extra speedup (AVX instructions are available on my laptop, and enabled by -march=native).

CC=clang CXX=clang++ python -m pythran.run arc_distance.py -O3 -march=native -DUSE_XSIMD
python -m perf timeit -s 'N = 10000 ; import numpy as np ; np.random.seed(0); t0, p0, t1, p1 = np.random.randn(N), np.random.randn(N), np.random.randn(N), np.random.randn(N); from arc_distance import arc_distance' 'arc_distance(t0, p0, t1, p1)'
.....................
Mean +- std dev: 199 us +- 4 us

That's roughly 7 times faster. And using Pythran 0.8.7, the last release with boost.simd support, we have

CC=clang CXX=clang++ python -m pythran.run arc_distance.py -O3 -march=native -DUSE_BOOST_SIMD
python -m perf timeit -s 'N = 10000 ; import numpy as np ; np.random.seed(0); t0, p0, t1, p1 = np.random.randn(N), np.random.randn(N), np.random.randn(N), np.random.randn(N); from arc_distance import arc_distance' 'arc_distance(t0, p0, t1, p1)'
.....................
Mean +- std dev: 284 us +- 8 us

This is slightly slower, but within the same magnitude order. Out of curiosity, I ran the same three experiments using Clang 6 as a backend compiler and I get the following timings:

clang + boost.simd: 220 us +- 8 us
clang + xsimd:      273 us +- 11 us
clang:             1.41 ms +- 0.04 ms

Interestingly, on that example, Clang generates better code for the boost.simd version. So let's be wary of hasty conclusion and just state that with both engines, I can get efficient vectorization of my code.

Complex Numbers

Thanks to xsimd, Pythran is now able to naively support complex number vectorization. I state naively because we don't support changing internal representation from array-of-struct to struct-of-array, as we stick to numpy's layout. Still that's something new for Pythran as showcased by the following kernel:

#pythran export normalize_complex_arr(complex[])

import numpy as np
def normalize_complex_arr(a):
    a_oo = a - a.real.min() - 1j*a.imag.min() # origin offsetted
    return a_oo/np.abs(a_oo).max()

Pythran provides a vectorized version of np.min and np.max operators, so thanks to complex support, it should provide some decent acceleration. Note that the two calls to np.min() do not involve complex numbers, but that the remaining parts of the expression do. Let's check that!

First, the reference numpy version:

python -m perf timeit -s 'import numpy as np; np.random.seed(0); N = 100000; x = np.random.random(N) + 1j *  np.random.random(N); from normalize_complex_arr import normalize_complex_arr' 'normalize_complex_arr(x)'
.....................
Mean +- std dev: 3.19 ms +- 0.02 ms

Then with Pythran, no explicit vectorization:

CC=gcc CXX=g++ pythran -march=native -O3 normalize_complex_arr.py
python -m perf timeit -s 'import numpy as np; np.random.seed(0); N = 100000; x = np.random.random(N) + 1j *  np.random.random(N); from normalize_complex_arr import normalize_complex_arr' 'normalize_complex_arr(x)'
.....................
Mean +- std dev: 2.84 ms +- 0.01 ms

And with vectorization on .

CC=gcc CXX=g++ pythran -march=native -O3 make_decision.py -DUSE_XSIMD
python -m perf timeit -s 'import numpy as np; np.random.seed(0); N = 100000; x = np.random.random(N) + 1j *  np.random.random(N); from normalize_complex_arr import normalize_complex_arr' 'normalize_complex_arr(x)'
.....................
Mean +- std dev: 723 us +- 14 us

Cool! Speedup for complex! For the record, the numpy version already ran at roughly 3.19 ms +- 0.02 ms.

Scalar Version

That's probably a detail for many xsimd users, but thanks to this cooperation, xsimd now exposes a scalar version of all the mathematical function inside the xsimd:: namespace. That way one can write higher level functions based on xsimd, and they would work for both scalar and vector version:

template<class T>
T euclidian_distance_squared(T x, T y)
{
    auto tmp = xsimd::hypot(x, y);
    return tmp * tmp;
}

In the context of Pythran, this makes the expression template engine easier to write. Good point.

Compilation Time

Pythran is an Ahead of Time compiler, so compilation time is generally not a good metric. But there's one situation where it matters to me: Continuous Integration. Because Travis has time limits, the faster we compile, the more tests we can pass! As Pythran validates for Python2 and Python3, for Clang and GCC, with and without SIMD, with and without OpenMP, that's a lot of configurations to test. Roughly... 20hours of cumulated tests actually, see this recent build for instance.

In pre-xsimd setting, compiling the above arc_distance.py file in simd mode is relatively slow. As a reference consider the compilation of the sequential version:

time pythran -O3 -march=native normalize_complex_arr.py -E # generate the .cpp
0.91s user 0.28s system 130% cpu 0.908 total

time pythran -O3 -march=native arc_distance.cpp
5.67s user 0.61s system 104% cpu 6.001 total

Ok, roughly 5 seconds in sequential mode. What about vectorized version? With boost, it's pretty damn slow:

time pythran -O3 -march=native normalize_complex_arr.cpp -DUSE_BOOST_SIMD
12.10s user 0.79s system 102% cpu 12.616 total

With xsimd, it's slightly faster (no boost dependencies, and less C++ magic):

time pythran -O3 -march=native arc_distance.cpp -DUSE_XSIMD
10.32s user 0.65s system 102% cpu 10.688 total

Performance of Basic Functions

Using airspeed velocity, I've compared how well xsimd behaves for simple operations on 1D array. All the benchmarks hereafter have the following form:

#pythran export cos_array(float64 [])
#setup: import numpy as np ; np.random.seed(0); N = 10000 ; x = np.random.random(N) * 2 * np.pi
#run: cos_array(x)
import numpy as np
def cos_array(x):
    return np.cos(x)

The results are obtained through the asv compare commit_id0 commit_id1 command.

All benchmarks:

    before     after       ratio
  [99d8234f] [60632651]
     9.90μs     9.89μs      1.00  benchmarks.TimeSuite.time_abs_array
+   36.82μs    58.44μs      1.59  benchmarks.TimeSuite.time_acos_array
    36.25μs    33.60μs      0.93  benchmarks.TimeSuite.time_asin_array
-   50.47μs    33.03μs      0.65  benchmarks.TimeSuite.time_atan_array
-   48.62μs    35.72μs      0.73  benchmarks.TimeSuite.time_cos_array
-   73.82μs    43.81μs      0.59  benchmarks.TimeSuite.time_cosh_array
-   47.55μs    35.52μs      0.75  benchmarks.TimeSuite.time_sin_array
-   91.45μs    47.86μs      0.52  benchmarks.TimeSuite.time_sinh_array
    18.35μs    17.91μs      0.98  benchmarks.TimeSuite.time_sqrt_array
     9.60μs    10.05μs      1.05  benchmarks.TimeSuite.time_square_array
-   71.71μs    33.35μs      0.47  benchmarks.TimeSuite.time_tan_array
-   84.63μs    42.28μs      0.50  benchmarks.TimeSuite.time_tanh_array

Looks pretty good! Apart from a regression on acos, this is either on-par or faster than before.

Out of curiosity, I also ran the same benchmark, but using Clang as back-end.

All benchmarks:

    before     after       ratio
  [99d8234f] [60632651]
     9.57μs    10.00μs      1.05  benchmarks.TimeSuite.time_abs_array
+   34.20μs    58.53μs      1.71  benchmarks.TimeSuite.time_acos_array
    36.09μs    33.91μs      0.94  benchmarks.TimeSuite.time_asin_array
-   45.02μs    33.86μs      0.75  benchmarks.TimeSuite.time_atan_array
+   39.44μs    45.48μs      1.15  benchmarks.TimeSuite.time_cos_array
-   65.98μs    44.78μs      0.68  benchmarks.TimeSuite.time_cosh_array
+   39.39μs    45.48μs      1.15  benchmarks.TimeSuite.time_sin_array
-  110.62μs    48.44μs      0.44  benchmarks.TimeSuite.time_sinh_array
    18.18μs    18.54μs      1.02  benchmarks.TimeSuite.time_sqrt_array
    10.05μs     9.56μs      0.95  benchmarks.TimeSuite.time_square_array
-   56.82μs    45.32μs      0.80  benchmarks.TimeSuite.time_tan_array
-   98.85μs    44.16μs      0.45  benchmarks.TimeSuite.time_tanh_array

Wow, that's significant changes. Regression on both cos, sin and acos are not good news.

What conclusion should we draw? My take on this is that these benchmarks are not synthetic enough to state xsimd implementation of function X is better or worse than boost.simd implementation. But maybe there are bad interactions with Pythran's expression templates? A single register spill could wreak havoc in the performance, and I know there is room for improvement there.

Conclusions

I'm indeed very happy with the changes. The xsimd team is very reactive, it's cool to chat with them about performance, Python, C++... And did I say xsimd supports NEON, AVX512? I should try to run cross-compiled Pythran code on a Raspberry, but... That's for another story!

Again thanks a lot to (alphabetical order) Johan, Martin, Sylvain and Wolf. Let's meet again in front of a generous choucroute!