Getting the best of every world: Cython and Pythran working together

Once upon a time, on IRC, Serge Guelton asked me whether I wanted to work on having Cython using Pythran for Numpy-related computation. I wasn't really sure what I was getting into, but I've always liked optimizing software, and that sounded like an interesting challenge to understand both projects. As an also important interesting note, this whole project has been financed by the OpenDreamKit project!

That's the end of the small story, now let's get to the real stuff!

Why mixing Cython and Pythran?

On one side, when Cython code contains operations which are done on Numpy arrays, Cython relies on the original Numpy package to compute them. This involves a fall back to the Python interpreter. It thus misses several optimization opportunities, especially with complex expressions: even if each Numpy call is decently optimized, their combination is not.

On the other side, Pythran has a full C++ implementation of a major set of the Numpy API. Some of the advantage of this implementation is that it supports expression templates and SIMD instructions. Expression templates allow to fuse loops that can occurs when expressions with multiple operators are computed. For instance, the expression a + b * c is transformed by Cython in two call: one for the multiplication of b by c, and one for the addition of the result of this multiplication and the addition by a. Each call ends up in one loop, that reads memory, computes the operation and writes back to (newly allocated) memory. The second loop has the same pattern. In nowadays architecture, memory bandwidth is often the limiting factor in this kind of operation. It is thus really interesting to merge these loops, and load/store the memory only once.

Expression templating is a C++ technique that allows to evaluate expressions only when they are stored to memory. Thus, in this case, the two loops are automatically fused by the C++ compiler, and we get an optimized version of this code. Note that this technique is used for instance by the C++ wrapper of the GMP library. Using xsimd, it is even possible to automagically vectoriez these computations.

The project has been focused on using this Pythran backend for Numpy arrays in Cython when possible. At the time of writing this integration, Pythran had a few limitations regarding the Numpy arrays it can handle:

array "views" are not supported. That means that arrays must be stored in contiguous memory. Fortran and C-style format are supported.
the endianess of the integers must be the same that the one of the targeted architecture (note that Cython has the same limitation, and that it is still true today)

However, we still achieve interesting speedup, without the need of manual loop.

Implementation details within Cython

The overall idea of the implementation of this feature is to generate code that is using the pythonic backend instead of calls to the Numpy Python functions. Moreover, as Pythran didn't support every Numpy array types, we need a mechanism to switch back to the original implementation if necessary.

In order to explain this, let's take an example with this simple Cython function:

  import numpy as np
  cimport numpy as np
  def add(np.ndarray(double, ndim=1) a, np.ndarray(double, ndim=1) b):
      return a+b

When we encounter such a definition, we want to generate various functions, depending on the shapes of a and b at runtime:

the original Cython code, if a and b can't be handled by Pythran
the version when only a can be handled by Pythran
the version when only b can be handled by Pythran
the version when both a and b can be handled by Pythran

Note that, in the case of the add function, only the first and last versions are really of value. For now, we don't try to be smart about this and generate all of these versions.

In order to do that, we rely on the type infrastructure that already exists in Cython. For every argument that is a potentially Pythran-supported Numpy array, we convert its type into a Cython FusedType. A FusedType allows to declare a union of type. Multiple FusedType can be specified within a function. In the case of our add function, this will generate the four aforementioned versions, and the dispatching is done at runtime. What's nice is that we just need to declare these FusedType types, and Cython already handled all this dispatching and the generation of the various functions.

Once this is done, we use these rules to known when we can generate pythonic-based code:

unary and binary operators of two pythonic arrays is a supported operation
calling a function of a module implemented in Pythran is a supported operation

If none of these rules work out, we fall back to Python objects and use the original Cython implementation.

How to use it

The Pythran-Numpy backend isn't activated by default within Cython. There are multiple ways to activate it:

if you are using Cython directly from the command line, you can pass the --np-pythran flag to the cython program
if you are using distutils, you can just add a comment with # cython: np_pythran=True at the top of the necessary Cython files

More detailed information can be found within the Cython documentation.

Benchmarks and Examples

`cos norm`

>>> import numpy as np
... 
>>> n = 10000
>>> x, y = np.random.random((2, n))

>>> def np_cos_norm(a, b):
...     val = np.sum(1. - np.cos(a-b))
...     return np.sqrt(val / 2. / a.shape[0])

>>> %timeit np_cos_norm(x, y)

138 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Ok, that's our baseline.

>>> %load_ext Cython

>>> %%cython
>>> import numpy as np
>>> def cy_np_cos_norm(a, b):
...     val = np.sum(1. - np.cos(a-b))
...     return np.sqrt(val / 2. / a.shape[0])

>>> %timeit cy_np_cos_norm(x, y)

137 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Nothing surprising: there's no type annotation and not that much of interpreation step anyway.

>>> %%cython
>>> import numpy as np
>>> cimport numpy as np
>>> def cy_np_typed_cos_norm(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
...     val = np.sum(1. - np.cos(a-b))
...     return np.sqrt(val / 2. / a.shape[0])

>>> %timeit cy_np_typed_cos_norm(x, y)

140 µs ± 765 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Same here: adding type is not enough, Cython still uses numpy's implementation for each individual operation.

>>> %%cython
>>> #cython: np_pythran=True
>>> #cython: cxx=True
>>> import numpy as np
>>> cimport numpy as np
... 
>>> def cy_np_typed_pythran_cos_norm(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
...     cdef int n = len(a)
...     val = np.sum(1. - np.cos(a-b))
...     return np.sqrt(val / 2. / n)

>>> %timeit cy_np_typed_pythran_cos_norm(x, y)

131 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

That's slighlty better, but not really impressive. That's because the execution time of the kernel is dominated by the cos call, and Pythran cannot do much about it.

>>> %%cython
... 
>>> import numpy as np
>>> cimport numpy as np
>>> from libc.math cimport cos, sqrt
... 
>>> cimport cython
>>> @cython.boundscheck(False) # turn off bounds-checking for entire function
>>> @cython.wraparound(False)  # turn off negative index wrapping for entire function
>>> def cy_np_typed_python_cos_norm(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
...     cdef int n = len(a), i
...     cdef double acc = 0,res
...     for i in range(n):
...         acc += 1 - cos(a[i]-b[i])
...     res = sqrt(acc / 2. / n)
...     return res

>>> %timeit cy_np_typed_python_cos_norm(x, y)

130 µs ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Indeed even the C loop does not give us a great speedup...

>>> %%cython -c=-DUSE_XSIMD -c=-march=native
>>> #cython: np_pythran=True
>>> #cython: cxx=True
>>> import numpy as np
>>> cimport numpy as np
... 
>>> def cy_np_typed_pythran_xsimd_cos_norm(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
...     cdef int n = len(a)
...     val = np.sum(1. - np.cos(a-b))
...     return np.sqrt(val / 2. / n)

>>> %timeit cy_np_typed_pythran_xsimd_cos_norm(x, y)

34.8 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

That is interesting. By using the -DUSE_XSIMD flag and allowing the use of machine-specific instruction set (in our case, AVX), we get a great x4 speedup. And we still use the nice and high-level syntax of Numpy.

`laplacien`

>>> def np_laplacien(image):
...     out_image = np.abs(4*image[1:-1,1:-1] -
...                        image[0:-2,1:-1] - image[2:,1:-1] -
...                        image[1:-1,0:-2] - image[1:-1,2:])
...     valmax = np.max(out_image)
...     valmax = max(1.,valmax)+1.E-9
...     out_image /= valmax
...     return out_image

>>> N = 500 ; image = np.random.randn(N,N)

>>> %timeit np_laplacien(image)

2.61 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Again, some high-level numpy baseline.

>>> %%cython
>>> import numpy as np
>>> def cy_np_laplacien(image):
...     out_image = np.abs(4*image[1:-1,1:-1] -
...                        image[0:-2,1:-1] - image[2:,1:-1] -
...                        image[1:-1,0:-2] - image[1:-1,2:])
...     valmax = np.max(out_image)
...     valmax = max(1.,valmax)+1.E-9
...     out_image /= valmax
...     return out_image

>>> %timeit cy_np_laplacien(image)

2.83 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And it comes at no surprise that just cythonizing it does not help much.

>>> %%cython
>>> #cython: np_pythran=True
>>> #cython: cxx=True
>>> import numpy as np
>>> cimport numpy as np
... 
>>> def cy_np_pythran_laplacien(np.ndarray[double, ndim=2] image):
... 
...     out_image = np.abs(4*image[1:-1,1:-1] -
...                        image[0:-2,1:-1] - image[2:,1:-1] -
...                        image[1:-1,0:-2] - image[1:-1,2:])
...     valmax = np.max(out_image)
...     valmax = max(1.,valmax)+1.E-9
...     out_image /= valmax
...     return out_image

>>> %timeit cy_np_pythran_laplacien(image)

640 µs ± 68.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This time, plain Pythran without vectorization is already a nice catch, as there is no costly operation that hides other optimizations.

>>> %%cython
>>> import numpy as np
>>> cimport numpy as np
>>> from libc.math cimport fabs
... 
>>> cimport cython
>>> @cython.boundscheck(False) # turn off bounds-checking for entire function
>>> @cython.wraparound(False)  # turn off negative index wrapping for entire function
>>> def cy_py_laplacien(np.ndarray[double, ndim=2] image):
...     cdef int i, j
...     cdef int n = image.shape[0], m = image.shape[1]
...     cdef np.ndarray[double, ndim=2] out_image = np.empty((n-2,m-2))
...     cdef double valmax
...     for i in range(n-2):
...         for j in range(m-2):
...             out_image[i,j] =  fabs(4*image[1+i,1+j] -
...                        image[i,1+j] - image[2+i,1+j] -
...                        image[1+i,j] - image[1+i,2+j])
...     valmax = out_image[0,0]
...     for i in range(n-2):
...         for j in range(m-2):
...             if out_image[i,j] > valmax:
...                 valmax = out_image[i,j]
...     valmax = max(1.,valmax)+1.E-9
...     for i in range(n-2):
...         for j in range(m-2):
...             out_image[i,j] /= valmax
...     return out_image

>>> %timeit cy_py_laplacien(image)

852 µs ± 47.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And the plain loop cython version is even not as good as Pythran's one :-) It's probably due to fabs which implies a call to libm while pythran's np.abs does not have this overhead.

Future work

Pythran does not support yet memory views, while Cython has support for this. Memory views are an official CPython API to transparently forward buffers back and forth from native world to Python. As Numpy arrays, they support various shapes and ordering. Numpy arrays can also be processed as memory views. The next move for Pythran would be to support this, and for Cython to be able to use the Pythran backend for memory views!

This blogpost originally was a Jupyter Notebook. You can download it if you want. The conversion was done using nbconvert and a custom template to match the style of the other part of the blog.