Getting the best of every world: Cython and Pythran working together
Once upon a time, on IRC, Serge Guelton asked me whether I wanted to work on having Cython using Pythran for Numpy-related computation. I wasn't really sure what I was getting into, but I've always liked optimizing software, and that sounded like an interesting challenge to understand both projects. As an also important interesting note, this whole project has been financed by the OpenDreamKit project!
That's the end of the small story, now let's get to the real stuff!
Why mixing Cython and Pythran?
On one side, when Cython code contains operations which are done on Numpy arrays, Cython relies on the original Numpy package to compute them. This involves a fall back to the Python interpreter. It thus misses several optimization opportunities, especially with complex expressions: even if each Numpy call is decently optimized, their combination is not.
On the other side, Pythran has a full C++ implementation of a major set of
the Numpy API. Some of the advantage of this implementation is that it supports
expression templates and SIMD instructions. Expression templates allow to
fuse loops that can occurs when expressions with multiple operators are
computed. For instance, the expression a + b * c
is transformed by
Cython in two call: one for the multiplication of b
by c
, and one for the
addition of the result of this multiplication and the addition by a
. Each call
ends up in one loop, that reads memory, computes the operation and writes
back to (newly allocated) memory. The second loop has the same pattern. In nowadays
architecture, memory bandwidth is often the limiting factor in this kind of
operation. It is thus really interesting to merge these loops, and load/store
the memory only once.
Expression templating is a C++ technique that allows to evaluate expressions only when they are stored to memory. Thus, in this case, the two loops are automatically fused by the C++ compiler, and we get an optimized version of this code. Note that this technique is used for instance by the C++ wrapper of the GMP library. Using xsimd, it is even possible to automagically vectoriez these computations.
The project has been focused on using this Pythran backend for Numpy arrays in Cython when possible. At the time of writing this integration, Pythran had a few limitations regarding the Numpy arrays it can handle:
- array "views" are not supported. That means that arrays must be stored in contiguous memory. Fortran and C-style format are supported.
- the endianess of the integers must be the same that the one of the targeted architecture (note that Cython has the same limitation, and that it is still true today)
However, we still achieve interesting speedup, without the need of manual loop.
Implementation details within Cython
The overall idea of the implementation of this feature is to generate code that is using the pythonic backend instead of calls to the Numpy Python functions. Moreover, as Pythran didn't support every Numpy array types, we need a mechanism to switch back to the original implementation if necessary.
In order to explain this, let's take an example with this simple Cython function:
import numpy as np
cimport numpy as np
def add(np.ndarray(double, ndim=1) a, np.ndarray(double, ndim=1) b):
return a+b
When we encounter such a definition, we want to generate various functions, depending on the shapes of a
and b
at runtime:
- the original Cython code, if
a
andb
can't be handled by Pythran - the version when only
a
can be handled by Pythran - the version when only
b
can be handled by Pythran - the version when both
a
andb
can be handled by Pythran
Note that, in the case of the add
function, only the first and last
versions are really of value. For now, we don't try to be smart about this and
generate all of these versions.
In order to do that, we rely on the type infrastructure that already exists in
Cython. For every argument that is a potentially Pythran-supported Numpy array,
we convert its type into a Cython FusedType
. A FusedType
allows to declare a
union of type. Multiple FusedType
can be specified within a function. In
the case of our add
function, this will generate the four aforementioned
versions, and the dispatching is done at runtime. What's nice is that we just
need to declare these FusedType
types, and Cython already handled all this
dispatching and the generation of the various functions.
Once this is done, we use these rules to known when we can generate pythonic-based code:
- unary and binary operators of two pythonic arrays is a supported operation
- calling a function of a module implemented in Pythran is a supported operation
If none of these rules work out, we fall back to Python objects and use the original Cython implementation.
How to use it
The Pythran-Numpy backend isn't activated by default within Cython. There are multiple ways to activate it:
- if you are using Cython directly from the command line, you can pass the
--np-pythran
flag to thecython
program - if you are using
distutils
, you can just add a comment with# cython: np_pythran=True
at the top of the necessary Cython files
More detailed information can be found within the Cython documentation.
Benchmarks and Examples
cos norm
>>> import numpy as np
...
>>> n = 10000
>>> x, y = np.random.random((2, n))
>>> def np_cos_norm(a, b):
... val = np.sum(1. - np.cos(a-b))
... return np.sqrt(val / 2. / a.shape[0])
>>> %timeit np_cos_norm(x, y)
138 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Ok, that's our baseline.
>>> %load_ext Cython
>>> %%cython
>>> import numpy as np
>>> def cy_np_cos_norm(a, b):
... val = np.sum(1. - np.cos(a-b))
... return np.sqrt(val / 2. / a.shape[0])
>>> %timeit cy_np_cos_norm(x, y)
137 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Nothing surprising: there's no type annotation and not that much of interpreation step anyway.
>>> %%cython
>>> import numpy as np
>>> cimport numpy as np
>>> def cy_np_typed_cos_norm(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
... val = np.sum(1. - np.cos(a-b))
... return np.sqrt(val / 2. / a.shape[0])
>>> %timeit cy_np_typed_cos_norm(x, y)
140 µs ± 765 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Same here: adding type is not enough, Cython still uses numpy's implementation for each individual operation.
>>> %%cython
>>> #cython: np_pythran=True
>>> #cython: cxx=True
>>> import numpy as np
>>> cimport numpy as np
...
>>> def cy_np_typed_pythran_cos_norm(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
... cdef int n = len(a)
... val = np.sum(1. - np.cos(a-b))
... return np.sqrt(val / 2. / n)
>>> %timeit cy_np_typed_pythran_cos_norm(x, y)
131 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
That's slighlty better, but not really impressive. That's because the execution time of the kernel is dominated by the cos
call, and Pythran cannot do much about it.
>>> %%cython
...
>>> import numpy as np
>>> cimport numpy as np
>>> from libc.math cimport cos, sqrt
...
>>> cimport cython
>>> @cython.boundscheck(False) # turn off bounds-checking for entire function
>>> @cython.wraparound(False) # turn off negative index wrapping for entire function
>>> def cy_np_typed_python_cos_norm(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
... cdef int n = len(a), i
... cdef double acc = 0,res
... for i in range(n):
... acc += 1 - cos(a[i]-b[i])
... res = sqrt(acc / 2. / n)
... return res
>>> %timeit cy_np_typed_python_cos_norm(x, y)
130 µs ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Indeed even the C loop does not give us a great speedup...
>>> %%cython -c=-DUSE_XSIMD -c=-march=native
>>> #cython: np_pythran=True
>>> #cython: cxx=True
>>> import numpy as np
>>> cimport numpy as np
...
>>> def cy_np_typed_pythran_xsimd_cos_norm(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
... cdef int n = len(a)
... val = np.sum(1. - np.cos(a-b))
... return np.sqrt(val / 2. / n)
>>> %timeit cy_np_typed_pythran_xsimd_cos_norm(x, y)
34.8 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
That is interesting. By using the -DUSE_XSIMD
flag and allowing the use of machine-specific instruction set (in our case, AVX), we get a great x4
speedup. And we still use the nice and high-level syntax of Numpy.
laplacien
>>> def np_laplacien(image):
... out_image = np.abs(4*image[1:-1,1:-1] -
... image[0:-2,1:-1] - image[2:,1:-1] -
... image[1:-1,0:-2] - image[1:-1,2:])
... valmax = np.max(out_image)
... valmax = max(1.,valmax)+1.E-9
... out_image /= valmax
... return out_image
>>> N = 500 ; image = np.random.randn(N,N)
>>> %timeit np_laplacien(image)
2.61 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Again, some high-level numpy baseline.
>>> %%cython
>>> import numpy as np
>>> def cy_np_laplacien(image):
... out_image = np.abs(4*image[1:-1,1:-1] -
... image[0:-2,1:-1] - image[2:,1:-1] -
... image[1:-1,0:-2] - image[1:-1,2:])
... valmax = np.max(out_image)
... valmax = max(1.,valmax)+1.E-9
... out_image /= valmax
... return out_image
>>> %timeit cy_np_laplacien(image)
2.83 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And it comes at no surprise that just cythonizing it does not help much.
>>> %%cython
>>> #cython: np_pythran=True
>>> #cython: cxx=True
>>> import numpy as np
>>> cimport numpy as np
...
>>> def cy_np_pythran_laplacien(np.ndarray[double, ndim=2] image):
...
... out_image = np.abs(4*image[1:-1,1:-1] -
... image[0:-2,1:-1] - image[2:,1:-1] -
... image[1:-1,0:-2] - image[1:-1,2:])
... valmax = np.max(out_image)
... valmax = max(1.,valmax)+1.E-9
... out_image /= valmax
... return out_image
>>> %timeit cy_np_pythran_laplacien(image)
640 µs ± 68.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This time, plain Pythran without vectorization is already a nice catch, as there is no costly operation that hides other optimizations.
>>> %%cython
>>> import numpy as np
>>> cimport numpy as np
>>> from libc.math cimport fabs
...
>>> cimport cython
>>> @cython.boundscheck(False) # turn off bounds-checking for entire function
>>> @cython.wraparound(False) # turn off negative index wrapping for entire function
>>> def cy_py_laplacien(np.ndarray[double, ndim=2] image):
... cdef int i, j
... cdef int n = image.shape[0], m = image.shape[1]
... cdef np.ndarray[double, ndim=2] out_image = np.empty((n-2,m-2))
... cdef double valmax
... for i in range(n-2):
... for j in range(m-2):
... out_image[i,j] = fabs(4*image[1+i,1+j] -
... image[i,1+j] - image[2+i,1+j] -
... image[1+i,j] - image[1+i,2+j])
... valmax = out_image[0,0]
... for i in range(n-2):
... for j in range(m-2):
... if out_image[i,j] > valmax:
... valmax = out_image[i,j]
... valmax = max(1.,valmax)+1.E-9
... for i in range(n-2):
... for j in range(m-2):
... out_image[i,j] /= valmax
... return out_image
>>> %timeit cy_py_laplacien(image)
852 µs ± 47.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And the plain loop cython version is even not as good as Pythran's one :-) It's probably due to fabs
which implies a call to libm
while pythran's np.abs
does not have this overhead.
Future work
Pythran does not support yet memory views, while Cython has support for this. Memory views are an official CPython API to transparently forward buffers back and forth from native world to Python. As Numpy arrays, they support various shapes and ordering. Numpy arrays can also be processed as memory views. The next move for Pythran would be to support this, and for Cython to be able to use the Pythran backend for memory views!
This blogpost originally was a Jupyter Notebook. You can download it if you want. The conversion was done using nbconvert
and a custom template to match the style of the other part of the blog.