Being more than a translator

This blogpost originally was a Jupyter Notebook. You can download it if you want. The conversion was done using nbconvert and a custom template to match the style of the other part of the blog.

Special thanks to w1gz for his review btw o/

Optimizing with Cython

Jake VanderPlas recently wrote a blogpost about Cython, where the following piece of Python code and the associated benchmark can be found:

>>> import numpy as np
>>> def ising_step(field, beta=0.4):
...     
...     N, M = field.shape
...     for n_offset in range(2):
...         for m_offset in range(2):
...             for n in range(n_offset, N, 2):
...                 for m in range(m_offset, M, 2):
...                     _ising_update(field, n, m, np.float32(beta))
...     return field
... 
>>> def _ising_update(field, n, m, beta):
...     total = 0
...     N, M = field.shape
...     for i in range(n-1, n+2):
...         for j in range(m-1, m+2):
...             if i == n and j == m:
...                 continue
...             total += field[i% N, j% M]
...     dE = 2 * field[n, m] * total
...     if dE <= 0:
...         field[n, m] *= -1
...     elif np.exp(-dE * beta) > np.random.rand():
...         field[n, m] *= -1

>>> def random_spin_field(N, M):
...     return np.random.choice([-1, 1], size=(N, M))
>>> field = random_spin_field(200, 200)
... 
>>> %timeit ising_step(field)

1 loop, best of 3: 938 ms per loop

The blogpost also contains a Cython implementation of the same kernel:

>>> %load_ext cython

>>> %%cython
>>> cimport cython
... 
>>> import numpy as np
>>> cimport numpy as np
... 
>>> from libc.math cimport exp
>>> from libc.stdlib cimport rand
>>> cdef extern from "limits.h":
...     int RAND_MAX
... 
... 
>>> @cython.boundscheck(False)
>>> @cython.wraparound(False)
>>> def cy_ising_step(np.int64_t[:, :] field, float beta=0.4):
...     cdef int N = field.shape[0]
...     cdef int M = field.shape[1]
...     cdef int n_offset, m_offset, n, m
...     for n_offset in range(2):
...         for m_offset in range(2):
...             for n in range(n_offset, N, 2):
...                 for m in range(m_offset, M, 2):
...                     _cy_ising_update(field, n, m, beta)
...     return np.array(field)
... 
... 
>>> @cython.boundscheck(False)
>>> @cython.wraparound(False)
>>> cdef _cy_ising_update(np.int64_t[:, :] field, int n, int m, float beta):
...     cdef int total = 0
...     cdef int N = field.shape[0]
...     cdef int M = field.shape[1]
...     cdef int i, j
...     for i in range(n-1, n+2):
...         for j in range(m-1, m+2):
...             if i == n and j == m:
...                 continue
...             total += field[i % N, j % M]
...     cdef float dE = 2 * field[n, m] * total
...     if dE <= 0:
...         field[n, m] *= -1
...     elif exp(-dE * beta) * RAND_MAX > rand():
...         field[n, m] *= -1
...

>>> %timeit cy_ising_step(field)

The slowest run took 4.57 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 8.64 ms per loop

As expected, this is much faster than the Python version which uses explicit loops. Here, Cython is typically used as a guided translater. It translates Python code into C code, using the type annotations and the extra Cython directives to drive the translation process, removing most of the calls to the Python C API.

Using the Pythran compiler

When I wanted to use Pythran to convert the original Python code into native code, I went the traditional Pythran way, i.e. copy paste the Python code, add a #pythran export line and that's all folks. Unfortunately, I ended up with a code roughly two times slower than the Cython version. That's still good with respect to the Python version, but a bit disapointing.

So I did some benchmarking, compared the C code generated by Cython and the C++ code generated by Pythran, made sure the compiler flags were similar etc. And I realized that Pythran is using int64 for Python integers, including loop indices, whereas the Cython version of this kernel is specialized to use int. So I was not comparing the same computations. In a similar manner the Pythran version -- just like the Python version -- uses double precision floating pointer numbers, whereas Cython version is specialized to use single precision number.

This alone should not explain the difference between the two implementations, but it turns out it does, because one of the hotspot of the program is the modulo computation in the loop indexing. Indeed, the modulo is a relatively costly operation compared to an add, and even a branch mis prediction, check this great table for some reminder about that. Moreover, the cost of the modulo increases with the integer size.

After some tinkering, and without changing the original Python code, I managed to get Pythran produce a code that runs faster than the Cython version, without changing the integer type. Demonstration:

>>> import pythran
>>> %load_ext pythran.magic

>>> %%pythran 
>>> #pythran export ising_step(int64[:,:])
... 
>>> import numpy as np
>>> def ising_step(field, beta=0.4):
...     
...     N, M = field.shape
...     for n_offset in range(2):
...         for m_offset in range(2):
...             for n in range(n_offset, N, 2):
...                 for m in range(m_offset, M, 2):
...                     _ising_update(field, n, m, beta)
...     return field
... 
>>> def _ising_update(field, n, m, beta):
...     total = 0
...     N, M = field.shape
...     for i in range(n-1, n+2):
...         for j in range(m-1, m+2):
...             if i == n and j == m:
...                 continue
...             total += field[i% N, j% M]
...     dE = 2 * field[n, m] * total
...     if dE <= 0:
...         field[n, m] *= -1
...     elif np.exp(-dE * beta) > np.random.rand():
...         field[n, m] *= -1
...

>>> %timeit ising_step(field)

100 loops, best of 3: 7.76 ms per loop

The timings gets even better if beta is forced into a single precision float, to match Cython's code, but that's not the goal of this article.

So what happened? That's the topic of the second part of this post :-)

Implementing a new Pythran optimization

Once one realizes that the above code is bound by the modulo computation, the natural optimization goal becomes to get rid of the modulo. One way to do so is to notice that in the two expressions i % N and i % M (let us denote them as i_n and i_m):

i and j are loop induction variables, iterating through an increasing range;
N and M are positive values.

Thanks to the above properties, instead of computing the modulo each time, it is possible to use the inductive formula i_n = i_n + 1 if i_n == N - 1 else O.

i and j being loop induction variables is relatively simple, Pythran already provides the tooling for identifier binding, so binding the identifier range to the according builtin is not an issue, and walking the use-def chain is also within the scope of Pythran analyses.

N being positive can be deduced from it's assignment, which results from tuple unpacking of an array shape. And a shape only contains positive numbers.

Pythran models the shape attribute correctly and can see through the type destructuring after a simplification step. Let's showcase that and use it as an opportunity to use some Pythran internals.

>>> from pythran import passmanager, backend
>>> import gast as ast

gast is just a thin portability layer over the Python standard ast module. It provides the same API with a few extra and helps to cope with Python2/Python3 transition.

>>> code = '''
>>> def upper_dim(dat):
...     M, _ = dat.shape
...     return M
>>> '''
>>> node = ast.parse(code)

We first instanciate a pass manager to apply Pythran's transformation. A few normalization steps are necessary. These refinments are important for Pythran because it puts the Python AST into a normalized form easier to process for Pythran optimizations.

>>> from pythran.transformations import NormalizeTuples, NormalizeMethodCalls
>>> pm = passmanager.PassManager("test")
>>> pm.apply(NormalizeTuples, node)
>>> pm.apply(NormalizeMethodCalls, node)

(True, <gast.gast.Module at 0x7f00c4fa32d0>)

True means something got changed in the process, and by using the Python backend of Pythran, we can get back a Python view of the transformed code:

>>> print(pm.dump(backend.Python, node))

def upper_dim(dat):
    M = __builtin__.getattr(dat, 'shape')[0]
    _ = __builtin__.getattr(dat, 'shape')[1]
    return M

Using the RangeValue analysis, it is possible to collect range information about the various nodes and variables in this function:

>>> from pythran.analyses import RangeValues
... 
>>> rv = pm.gather(RangeValues, node)
>>> rv['M']

Interval(low=0, high=inf)

In the end given a simple representation of the original code, a (simplified) Pythran pipeline can optimize it into a Python code that can be translated to more efficient native code:

>>> code = '''
>>> def foo(x):
...     y = len(x)
...     for i in range(3):
...         z = i % y
>>> '''
>>> node = ast.parse(code)

>>> from pythran.transformations import ExpandBuiltins
>>> from pythran.optimizations import ModIndex, IterTransformation
>>> pm.apply(ExpandBuiltins, node)
>>> pm.apply(IterTransformation, node)
>>> pm.apply(ModIndex, node)
>>> print(pm.dump(backend.Python, node))

def foo(x):
    y = __builtin__.len(x)
    i_m = ((0 - 1) % y)
    for i in __builtin__.xrange(3):
        i_m = (0 if ((i_m + 1) == y) else (i_m + 1))
        z = i_m

Here is a summary of what happens:

Expand Builtins makes sure any identifier that is availble because the _builtin__ module (buitlins in Python3) is loaded by default have their full path specified. It turns range into __builtin__.range.
Iter Tranformation turns functions that create list into their counterpart that creats a generator, when it is legal (map → itertools.imap etc, in Python2 terminology).
Mod Index simplifies the modulo operation

Extra transformation could be applied, just for the fun of it:

>>> from pythran.transformations import FalsePolymorphism
>>> from pythran.optimizations import DeadCodeElimination, LoopFullUnrolling
... 
>>> pm.apply(LoopFullUnrolling, node)
>>> pm.apply(FalsePolymorphism, node)  # basically scalar renaming
... 
>>> while pm.apply(DeadCodeElimination, node)[0]:
...     pass
>>> print(pm.dump(backend.Python, node))

def foo(x):
    pass
    pass
    pass
    pass
    pass
    pass
    pass
    pass
    pass
    pass
    pass

Conclusion

The transformation described in this post could be made more generic (by supporting a broader range of expression as modulo parameter for instance). Still it showcases the underlying idea of Pythran: given enough test cases, it should be possible to pile more and more transformations, so that the user can focus on writing code, and let the compiler take care of all the details, taking advantage of all the knowledge gathered in the compiler.

There's a great benefit to reason at Python's level: the builtins provide high-level functionality whose semantic carries information a smart enough compiler can take advantage of.