Python: When Performance Matters

Proudly made in Namek by serge-sans-paille & pbrunet

`/me`

Serge « sans paille » Guelton

$ whoami
sguelton

R&D engineer at QuarksLab on compilation for security
Associate researcher at Télécom Bretagne

Myth: Python is slow

The Computer Language Benchmarks Game

binary-trees
	
source 	    secs 
Python 3    152.06
C++ g++  	6.98

Reality: CPython is fast (enough) in some cases

On the pidigits benchmark, that uses BigInts

× 	source 	secs
1.0 Pascal  1.73
1.0 C gcc	1.73
1.0 Rust	1.74
1.1 Fortran	1.92
1.3 Python3	2.20
1.3 C++ g++	2.29

Reality: CPython is really slow for numeric computations

On the spectral-norm benchmark, that uses list and scalars

×   source    secs
1.0 C gcc     1.98
2.1 Java      4.26
8.0 Node.js  15.77
126 Python3 250.12

But that's not a property of Python!

$ gcc sn.c -o sn -O3
$ time ./sn 5500
./sn 5500  4.86s user 0.00s system 99% cpu 4.864 total
$ pythran sn.py
$ python -m timeit 'import sn' 'sn.main(5500)'
10 loops, best of 3: 4.79 sec per loop

Tell me Why!

Interpreted code


>>> import dis
>>> code = lambda x, y: x + y
>>> dis.dis(code)
  1           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                1 (y)
              6 BINARY_ADD          
              7 RETURN_VALUE

VS.


foo:
    leal    (%rdi,%rsi), %eax
    ret

Tell me Why!

Indirections Everywhere

LOAD_FAST ⇒ array lookup
BINARY_ADD ⇒ dict lookup

Funcion Call Overhead

Suspend current frame
Create a new frame
Push it on the stack

Tell me Why!

No optimization

(except somme peephole optimization on the bytecode)

$ gcc sn.c -o sn -O0 -fsanitize=address -fsanitize=null -fsanitize=signed-integer-overflow -fsanitize=integer-divide-by-zero
$ time ./sn 5500
./a.out 5500  11.04s user 0.02s system 99% cpu 11.053 total

Tell me Why!

Poor Parallelism Support

Global Interpreter Lock ⇒ few parallism gain (except io/native calls) Do not speak about vectorization

Prove me wrong

Efficient dictionnary
Cached String hashing
BigInt comparable to GMP

Many Native Library Wrappers

Then comes NumPy

A Python packages for numeric computations

Flat Data Types optimized for Native ↔ Python transfer
Lightweight BLAS wrappers
All math routines and operators
Many libraries (random, polynomial, linalg, fft)
Relatively extensible (aggregate, ufunc, f2py)

→ Keystone of the Python scientific ecosystem!

Example of (bad) Python code

def pairwise(X):
    M, N = X.shape
    D = np.empty((M, M), dtype=np.float64)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = sqrt(d)
    return D

Do it the NumPy way!

import numpy as np
def pairwise(X):
    return np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))

Numpy Ecosystem

plot: Matplotlib
Image/Signal: Scipy
Stats: Scipy
Machine learning: SKlearn

Performance Tips

You don't need performance Tips, you're a scientist, get the job done!
Benchmark! Profile!

Use mpi4py for cluster parallelization
Use Cython/Hope/Parakeet/Pythran/Numba for more raw perf
Use multithreading from Cython/Pythran
Use GPUs with PyCUDA/PyOpenCL/GT-Py

Don't use explicit indexing/looping

Example of MPI interaction

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = np.arange(100, dtype='i')
else:
    data = np.empty(100, dtype='i')
comm.Bcast(data, root=0)

Calling Native Code from Python

`ctypes`

>>> from ctypes import *
>>> libm = cdll.LoadLibrary('libm.so.6')
>>> sqrt = libm.sqrt
>>> sqrt.argtypes, sqrt.restype = (c_double,), c_double
>>> sqrt(4.)

Building and Calling Native Code from Python

`cffi`

>>> from cffi import FFI
>>> ffi = FFI()
>>> ffi.cdef('int add(int, int);')
>>> C = ffi.verify('int add(int x, int y) { return x+y;}')
>>> C.add(40, 2)
42

Building Native Modules

`Cython`

cimport cython
cpdef double add(double x, double y):
    return x + y

Then

$ cython -a add.pyx
$ gcc `python-config --cflags --ldflags` -fPIC -shared add.c -o add.so

Building Native Modules

`Pythran`

#pythran export add(double, double)
def add(x, y):
    return x + y

Then

$ pythran add.py

Wrapping Native Code

SWIG

Through a custom compiler + interface file

Boost.Python

Regular C++ code + runtime deps

pybind11

Regular C++ code + modern compiler

PyBind11 example


#include 
int add(int i, int j) {
    return i + j;
}
namespace py = pybind11;
PYBIND11_PLUGIN(example) {
    py::module m("example", "pybind11 example plugin");
    m.def("add", &add, "A function which adds two numbers");
    return m.ptr();
}

General Considerations

Pro:

Leverage existing libs
Parallelism
Vectorization (SIMD code)

Cons:

Type conversion cost
Packaging issues
Runtime dependencies
Exceptions / Error handling

Just in Time Compilation

`PyPy`

a full interpreter

Python2 and Python3
Partial Numpy Support
cffi compatibility

Promise: no code change

Just in Time Compilation

`Numba`

Based on LLVM

from numba import jit
@jit
def add(x, y): return x + y

`Hope`

Based on C++

from hope import jit
@jit
def add(x, y): return x + y

Profiling Interpreted Code

`cProf`

$ python -m cProfile -s cumtime myscript.py

`timeit`

$ python -m timeit -s 'import mymodule' 'mymodule.run(2)'

Profiling native Code

Usual suspects

`perf`

$ perf run python myscript.py
$ perf report

`valgrind`

$ valgrind --tool=callgrind python myscript.py
$ kcachegrind callgrind.out.1982

THE END

Some reading

https://wiki.python.org/moin/PythonSpeed
High Performance Python, Practical Performant Programming for Humans by Micha Gorelick, Ian Ozsvald