Python: When Performance Matters

Proudly made in Namek by serge-sans-paille & pbrunet

/me

Serge « sans paille » Guelton

$ whoami
sguelton
  • R&D engineer at QuarksLab on compilation for security
  • Associate researcher at Télécom Bretagne

Myth: Python is slow

The Computer Language Benchmarks Game

binary-trees
	
source 	    secs 
Python 3    152.06
C++ g++  	6.98

Reality: CPython is fast (enough) in some cases

On the pidigits benchmark, that uses BigInts

× 	source 	secs
1.0 Pascal  1.73
1.0 C gcc	1.73
1.0 Rust	1.74
1.1 Fortran	1.92
1.3 Python3	2.20
1.3 C++ g++	2.29

                    

Reality: CPython is really slow for numeric computations

On the spectral-norm benchmark, that uses list and scalars

×   source    secs
1.0 C gcc     1.98
2.1 Java      4.26
8.0 Node.js  15.77
126 Python3 250.12
                    

But that's not a property of Python!

$ gcc sn.c -o sn -O3
$ time ./sn 5500
./sn 5500  4.86s user 0.00s system 99% cpu 4.864 total
$ pythran sn.py
$ python -m timeit 'import sn' 'sn.main(5500)'
10 loops, best of 3: 4.79 sec per loop

Tell me Why!

Interpreted code


>>> import dis
>>> code = lambda x, y: x + y
>>> dis.dis(code)
  1           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                1 (y)
              6 BINARY_ADD          
              7 RETURN_VALUE
VS.

foo:
    leal    (%rdi,%rsi), %eax
    ret

Tell me Why!

Indirections Everywhere

  • LOAD_FAST ⇒ array lookup
  • BINARY_ADD ⇒ dict lookup

Funcion Call Overhead

  1. Suspend current frame
  2. Create a new frame
  3. Push it on the stack

Tell me Why!

No optimization

(except somme peephole optimization on the bytecode)
$ gcc sn.c -o sn -O0 -fsanitize=address -fsanitize=null -fsanitize=signed-integer-overflow -fsanitize=integer-divide-by-zero
$ time ./sn 5500
./a.out 5500  11.04s user 0.02s system 99% cpu 11.053 total
                

Tell me Why!

Poor Parallelism Support

Global Interpreter Lock ⇒ few parallism gain (except io/native calls) Do not speak about vectorization

Prove me wrong

  • Efficient dictionnary
  • Cached String hashing
  • BigInt comparable to GMP

Many Native Library Wrappers

Then comes NumPy

A Python packages for numeric computations

  • Flat Data Types optimized for Native ↔ Python transfer
  • Lightweight BLAS wrappers
  • All math routines and operators
  • Many libraries (random, polynomial, linalg, fft)
  • Relatively extensible (aggregate, ufunc, f2py)

→ Keystone of the Python scientific ecosystem!

Example of (bad) Python code

def pairwise(X):
    M, N = X.shape
    D = np.empty((M, M), dtype=np.float64)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = sqrt(d)
    return D

Do it the NumPy way!

import numpy as np
def pairwise(X):
    return np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))

Numpy Ecosystem

  • plot: Matplotlib
  • Image/Signal: Scipy
  • Stats: Scipy
  • Machine learning: SKlearn

Performance Tips

  1. You don't need performance Tips, you're a scientist, get the job done!
  2. Benchmark! Profile!
    • Use mpi4py for cluster parallelization
    • Use Cython/Hope/Parakeet/Pythran/Numba for more raw perf
    • Use multithreading from Cython/Pythran
    • Use GPUs with PyCUDA/PyOpenCL/GT-Py
  3. Don't use explicit indexing/looping

Example of MPI interaction

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = np.arange(100, dtype='i')
else:
    data = np.empty(100, dtype='i')
comm.Bcast(data, root=0)

Calling Native Code from Python

ctypes

>>> from ctypes import *
>>> libm = cdll.LoadLibrary('libm.so.6')
>>> sqrt = libm.sqrt
>>> sqrt.argtypes, sqrt.restype = (c_double,), c_double
>>> sqrt(4.)

Building and Calling Native Code from Python

cffi

>>> from cffi import FFI
>>> ffi = FFI()
>>> ffi.cdef('int add(int, int);')
>>> C = ffi.verify('int add(int x, int y) { return x+y;}')
>>> C.add(40, 2)
42

Building Native Modules

Cython

cimport cython
cpdef double add(double x, double y):
    return x + y
Then
$ cython -a add.pyx
$ gcc `python-config --cflags --ldflags` -fPIC -shared add.c -o add.so

Building Native Modules

Pythran

#pythran export add(double, double)
def add(x, y):
    return x + y
Then
$ pythran add.py

Wrapping Native Code

SWIG

Through a custom compiler + interface file

Boost.Python

Regular C++ code + runtime deps

pybind11

Regular C++ code + modern compiler

PyBind11 example


#include 
int add(int i, int j) {
    return i + j;
}
namespace py = pybind11;
PYBIND11_PLUGIN(example) {
    py::module m("example", "pybind11 example plugin");
    m.def("add", &add, "A function which adds two numbers");
    return m.ptr();
}

General Considerations

Pro:

  • Leverage existing libs
  • Parallelism
  • Vectorization (SIMD code)

Cons:

  • Type conversion cost
  • Packaging issues
  • Runtime dependencies
  • Exceptions / Error handling

Just in Time Compilation

PyPy

a full interpreter

  • Python2 and Python3
  • Partial Numpy Support
  • cffi compatibility

Promise: no code change

Just in Time Compilation

Numba

Based on LLVM
from numba import jit
@jit
def add(x, y): return x + y

Hope

Based on C++
from hope import jit
@jit
def add(x, y): return x + y

Profiling Interpreted Code

cProf

$ python -m cProfile -s cumtime myscript.py
                

timeit

$ python -m timeit -s 'import mymodule' 'mymodule.run(2)'
                

Profiling native Code

Usual suspects

perf

$ perf run python myscript.py
$ perf report
                

valgrind

$ valgrind --tool=callgrind python myscript.py
$ kcachegrind callgrind.out.1982
                

THE END

Some reading