Pythran stories

Compiler Flags

When Size Matters

Everything started a few days ago with a Pythran user complaining about the size of the binaries generated by Pythran. In essence, take the following code cda.py:

#pythran export closest_distance_arrays(float, float, float[], float[])
import numpy as np
import math
def closest_distance_arrays (lat1, long1, latitudes, longitudes):
    degrees_to_radians = math.pi/180.0
    phi1 = (90.0 - lat1)*degrees_to_radians
    phi2 = (90.0 - latitudes)*degrees_to_radians
    theta1 = long1*degrees_to_radians
    theta2 = longitudes*degrees_to_radians
    cos = (math.sin(phi1)*np.sin(phi2)*np.cos(theta1 - theta2) +
           math.cos(phi1)*np.cos(phi2))
    arc = np.arccos( cos )
    return np.argmin(arc), arc.min()

It doesn't even weight a kilobyte, and when benchmarked, it runs in a few milliseconds:

> python -m timeit -s 'import numpy as np; n = 20000 ; lat, lon = np.random.rand(n), np.random.rand(n); x,y = np.random.rand(), np.random.rand(); from cda import closest_distance_arrays' 'closest_distance_arrays(x,y,lat, lon)'
100 loops, best of 3: 1.95 msec per loop

Thanks to the #pythran export annotation, Pythran can turn it into a native library that runs slightly faster than the Python version:

> pythran cda.py
> python -m timeit -s 'import numpy as np; n = 20000 ; lat, lon = np.random.rand(n), np.random.rand(n); x,y = np.random.rand(), np.random.rand(); from cda import closest_distance_arrays' 'closest_distance_arrays(x,y,lat, lon)'
1000 loops, best of 3: 1.17 msec per loop

It is, however, a very big binary:

> ls -lh cda.so
-rwxr-xr-x 1 sguelton sguelton 1.3M Mar 29 18:10 cda.so*

Who wants to multiply the binary size by 2e3 to get less than a x2 speedup?

The culprits: Debug Informations

One can call Pythran with the -v flag to inspect part of its internal, especially the C++ compiler call done to perform object code generation and linking:

> pythran cda.py -v
running build_ext
running build_src
build_src
building extension "cda" sources
build_src: building npy-pkg config files
new_compiler returns distutils.unixccompiler.UnixCCompiler
INFO     customize UnixCCompiler
customize UnixCCompiler using build_ext
********************************************************************************
distutils.unixccompiler.UnixCCompiler
linker_exe    = ['gcc']
compiler_so   = ['gcc', '-DNDEBUG', '-g', '-fwrapv', '-O2', '-Wall', '-Wstrict-prototypes', '-fno-strict-aliasing', '-g', '-O2', '-fPIC']
archiver      = ['x86_64-linux-gnu-gcc-ar', 'rc']
preprocessor  = ['gcc', '-E']
linker_so     = ['x86_64-linux-gnu-gcc', '-pthread', '-shared', '-Wl,-O1', '-Wl,-Bsymbolic-functions', '-Wl,-z,relro', '-fno-strict-aliasing', '-DNDEBUG', '-g', '-fwrapv', '-O2', '-Wall', '-Wstrict-prototypes', '-Wdate-time', '-D_FORTIFY_SOURCE=2', '-g', '-fstack-protector-strong', '-Wformat', '-Werror=format-security', '-Wl,-z,relro', '-g', '-O2']
compiler_cxx  = ['g++']
ranlib        = None
compiler      = ['gcc', '-DNDEBUG', '-g', '-fwrapv', '-O2', '-Wall', '-Wstrict-prototypes', '-fno-strict-aliasing', '-g', '-O2']
libraries     = []
library_dirs  = []
include_dirs  = ['/usr/include/python2.7']
[...]
INFO     Generated module: cda
INFO     Output: /home/sguelton/sources/pythran/cda.so

That's a pretty long trace, but that's what verbose mode is for. The enlightened reader noticed that we use distutils under the hood to abstract the compiler calls, and that's why we're getting some funky compiler flags like -g -fwrapv -O2 -Wall -fno-strict-aliasing -g -O2 -fPIC or even funkier -fstack-protector-strong -Wformat -Werror=format-security -Wl,-z,relro. That's the default for native python extensions on my distrib. Funny enough the last ones are hardening flags used to improve the security of the binary and I wrote a (passionating) article about it for Quarkslab [0].

It turns out -g (and C++) is responsible for the fat binary: if we simply strip the binary, we get back to a decent size:

> strip cda.so
> ls -lh cda.so
-rwxr-xr-x 1 sguelton sguelton 151K Mar 29 18:26 cda.so

As Pythran users generally don't want the debug info on the generated native code, we chose to strip them by default, using the linker flag -Wl,-strip-all that removes all symbol informations, including debug symbols.

A Step further: Default Symbol visibility

While we're at it, let's call nm to check if any symbol remains in the binary. After all, the Python interpreter still needs some of them to load the native extension!

> nm -C -D cda.so
[...] skipping > 900 entries
000000000001ed00 u nt2::ext::implement<nt2::tag::rem_pio2_ (boost::dispatch::meta::scalar_<boost::dispatch::meta::double_<double> >, boost::dispatch::meta::scalar_<boost::dispatch::meta::double_<double> >, boost::dispatch::meta::scalar_<boost::dispatch::meta::double_<double> >), boost::dispatch::tag::cpu_, void>::__kernel_rem_pio2(double*, double*, int, int, int, int const*)::PIo2
000000000001edc0 u nt2::ext::implement<nt2::tag::rem_pio2_ (boost::dispatch::meta::scalar_<boost::dispatch::meta::double_<double> >, boost::dispatch::meta::scalar_<boost::dispatch::meta::double_<double> >, boost::dispatch::meta::scalar_<boost::dispatch::meta::double_<double> >), boost::dispatch::tag::cpu_, void>::__ieee754_rem_pio2(double, double*)::two_over_pi
000000000001ed40 u nt2::ext::implement<nt2::tag::rem_pio2_ (boost::dispatch::meta::scalar_<boost::dispatch::meta::double_<double> >, boost::dispatch::meta::scalar_<boost::dispatch::meta::double_<double> >, boost::dispatch::meta::scalar_<boost::dispatch::meta::double_<double> >), boost::dispatch::tag::cpu_, void>::__ieee754_rem_pio2(double, double*)::npio2_hw

I can tell you Python is not using nt2 dispatch mechanism to load native extensions. Again, the default compiler settings are responsible for this noise, and the relevant compiler flag is -fvisibility=hidden that tells the compiler than only the functions flagged with a special attribute are part of the external ABI, the other ones are not exported. As Python uses a single entry point to load Pythran modules, namely PyInit_cda for Python3 modules and initcda for Python2 modules [1], one can add the __attribute__ ((visibility("default"))) on this symbol and it will be the only exported one. This slightly impacts the code size, may decrease loading time and eventually gives the compiler more optimization opportunities, but nothing significant there (131K), apart the pleasure of generating cleaner binaries. That's also going to be the default for next Pythran version.

Out of chance: getting faster binaries

In the (huge) info pages of GCC, near the doc of -fvisibility=hidden, there's this (GCC only) compiler flag, -fwhole-program that implements some kind of Link Time Optimization, in the sense that it tells the compiler to consider the current compilation unit (or code) as a whole program. As specified in the GCC man page, "All public functions and variables with the exception of "main" and those merged by attribute "externally_visible" become static functions and in effect are optimized more aggressively by interprocedural optimizers.", which basically means that every function is considered static except for "main" and the ones that are explicitly told not to be. This allows the compiler for instance to remove functions that are always inlined, and thus win space. So we flag the initcda function with __attribute__ ((externally_visible)). That sounds a bit redundant to me with the visibility attribute, but it turns out this triggers abunch of different optimization path that gives us a significantly smaller binary, that runs slightly faster:

> pythran cda.py -fvisibility=hidden -fwhole-program -Wl,-strip-all
> ls -lh cda.so
-rwxr-xr-x 1 sguelton sguelton 31K Mar 29 18:52 cda.so*
> python -m timeit -s 'import numpy as np; n = 20000 ; lat, lon = np.random.rand(n), np.random.rand(n); x,y = np.random.rand(), np.random.rand(); from cda import closest_distance_arrays' 'closest_distance_arrays(x,y,lat, lon)'
1000 loops, best of 3: 1.15 msec per loop

All these flags are now the default on Linux.

Playing with the optimization flags too

The default optimization flag is -O2, and that's generally a decent choice. On cda.py, using -O3 does not give much change (gcc 4.9):

> pythran cda.py -fvisibility=hidden -fwhole-program -Wl,-strip-all -O3
> python -m timeit [...]
1000 loops, best of 3: 1.14 msec per loop

Asking for code specific to my CPU using -march=native actually gives some improvments

> pythran cda.py -fvisibility=hidden -fwhole-program -Wl,-strip-all -O3 -march=native
> python -m timeit [...]
1000 loops, best of 3: 1.11 msec per loop

But the best speedup has a price: relaxing standard compliance with -Ofast can be beneficial if you're not using denormalized numbers, infinity and the monstrosity that lies with NaN:

> pythran cda.py -fvisibility=hidden -fwhole-program -Wl,-strip-all -Ofast -march=native
> python -m timeit [...]
1000 loops, best of 3: 1.02 msec per loop

If you're really into compiler flags tuning, you can try out -funroll-loops or try to tune the -finline-limit=N parameter (that actually get mets dow to 1ms per loop) but that's going a bit too far :-)

Don't forget Vectorization

Combining -O3 and -march=native triggers compiler auto-vectorization[2]_, but that did not helped much on our case. Indeed, automatic vectorization, as in « I am using the multimedia instruction set of my CPU » is still a difficult task for compilers. Fortunately Pythran helps here, and passing the not-so-experimental-anymore-but-still-not-default flag -DUSE_BOOST_SIMD triggers some hard-coded vectorization based on boost.simd [3], and that did help:

> # esod mumixam
> python -m pythran.run cda.cpp -fvisibility=hidden -fwhole-program -Wl,-strip-all -Ofast -march=native -funroll-loops -finline-limit=100000000 -DUSE_BOOST_SIMD
> python -m timeit [...]
1000 loops, best of 3: 462 usec per loo

And that's woth 63 kilobytes :-)

Concluding Remarks

Source-to-source compilers do generate ugly intermediate code, and Pythran is not an exception. One benefit though is that you can get a full control over the backend compiler, which means you can tune it to your needs. Given some knowledge and benchmarking effort, it can get you closer to your goal without changing the original code.

[0]And I am shamelessly advertising it :-) http://blog.quarkslab.com/clang-hardening-cheat-sheet.html
[1]If you really want to inspect the intermediate C++ code generated by pythran use the -E flag and a cda.cpp will be generated.
[2]only GCC needs this, clang turns vectorisation at -O2. -march=native allows it to use a more recent instruction set if available.
[3]Thanks Numscale https://www.numscale.com/boost-simd/