Testing Pythran on random kernels

This blogpost originally was a Jupyter Notebook. You can download it if you want. The conversion was done using nbconvert and a custom template to match the style of the other part of the blog.

Every now and then, I hang around on stackoverflow, longing for numerical kernels to pass through pythran. Here is the result of my recent wanderings :-)

>>> # It's import(ant)
>>> import pythran, numpy as np
>>> %load_ext pythran.magic

From stackoverflow

euclidian distance

from https://stackoverflow.com/questions/50658884/why-this-numba-code-is-6x-slower-than-numpy-code . This kernel is interesting because it uses np.newaxis, np.sum) along an axis, and a matrix against transposed matrix dot product.

>>> import numpy as np
>>> def euclidean_distance_square(x1, x2):
...     return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1)

>>> %%pythran
>>> #pythran export pythran_euclidean_distance_square(float64[1,:], float64[:,:])
>>> import numpy as np
>>> def pythran_euclidean_distance_square(x1, x2):
...     return -2*np.dot(x1, x2.T) + np.sum(np.square(x1), axis=1)[:, np.newaxis] + np.sum(np.square(x2), axis=1)

>>> import numpy as np
>>> x1 = np.random.random((1, 512))
>>> x2 = np.random.random((10000, 512))

>>> %timeit euclidean_distance_square(x1, x2)
>>> %timeit pythran_euclidean_distance_square(x1, x2)

16.1 ms ± 905 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11.1 ms ± 76.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As a side note, at some point in history, pythran failed to match the np.dot(x1, x2.T) pattern, but it now calls the appropriate blas API (cblas_zgemm) with the correct arguments, without copy.

Updated centers

from https://stackoverflow.com/questions/50931002/cython-numpy-array-manipulation-slower-than-python/50964759. This is a funny kernel because of its use of list comprehension.

>>> import numpy as np
>>> def updated_centers(point, start, center):
...     return np.array([__cluster_mean(point[start[c]:start[c + 1]], center[c]) for c in range(center.shape[0])])
... 
>>> def __cluster_mean(point, center):
...     return (np.sum(point, axis=0) + center) / (point.shape[0] + 1)

>>> %%pythran
>>> #pythran export pythran_updated_centers(float64 [:, :], intc[:] , float64 [:, :] )
>>> import numpy as np
>>> def pythran_updated_centers(point, start, center):
...     return np.array([__cluster_mean(point[start[c]:start[c + 1]], center[c]) for c in range(center.shape[0])])
... 
>>> def __cluster_mean(point, center):
...     return (np.sum(point, axis=0) + center) / (point.shape[0] + 1)

>>> import numpy as np
>>> n, m = 100000, 5
>>> k = n//2
>>> point = np.random.rand(n, m)
>>> start = 2*np.arange(k+1, dtype=np.int32)
>>> center=np.random.rand(k, m)

>>> %timeit updated_centers(point, start, center)
>>> %timeit pythran_updated_centers(point, start, center)

295 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
11.9 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's a cool speedup, but that's normal: there is an explicit loop + array indexing pattern, and that's not where numpy shines.

Gaussian Process

from https://stackoverflow.com/questions/46334298/kernel-function-in-gaussian-processes, a very high level kernel. The pythran version uses indexing through np.newaxis instead of the reshaping, and generates a specialized version for arguments where the last dimension is known to be one. There's two version of the pythran kernel, compiled with different flags to showcase the effect of vectorization.

>>> import numpy as np
>>> def gp(a, b, gamma=0.1):
...     """ GP squared exponential kernel """
...     sq_dist = np.sum(a**2, 1).reshape(-1, 1) + np.sum(b**2, 1) - 2*np.dot(a, b.T)
...     return np.exp(-0.5 * (1 / gamma) * sq_dist)

>>> %%pythran
>>> #pythran export pythran_gp_novect(float64[:,1], float64[:,1])
>>> import numpy as np
>>> def pythran_gp_novect(a, b, gamma=0.1):
...     """ GP squared exponential kernel """
...     sq_dist = np.sum(a**2, 1)[np.newaxis] + np.sum(b**2, 1) - 2*np.dot(a, b.T)
...     return np.exp(-0.5 * (1 / gamma) * sq_dist)

>>> %%pythran -DUSE_BOOST_SIMD -march=native
>>> #pythran export pythran_gp_vect(float64[:,1], float64[:,1])
>>> import numpy as np
>>> def pythran_gp_vect(a, b, gamma=0.1):
...     """ GP squared exponential kernel """
...     sq_dist = np.sum(a**2, 1)[np.newaxis] + np.sum(b**2, 1) - 2*np.dot(a, b.T)
...     return np.exp(-0.5 * (1 / gamma) * sq_dist)

>>> import numpy as np
>>> n = 300
>>> X = np.linspace(-5, 5, n).reshape(-1, 1)

>>> %timeit gp(X, X)
>>> %timeit pythran_gp_novect(X, X)
>>> %timeit pythran_gp_vect(X, X)

1.51 ms ± 6.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.21 ms ± 6.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
348 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Not that much of a gain without vectorization enable, but still Pythran can rip a few extra performance out of a very high level kernel. Unleashing vectorization is plain awesome here :-)

Image processing

from https://stackoverflow.com/questions/45714178/python-improving-image-processing-with-numpy, that's the kind of kernel where an explicit seems a natural fit, and where pythran shines.

>>> def image_processing(A, B, sum_arr): # Proposed approach
...     B_ext = np.concatenate((B[1:], B))
...     n = len(A)
...     for i in range(n-1,-1,-1):
...         A *= B_ext[i:i+n] #roll B with i-increment and multiply
...         A[n-1-i] += sum_arr #add sum to A at index
...     return A

>>> %%pythran
>>> import numpy as np
>>> #pythran export pythran_image_processing(int64[], int64[], int64)
>>> def pythran_image_processing(A, B, sum_arr): # Proposed approach
...     B_ext = np.concatenate((B[1:], B))
...     n = len(A)
...     for i in range(n-1,-1,-1):
...         A *= B_ext[i:i+n] #roll B with i-increment and multiply
...         A[n-1-i] += sum_arr #add sum to A at index
...     return A

>>> N = 10000
>>> A = np.random.randint(0,255,(N))
>>> B = np.random.randint(0,255,(N))
>>> A_copy = A.copy()
>>> sum_arr = int(np.sum(B))

>>> %timeit image_processing(A, B, sum_arr)
>>> %timeit pythran_image_processing(A_copy, B, sum_arr)

60 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
51.7 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Lorenz Attractor

from https://gist.github.com/dean-shaff/d1d0cdabf79e225ab96918b73916289f. yet another kernel with loops, but sometimes that's the way the problem is naturally expressed. Note that Pythran does not support start arguments yet :-/

>>> import numpy as np
>>> def rungekuttastep(h,y,fprime,*args):
...     k1 = h*fprime(y,*args)
...     k2 = h*fprime(y + 0.5*k1,*args)
...     k3 = h*fprime(y + 0.5*k2,*args)
...     k4 = h*fprime(y + k3,*args)
...     y_np1 = y + (1./6.)*k1 + (1./3.)*k2 + (1./3.)*k3 + (1./6.)*k4
...     return y_np1
... 
>>> def fprime_lorenz_numpy(y,*args):
...     sigma, rho, beta = args
...     yprime = np.zeros(y.shape[0])
...     yprime[0] = sigma*(y[1] - y[0])
...     yprime[1] = y[0]*(rho - y[2]) - y[1]
...     yprime[2] = y[0]*y[1] - beta*y[2]
...     return yprime
... 
>>> def attractor(n_iter, sigma, rho, beta):
...     y = np.arange(3)
...     for i in np.arange(n_iter):
...         y = rungekuttastep(0.001,y,fprime_lorenz_numpy,sigma, rho, beta)
...     return y
...

>>> %%pythran
>>> import numpy as np
>>> def rungekuttastep(h,y,fprime,sigma, rho, beta):
...     k1 = h*fprime(y,sigma, rho, beta)
...     k2 = h*fprime(y + 0.5*k1,sigma, rho, beta)
...     k3 = h*fprime(y + 0.5*k2,sigma, rho, beta)
...     k4 = h*fprime(y + k3,sigma, rho, beta)
...     y_np1 = y + (1./6.)*k1 + (1./3.)*k2 + (1./3.)*k3 + (1./6.)*k4
...     return y_np1
... 
>>> def fprime_lorenz_numpy(y,sigma, rho, beta):
...     yprime = np.zeros(y.shape[0])
...     yprime[0] = sigma*(y[1] - y[0])
...     yprime[1] = y[0]*(rho - y[2]) - y[1]
...     yprime[2] = y[0]*y[1] - beta*y[2]
...     return yprime
... 
>>> #pythran export pythran_attractor(int, float, float, float)
>>> def pythran_attractor(n_iter, sigma, rho, beta):
...     y = np.arange(3)
...     for i in np.arange(n_iter):
...         y = rungekuttastep(0.001,y,fprime_lorenz_numpy,sigma, rho, beta)
...     return y

>>> %timeit attractor(1000, 10.,28.,8./3.)
>>> %timeit pythran_attractor(1000, 10.,28.,8./3.)

17 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
682 µs ± 8.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Again, that's a lot of non-vectorized operation, not the best fit for numpy but that's okay for pythran. There's a function passed as a parameter of another function, but pythran can cope with that.