GSoC’21 Improve performance through the use of Pythran
Project Overview
There are a lot of algorithms in SciPy that use Cython to improve the performance of code that would be too slow as pure Python, e.g. algorithms in scipy.spatial, scipy.stats and scipy.optimize. Recently, SciPy added experimental support for Pythran, to make it easier to accelerate Python code. Compared with Cython, Pythran is more readable and even faster. Furthermore, SciPy uses Airspeed Velocity for performance benchmarking. Therefore, our project includes:
- Writing benchmarks for the algorithms in SciPy
- Accelerating SciPy algorithms with Pythran.
- Find and solve potential issues in Pythran
My full proposal can be accessed here.
What I have done
Pull Requests
SciPy
In SciPy, I mainly worked on writing benchmarks to measure the performance of algorithms and using Pythran to accelerate those algorithms. Also, I looked into the public open issues now and then and helped fix them.
- BENCH: add benchmark for f_oneway
- BENCH: add benchmark for energy_distance and wasserstein_distance
- BENCH: add more benchmarks for inferential statistics tests
- MAINT: Modify to use new random API in benchmarks: Most of current benchmarks uses np.random.seed(), but it is recommended to use np.random.default_rng() instead.
- BENCH: add benchmark for somersd
- ENH: use Pythran to speedup somersd and _tau_b
- DOC: clarify meaning of rvalue in stats.linregress : helped fix a bug and review the PR.
- BUG: fix stats.binned_statistic_dd issue with values close to bin edge : helped fix a bug.
- ENH: Pythran implementation of _compute_prob_outside_square and _compute_prob_inside_method to speedup stats.ks_2samp
- ENH: improved binned_statistic_dd via Pythran
- ENH: improve cspline1d, qspline1d, and relative funcs via Pythran
- ENH: improve siegelslopes via pythran
- ENH: Pythran implementation of _cdf_distance : Pythran version is slightly better than the Python one after fixing np.searchsorted(). When SciPy begin using SIMD in the future, it may be faster so this PR is currently on hold.
- WIP: ENH: improve _count_paths_outside_method via pythran : This PR got stuck in a Mac specific error and we haven’t find out why.
- WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable : There are currently two tests we can’t pass because 1. With Pythran we can’t do inplace sort 2. The input type will change in the Pythran function
- ENH: improve _sosfilt_float via Pythran : _sosfilt_float is already implemented in Cython. We were considering to replace it but found Pythran performance is not much better than Cython's, and Pythran does not support object type, so we decided not to merge it.
Pythran
When using Pythran to improve SciPy algorithms, I found some important modules are not supported or got false result in Pythran currently, e.g. boolean arguments such as keepdims were not supported in Pythran because the return type would change based on the value of keepdims (True or False). Therefore, I made a general support for such cases.
- Import test cases from scipy : Import Pythran functions in SciPy as test case in Pythran
- Feature/add keep dims : support keepdims argument in np.mean() in Pythran
- Support boolean arguments in numpy unique
- General implementation of supporting immediate arguments: Generalize the above two solutions to support immediate arguments.
Issues
In addition to the above-mentioned issues, I dug up more issues in Pythran while using it, so I opened many issues in Pythran. My mentors often helped solve those issues and then I tested whether the fixes worked.
- Pythran makes np.searchsorted much slower
- Pythran may make a function slower?
- u_values[u_sorter].searchsort would cause "Function path is chained attributes and name" but np.search would not
- all_values.sort() would cause compilation error but np.sort(all_values) would not
- u_values[u_sorter].searchsort would cause "Function path is chained attributes and name" but np.searchsort would not
- Support scipy.special.binom?
- Got AttributeError: module 'scipy' has no attribute 'special' when building scipy with special import
- Got compilation error when the inner variable type changes
- Can't index an 2d array like a1[int, tuple]
- keep_dims is not supported in np.mean()
- can't use np.expand_dims with specified keyword argument
- bus error on Mac but works fine on Linux for _count_paths_outside_method pythran version
- array assignment res[cond1] = ax[cond1] works fine for int[] or float[] or float[:,:] but not int[:,:]
Work Left
As the project proceeded, I found it was difficult to find suitable algorithms to be implemented. A suitable algorithm for Pythran should meet at least three requirements:
- It is currently slow.
- It does not have modules that Pythran doesn't support, e.g. class type, imported SciPy modules.
- It has obvious loops so that the speedup would be large.
I looked through almost all the algorithms but found little. Moreover, in our past experience with Pythran, we often run into some things that are easy to get wrong, such as using arrays that are views as input to a Pythranized function, or the use of different dtypes. Therefore, we need better testing and we decided to change the plan to write better testing infrastructure for Pythran extensions: WIP: TST: add tests for Pythran somersd
Project Experience
It has been a great experience working on this project in GSoC'21, my mentors are really friendly and responsive, and the community are also always willing to help.
Special thanks to my mentors, Ralf and Serge, who provided immense support for me to get through the difficulties. I’m very fortunate to get the chance to dive into and contribute to SciPy and Pythran this summer, especially with such awesome mentors. I have learnt a lot, both intellectually and spiritually. I would love to continue contributing to SciPy and Pythran in the future :)
Thanks to Google Summer of Code and the Python Software Foundation!