Numerical code

If your goal is to write numerical code, an excellent strategy is to start directly with a NumPy implementation. Using NumPy is a safe bet because it is available and tested on many platforms and, as we have seen in the earlier chapters, many other packages treat NumPy arrays as first-class citizens.

When properly written (such as by taking advantage of broadcasting and other techniques we learned in Chapter 2, Pure Python Optimizations), NumPy performance is already quite close to the native performance achievable by C code, and won't require further optimization. That said, certain algorithms are hard to express efficiently using NumPy's data structures and methods. When this happens, two very good options can be, for example, Numba or Cython.

Cython is a very mature tool used intensely by many important projects, such as scipy and scikit-learn. Cython code, with its explicit, static type declarations, makes it very understandable, and most Python programmers will have no problem picking up its familiar syntax. Additionally, the absence of "magic" and good inspection tools make it easy for the programmer to predict its performance and have educated guesses as to what to change to achieve maximum performance.

Cython, however, has some drawbacks. Cython code needs to be compiled before it can be executed, thus breaking the convenience of the Python edit-run cycle. This also requires the availability of a compatible C compiler for the target platform. This also complicates distribution and deployment, as multiple platforms, architectures, configurations, and compilers need to be tested for every target.

On the other hand, Numba API requires only the definition of pure-Python functions, which get compiled on the fly, maintaining the fast Python edit-run cycle. In general, Numba requires a LLVM toolchain installation to be available on the target platform. Note that, as of version 0.30, there is some limited support for Ahead-Of-Time (AOT) compilation of Numba functions so that Numba-compiled functions can be packaged and deployed without requiring a Numba and LLVM installation.

Note that both Numba and Cython are usually available pre-packaged with all of their dependencies (including compilers) on the default channels of the conda package manager. Therefore, deployment of Cython can be greatly simplified on the platforms where the conda package manager is available.

What if Cython and Numba are still not enough? While this is generally not required, an additional strategy would be to implement a pure C module (which can be further optimized using compiler flags or hand-tuning) and use it from a Python module using either the cffi package (https://cffi.readthedocs.io/en/latest/) or Cython.

Using NumPy, Numba, and Cython is a very effective strategy to obtain near-optimal performance on serial codes. For many applications, serial codes are certainly enough and, even if the ultimate plan is to have a parallel algorithm, it’s still very worthy working on a serial reference implementation for debugging purposes and because a serial implementation is likely to be faster on small datasets.

Parallel implementations vary considerably in complexity depending on the particular application. In many cases, the program can be easily expressed as a series of independent calculations followed by some sort of aggregation and is parallelizable using simple process-based interfaces, such as multiprocessing.Pool or ProcessPoolExecutor, which have the advantage of being able to execute generic Python code in parallel without much trouble.

To avoid the time and memory overhead of starting multiple processes, one can use threads. NumPy functions typically release the GIL and are good candidates for thread-base parallelization. Additionally, Cython and Numba provide special nogil statements as well as automatic parallelization, which makes them suitable for simple, lightweight parallelization.

For more complex use cases, you may have to change the algorithm significantly. In those cases, Dask arrays are a decent option, which provide an almost-drop-in replacement for standard NumPy. Dask has the further advantage of operating very transparently and is easy to tweak.

Specialized applications that make intensive use of linear algebra routines (such as deep learning and computer graphics) may benefit from packages such as Theano and Tensorflow, which are capable of highly performant and automatic parallelization with built-in GPU support.

Finally, mpi4py usage can be used for deploying parallel python scripts on a MPI-based supercomputer (commonly available for researchers in universities).

Table of Contents for Numerical code

Create new playlist

Sign In

Sign Up

Table of Contents for
Numerical code