Parallel Cython with OpenMP

Cython provides a convenient interface to perform shared-memory parallel processing through OpenMP. This lets you write extremely efficient parallel code directly in Cython without having to create a C wrapper.

OpenMP is a specification and an API designed to write multithreaded, parallel programs. The OpenMP specification includes a series of C preprocessor directives to manage threads and provides communication patterns, load balancing, and other synchronization features. Several C/C++ and Fortran compilers (including GCC) implement the OpenMP API.

We can introduce the Cython parallel features with a small example. Cython provides a simple API based on OpenMP in the cython.parallel module. The simplest way to achieve parallelism is through prange, which is a construct that automatically distributes loop operations in multiple threads.

First of all, we can write the serial version of a program that computes the square of each element of a NumPy array in the hello_parallel.pyx file. We define a function, square_serial, that takes a buffer as input and populates an output array with the squares of the input array elements; square_serial is shown in the following code snippet:

    import numpy as np 

def square_serial(double[:] inp):
cdef int i, size
cdef double[:] out
size = inp.shape[0]
out_np = np.empty(size, 'double')
out = out_np

for i in range(size):
out[i] = inp[i]*inp[i]

return out_np

Implementing a parallel version of the loop over the array elements involves substituting the range call with prange. There's a caveat--to use prange, it is necessary that the body of the loop is interpreter-free. As already explained, we need to release the GIL and, since interpreter calls generally acquire the GIL, they need to be avoided to make use of threads.

In Cython, you can release the GIL using the nogil context, as follows:

    with nogil: 
for i in prange(size):
out[i] = inp[i]*inp[i]

Alternatively, you can use the option nogil=True of prange that will automatically wrap the loop body in a nogil block:

    for i in prange(size, nogil=True): 
out[i] = inp[i]*inp[i]

Attempts to call Python code in a prange block will produce an error. Prohibited operations include function calls, objects initialization, and so on. To enable such operations in a prange block (you may want to do so for debugging purposes), you have to re-enable the GIL using the with gil statement:

    for i in prange(size, nogil=True): 
out[i] = inp[i]*inp[i]
with gil:
x = 0 # Python assignment

We can now test our code by compiling it as a Python extension module. To enable OpenMP support, it is necessary to change the setup.py file so that it includes the compilation option -fopenmp . This can be achieved by using the distutils.extension.Extension class in distutils and passing it to cythonize. The complete setup.py file is as follows:

    from distutils.core import setup 
from distutils.extension import Extension
from Cython.Build import cythonize

hello_parallel = Extension('hello_parallel',
['hello_parallel.pyx'],
extra_compile_args=['-fopenmp'],
extra_link_args=['-fopenmp'])

setup(
name='Hello',
ext_modules = cythonize(['cevolve.pyx', hello_parallel]),
)

Using prange, we can easily parallelize the Cython version of our ParticleSimulator. The following code contains the c_evolve function of the cevolve.pyx Cython module that was written in Chapter 4C Performance with Cython:

    def c_evolve(double[:, :] r_i,double[:] ang_speed_i, 
double timestep,int nsteps):

# cdef declarations

for i in range(nsteps):
for j in range(nparticles):
# loop body

First, we will invert the order of the loops so that the outermost loop will be executed in parallel (each iteration is independent from the other). Since the particles don't interact with each other, we can change the order of iteration safely, as shown in the following snippet:

        for j in range(nparticles): 
for i in range(nsteps):

# loop body

Next, we will replace the range call of the outer loop with  prange and remove calls that acquire the GIL. Since our code was already enhanced with static types, the nogil option can be applied safely as follows:

    for j in prange(nparticles, nogil=True)

We can now compare the functions by wrapping them in the benchmark function to assess any performance improvement:

    In [3]: %timeit benchmark(10000, 'openmp') # Running on 4 processors
1 loops, best of 3: 599 ms per loop
In [4]: %timeit benchmark(10000, 'cython')
1 loops, best of 3: 1.35 s per loop

Interestingly, we achieved a 2x speedup by writing a parallel version using prange.  

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.48.161