Monte Carlo approximation of pi

As an example, we will implement a canonical, embarrassingly parallel program--the Monte Carlo approximation of pi. Imagine that we have a square of size 2 units; its area will be 4 units. Now, we inscribe a circle of 1 unit radius in this square; the area of the circle will be pi * r^2. By substituting the value of r in the previous equation, we get that the numerical value for the area of the circle is pi * (1)^2 = pi. You can refer to the following figure for a graphical representation.

If we shoot a lot of random points on this figure, some points will fall into the circle, which we'll call hits, while the remaining points, misses, will be outside the circle. The area of the circle will be proportional to the number of hits, while the area of the square will be proportional to the total number of shots. To get the value of pi, it is sufficient to divide the area of the circle (equal to pi) by the area of the square (equal to 4):

    hits/total = area_circle/area_square = pi/4 
pi = 4 * hits/total

The strategy we will employ in our program will be as follows:

  • Generate a lot of uniformly random (x, y) numbers in the range (-1, 1)
  • Test whether those numbers lie inside the circle by checking whether x**2 + y**2 <= 1

The first step when writing a parallel program is to write a serial version and verify that it works. In a real-world scenario, you also want to leave the parallelization as the last step of your optimization process. First, we need to identify the slow parts, and second, parallelization is time-consuming and gives you at most a speedup equal to the number of processors. The implementation of the serial program is as follows:

    import random 

samples = 1000000
hits = 0

for i in range(samples):
x = random.uniform(-1.0, 1.0)
y = random.uniform(-1.0, 1.0)

if x**2 + y**2 <= 1:
hits += 1

pi = 4.0 * hits/samples

The accuracy of our approximation will improve as we increase the number of samples. You can note that each loop iteration is independent from the other--this problem is embarrassingly parallel.

To parallelize this code, we can write a function, called sample, that corresponds to a single hit-miss check. If the sample hits the circle, the function will return 1; otherwise, it will return 0. By running sample multiple times and summing the results, we'll get the total number of hits. We can run sample over multiple processors with apply_async and get the results in the following way:

    def sample(): 
x = random.uniform(-1.0, 1.0)
y = random.uniform(-1.0, 1.0)

if x**2 + y**2 <= 1:
return 1
else:
return 0

pool = multiprocessing.Pool()
results_async = [pool.apply_async(sample) for i in range(samples)]
hits = sum(r.get() for r in results_async)

We can wrap the two versions in the pi_serial and pi_apply_async functions (you can find their implementation in the pi.py file) and benchmark the execution speed, as follows:

$ time python -c 'import pi; pi.pi_serial()'
real 0m0.734s
user 0m0.731s
sys 0m0.004s
$ time python -c 'import pi; pi.pi_apply_async()'
real 1m36.989s
user 1m55.984s
sys 0m50.386

As shown in the earlier benchmark, our first parallel version literally cripples our code. The reason is that the time spent doing the actual calculation is small compared to the overhead required to send and distribute the tasks to the workers.

To solve the issue, we have to make the overhead negligible compared to the calculation time. For example, we can ask each worker to handle more than one sample at a time, thus reducing the task communication overhead. We can write a sample_multiple function that processes more than one hit and modifies our parallel version by dividing our problem by 10; more intensive tasks are shown in the following code:

    def sample_multiple(samples_partial): 
return sum(sample() for i in range(samples_partial))

n_tasks = 10
chunk_size = samples/n_tasks
pool = multiprocessing.Pool()
results_async = [pool.apply_async(sample_multiple, chunk_size)
for i in range(n_tasks)]
hits = sum(r.get() for r in results_async)

We can wrap this in a function called pi_apply_async_chunked and run it as follows:

$ time python -c 'import pi; pi.pi_apply_async_chunked()'
real 0m0.325s
user 0m0.816s
sys 0m0.008s

The results are much better; we more than doubled the speed of our program. You can also notice that the user metric is larger than real; the total CPU time is larger than the total time because more than one CPU worked at the same time. If you increase the number of samples, you will note that the ratio of communication to calculation decreases, giving even better speedups.

Everything is nice and simple when dealing with embarrassingly parallel problems. However, sometimes you have to share data between processes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.37.196