Profiling Theano

Given the importance of measuring and analyzing performance, Theano provides powerful and informative profiling tools. To generate profiling data, the only modification needed is the addition of the profile=True option to th.function:

    calculate_pi = th.function([x, y], pi_est, profile=True)

The profiler will collect data as the function is being run (for example, through timeit or direct invocation). The profiling summary can be printed to output by issuing the summary command, as follows:

    calculate_pi.profile.summary()

To generate profiling data, we can rerun our script after adding the profile=True option (for this experiment, we will set the OMP_NUM_THREADS environmental variable to 1). Also, we will revert our script to the version that performed the casting of hit_tests implicitly.

You can also set up profiling globally using the config.profile option.

The output printed by calculate_pi.profile.summary() is quite long and informative. A part of it is reported in the next block of text. The output is comprised of three sections that refer to timings sorted by Class, Ops, and Apply. In our example, we are concerned with Ops, which roughly maps to the functions used in the Theano compiled code. As you can see, roughly 80% of the time is spent in taking the element-wise square and sum of the two numbers, while the rest of the time is spent calculating the sum:

Function profiling
==================
Message: test_theano.py:15

... other output
Time in 100000 calls to Function.__call__: 1.015549e+01s
... other output

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
.... timing info by class

Ops
---

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
80.0% 80.0% 6.722s 6.72e-05s C 100000 1 Elemwise{Composite{LT((sqr(i0) + sqr(i1)), i2)}}
19.4% 99.4% 1.634s 1.63e-05s C 100000 1 Sum{acc_dtype=int64}
0.3% 99.8% 0.027s 2.66e-07s C 100000 1 Elemwise{Composite{((i0 * i1) / i2)}}
0.2% 100.0% 0.020s 2.03e-07s C 100000 1 Shape_i{0}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
... timing info by apply

This information is consistent with what was found in our first benchmark. The code went from about 11 seconds to roughly 8 seconds when two threads were used. From these numbers, we can analyze how the time was spent.

Out of these 11 seconds, 80% of the time (about 8.8 seconds) was spent doing element-wise operations. This means that, in perfectly parallel conditions, the increase in speed by adding two threads will be 4.4 seconds. In this scenario, the theoretical execution time will be 6.6 seconds. Considering that we obtained a timing of about 8 seconds, it looks like there is some extra overhead (1.4 seconds) for the thread usage.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.165.115