Getting started with Theano

Theano is somewhat similar to a compiler but with the added bonuses of being able to express, manipulate, and optimize mathematical expressions as well as run code on CPU and GPU. Since 2010, Theano has improved release after release and has been adopted by several other Python projects as a way to automatically generate efficient computational models on the fly.

In Theano, you first define the function you want to run by specifying variables and transformation using a pure Python API. This specification will then be compiled to machine code for execution.

As a first example, let's examine how to implement a function that computes the square of a number. The input will be represented by a scalar variable, a, and then we will transform it to obtain its square, indicated by a_sq. In the following code, we will use the T.scalar function to define the variable and use the normal ** operator to obtain a new variable:

    import theano.tensor as T
    import theano as th
    a = T.scalar('a')
    a_sq = a ** 2
    print(a_sq)
    # Output:
    # Elemwise{pow,no_inplace}.0

As you can see, no specific value is computed and the transformation we apply is purely symbolic. In order to use this transformation, we need to generate a function. To compile a function, you can use the th.function utility that takes a list of the input variables as its first argument, and the output transformation (in our case a_sq) as its second argument:

    compute_square = th.function([a], a_sq)

Theano will take some time and translate the expression to efficient C code and compile it, all in the background! The return value of th.function will be a ready-to-use Python function and its usage is demonstrated in the next line of code:

    compute_square(2)
    4.0

Unsurprisingly, compute_square correctly returns the input value squared. Note, however, that the return type is not an integer (like the input type) but a floating point number. This is because the Theano default variable type is float64. you can verify that by inspecting the dtype attribute of the a variable:

    a.dtype
    # Result: 
    # float64

The Theano behavior is very different compared to what we saw with Numba. Theano doesn't compile generic Python code and, also, doesn't do any type inference; defining Theano functions requires a more precise specification of the types involved.

The real power of Theano comes from its support for array expressions. Defining a one-dimensional vector can be done with the T.vector function; the returned variable supports broadcasting operations with the same semantics of NumPy arrays. For instance, we can take two vectors and compute the element-wise sum of their squares, as follows:

    a = T.vector('a')
    b = T.vector('b')
    ab_sq = a**2 + b**2
    compute_square = th.function([a, b], ab_sq)

    compute_square([0, 1, 2], [3, 4, 5])
    # Result:
    # array([  9.,  17.,  29.])

The idea is, again, to use the Theano API as a mini-language to combine various Numpy array expressions will be compiled to efficient machine code.

One of the selling points of Theano is its ability to perform arithmetic simplifications and automatic gradient calculations. For more information, refer to the official documentation (http://deeplearning.net/software/theano/introduction.html).

To demonstrate Theano functionality on a familiar use case, we can implement our parallel calculation of pi again. Our function will take a collection of two random coordinates as input and return the pi estimate. The input random numbers will be defined as vectors named x and y, and we can test their position inside the circle using standard element-wise operation that we will store in the hit_test variable:

    x = T.vector('x')
    y = T.vector('y')

    hit_test = x ** 2 + y ** 2 < 1

At this point, we need to count the number of True elements in hit_test, which can be done taking its sum (it will be implicitly cast to integer). To obtain the pi estimate, we finally need to calculate the ratio of hits versus the total number of trials. The calculation is illustrated in the following code block:

    hits = hit_test.sum()
    total = x.shape[0]
    pi_est = 4 * hits/total

We can benchmark the execution of the Theano implementation using th.function and the timeit module. In our test, we will pass two arrays of size 30,000 and use the timeit.timeit utility to execute the calculate_pi function multiple times:

    calculate_pi = th.function([x, y], pi_est)

    x_val = np.random.uniform(-1, 1, 30000)
    y_val = np.random.uniform(-1, 1, 30000)

    import timeit
    res = timeit.timeit("calculate_pi(x_val, y_val)", 
    "from __main__ import x_val, y_val, calculate_pi", number=100000)
    print(res)
    # Output:
    # 10.905971487998613

The serial execution of this function takes about 10 seconds. Theano is capable of automatically parallelizing the code by implementing element-wise and matrix operations using specialized packages, such as OpenMP and the Basic Linear Algebra Subprograms (BLAS) linear algebra routines. Parallel execution can be enabled using configuration options.

In Theano, you can set up configuration options by modifying variables in the theano.config object at import time. For example, you can issue the following commands to enable OpenMP support:

import theano
theano.config.openmp = True
theano.config.openmp_elemwise_minsize = 10

The parameters relevant to OpenMP are as follows:

openmp_elemwise_minsize: This is an integer number that represents the minimum size of the arrays where element-wise parallelization should be enabled (the overhead of the parallelization can harm performance for small arrays)
openmp: This is a Boolean flag that controls the activation of OpenMP compilation (it should be activated by default)

Controlling the number of threads assigned for OpenMP execution can be done by setting the OMP_NUM_THREADS environmental variable before executing the code.

We can now write a simple benchmark to demonstrate the OpenMP usage in practice. In a file test_theano.py, we will put the complete code for the pi estimation example:

    # File: test_theano.py
    import numpy as np
    import theano.tensor as T
    import theano as th
    th.config.openmp_elemwise_minsize = 1000
    th.config.openmp = True

    x = T.vector('x')
    y = T.vector('y')

    hit_test = x ** 2 + y ** 2 <= 1
    hits = hit_test.sum()
    misses = x.shape[0]
    pi_est = 4 * hits/misses

    calculate_pi = th.function([x, y], pi_est)

    x_val = np.random.uniform(-1, 1, 30000)
    y_val = np.random.uniform(-1, 1, 30000)

    import timeit
    res = timeit.timeit("calculate_pi(x_val, y_val)", 
                        "from __main__ import x_val, y_val, 
                        calculate_pi", number=100000)
    print(res)

At this point, we can run the code from the command line and assess the scaling with an increasing number of threads by setting the OMP_NUM_THREADS environment variable:

    $ OMP_NUM_THREADS=1 python test_theano.py
    10.905971487998613
    $ OMP_NUM_THREADS=2 python test_theano.py
    7.538279129999864
    $ OMP_NUM_THREADS=3 python test_theano.py
    9.405846934998408
    $ OMP_NUM_THREADS=4 python test_theano.py
    14.634153957000308

Interestingly, there is a small speedup when using two threads, but the performance degrades quickly as we increase their number. This means that for this input size, it is not advantageous to use more than two threads as the price you pay to start new threads and synchronize their shared data is higher than the speedup that you can obtain from the parallel execution.

Achieving good parallel performance can be tricky as it will depend on the specific operations and how they access the underlying data. As a general rule, measuring the performance of a parallel program is crucial and obtaining substantial speedups is a work of trial and error.

As an example, we can see that the parallel performance quickly degrades using a slightly different code. In our hit test, we used the sum method directly and relied on the explicit casting of the hit_tests Boolean array. If we make the cast explicit, Theano will generate a slightly different code that benefits less from multiple threads. We can modify the test_theano.py file to verify this effect:

    # Older version
    # hits = hit_test.sum()
    hits = hit_test.astype('int32').sum()

If we rerun our benchmark, we see that the number of threads does not affect the running time significantly. Despite that, the timings improved considerably as compared to the original version:

    $ OMP_NUM_THREADS=1 python test_theano.py
    5.822126664999814
    $ OMP_NUM_THREADS=2 python test_theano.py
    5.697357518001809
    $ OMP_NUM_THREADS=3 python test_theano.py 
    5.636914656002773
    $ OMP_NUM_THREADS=4 python test_theano.py
    5.764030176000233

Table of Contents for Getting started with Theano

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting started with Theano