Running code on a GPU

In this subsection, we will demonstrate the usage of a GPU with Theano and Tensorflow. As an example, we will benchmark the execution of a very simple matrix multiplication on the GPU and compare it to its running time on a CPU.

The code in this subsection requires the possession of a GPU. For learning purposes, it is possible to use the Amazon EC2 service (https://aws.amazon.com/ec2) to request a GPU-enabled instance.

The following code performs a simple matrix multiplication using Theano. We use the T.matrix function to initialize a two-dimensional array, and then we use the T.dot method to perform the matrix multiplication:

    from theano import function, config
    import theano.tensor as T
    import numpy as np
    import time

    N = 5000

    A_data = np.random.rand(N, N).astype('float32')
    B_data = np.random.rand(N, N).astype('float32')

    A = T.matrix('A')
    B = T.matrix('B')

    f = function([A, B], T.dot(A, B))

    start = time.time()
    f(A_data, B_data)

    print("Matrix multiply ({}) took {} seconds".format(N, time.time() - start))
    print('Device used:', config.device)

It is possible to ask Theano to execute this code on a GPU by setting the config.device=gpu option. For added convenience, we can set up the configuration value from the command line using the THEANO_FLAGS environmental variable, shown as follows. After copying the previous code in the test_theano_matmul.py file, we can benchmark the execution time by issuing the following command:

    $ THEANO_FLAGS=device=gpu python test_theano_gpu.py 
    Matrix multiply (5000) took 0.4182612895965576 seconds
    Device used: gpu

We can analogously run the same code on the CPU using the device=cpu configuration option:

    $ THEANO_FLAGS=device=cpu python test_theano.py 
    Matrix multiply (5000) took 2.9623231887817383 seconds
    Device used: cpu

As you can see, the GPU is 7.2 times faster than the CPU version for this example!

For comparison, we may benchmark equivalent code using Tensorflow. The implementation of a Tensorflow version is reported in the next code snippet. The main differences with the Theano version are as follows:

The usage of the tf.device config manager that serves to specify the target device (/cpu:0 or /gpu:0)
The matrix multiplication is performed using the tf.matmul operator:

    import tensorflow as tf
    import time
    import numpy as np
    N = 5000

    A_data = np.random.rand(N, N)
    B_data = np.random.rand(N, N)

    # Creates a graph.

    with tf.device('/gpu:0'):
        A = tf.placeholder('float32')
        B = tf.placeholder('float32')

        C = tf.matmul(A, B)

    with tf.Session() as sess:
        start = time.time()
        sess.run(C, {A: A_data, B: B_data})
        print('Matrix multiply ({}) took: {}'.format(N, time.time() - start))

If we run the test_tensorflow_matmul.py script with the appropriate tf.device option, we obtain the following timings:

    # Ran with tf.device('/gpu:0')
    Matrix multiply (5000) took: 1.417285680770874

    # Ran with tf.device('/cpu:0')
    Matrix multiply (5000) took: 2.9646761417388916

As you can see, the performance gain is substantial (but not as good as the Theano version) in this simple case.

Another way to achieve automatic GPU computation is the now familiar Numba. With Numba, it is possible to compile Python code to programs that can be run on a GPU. This flexibility allows for advanced GPU programming as well as more simplified interfaces. In particular, Numba makes extremely easy-to-write, GPU-ready, generalized universal functions.

In the next example, we will demonstrate how to write a universal function that applies an exponential function on two numbers and sums the results. As we already saw in Chapter 5, Exploring Compilers this can be accomplished using the nb.vectorize function (we'll also specify the cpu target explicitly):

    import numba as nb
    import math
    @nb.vectorize(target='cpu')
    def expon_cpu(x, y):
        return math.exp(x) + math.exp(y)

The expon_cpu universal function can be compiled for the GPU device using the target='cuda' option. Also, note that it is necessary to specify the input types for CUDA universal functions. The implementation of expon_gpu is as follows:

    @nb.vectorize(['float32(float32, float32)'], target='cuda')
    def expon_gpu(x, y):
        return math.exp(x) + math.exp(y)

We can now benchmark the execution of the two functions by applying the functions on two arrays of size 1,000,000. Also, note that we execute the function before measuring the timings to trigger the Numba just-in-time compilation:

    import numpy as np
    import time

    N = 1000000
    niter = 100

    a = np.random.rand(N).astype('float32')
    b = np.random.rand(N).astype('float32')

    # Trigger compilation
    expon_cpu(a, b)
    expon_gpu(a, b)

    # Timing
    start = time.time()
    for i in range(niter):
       expon_cpu(a, b)
    print("CPU:", time.time() - start)

    start = time.time()
    for i in range(niter): 
        expon_gpu(a, b) 
    print("GPU:", time.time() - start) 
    # Output:
    # CPU: 2.4762887954711914
    # GPU: 0.8668839931488037

Thanks to the GPU execution, we were able to achieve a 3x speedup over the CPU version. Note that transferring data on the GPU is quite expensive; therefore, GPU execution becomes advantageous only for very large arrays.

Table of Contents for Running code on a GPU

Create new playlist

Sign In

Sign Up

Table of Contents for
Running code on a GPU