© Nikhil Ketkar, Jojo Moolayil 2021
N. Ketkar, J. MoolayilDeep Learning with Pythonhttps://doi.org/10.1007/978-1-4842-5364-9_4

4. Automatic Differentiation in Deep Learning

Nikhil Ketkar1   and Jojo Moolayil2
(1)
Bangalore, Karnataka, India
(2)
Vancouver, BC, Canada
 

While exploring stochastic gradient descent in Chapter 3, we treated the computation of gradients of the loss function ?xL(x) as a black box. In this chapter, we open the black box and cover the theory and practice of automatic differentiation, as well as explore PyTorch’s Autograd module that implements the same. Automatic differentiation is a mature method that allows for the effortless and efficient computation of gradients of arbitrarily complicated loss functions. This is critical when it comes to minimizing loss functions of interest; at the heart of building any deep learning model lies an optimization problem that is invariably solved using stochastic gradient descent, which, in turn, requires one to compute gradients.

Automatic differentiation is distinct from both numerical and symbolic differentiation. We start by covering enough about both of these so that distinction becomes clear. For the purposes of illustration, assume that our function of interest is f : R → R and we intend to find the derivative of f, denoted by f(x).

Numerical Differentiation

Numerical differentiation , in its basic form, follows from the definition of derivative/gradient. It used to estimate the derivative of a mathematical function. A derivate of y with respect to x more specifically defines the rate of change of y with respect to x. A simple way would be to compute the slope of the function through the line x, f(x) and x+h, f(x+h).

So, given that
$$ {f}^{prime }(x)=frac{df}{dx}=frac{fleft(x+varDelta x
ight)-f(x)}{varDelta x} $$
we can compute the f(x) using the forward difference method as
$$ {f}^{prime }(x)={D}_{+}(h)=frac{fleft(x+h
ight)-f(x)}{h} $$
setting a suitably small value for h. Similarly, we can compute f(x) using the backward difference method as
$$ {f}^{prime }(x)=D\_(h)=frac{f(x)-fleft(x-h
ight)}{h} $$

again, by setting a suitably small value for h.

A more symmetric form is the central difference approach, which computes f as
$$ {f}^{prime }(x)={D}_0(h)=frac{fleft(x+h
ight)-fleft(x-h
ight)}{2h} $$
Extrapolation is a process of using known values to project a value outside of the intended existing known range. Richardson extrapolation is a technique that helps in achieving for estimating very high order integration using only a few series of values.
$$ {f}^{prime }(x)=frac{4{D}_0(h)-{D}_0(2h)}{3} $$

The approximation errors for forward and backward differences are in the order of h, that is, O(h)—whereas those for central difference and Richardson extrapolation are O(h2) and O(h4), respectively.

The key problems with numerical differentiation are the computational costs, which grow with the number of parameters in the loss function, the truncation errors, and the round off errors. The truncation error is the inaccuracy we have in the computation of f(x) due to h not being zero. The round off error is inherent to using floating-point numbers and floating-point arithmetic (as opposed to using infinite precision numbers, which would be prohibitively expensive).

Numerical differentiation is thus not a feasible approach for computing gradients while building deep learning models. The only place where numerical differentiation comes in handy is quickly checking whether gradients are being computed correctly. This is highly recommended when you have computed gradients manually or with a new/unknown automatic differentiation library. Ideally, this check should be put in as an automated check/assertion before starting SGD.

Note

Numerical differentiation is implemented in a Python package called Scipy. We do not cover it here, as it is not directly relevant to deep learning.

Symbolic Differentiation

Symbolic differentiation in its basic form is a set of symbol rewriting rules applied to the loss function to arrive at the derivatives/gradients. Consider two of such simple rules
$$ frac{d }{dx}left(f(x)+g(x)
ight)=frac{d}{dx} f(x)+frac{d}{dx} g(x) $$
and
$$ frac{d}{dx} {x}^n=n{x}^{left(n-1
ight)} $$
Given a function such as f(x) = 2x3 + x2, we can successively apply the the symbol writing rules to first arrive at
$$ {f}^{prime }(x)=frac{d}{dx} left(2{x}^3
ight)+frac{d}{dx} left({x}^2
ight) $$
by applying the first rewriting rule, and
$$ {f}^{prime }(x)=6{x}^2+2x $$

by applying the second rule.

Symbolic differentiation is thus automating what we do when we derive gradients manually. Of course, the number of such rules can be large, and more sophisticated algorithms can be leveraged to make this symbol rewriting more efficient. In its essence, however, symbolic differentiation is simply the application of a set of symbol rewriting rules. The key advantage of symbolic differentiation is that it generates a legible mathematical expression for the derivative/gradient that can be understood and analyzed.

The key problem with symbolic differentiation is that it is limited to the symbolic differentiation rules already defined, which can cause us to hit roadblocks when trying to minimize complicated loss functions. An example of this is when your loss function involves an if-else clause or a for/while loop. In a sense, symbolic differentiation is differentiating a (closed form) mathematical expression; it is not differentiating a given computational procedure.

Another problem with symbolic differentiation is that a naïve application of symbol rewriting rules, in some cases, can lead to an explosion of symbolic terms (expression swell) and make the process computationally unfeasible. Typically, a fair amount of compute effort is required to simplify such expressions and produce a closed form expression of the derivative.

Note

Symbolic differentiation is implemented in a Python package called SymPy. We do not cover it here, as it is not directly relevant to deep learning.

Automatic Differentiation Fundamentals

The first key intuition behind automatic differentiation is that all functions of interest (which we intend to differentiate) can be expressed as compositions of elementary functions for which corresponding derivative functions are known. Composite functions thus can be differentiated by applying the chain rule for derivatives. This intuition is also at the basis of symbolic differentiation.

The second key intuition behind automatic differentiation is that rather than storing and manipulating intermediate symbolic forms of derivatives of primitive functions, we can simply evaluate them (for a specific set of input values) and thus address the issue of expression swell. Because intermediate symbolic forms are being evaluated, we do not have the burden of simplifying the expression. Note that this prevents us from getting a closed form mathematical expression of the derivate like the one symbolic differentiation gives us; what we get via automatic differentiation is the evaluation of the derivative for a given set of values.

The third key intuition behind automatic differentiation is that because we are evaluating derivatives of primitive forms, we can deal with arbitrary computational procedures and not just closed form mathematical expressions. That is, our function can contain if-else statements, for loops, or even recursion. The way automatic differentiation deals with any computational procedure is to treat a single evaluation of the procedure (for a given set of inputs) as a finite list of elementary function evaluations over the input variables to produce one or more output variables. Although there might be control flow statements (if-else statements, for loops, etc.), ultimately, there is a specific list of function evaluations that transform the given input to the output. Such a list/evaluation trace is referred to as a Wengert list.

To understand how automatic differentiation specifically works for a deep learning use case, let’s take a simple function, which we will compute manually using chain rule, and also look at the PyTorch equivalent of implementing the same.

In deep learning networks, the entire flow is represented using a computational graph, which is a directed graph where nodes represent mathematical operations. This provide an easy to evaluate mathematical expression. Computational graphs can be translated into a data structure to programmatically approach the problem using computer programming languages, thereby making solving larger problems more intuitive.

We will use a relatively small and easy to compute function to work through our example.

Assume that f(x, y, z) = (x + y)*z and that we have values for the three variables as x=1, y =-2 and z =3.

We can represent this function using a computational graph, as shown in Figure 4-1.
../images/478491_2_En_4_Chapter/478491_2_En_4_Fig1_HTML.jpg
Figure 4-1

A computational graph

Along with the input variables (x, y, and z), we will see the variable a, which is an intermediate variable that stores the computed value of (x + y), and the variable f, which stores the final value of (x + y)z—i.e., a*z.

In the forward pass, we will substitute the values and arrive at the final value as

x = 1, y =-2, z= 3

Then,

(x + y )z = (1 - 2)3 = -3

Therefore,

f = -3

We can visualize this using the computational graph shown in Figure 4-2.
../images/478491_2_En_4_Chapter/478491_2_En_4_Fig2_HTML.jpg
Figure 4-2

A computational graph with computed values

Now, with automatic differentiation, we would want to find the gradients of f with regard to the input variables (x, y, and z) represented as $$ frac{partial f}{partial x} $$,$$ frac{partial f}{partial y} $$ and $$ frac{partial f}{partial z} $$.

In the feed-forward network, essentially, we find the gradients of the loss function with respect to the weights. To solve this, we can use the chain rule.

Let’s find the partial derivatives for the above equation.

We know that a = (x + y), z = a * x and thus f = az.

Therefore,

$$ frac{partial f}{partial z}=frac{partial (az)}{partial z}=a $$ = (x + y) = (1 – 2) = -1

and
$$ frac{partial f}{partial a}=frac{partial (az)}{partial a}=z $$

If we go one step further, we can find the partial derivatives of a with regard to x and y.

$$ frac{partial a}{partial x}=frac{partial left(x+y
ight)}{partial x}=1 $$, and $$ frac{partial a}{partial y}=frac{partial left(x+y
ight)}{partial y}=1 $$

Now, coming to our end objective, to find the gradients of f with regard to x, y and z. We already have computed the required gradient with regard to z. For x and y, we can leverage the previously computed values in chain rule as
$$ frac{partial f}{partial x}=frac{partial f}{partial a}frac{partial a}{partial x}=zast 1=3 $$
$$ frac{partial f}{partial y}=frac{partial f}{partial a}frac{partial a}{partial y}=zast 1=3 $$

We now have computed all the values required.

$$ frac{partial f}{partial x}=3,kern0.75em frac{partial f}{partial y} $$ = 3 and $$ frac{partial f}{partial z} $$ = -1

Essentially, what a network would infer is that x and y positively influence the outcome, whereas z negatively influences it (Figure 4-3). This information is useful to reduce the loss and updates the weights of the network incrementally to reach the minima.
../images/478491_2_En_4_Chapter/478491_2_En_4_Fig3_HTML.jpg
Figure 4-3

A computational graph with partial derivatives

Implementing Automatic Differentiation

Let’s now consider how automatic differentiation is implemented within PyTorch. The preceding example was very simple; things would be really complicated as we explore the approach on paper for large functions (i.e., deep learning functions). In most common networks, the number of parameters that would be involved is very high, making manually programming the computation of gradients a herculean task.

PyTorch provides the Autograd package, which essentially simplifies the entire process for us. Recall the loss.backward() function that we leveraged in Chapter 3 for the toy neural network. The network computes all the necessary gradients for the loss with respect to the weights. Let’s explore this further.

What Is Autograd?

The Autograd package within PyTorch provides automatic differentiation for all operations on tensors. It performs the necessary computations within backpropagation for our neural network. When the backward() function is called, the module computes all the backpropagation gradients automatically. We can also access individual gradients through a variable’s grad attribute.

The Autograd module provides ready to use tools (functions/classes) for implementing automatic differentiation of arbitrary scalar valued functions. To enable gradients to be computed for a variable, we need only to set the value as True for the keyword requires_grad.

Let’s replicate the same example we used to manually implement automatic differentiation but using PyTorch (Listing 4-1).
#Import required libraries
import torch
#Define ensors
x = torch.Tensor([1])
y = torch.Tensor([-2])
z= torch.Tensor([3])
print("Default value for requires_grad for x:",x.requires_grad)
#Set the keyword requires_grad as True (default is False)
x.requires_grad=True
y.requires_grad=True
z.requires_grad=True
print("Updated  value for requires_grad for x:",x.requires_grad)
#Compute a
a = x + y
#Finally define the function f
f = z * a
print("Final value for Function f = ",f)
#Compute gradients
f.backward()
#Print the gradient value
print("Gradient value for x:",x.grad)
print("Gradient value for y:",y.grad)
print("Gradient value for z:",z.grad)
Output[]
Default value for requires_grad for x: False
Updated value for requires_grad for x: True
Final value for Function f = tensor([-3.], grad_fn=<MulBackward0>)
Gradient value for x: tensor([3.])
Gradient value for y: tensor([3.])
Gradient value for z: tensor([-1.])
Listing 4-1

Implementing Automatic Differentition (Autograd) in PyTorch

The gradient values here match exactly with what we computed manually earlier.

In the preceding example, we first created a tensor and then assigned the keyword for requires_grad as True. We can also combine this along with our definition.
x = torch.autograd.Variable(torch.Tensor([1]),requires_grad=True)

While we define a network in PyTorch, a lot of these details are taken care of. When we define a network layer, with nn.Linear(64, 256) (refer to the Chapter 3 example), PyTorch creates the weight and bias tensor with the necessary values (setting requires_grad as True). The input tensors did not need the gradients; hence, we never set them in our example and used the default (i.e., False).

Summary

This chapter covered the basics of automatic differentiation. Backpropagation is a special case of automatic differentiation used in training deep neural networks. In modern deep learning literature, automatic differentiation is analogous to backpropagation, as it a more generalized term. The key takeaway from this chapter is that automatic differentiation enables the computation of gradients for arbitrarily complex loss functions and is one of the key enabling technologies for deep learning. You should internalize the concepts of automatic differentiation and how it differs from both symbolic and numerical differentiation.

In the next chapter, we will study some additional topics related to deep learning in more detail, including performance metrics and model evaluation, analyzing overfitting and underfitting, regularization, and hyperparameter tuning. Finally, we will combine all the foundational bits about deep learning we’ve covered so far into a practical example that implements feed-forward neural networks for a real-world dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.10.246