In this appendix, we’ll introduce the distinguishing elements of PyTorch, including contrasting it with its primary competition—TensorFlow.
In Chapter 14, we introduced PyTorch at a high level. In this section, we continue by examining the library’s core attributes.
PyTorch operates using what’s called an autograd system, which relies on the principle of reverse-mode automatic differentiation. As detailed in Chapter 7, the end product of forward propagating through a deep neural network is the result of a series of functions chained together. Reverse-mode automatic differentiation applies the chain rule to differentiate the inputs with respect to the cost at the end, working backwards (introduced in Chapter 8 and detailed in Appendix B). At each iteration, the activations of the neurons in the network are computed by forward propagation, and each function is recorded on a graph. At the end of training, this graph can be computed backwards to calculate the gradient at each neuron.
What makes autograd especially interesting is the define-by-run nature of the framework: The calculations for backpropagation are defined with each forward pass. This is important because it means that the backpropagation step is only dependent on how your code is run, and as such the backpropagation mathematics can vary with each forward pass. This means that every round of training (see Figure 8.5) can be different. This is useful in settings such as natural language processing, where the input sequence length is typically set to the maximum length (i.e., the longest sentence in the corpus) and shorter sequences are padded with zeros (as we did in Chapter 11). PyTorch, in contrast, natively supports dynamic inputs, circumventing the need for this truncating and padding.
The define-by-run framework also means that the framework is not asynchronous. When a line is executed, the code is run, making debugging much simpler. When the code throws an error, you’re able to see exactly which line caused the error. Furthermore, by running an appropriate helper function, this so-called eager execution can be easily replaced with a traditional graph-based model—wherein the graphs are defined in advance, which brings with it speed and optimization benefits.
You might now wonder when one might select PyTorch over TensorFlow. The answer is not unambiguous, but we’ll explore some of the advantages and disadvantages of each library here.
One relevant topic is adoption: TensorFlow is currently more widely used than PyTorch. PyTorch was first released to the public in January 2017, whereas TensorFlow was released a little over a year prior, in November 2015. In the rapidly developing world of deep learning, this is a significant head start. Indeed, the 1.0.0 version of PyTorch was only released on December 7, 2018. In this way, TensorFlow gained traction and a large body of tutorials and Stack Overflow posts emerged online, giving Google’s library an edge.
A second consideration is that PyTorch’s dynamic interface makes iteration easier and quicker relative to the static nature of TensorFlow.1 With PyTorch you can define, change, and execute nodes as you go, as opposed to defining the entire model in advance. Debugging is significantly easier in PyTorch, largely because graphs are defined at run time. This means that errors occur when the code is executed and are more easily traceable to the offending line of code.
1. The Eager mode central to TensorFlow 2.0 intends to remedy this.
Visualization in TensorFlow is intuitive and easy with the built-in TensorBoard platform (see Figure 9.8). However, TensorBoard integrations with PyTorch do exist, and data are more implicitly available during PyTorch model training, so custom solutions can be built using other libraries (for example, with matplotlib).
TensorFlow is used in both development and production at Google, and for this reason the library has much more sophisticated deployment options, including mobile support and distributed training support. PyTorch has historically lagged in these departments; however, with the release of PyTorch 1.0.0, a new just in time (JIT) compiler and its new distributed library are available to address these shortcomings. Additionally, all of the major cloud providers have announced PyTorch integrations, including ones with TensorBoard and TPU support on Google Cloud!2
2. One might have expected Google to drag its feet on integrations with the library of one of its primary competitors—in this case, Facebook.
When it comes to everyday use, PyTorch feels more “Pythonic” than TensorFlow: It was written specifically as a Python library, and so it will feel familiar to Python developers. While TensorFlow has an established Python implementation that’s widely used, the library was originally written in C++, and so its Python implementation can feel cumbersome. Of course, Keras exists to try to solve this problem, but in the process it obscures some of TensorFlow’s functionality.3 On the topic of Keras, PyTorch has the Fast.ai library,4 which aims to provide high-level abstractions to PyTorch that are analogous to those provided by Keras to TensorFlow.
3. TensorFlow 2.0’s tight coupling with Keras therein intends to correct many of these issues.
Taking all of these topics into account, if you’re doing research or if your in-production execution demands are not very high, PyTorch might be the optimal choice. The speed of iteration when experimenting, coupled with simpler debugging and extensive NumPy integration, make this library well suited to research. However, if you’re deploying deep learning models into a production environment, you’ll find more support with TensorFlow. This is especially the case if you’re using distributed training or performing inference on a mobile platform.
In this section, we go over the basics of PyTorch installation and use.
Alongside TensorFlow and Keras, PyTorch is one of the libraries in the Docker container we recommended installing5 for running the Jupyter notebooks throughout this book. So, if you followed those instructions, you’re already all set. If you’re working outside of our recommended Docker setup, then you can consult the installation notes that are available on the PyTorch homepage.6
5. See the beginning of Chapter 5 for these instructions.
The fundamental units within PyTorch are tensors and variables, which we describe in turn here.
As in TensorFlow, tensor is little more than a fancy name for a matrix or vector. Tensors are functionally the same as NumPy arrays, except that PyTorch provides specific methods to perform computation with them on GPUs. Under the hood, these tensors also keep a record of the graph (for the autograd system) and the gradients.
The default tensor is usually
FloatTensor. PyTorch has eight types of tensors, which contain either integers or floats. When you define which type of tensor you’d like to use, that choice has memory and precision implications; 8-bit integers can only store 256 values (i.e., [0 : 255]) and occupy much less memory than 64-bit7 integers. However, in cases where, say, integers up to 255 are all that is required, using higher-order integers would be unnecessary. This consideration is especially relevant when you’re running models on GPU architectures, because memory is generally the limiting factor on GPUs, as compared to running models on the CPU, where installing more RAM is relatively cheap.
7. 64-bit integers can store values as large 263 − 1, which is 9.2 quintillion.
import torch x = torch.zeros(28, 28, 1, dtype=torch.uint8) y = torch.randn(28, 28, 1, dtype=torch.float32)
This code (which is available in our PyTorch Jupyter notebook, along with all of the other examples in this appendix) creates a 28×28×1 tensor,
x, that’s filled with zeros, of the type
uint8.8 You could also have used
torch.ones() to create a comparable tensor filled with ones. The second tensor,
y, contains random numbers from the standard normal distribution.9 By definition, these cannot be 8-bit integers, so we specified 32-bit floats here.
8. The “u” in uint8 stands for unsigned, meaning that these 8-bit integers span from 0 to 255 instead of from −128 to 127.
9. The standard normal distribution has a mean of 0 and a standard deviation of 1.
As mentioned initially, these tensors have a lot in common with NumPy n-dimensional arrays. For example, it’s easy to generate a PyTorch tensor from a NumPy array with the
torch.from_numpy() method. The PyTorch library also contains many math operations that can be efficiently performed on these tensors, many of which mirror their NumPy counterparts.
PyTorch tensors can natively store the computational graph for the network as well as the gradients. This is enabled by setting the
requires_grad argument to
True when you create the tensor. Now, each tensor has a
grad attribute that stores the gradient. Initially, this is set to
None until the tensor’s
backward() method is called. The
backward() method reverses through the record of operations and calculates the gradient at each point in the graph. After the first call to
grad attribute becomes filled with gradient values.
In the following code block, we define a simple tensor, perform some mathematical operations, and call the
backward() method to reverse through the graph and calculate the gradients. Subsequently, the
grad attribute will store gradients.
import torch x = torch.zeros(3, 3, dtype=torch.float32, requires_grad=True) y = x - 4 z = y** 3 * 6 out = z.mean() out.backward() print(x.grad)
x had its
require_grad flag set, we can perform backpropagation on this series of computations. PyTorch has accumulated the functions that generated the final output using its autograd system, so calling
out.backward() will calculate the gradients and store them in
x.grad. The final line prints the following:
tensor([[32., 32., 32.], [32., 32., 32.], [32., 32., 32.]])
As this example demonstrates, PyTorch takes the hassle out of automatic differentiation. Next, we cover the basics of building a neural network in PyTorch.
The essential paradigm of building neural networks should be familiar: They consist of multiple layers that are stacked together (as in Figure 4.2). In the examples throughout this book, we used the Keras library as a high-level abstraction over the raw TensorFlow functions. Similarly, the PyTorch
nn module contains layerlike modules that receive tensors as inputs and return tensors as outputs. In the following example, we build a two-layer network akin to the dense nets we used to classify handwritten digits in Part II:
import torch # Define random tensors for the inputs and outputs x = torch.randn(32, 784, requires_grad=True) y = torch.randint(low=0, high=10, size=(32,)) # Define the model, using the Sequential class model = torch.nn.Sequential( torch.nn.linear(784, 100), torch.nn.Sigmoid(), torch.nn.Linear(100, 10), torch.nn.LogSoftmax(dim=1) ) # Define the optimizer and loss function optimizer = torch.optim.Adam(model.parameters()) loss_fn = torch.nn.NLLLoss() for step in range(1000): # Make predictions by forward propagation y_hat = model(x) # Calculate the loss loss = loss_fn(y_hat, y) # Zero-out the gradient before performing a backward pass optimizer.zero_grad() # Compute the gradients w.r.t. the loss loss.backward() # Print the results print('Step: :4d - loss: :0.4f'.format(step+1, loss.item())) # Update the model parameters optimizer.step()
Let’s break this down step by step:
y tensors are placeholders for the input and output values of the model.
We use the
Sequential class to begin building our model as a series of layers (
linear() through to
LogSoftmax()), in much the same way as we did in Keras.
We initialize an optimizer; in this case we use Adam with its default values. We also pass into the optimizer all of the tensors we’d like optimized—in this case,
We also initialize the loss function, although it doesn’t require any parameters. We opted for the built-in negative log-likelihood loss function,
We manually iterate over the number of rounds of training (Figure 8.5) that we’d like to take (in this case,
1000), and during each round we
Calculate the model outputs using
y_hat = model(x).
Calculate the loss using the function we defined earlier, passing in the predicted ŷ values and the true y values.
Zero the gradients. This is necessary because the gradients are accumulated in buffers, and not overwritten.
Perform backpropagation to recalculate the gradients, given the loss.
Finally, take a step using the optimizer. This updates the model weights using the gradients.
10. Pairing a LogSoftmax( ) output layer with the torch.nn.NLLLoss( ) cost function in PyTorch is equivalent to using a softmax output layer with cross-entropy cost in Keras. PyTorch does have a cross_entropy( ) cost function, but it incorporates the softmax calculation so that if you were to use it, you wouldn’t need to apply the softmax activation function to your model output.
This procedure diverges from the
model.fit() method we employed in Keras. However, with all of the theory covered in this book and the hands-on examples we’ve worked through together, hopefully it’s not a stretch to appreciate what’s taking place in this PyTorch code. Without too much effort, you should be able to adapt the deep learning models in this book from Keras into PyTorch.11
11. Note that our example PyTorch neural network in this appendix isn’t learning anything meaningful. The loss decreases, but the model is simply memorizing (overfitting to) the training data we randomly generated. We’re feeding in random numbers as inputs and mapping them to other random numbers. If we randomly generated validation data, too, the validation loss wouldn’t decrease. If you’re feeling adventurous, you could initialize x and y with actual data from, say, the MNIST dataset (you can import these data with Keras, as in Example 5.2) and train a PyTorch model to map a meaningful relationship!