Chapter 1. Introducing deep learning: why you should learn it
Welcome to Grokking Deep Learning
You’re about to learn some of the most valuable skills of the century!
Why you should learn deep learning
It’s a powerful tool for the incremental automation of intelligence
Deep learning has the potential for significant automation of skilled labor
Will this be difficult to learn?
How hard will you have to work before there’s a “fun” payoff?
It has a uniquely low barrier to entry
It will help you understand what’s inside a framework (Torch, TensorFlow, and so on)
All math-related material will be backed by intuitive analogies
Everything after the introduction chapters is “project” based
You’ll probably need some Python knowledge
Python is my teaching library of choice, but I’ll provide a few others online
Chapter 2. Fundamental concepts: how do machines learn?
Parametric vs. nonparametric learning
Oversimplified: Trial-and-error learning vs. counting and probability
Supervised parametric learning
Oversimplified: Trial-and-error learning using knobs
Unsupervised parametric learning
Chapter 3. Introduction to neural prediction: forward propagation
A simple neural network making a prediction
What does this neural network do?
It multiplies the input by a weight. It “scales” the input by a certain amount
Making a prediction with multiple inputs
Neural networks can combine intelligence from multiple datapoints
Multiple inputs: What does this neural network do?
It multiplies three inputs by three knob weights and sums them. This is a weighted sum
Multiple inputs: Complete runnable code
Making a prediction with multiple outputs
Neural networks can also make multiple predictions using only a single input
Predicting with multiple inputs and outputs
Neural networks can predict multiple outputs given multiple inputs
Multiple inputs and outputs: How does it work?
It performs three independent weighted sums of the input to make three predictions
To predict, neural networks perform repeated weighted sums of the input
Chapter 4. Introduction to neural learning: gradient descent
Comparing gives a measurement of how much a prediction “missed” by
Learning tells each weight how it can change to reduce the error
Compare: Does your network make good predictions?
Measuring error simplifies the problem
Different ways of measuring error prioritize error differently.
What’s the simplest form of neural learning?
Characteristics of hot and cold learning
Problem 2: Sometimes it’s impossible to predict the exact goal prediction
Calculating both direction and amount from error
One iteration of gradient descent
This performs a weight update on a single training example (input->true) pair
Learning is just reducing error
Let’s watch several steps of learning
Why does this work? What is weight_delta, really?
Let’s back up and talk about functions. What is a function? How do you understand one?
Concept: Learning is adjusting the weight to reduce the error to 0
A box with rods poking out of it
Still a little unsure about them? Let’s take another perspective
With derivatives, you can pick any two variables in any formula, and know how they interact
What you don’t really need to know
How to use a derivative to learn
Visualizing the overcorrections
It’s the simplest way to prevent overcorrecting weight updates
Chapter 5. Learning multiple weights at a time: generalizing gradient descent
Gradient descent learning with multiple inputs
Gradient descent with multiple inputs explained
Simple to execute, and fascinating to understand
How do you turn a single delta (on the node) into three weight_delta values?
Let’s watch several steps of learning
Freezing one weight: What does it do?
Gradient descent learning with multiple outputs
Neural networks can also make multiple predictions using only a single input
Gradient descent with multiple inputs and outputs
Each weight tries to reduce the error, but what do they learn in aggregate?
Visualizing dot products (weighted sums)
Chapter 6. Building your first deep neural network: introduction to backpropagation
This toy problem considers how a network learns entire datasets
Matrices and the matrix relationship
Translate the streetlight into math
Creating a matrix or two in Python
The neural network has been learning only one streetlight. Don’t we want it to learn them all?
Full, batch, and stochastic gradient descent
Stochastic gradient descent updates weights one example at a time
(Full) gradient descent updates weights one dataset at a time
Neural networks learn correlation
Edge case: Conflicting pressure
If your data doesn’t have correlation, create intermediate data that does!
Stacking neural networks: A review
Chapter 3 briefly mentioned stacked neural networks. Let’s review
Backpropagation: Long-distance error attribution
Backpropagation: Why does this work?
This is probably the hardest concept in the book. Let’s take it slowly
Why the neural network still doesn’t work
If you trained the three-layer network as it is now, it wouldn’t converge
The secret to sometimes correlation
That last part probably felt a little abstract, and that’s totally OK
Your first deep neural network
You can learn the amount that each weight contributes to the final error
One iteration of backpropagation
Here’s the self-sufficient program you should be able to run (runtime output follows)
What’s the point of creating “intermediate datasets” that have correlation?
Chapter 7. How to picture neural networks: in your head and on paper
It’s impractical to think about everything all the time. Mental tools can help
This is the key to sanely moving forward to more advanced neural networks
The previously overcomplicated visualization
While simplifying the mental picture, let’s simplify the visualization as well
Neural networks are like LEGO bricks, and each brick is a vector or matrix
The dimensionality of the matrices is determined by the layers
Let’s see this network predict
Let’s picture data from the streetlight example flowing through the system
Visualizing using letters instead of pictures
All these pictures and detailed explanations are actually a simple piece of algebra
The letters can be combined to indicate functions and operations
Let’s see the visualization, algebra formula, and Python code in one place
The importance of visualization tools
Chapter 8. Learning signal and ignoring noise: introduction to regularization and batching
Let’s return to the MNIST dataset and attempt to classify it with the new network
The neural network perfectly learned to predict all 1,000 images
Memorization vs. generalization
Memorizing 1,000 images is easier than generalizing to all images
Overfitting in neural networks
The simplest regularization: Early stopping
Industry standard regularization: Dropout
The method: Randomly turn off neurons (set them to 0) during training
Why dropout works: Ensembling works
Dropout is a form of training a bunch of networks and averaging them
Here’s a method for increasing the speed of training and the rate of convergence
Chapter 9. Modeling probabilities and nonlinearities: activation functions
What is an activation function?
It’s a function applied to the neurons in a layer during prediction
Constraint 1: The function must be continuous and infinite in domain
Constraint 2: Good activation functions are monotonic, never changing direction
Constraint 3: Good activation functions are nonlinear (they squiggle or turn)
Constraint 4: Good activation functions (and their derivatives) should be efficiently computable
Standard hidden-layer activation functions
Of the infinite possible functions, which ones are most commonly used?
Standard output layer activation functions
Choosing the best one depends on what you’re trying to predict
Configuration 1: Predicting raw data values (no activation function)
Configuration 2: Predicting unrelated yes/no probabilities (sigmoid)
Configuration 3: Predicting which-one probabilities (softmax)
The core issue: Inputs have similarity
Different numbers share characteristics. It’s good to let the network believe that
softmax raises each input value exponentially and then divides by the layer’s sum
Activation installation instructions
How do you add your favorite activation function to any layer?
Multiplying delta by the slope
To compute layer_delta, multiply the backpropagated delta by the layer’s slope
Converting output to slope (derivative)
Most great activations can convert their output to their slope. (Efficiency win!)
Let’s upgrade the MNIST network to reflect what you’ve learned
Chapter 10. Neural learning about edges and corners: intro to convolutional neural networks
Reusing weights in multiple places
If you need to detect the same feature in multiple places, use the same weights!
Lots of very small linear layers are reused in every position, instead of a single big one
A simple implementation in NumPy
Just think mini-linear layers, and you already know what you need to know
Reusing weights is one of the most important innovations in deep learning
Chapter 11. Neural networks that understand language: king – man + woman == ?
What does it mean to understand language?
Natural language processing (NLP)
You can predict whether people post positive or negative reviews
Capturing word correlation in input data
Bag of words: Given a review’s vocabulary, predict the sentiment
With the encoding strategy and the previous network, you can predict sentiment
How did the choice of architecture affect what the network learned?
What should you see in the weights connecting words and hidden neurons?
What is the meaning of a neuron?
Meaning is entirely based on the target labels being predicted
Learn richer meanings for words by having a richer signal to learn
Neural networks don’t really learn data; they minimize the loss function
The choice of loss function determines the neural network’s knowledge
Word analogies are an interesting consequence of the previously built network
You’ve learned a lot about neural word embeddings and the impact of loss on learning
Chapter 12. Neural networks that write like Shakespeare: recurrent layers for variable-length data
The challenge of arbitrary length
Let’s model arbitrarily long sequences of data with neural networks!
Why should you care about whether you can compare two sentence vectors?
The surprising power of averaged word vectors
It’s the amazingly powerful go-to tool for neural prediction
How is information stored in these embeddings?
How does a neural network use embeddings?
Neural networks detect the curves that have correlation with a target label
The limitations of bag-of-words vectors
Using identity vectors to sum word embeddings
Matrices that change absolutely nothing
Let’s create sentence embeddings using identity matrices in Python
Learning the transition matrices
What if you allowed the identity matrices to change to minimize the loss?
Learning to create useful sentence vectors
Create the sentence vector, make a prediction, and modify the sentence vector via its parts
Let’s take this idea and see how to perform a simple forward propagation
How do you backpropagate into this?
It might seem trickier, but they’re the same steps you already learned
You have all the tools; let’s train the network on a toy corpus
Before you can create matrices, you need to learn how many parameters you have
Forward propagation with arbitrary length
You’ll forward propagate using the same logic described earlier
Backpropagation with arbitrary length
Weight update with arbitrary length
You’ll update weights using the same logic described earlier
You’ll update weights using the same logic described earlier
Looking at predictions can help you understand what’s going on
Recurrent neural networks predict over arbitrary-length sequences
Chapter 13. Introducing automatic optimization: let’s build a deep learning framework
What is a deep learning framework?
Good tools reduce errors, speed development, and increase runtime performance
Introduction to automatic gradient computation (autograd)
Previously, you performed backpropagation by hand. Let’s make it automatic!
Everything in Tensor is another form of lessons already learned
Tensors that are used multiple times
Upgrading autograd to support multiuse tensors
How does addition backpropagation work?
Let’s study the abstraction to learn how to add support for more functions
Adding support for additional functions
Subtraction, multiplication, sum, expand, transpose, and matrix multiplication
Using autograd to train a neural network
Adding support for layer types
Oversimplified, frameworks are autograd + a list of prebuilt layers and optimizers
Let’s add nonlinear functions to Tensor and then create some layer types
Before you can build the embedding layer, autograd needs to support indexing
The embedding layer (revisited)
Now you can finish forward propagation using the new .index_select() method
The recurrent neural network layer
By combining several layers, you can learn over time series
You can learn to fit the task you previously accomplished in the preceding chapter
Frameworks are efficient, convenient abstractions of forward and backward logic
Chapter 14. Learning to write like Shakespeare: long short-term memory
The need for truncated backpropagation
Technically, it weakens the theoretical maximum of the neural network
By sampling from the predictions of the model, you can write Shakespeare!
Vanishing and exploding gradients
A toy example of RNN backpropagation
To see vanishing/exploding gradients firsthand, let’s synthesize an example
Long short-term memory (LSTM) cells
LSTMs are the industry standard model to counter vanishing/exploding gradients
Some intuition about LSTM gates
LSTM gates are semantically similar to reading/writing from memory
The long short-term memory layer
Upgrading the character language model
Training the LSTM character language model
Tuning the LSTM character language model
I spent about two days tuning this model, and it trained overnight
Chapter 15. Deep learning on unseen data: introducing federated learning
The problem of privacy in deep learning
Deep learning (and tools for it) often means you have access to your training data
You don’t have to have access to a dataset in order to learn from it
Let’s say you want to train a model across people’s emails to detect spam
The previous example was plain vanilla deep learning. Let’s protect privacy
Hacking into federated learning
Let’s use a toy example to see how to still learn the training dataset
Let’s average weight updates from zillions of people before anyone can see them
Homomorphically encrypted federated learning
Let’s use homomorphic encryption to protect the gradients being aggregated
Federated learning is one of the most exciting breakthroughs in deep learning
Chapter 16. Where to go from here: a brief guide
If you’re reading this, you’ve made it through nearly 300 pages of deep learning
Step 1: Start learning PyTorch
The deep learning framework you made most closely resembles PyTorch
Step 2: Start another deep learning course
I learned deep learning by relearning the same concepts over and over
Step 3: Grab a mathy deep learning textbook
You can reverse engineer the math from your deep learning knowledge
Step 4: Start a blog, and teach deep learning
Nothing I’ve ever done has helped my knowledge or career more
Step 6: Implement academic papers
Step 7: Acquire access to a GPU (or many)
The more time you have to do deep learning, the faster you’ll learn
Step 9: Join an open source project
Step 10: Develop your local community
I really learned deep learning because I enjoyed hanging with friends who were
3.146.176.145