Table of Contents

Copyright

Brief Table of Contents

Table of Contents

Preface

Acknowledgments

About this book

About the author

Chapter 1. Introducing deep learning: why you should learn it

Welcome to Grokking Deep Learning

You’re about to learn some of the most valuable skills of the century!

Why you should learn deep learning

It’s a powerful tool for the incremental automation of intelligence

Deep learning has the potential for significant automation of skilled labor

It’s fun and creative. You’ll discover much about what it is to be human by trying to simulate intelligence and creativity

Will this be difficult to learn?

How hard will you have to work before there’s a “fun” payoff?

Why you should read this book

It has a uniquely low barrier to entry

It will help you understand what’s inside a framework (Torch, TensorFlow, and so on)

All math-related material will be backed by intuitive analogies

Everything after the introduction chapters is “project” based

What you need to get started

Install Jupyter Notebook and the NumPy Python library

Pass high school mathematics

Find a personal problem you’re interested in

You’ll probably need some Python knowledge

Python is my teaching library of choice, but I’ll provide a few others online

How much coding experience should you have?

Summary

Chapter 2. Fundamental concepts: how do machines learn?

What is deep learning?

Deep learning is a subset of methods for machine learning

What is machine learning?

Supervised machine learning

Supervised learning transforms datasets

Unsupervised machine learning

Unsupervised learning groups your data

Parametric vs. nonparametric learning

Oversimplified: Trial-and-error learning vs. counting and probability

Supervised parametric learning

Oversimplified: Trial-and-error learning using knobs

Step 1: Predict

Step 2: Compare to the truth pattern

Step 3: Learn the pattern

Unsupervised parametric learning

Nonparametric learning

Oversimplified: Counting-based methods

Summary

Chapter 3. Introduction to neural prediction: forward propagation

Step 1: Predict

This chapter is about prediction

A simple neural network making a prediction

Let’s start with the simplest neural network possible

What is a neural network?

Here is your first neural network

What does this neural network do?

It multiplies the input by a weight. It “scales” the input by a certain amount

Making a prediction with multiple inputs

Neural networks can combine intelligence from multiple datapoints

Multiple inputs: What does this neural network do?

It multiplies three inputs by three knob weights and sums them. This is a weighted sum

Multiple inputs: Complete runnable code

Making a prediction with multiple outputs

Neural networks can also make multiple predictions using only a single input

Predicting with multiple inputs and outputs

Neural networks can predict multiple outputs given multiple inputs

Multiple inputs and outputs: How does it work?

It performs three independent weighted sums of the input to make three predictions

Predicting on predictions

Neural networks can be stacked!

A quick primer on NumPy

NumPy does a few things for you. Let’s reveal the magic

Summary

To predict, neural networks perform repeated weighted sums of the input

Chapter 4. Introduction to neural learning: gradient descent

Predict, compare, and learn

Compare

Comparing gives a measurement of how much a prediction “missed” by

Learn

Learning tells each weight how it can change to reduce the error

Compare: Does your network make good predictions?

Let’s measure the error and find out!

Why measure error?

Measuring error simplifies the problem

Different ways of measuring error prioritize error differently.

Why do you want only positive error?

What’s the simplest form of neural learning?

Learning using the hot and cold method

Hot and cold learning

This is perhaps the simplest form of learning

Characteristics of hot and cold learning

It’s simple

Problem 1: It’s inefficient

Problem 2: Sometimes it’s impossible to predict the exact goal prediction

Calculating both direction and amount from error

Let’s measure the error and find the direction and amount!

One iteration of gradient descent

This performs a weight update on a single training example (input->true) pair

Learning is just reducing error

You can modify weight to reduce error

Let’s watch several steps of learning

Will we eventually find the bottom of the bowl?

Why does this work? What is weight_delta, really?

Let’s back up and talk about functions. What is a function? How do you understand one?

Tunnel vision on one concept

Concept: Learning is adjusting the weight to reduce the error to 0

A box with rods poking out of it

Derivatives: Take two

Still a little unsure about them? Let’s take another perspective

What you really need to know

With derivatives, you can pick any two variables in any formula, and know how they interact

What you don’t really need to know

Calculus

How to use a derivative to learn

weight_delta is your derivative

Look familiar?

Breaking gradient descent

Just give me the code!

Visualizing the overcorrections

Divergence

Sometimes neural networks explode in value. Oops?

Introducing alpha

It’s the simplest way to prevent overcorrecting weight updates

Alpha in code

Where does our “alpha” parameter come into play?

Memorizing

It’s time to really learn this stuff

Chapter 5. Learning multiple weights at a time: generalizing gradient descent

Gradient descent learning with multiple inputs

Gradient descent also works with multiple inputs

Gradient descent with multiple inputs explained

Simple to execute, and fascinating to understand

How do you turn a single delta (on the node) into three weight_delta values?

Let’s watch several steps of learning

Freezing one weight: What does it do?

Gradient descent learning with multiple outputs

Neural networks can also make multiple predictions using only a single input

Gradient descent with multiple inputs and outputs

Gradient descent generalizes to arbitrarily large networks

What do these weights learn?

Each weight tries to reduce the error, but what do they learn in aggregate?

Visualizing weight values

Visualizing dot products (weighted sums)

Summary

Gradient descent is a general learning algorithm

Chapter 6. Building your first deep neural network: introduction to backpropagation

The streetlight problem

This toy problem considers how a network learns entire datasets

Preparing the data

Neural networks don’t read streetlights

Matrices and the matrix relationship

Translate the streetlight into math

Good data matrices perfectly mimic the outside world

Matrices A and B both contain the same underlying pattern

Creating a matrix or two in Python

Import the matrices into Python

Building a neural network

Learning the whole dataset

The neural network has been learning only one streetlight. Don’t we want it to learn them all?

Full, batch, and stochastic gradient descent

Stochastic gradient descent updates weights one example at a time

(Full) gradient descent updates weights one dataset at a time

Batch gradient descent updates weights after n examples

Neural networks learn correlation

What did the last neural network learn?

Up and down pressure

It comes from the data

Edge case: Overfitting

Sometimes correlation happens accidentally

Edge case: Conflicting pressure

Sometimes correlation fights itself

It doesn’t always work out like this

Learning indirect correlation

If your data doesn’t have correlation, create intermediate data that does!

Creating correlation

Stacking neural networks: A review

Chapter 3 briefly mentioned stacked neural networks. Let’s review

Backpropagation: Long-distance error attribution

The weighted average error

Backpropagation: Why does this work?

The weighted average delta

Linear vs. nonlinear

This is probably the hardest concept in the book. Let’s take it slowly

Why the neural network still doesn’t work

If you trained the three-layer network as it is now, it wouldn’t converge

The secret to sometimes correlation

Turn off the node when the value would be below 0

A quick break

That last part probably felt a little abstract, and that’s totally OK

Your first deep neural network

Here’s how to make the prediction

Backpropagation in code

You can learn the amount that each weight contributes to the final error

One iteration of backpropagation

Putting it all together

Here’s the self-sufficient program you should be able to run (runtime output follows)

Why do deep networks matter?

What’s the point of creating “intermediate datasets” that have correlation?

Chapter 7. How to picture neural networks: in your head and on paper

It’s time to simplify

It’s impractical to think about everything all the time. Mental tools can help

Let’s start by reviewing the concepts you’ve learned so far

Correlation summarization

This is the key to sanely moving forward to more advanced neural networks

The previously overcomplicated visualization

While simplifying the mental picture, let’s simplify the visualization as well

The simplified visualization

Neural networks are like LEGO bricks, and each brick is a vector or matrix

Simplifying even further

The dimensionality of the matrices is determined by the layers

Let’s see this network predict

Let’s picture data from the streetlight example flowing through the system

Visualizing using letters instead of pictures

All these pictures and detailed explanations are actually a simple piece of algebra

Linking the variables

The letters can be combined to indicate functions and operations

Everything side by side

Let’s see the visualization, algebra formula, and Python code in one place

The importance of visualization tools

We’re going to be studying new architectures

Chapter 8. Learning signal and ignoring noise: introduction to regularization and batching

Three-layer network on MNIST

Let’s return to the MNIST dataset and attempt to classify it with the new network

Well, that was easy

The neural network perfectly learned to predict all 1,000 images

Memorization vs. generalization

Memorizing 1,000 images is easier than generalizing to all images

Overfitting in neural networks

Neural networks can get worse if you train them too much!

Where overfitting comes from

What causes neural networks to overfit?

The simplest regularization: Early stopping

Stop training the network when it starts getting worse

Industry standard regularization: Dropout

The method: Randomly turn off neurons (set them to 0) during training

Why dropout works: Ensembling works

Dropout is a form of training a bunch of networks and averaging them

Dropout in code

Here’s how to use dropout in the real world

Dropout evaluated on MNIST

Batch gradient descent

Here’s a method for increasing the speed of training and the rate of convergence

Summary

Chapter 9. Modeling probabilities and nonlinearities: activation functions

What is an activation function?

It’s a function applied to the neurons in a layer during prediction

Constraint 1: The function must be continuous and infinite in domain

Constraint 2: Good activation functions are monotonic, never changing direction

Constraint 3: Good activation functions are nonlinear (they squiggle or turn)

Constraint 4: Good activation functions (and their derivatives) should be efficiently computable

Standard hidden-layer activation functions

Of the infinite possible functions, which ones are most commonly used?

sigmoid is the bread-and-butter activation

tanh is better than sigmoid for hidden layers

Standard output layer activation functions

Choosing the best one depends on what you’re trying to predict

Configuration 1: Predicting raw data values (no activation function)

Configuration 2: Predicting unrelated yes/no probabilities (sigmoid)

Configuration 3: Predicting which-one probabilities (softmax)

The core issue: Inputs have similarity

Different numbers share characteristics. It’s good to let the network believe that

softmax computation

softmax raises each input value exponentially and then divides by the layer’s sum

Activation installation instructions

How do you add your favorite activation function to any layer?

Multiplying delta by the slope

To compute layer_delta, multiply the backpropagated delta by the layer’s slope

Converting output to slope (derivative)

Most great activations can convert their output to their slope. (Efficiency win!)

Upgrading the MNIST network

Let’s upgrade the MNIST network to reflect what you’ve learned

Chapter 10. Neural learning about edges and corners: intro to convolutional neural networks

Reusing weights in multiple places

If you need to detect the same feature in multiple places, use the same weights!

The convolutional layer

Lots of very small linear layers are reused in every position, instead of a single big one

A simple implementation in NumPy

Just think mini-linear layers, and you already know what you need to know

Summary

Reusing weights is one of the most important innovations in deep learning

Chapter 11. Neural networks that understand language: king – man + woman == ?

What does it mean to understand language?

What kinds of predictions do people make about language?

Natural language processing (NLP)

NLP is divided into a collection of tasks or challenges

Supervised NLP

Words go in, and predictions come out

IMDB movie reviews dataset

You can predict whether people post positive or negative reviews

Capturing word correlation in input data

Bag of words: Given a review’s vocabulary, predict the sentiment

Predicting movie reviews

With the encoding strategy and the previous network, you can predict sentiment

Intro to an embedding layer

Here’s one more trick to make the network faster

After running the previous code, run this code

Interpreting the output

What did the neural network learn along the way?

Neural architecture

How did the choice of architecture affect what the network learned?

What should you see in the weights connecting words and hidden neurons?

Comparing word embeddings

How can you visualize weight similarity?

What is the meaning of a neuron?

Meaning is entirely based on the target labels being predicted

Filling in the blank

Learn richer meanings for words by having a richer signal to learn

Meaning is derived from loss

Neural networks don’t really learn data; they minimize the loss function

The choice of loss function determines the neural network’s knowledge

King – Man + Woman ~= Queen

Word analogies are an interesting consequence of the previously built network

Word analogies

Linear compression of an existing property in the data

Summary

You’ve learned a lot about neural word embeddings and the impact of loss on learning

Chapter 12. Neural networks that write like Shakespeare: recurrent layers for variable-length data

The challenge of arbitrary length

Let’s model arbitrarily long sequences of data with neural networks!

Do comparisons really matter?

Why should you care about whether you can compare two sentence vectors?

The surprising power of averaged word vectors

It’s the amazingly powerful go-to tool for neural prediction

How is information stored in these embeddings?

When you average word embeddings, average shapes remain

How does a neural network use embeddings?

Neural networks detect the curves that have correlation with a target label

The limitations of bag-of-words vectors

Order becomes irrelevant when you average word embeddings

Using identity vectors to sum word embeddings

Let’s implement the same logic using a different approach

Matrices that change absolutely nothing

Let’s create sentence embeddings using identity matrices in Python

Learning the transition matrices

What if you allowed the identity matrices to change to minimize the loss?

Learning to create useful sentence vectors

Create the sentence vector, make a prediction, and modify the sentence vector via its parts

Forward propagation in Python

Let’s take this idea and see how to perform a simple forward propagation

How do you backpropagate into this?

It might seem trickier, but they’re the same steps you already learned

Let’s train it!

You have all the tools; let’s train the network on a toy corpus

Setting things up

Before you can create matrices, you need to learn how many parameters you have

Forward propagation with arbitrary length

You’ll forward propagate using the same logic described earlier

Backpropagation with arbitrary length

You’ll backpropagate using the same logic described earlier

Weight update with arbitrary length

You’ll update weights using the same logic described earlier

Execution and output analysis

You’ll update weights using the same logic described earlier

Looking at predictions can help you understand what’s going on

Summary

Recurrent neural networks predict over arbitrary-length sequences

Chapter 13. Introducing automatic optimization: let’s build a deep learning framework

What is a deep learning framework?

Good tools reduce errors, speed development, and increase runtime performance

Introduction to tensors

Tensors are an abstract form of vectors and matrices

Introduction to automatic gradient computation (autograd)

Previously, you performed backpropagation by hand. Let’s make it automatic!

A quick checkpoint

Everything in Tensor is another form of lessons already learned

Tensors that are used multiple times

The basic autograd has a rather pesky bug. Let’s squish it!

Upgrading autograd to support multiuse tensors

Add one new function, and update three old ones

How does addition backpropagation work?

Let’s study the abstraction to learn how to add support for more functions

Adding support for negation

Let’s modify the support for addition to support negation

Adding support for additional functions

Subtraction, multiplication, sum, expand, transpose, and matrix multiplication

Using autograd to train a neural network

You no longer have to write backpropagation logic!

Adding automatic optimization

Let’s make a stochastic gradient descent optimizer

Adding support for layer types

You may be familiar with layer types in Keras or PyTorch

Layers that contain layers

Layers can also contain other layers

Loss-function layers

Some layers have no weights

How to learn a framework

Oversimplified, frameworks are autograd + a list of prebuilt layers and optimizers

Nonlinearity layers

Let’s add nonlinear functions to Tensor and then create some layer types

The embedding layer

An embedding layer translates indices into activations

Adding indexing to autograd

Before you can build the embedding layer, autograd needs to support indexing

The embedding layer (revisited)

Now you can finish forward propagation using the new .index_select() method

The cross-entropy layer

Let’s add cross entropy to the autograd and create a layer

The recurrent neural network layer

By combining several layers, you can learn over time series

You can learn to fit the task you previously accomplished in the preceding chapter

Summary

Frameworks are efficient, convenient abstractions of forward and backward logic

Chapter 14. Learning to write like Shakespeare: long short-term memory

Character language modeling

Let’s tackle a more challenging task with the RNN

The need for truncated backpropagation

Backpropagating through 100,000 characters is intractable

Truncated backpropagation

Technically, it weakens the theoretical maximum of the neural network

Let’s see how to iterate using truncated backpropagation

A sample of the output

By sampling from the predictions of the model, you can write Shakespeare!

Vanishing and exploding gradients

Vanilla RNNs suffer from vanishing and exploding gradients

A toy example of RNN backpropagation

To see vanishing/exploding gradients firsthand, let’s synthesize an example

Long short-term memory (LSTM) cells

LSTMs are the industry standard model to counter vanishing/exploding gradients

Some intuition about LSTM gates

LSTM gates are semantically similar to reading/writing from memory

The long short-term memory layer

You can use the autograd system to implement an LSTM

Upgrading the character language model

Let’s swap out the vanilla RNN with the new LSTM cell

Training the LSTM character language model

The training logic also hasn’t changed much

Tuning the LSTM character language model

I spent about two days tuning this model, and it trained overnight

Summary

LSTMs are incredibly powerful models

Chapter 15. Deep learning on unseen data: introducing federated learning

The problem of privacy in deep learning

Deep learning (and tools for it) often means you have access to your training data

Federated learning

You don’t have to have access to a dataset in order to learn from it

Learning to detect spam

Let’s say you want to train a model across people’s emails to detect spam

Let’s make it federated

The previous example was plain vanilla deep learning. Let’s protect privacy

Hacking into federated learning

Let’s use a toy example to see how to still learn the training dataset

Secure aggregation

Let’s average weight updates from zillions of people before anyone can see them

Homomorphic encryption

You can perform arithmetic on encrypted values

Homomorphically encrypted federated learning

Let’s use homomorphic encryption to protect the gradients being aggregated

Summary

Federated learning is one of the most exciting breakthroughs in deep learning

Chapter 16. Where to go from here: a brief guide

Congratulations!

If you’re reading this, you’ve made it through nearly 300 pages of deep learning

Step 1: Start learning PyTorch

The deep learning framework you made most closely resembles PyTorch

Step 2: Start another deep learning course

I learned deep learning by relearning the same concepts over and over

Step 3: Grab a mathy deep learning textbook

You can reverse engineer the math from your deep learning knowledge

Step 4: Start a blog, and teach deep learning

Nothing I’ve ever done has helped my knowledge or career more

Step 5: Twitter

A lot of AI conversation happens on Twitter

Step 6: Implement academic papers

Twitter + your blog = tutorials on academic papers

Step 7: Acquire access to a GPU (or many)

The faster you can experiment, the faster you can learn

Step 8: Get paid to practice

The more time you have to do deep learning, the faster you’ll learn

Step 9: Join an open source project

The best way to network and career-build in AI is to become a core developer in an open source project

Step 10: Develop your local community

I really learned deep learning because I enjoyed hanging with friends who were

Index

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.176.145