© Ervin Varga 2019
E. Varga Practical Data Science with Python 3https://doi.org/10.1007/978-1-4842-4859-1_12

12. Deep Learning

Ervin Varga1 
(1)
Kikinda, Serbia
 

Deep learning is a field of machine learning that is a true enabler of cutting-edge achievements in the domain of artificial intelligence. The term deep implies a complex structure that is designed to handle massive datasets using intensive parallel computations (mostly by leveraging clusters of GPU-equipped machines). The term learning in this context means that feature engineering and customization of model parameters are left to the machine. In practice, the combination of these terms in the form of deep learning implies multilayered neural networks. Neural networks are heavily used for tasks like image classification, voice recognition/synthetization, time series analysis, and so forth. Neural networks tend to mimic how our brain cells work in tandem in decision-making activities. This chapter introduces you to neural networks and how to build them using PyTorch, which is an open-source Python framework (visit https://pytorch.org ) that has a familiar API to those accustomed to Numpy. Furthermore, as the last chapter in this book, it exemplifies many stages of the data science life cycle model (data preparation, feature engineering, data visualization, data analysis, and data product deployment). First, though, let’s consider the notion of intelligence as well as when, how, and why it matters.

Intelligent Machines

The meaning of the term intelligence is subjective and has changed over history. In the 18th and 19th centuries, people were amazed by mechanical contrivances and automated steam engines, understandably so; when I visited Tower Bridge in London, I was impressed by the amount of automation present in the coal-driven steam engine system that, until 1974, operated the bascule bridge. When Tower Bridge became operational back in 1894, many Londoners probably thought that the machinery inside Tower Bridge’s engine room was “intelligent.” Of course, no one today is likely to attach the attribute of intelligent to any sort of steam engine, though. Why? Well, as people’s understanding of the mechanics behind a technology becomes commonplace, their associated level of amazement drops accordingly. So, we may establish a correlation between being intelligent and our excitement factor regarding some technology. Here are some questions to ponder:
  • What makes a car intelligent? Is a self-driving car intelligent? Is a self-parking car also intelligent (albeit not that much as a fully self-driving one)? Is a car with a cruise control system intelligent at all? Where is the borderline?

  • Is a common fruit fly intelligent? Is an artificial, self-reproductive fruit fly the state-of-the-art of AI in 2019+?

  • Are there indirect signs of intelligence? (Take a look at Figure 12-1, discussed a bit later.)

Suppose that you have a “stupid” washing machine without any fuzzy logic or similar smart algorithm built into it. Furthermore, assume that you have stumbled across an “intelligent” washing detergent (such as one marketed at https://www.skip.co.za/product-format ). Don’t contemplate too long about what constitutes intelligence in a washing detergent or how fuzzy logic may help your machine become cleverer. These aspects aren’t important for now. Instead, try to answer the following questions: Would the simple act of pouring an “intelligent” washing detergent into your “stupid” machine make it intelligent? How would you make a judgment? Would you simply evaluate the outcome based on the cleanliness of the washed clothes from the machine? Would you also monitor how your machine operates? Would you disassemble your machine and analyze the parts separately (including the washing detergent), hoping to find intelligence?
../images/473947_1_En_12_Chapter/473947_1_En_12_Fig1_HTML.jpg
Figure 12-1

You and your team on a mission to find a habitable planet

Imagine that you and your team are searching nearby solar systems for habitable planets. You approach a good candidate planet, but there is a problem. You have noticed a shovel on the beach. You are afraid to land, not because you are afraid of the shovel, but afraid of what it represents. It is a clear sign of the presence of intelligent beings, who might not welcome you. It doesn’t even matter whether the planet is colonized by intelligent living beings or by robots. What matters is the indirect sign of intelligence, which sparks some level of respect. This idea is nicely elaborated in reference [1] and we will further develop it in this chapter.

Estimating the level of intelligence of humans and machines has been a known conundrum for a long time. One way to approach it is embodied in the Turing test, which measures a machine’s ability to exhibit intelligent behavior indistinguishable from that of a human, as depicted in Figure 12-2. This is definitely a black-box test, which tries to answer a question purely by observing external properties and outcomes.
../images/473947_1_En_12_Chapter/473947_1_En_12_Fig2_HTML.jpg
Figure 12-2

Turing test setup (source: https://commons.wikimedia.org/wiki/File:Test_de_Turing.jpg ). The interrogator (C) tries to reveal which player (A or B) is a human only by presenting questions and looking at the written answers.

The Turing test aims to establish the ultimate criteria regarding what constitutes an intelligent machine. Maybe one day we will manage to attain such level of sophistication. Until then we must focus our attention on more pragmatic objectives. Returning to our washing machine example, do we really care who or what cleaned an item of clothing? The real question is, “Can we distinguish whether clothing was washed manually by a human or washed by a washing machine?” If there is no difference, then the washing machine is a useful utensil that makes us more productive and relieve us of mundane tasks. This is the whole point of human–machine interaction. In this sense, intelligence is just a requisite score related to the complexity of tasks that we want to tackle with or without a machine. For more advanced jobs, we need to reach out to more progressed techniques, technologies, and tools. We don’t usually care whether the desired level of a machine’s intelligence is due to the stupid hardware + super-smart software, ultra-smart hardware + dumb software, or shrewd hardware and software. Sometimes, even an ordinary calculator can make us powerful enough to solve a problem on time with the required level of accuracy.

Note

In the spirit of the previous text, I only contemplate using deep neural networks when simpler, more interpretable solutions are unfeasible. Deep neural networks may seem mystical to external spectators, but internally they are roughly linear algebra and calculus. Nonetheless, the amount of research and human effort that preceded them is staggering (developing the algorithm to train deep neural networks required around 30 years of concerted hard work of many scientists). You may also want to consult references [2] and [3] for more information regarding intelligence and different ways of looking at things.

One of the fascinating and amusing achievements of neural networks is embodied in the 20Q game (for more details, read reference [4]). This product constantly learns from users as they play the game. You may also want to consult reference [5] for examples of when neural networks are a good fit as well as to learn about Keras (I will present PyTorch in this chapter).

Intelligence As Mastery of Symbols

I have pondered a lot about how to best exemplify what is happening inside a neural network. Jumping immediately to nodes, weights, layers, and activation functions seems inappropriate for me. Luckily, I managed to find a suitable problem statement on Topcoder, for a game called AToughGame , to demonstrate how abstractions interact in creating something that appears as intelligent (to follow this discussion, you first must read the specification; click the link for AToughGame at https://www.topcoder.com/blog/how-to-come-up-with-problem-ideas ). Intelligence may be treated as a mastery of producing symbols (abstractions) at multiple levels of granularity, as mentioned in reference [1]. Abstract hierarchical structures are accumulated and built upon each other until the final solution may be trivially described in terms of them.

Manual Feature Engineering

Figure 12-3 shows the general structure of the AToughGame problem as a state diagram. The player progresses from one level to another according to the provided probabilities. The main idea is to process the states in pairs. In this manner, the number of states decreases by one after each iteration. This is a typical greedy algorithm with the safe move of aggregating two states. The only real work is then to implement this algorithm with the combine operator. The difficulty is to find out the expected value of a treasure for all possible ways to complete the two levels comprising the pair. The joint probability is simply the product of individual probabilities of levels. All in all, this results in a very fast linear algorithm (visiting each state only once is sufficient).
../images/473947_1_En_12_Chapter/473947_1_En_12_Fig3_HTML.jpg
Figure 12-3

The overall state diagram of the problem

If the probability of passing some level is p, then the opposite outcome is q = 1 – p. The final state is the winning one. The goal is to calculate the expected amount of treasure collected after completing all levels. The diagram shows the initial pair of states that is going to be aggregated first.

Modeling the Aggregated State

There are two questions to answer about the combined state:
  • What is the joint probability of the new state?

  • What is the joint value (I will use v to denote a value) of the new state?

We have already answered the first question, as also shown in Figure 12-3. The new value is a sum of the last value and the expected Amount of treasure taking into account all possible ways to leave the two states. The most trivial scenario is to leave the states in succession without dying at either level. The next possibility is dying once at level 0 and afterward finishing both states in succession. The third scenario is to die twice in succession at level 0 and afterward finish both states in succession. This pattern continues indefinitely. We may describe all these scenarios by $$ E(T)={v}_1+{v}_0ast {p}_1ast {p}_0ast sum limits_{i=0}^{infty }{q}_0^i $$, where E(T) is the expected amount of treasure.

The last expression already encompasses two powerful abstractions. We should synthesize them now. Don’t forget that there are many more possibilities to finish these two levels, which means that manipulating raw probabilities will soon become unwieldy. The first abstraction comes from algebra and provides a closed from solution to the summation. This is called a geometric sum $$ sum limits_{i=0}^{infty }{r}^i=frac{1}{1-r},r<1. $$ In our case, the parameter r will always be a non-negative real number. It is important to name abstractions, and this sum will be denoted as series1(r). Our second abstraction is γ = p0 ∗ series1(q0), with a meaning of “the probability of leaving level 0.” Notice that we can leave this level without dying, or by dying once, twice, and so on.

The next case of leaving the two levels is depicted by the following pseudocode:
repeat an arbitrary number of times:
    leave level 0 in any way
    die at level 1
leave level 0 by dying at least once
pass level 1

Thanks to the previously introduced abstractions, we may formulate the joint probability for the preceding use case in a succinct fashion as series1(γ ∗ q1) ∗ q0 ∗ γ ∗ p1. Observe how γ is nested inside series1.

The next use case is related to the ability of accumulating wealth by dying multiple times at level 1 without dying at level 0. In other words, this scenario is depicted with the following pseudocode:
repeat an arbitrary number of times:
    pass level 0
    die at level 1
pass level 0
pass level 1

In this case, after each iteration, the amount of treasure left at level 1 is equal to v0 ∗ m, where m is the number of iterations. To describe this case effectively, we need a new abstraction, which is a derivative of the geometric sum: $$ sum limits_{i=1}^{infty }iast {r}^{i-1}=frac{1}{{left(1-r
ight)}^2},r<1 $$. We will name this abstraction as series2(r). Therefore, the joint probability is given by series2(p0 ∗ q1) ∗ p0 ∗ p1.

One final remark is that the last use case can also be associated with the second one. In other words, it is possible to accumulate wealth and then, in the last round, lose everything by leaving level 0 after dying at least once.

Tying All Pieces Together

The whole solution is depicted in Listing 12-1 after refactoring the formulas for each use case. The combine function receives as inputs the raw probabilities and values for two consecutive levels and returns the aggregated probability and expected value of treasure. In between, it builds abstractions from raw data. This is very similar to how neural networks start from raw input nodes, create a hierarchy of abstractions via hidden layers, and finally output the result.

We can get rid of γ as it equals 1. Furthermore, we can simply inline series1 and series2 as well as simplify the expression series1(q1) into 1 / p1. The final result barely resembles the expanded version. Optimization should be left as the last step. These sorts of optimizations also happen inside neural networks. Not all features produced by hidden layers are equally useful nor used in stand-alone manner (many get integrated into higher-level abstractions).
class AToughGame:
    def expectedGain(self, prob, value):
        def combine(level0, level1):
            p0, v0, p1, v1 = level0[0], level0[1], level1[0], level1[1]
            q0, q1 = 1 - p0, 1 - p1
            return p0 * p1, v1 + v0 * p1 ∗ (p0 + q0 / p1) * (1 - p0 * q1) ** -2
        from functools import reduce
        return reduce(combine, zip(map(lambda p: p / 1000, prob), value))[1]
Listing 12-1

AToughGame.py Module That Implements the AToughGame Topcoder Problem

The class name and the sole method’s signature are part of the requirements specification (see Exercise 12-1). The internal details, despite all abstractions being manually created, would be unfathomable to someone who hasn’t read the preceding description. Therefore, you must take care to accompany your condensed code with a proper design document.

By looking at the preceding code, you might wonder where the intelligence is in these nine lines of code (including the boilerplate stuff and a blank line). Maybe you would marvel at its ingenuity by running it and seeing how it spits out the solution in a couple of nanoseconds. Can you imagine that it has all the necessary knowledge to take into account all possible ways the player could finish the game? This program nicely illustrates the state of affairs in AI before the 1990s. Software solutions were equipped beforehand will all the necessary heuristics and rules.

Now, imagine a software wizard that is capable of deciphering all the necessary abstractions to deal with a problem. This is exactly where neural networks shine. You simply define the architecture of the solution and leave to the network the hard work of producing higher-level symbols. Upper layers reuse abstractions from lower ones. More layers more sophisticated features.

Machine-Based Feature Engineering

We will now build two versions of a neural network to demonstrate automatic feature engineering and how such a network works. The first version will be built from scratch and the second one will use PyTorch. You have already seen linear regression in Chapter 7 (about machine learning), where the output is estimated as $$ hat{y}=mathrm{W}ast mathrm{features}+mathrm{b} $$ (W is the weight of the features and b is the bias term). The features may contain nonlinear components, though. It is possible to associate an activation function with this output; by doing this, you may turn linear regression into logistic regression for classification purposes. An activation function can even be a trivial step function that outputs 1 if $$ hat{y}ge 0 $$ and 0 otherwise. As a matter of fact, any function that has some threshold to discern positive and negative cases can serve as an activation function.

Figure 12-4 shows a general structure of a neural network with one input layer, one hidden layer, and one output layer. The input layer has as many nodes as there are different features (in this case two) plus an extra constant to represent the bias. The hidden layer is configurable from the viewpoint of number of nodes. The output layer has as many nodes as there are targets (in this case only one).

The shadowed nodes in the hidden layer are composed of an aggregator and an activation function. The node’s aggregator calculates the product of the matching input weights and input, while the activation function transforms the value from an aggregator. The final output node may also apply an activation function, when you are doing classification instead of regression. In Figure 12-4, the activation function is the sigmoid function, whose formula is also shown there (its derivative with respect to its input, which is very neat, as you can see).

Every neural network must be trained before use, where training is essentially an iterative and incremental process to find the proper weights. At the beginning, weights are initialized to some small random values. One training cycle (iteration) is composed of two parts: forward pass, which calculates the final output, and backpropagation, which sends back an error through the network to update weights. Cycles are repeated until the network converges to stable weights. Each iteration is traditionally called an epoch.
../images/473947_1_En_12_Chapter/473947_1_En_12_Fig4_HTML.jpg
Figure 12-4

The architecture of a simple neural network for regression task with one hidden layer

The error is the difference between the true label and the predicted value, represented as $$ E(W)=y-hat{y} $$. A set of partial derivatives of this error function with respect to weights gives us the desired gradients for updating weights; this is the crux of the gradient descent method. The amount to update the weight $$ {W}_{ij}^{(1)} $$ is $$ -frac{partial E}{partial {W}_{ij}^{(1)}}=-frac{partial E}{partial hat{y}}frac{partial hat{y}}{partial {h}_j}frac{partial {h}_j}{partial {W}_{ij}^{(1)}} $$, which is an application of the chain rule in calculating derivatives. The previous expression would have an additional term had the final output also applied an activation function. At any rate, these partial derivatives are only feasible if the activation function (or functions; you may use different functions in each layer) is continuous and differentiable. Furthermore, to avoid the vanishing gradient issue in deep neural networks (when the product of all partial derivates becomes tiny), the activation function should spread out the output over a larger range. In this respect, the sigmoid function isn’t quite good. This is why you will see hyperbolic tangent (tanh) or rectified linear unit (ReLU) in action. The latter is the default choice for hidden layers.

To avoid changing the weights abruptly, there is a hyperparameter called learning rate. Every gradient is multiplied by this quantity, so that the process moves cautiously toward a minimum (most often a good local minima).

The power of neural networks comes from the nonlinearity of outputs from hidden nodes. In deep neural networks, as each layer reuses outputs from a previous one, complex features may be created out of raw input. The beauty is that you don’t need to worry about how the network will describe the problem in succinct fashion. Of course, the downside is that you will have a hard time interpreting the network’s decision-making procedure. There is also an amazing project for producing visual effects from intermediary features created by deep neural networks (visit https://github.com/google/deepdream ).

Implementation from Scratch

Listing 12-2 provides a full implementation of our simple neural network by only using NumPy arrays. This code is actually my solution for Udacity’s bike-sharing project at http://bit.ly/project-bikes (I also recommend the excellent free course “Intro to Deep Learning with PyTorch” at Udacity).
import numpy as np
class NeuralNetwork:
    def __init__(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes
        # Initialize weights to small random values using Normal distribution.
        self.weights_input_to_hidden = np.random.normal(
            scale = self.input_nodes ** -0.5,
            size = (self.input_nodes, self.hidden_nodes))
        self.weights_hidden_to_output = np.random.normal(
            scale = self.hidden_nodes ** -0.5,
            size = (self.hidden_nodes, self.output_nodes))
        self.lr = learning_rate
        self.activation_function = lambda x : 1 / (1 + np.exp(-x)) # sigmoid
    def train(self, features, targets):
        delta_weights_i_h = np.zeros(self.weights_input_to_hidden.shape)
        delta_weights_h_o = np.zeros(self.weights_hidden_to_output.shape)
        for X, y in zip(features, targets):
            y_hat, hidden_outputs = self.__forward(X)
            delta_weights_i_h, delta_weights_h_o = self.__backward(
                y_hat, hidden_outputs,
                X, y,
                delta_weights_i_h, delta_weights_h_o)
        self.__update_weights(delta_weights_i_h, delta_weights_h_o)
    def run(self, X):
        return self.__forward(X)[0]
    def __forward(self, X):
        hidden_inputs = np.dot(X, self.weights_input_to_hidden)
        hidden_outputs = self.activation_function(hidden_inputs)
        final_inputs = np.dot(hidden_outputs, self.weights_hidden_to_output)
        y_hat = final_inputs
        return y_hat, hidden_outputs
    def __backward(self, y_hat, hidden_outputs, X, y, delta_weights_i_h, delta_weights_h_o):
        error = y - y_hat
        hidden_error = np.dot(self.weights_hidden_to_output, error)
        output_error_term = error
        hidden_error_term = hidden_error * hidden_outputs * (1 - hidden_outputs)
        delta_weights_i_h += np.dot(
            X[:, np.newaxis], hidden_error_term[np.newaxis, :])
        delta_weights_h_o += np.dot(
            hidden_outputs[:, np.newaxis], output_error_term[np.newaxis, :])
        return delta_weights_i_h, delta_weights_h_o
    def __update_weights(self, delta_weights_i_h, delta_weights_h_o):
        self.weights_hidden_to_output += self.lr * delta_weights_h_o
        self.weights_input_to_hidden += self.lr * delta_weights_i_h
Listing 12-2

Simple Neural Network Implementation for Regression Task As Shown in Figure 12-4 (Without Biases)

There are two public methods, train and run. The private __update_weights method doesn’t compute the average of the delta weights but assumes that the provided learning rate includes this factor. This is a usual practice, as this parameter is anyhow an arbitrary number that must be tuned for every problem separately.

The weights are initialized to small random values using normal distribution with a standard deviation of $$ sqrt{n} $$, where n is the number of input nodes into the matching layer. This is known as Xavier initialization, which you may read more about at http://bit.ly/xavier-init . This is another way to mitigate the vanishing gradient issue.

Here is a simple recipe to see this network in action:
  1. 1.

    Issue git clone https://github.com/udacity/deep-learning-v2-pytorch.git to clone the course repository.

     
  2. 2.

    Go into the project-bikesharing folder and copy there the simple_network1.py file from this chapter’s source code.

     
  3. 3.

    Delete the my_answers.py file and rename simple_network1.py to my_answers.py.

     
  4. 4.

    Open Predicting_bike_sharing_data.ipynb in your Jupyter notebook instance and follow the narrative.

     
The hyperparameters in the simple_network1.py file are set as follows:
iterations = 1000
learning_rate = 0.005
hidden_nodes = 20
output_nodes = 1

All unit tests should pass, and you should get a line plot for the test data as shown in Figure 12-5. The model predicts the data quite well, except for the last week of the year. You can see that in the period of 22 of December until the end of the year the prediction is higher than the actual data. The network had not been properly trained to recognize this period, when most people take vacation around Christmas. If you look carefully in the notebook, you will see that the test data is not covering a typical period, and this critical period was taken away during training. Furthermore, workingday as a feature was also removed from the data.

Implementation with PyTorch

Listing 12-3 provides a full implementation of our simple neural network, but this time using PyTorch. PyTorch helps you to reduce the amount of code that you need to write, thereby making your product easier to maintain. There is also less chance for you to introduce subtle bugs into your implementation. Most importantly, PyTorch has lots of cool capabilities, like support for deep neural networks, including convolutional, recurrent, gated recurrent, and long-short term memory networks. Since training deep neural networks is computationally quite intensive, PyTorch allows you to utilize GPUs on your machine (if your environment also support the CUDA programming model).
../images/473947_1_En_12_Chapter/473947_1_En_12_Fig5_HTML.jpg
Figure 12-5

The actual and predicted outputs for the test data

from collections import OrderedDict
import torch
from torch import nn
class NeuralNetwork:
    def __init__(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        self.model = nn.Sequential(OrderedDict([
                ('fc', nn.Linear(input_nodes, hidden_nodes)),
                ('sigmoid', nn.Sigmoid()),
                ('output', nn.Linear(hidden_nodes, output_nodes))]))
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        self.criterion = nn.MSELoss()
        self.optimizer = torch.optim.SGD(self.model.parameters(), lr = learning_rate)
    def train(self, features, targets):
        features, targets = features.to(self.device), targets.to(self.device)
        self.model.train()
        self.optimizer.zero_grad()
        output = self.model(features)
        loss = self.criterion(output, targets)
        loss.backward()
        self.optimizer.step()
    def run(self, x):
        self.model.eval()
        with torch.no_grad():
            return self.model(torch.tensor(x.values, dtype = torch.float)
                       .to(self.device)).cpu().numpy()
Listing 12-3

Version of Our Network with PyTorch Using GPUs, If Available The hyperparameters were also altered (look in the simple_network2.py module).

The NeuralNetwork class uses composition over inheritance with duck typing and saves the PyTorch network object as an internal attribute model. OrderedDict is useful to name each of the layers of the network. You can easily refer to them later by typing self.model.<name>. The whole code is simply a sequence of declarations instead of low-level implementation details. You can immediately read out the high-level backpropagation algorithm from the body of the train method :
  1. 1.

    Set gradients to zero (the same as we did in Listing 12-2).

     
  2. 2.

    Make a forward pass through the network.

     
  3. 3.

    Calculate the error.

     
  4. 4.

    Backpropagate the error to find the proper deltas for updating the weights.

     
  5. 5.

    Update the weights.

     
PyTorch operates with tensors, which are n-dimensional vectors. To run this new version, you will need to alter the code cell inside the Predicting_bike_sharing_data.ipynb notebook for training the network. Don’t try to run the unit tests, as they aren’t compatible with this code (you may want to tweak them as an additional exercise). Listing 12-4 shows the updated script that uses the PyTorch DataLoader class (the lines shown in bold are additions to the original variant, and some lines have been removed).
import sys
import torch.utils.data as data_utils
from my_answers import iterations, learning_rate, hidden_nodes, output_nodes
print("Is CUDA available?", "Yes" if torch.cuda.is_available() else "No")
N_i = train_features.shape[1]
network = NeuralNetwork(N_i, hidden_nodes, output_nodes, learning_rate)
losses = {'train':[], 'validation':[]}
train = data_utils.TensorDataset(torch.Tensor(np.array(train_features)),
                                 torch.Tensor(np.array(train_targets)))
train_loader = data_utils.DataLoader(train, batch_size = 128, shuffle = True)
for epoch in range(1, iterations + 1):
    for X, y in train_loader:
        network.train(X, y)
    # Printing out the training progress
    train_loss = MSE(network.run(train_features).T, train_targets['cnt'].values)
    val_loss = MSE(network.run(val_features).T, val_targets['cnt'].values)
    sys.stdout.write(" Progress: {:2.1f}".format(100 * epoch / iterations)
                     + "% ... Training loss: " + str(train_loss)[:5]
                     + " ... Validation loss: " + str(val_loss)[:5])
    sys.stdout.flush()
    losses['train'].append(train_loss)
    losses['validation'].append(val_loss)
Listing 12-4

Updated Script Using PyTorch DataLoader Class

Figure 12-6 shows how the training and validation losses change over time. After around 200 epochs, there is no significant improvement in either quantity. You should definitely train your network on a beefed-up machine with GPUs, because otherwise it will take a while to finish.
../images/473947_1_En_12_Chapter/473947_1_En_12_Fig6_HTML.jpg
Figure 12-6

The efficiency of the training process. You should always monitor the validation loss curve. If it starts to rise, then your network is overfitting. If the training loss doesn’t drop, then you are underfitting.

Exercise 12-1. Custom Parallelization

In Chapter 11 we applied Dask to perform operations in parallel over an array. There are situations where this form of concurrency isn’t viable. Dask also provides an option to parallelize custom algorithms through the dask.delayed interface.

The expectedGain function returns the expected amount of treasure at the end of a game. Suppose that you want to simply return the total probability of passing all levels in succession. This is currently returned as the first element in the final tuple. Create a new method totalProbability that calculates just this quantity in parallel.

Take a look at the example code about tree summation (reduction) at https://examples.dask.org/delayed.html . Instead of using the add operation from the Dask tutorial, you would use multiplication to merge nodes. For this simple case of tree reduction, you don’t need a custom procedure, but the aim is to try out the dask.delayed interface and monitor in the Dask dashboard how computations are carried out.

Exercise 12-2. Deployment into Production

PyTorch (starting from version 1.0) offers the capability to convert your Python model into an intermediary format that could be utilized from a C++ environment. In this manner, you can develop your solution in Python and later deploy it as a C++ application. This approach addresses the performance requirements associated with production setup.

Consult the tutorial about loading your PyTorch model in C++ at https://pytorch.org/tutorials/advanced/cpp_export.html . In our case, because the forward implementation is unified (there is no conditional logic based on input), you can transform the PyTorch model to Torch Script via tracing.

Summary

Using PyTorch, or some other framework for neural networks, is essential to cope with the inherent complexities of deep neural networks. There are lots of additional options to optimize the training process, which are readily available in PyTorch: dropout, batch normalization, various advanced optimizers, different activation functions, and so on. PyTorch also allows you to persist your network into external storage for later use. You might want to save your model each time you manage to reduce the validation loss. At the end, you can select the best-performing variant.

PyTorch is bundled with two major extensions: pytorchvision, which is useful for image processing, and pytorchtext, which is useful for handling text (such as doing sentiment analysis). You can also reuse publicly available trained models to realize the concept of transfer learning. For example, there are models trained on images from ImageNet ( http://www.image-net.org ) with cool features for image classification.

There is an interesting initiative called PyTorch Hub ( https://pytorch.org/hub ) for efficiently sharing pretrained models, thus realizing the vision of transfer learning. Neural networks are also a very popular option at the IoT edge. For this you need an ultra-light prepacked engine. One example is Intel’s Neural Compute Stick ( http://bit.ly/neural-stick ). All in all, there are very innovative and interesting approaches for each use case and domain.

References

  1. 1.

    Douglas R. Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid, Anniversary Edition, Basic Books, 1999.

     
  2. 2.

    Charles Petzold, Code: The Hidden Language of Computer Hardware and Software, Microsoft Press, 2000.

     
  3. 3.

    Garry Kasparov, “Don’t Fear Intelligent Machines. Work with Them,” TED2017, https://www.ted.com/talks/garry_kasparov_don_t_fear_intelligent_machines_work_with_them , April 2017.

     
  4. 4.

    Karen Schrock, “Twenty Questions, Ten Million Synapses,” Scienceline, https://scienceline.org/2006/07/tech-schrock-20q , July 28, 2006.

     
  5. 5.

    Jojo Moolayil, Learn Keras for Deep Neural Networks: A Fast-Track Approach to Modern Deep Learning with Python, Apress, 2018.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.239.17