© Pradeepta Mishra 2019
Pradeepta MishraPyTorch Recipeshttps://doi.org/10.1007/978-1-4842-4258-2_5

5. Supervised Learning Using PyTorch

Pradeepta Mishra1 
(1)
Bangalore, Karnataka, India
 

Supervised machine learning is the most sophisticated branch of machine learning. It is in use in almost all fields, including artificial intelligence, cognitive computing, and language processing. Machine learning literature broadly talks about three types of learning: supervised, unsupervised, and reinforcement learning. In supervised learning, the machine learns to recognize the output; hence, it is task driven and the task can be classification or regression.

In unsupervised learning, the machine learns patterns from data; thus, it generalizes the new dataset and the learning happens by taking a set of input features. In reinforcement learning, the learning happens in response to a system that reacts to situations.

This chapter covers regression techniques in detail with a machine learning approach and interprets the output from regression methods in the context of a business scenario. The algorithmic classification is shown in Figure 5-1.
../images/474315_1_En_5_Chapter/474315_1_En_5_Fig1_HTML.png
Figure 5-1

Algorithmic classification

Each object or row represents one event and each event is categorized into groups. Identifying which level group a record belongs to is called classification, in which the target variable has specific labels or tags attached to the events. For example, in a bank database, each customer is tagged as either a loyal customer or not a loyal customer. In a medical records database, each patient’s disease is tagged. In the telecom industry, each subscriber is tagged as a churn or non-churn customer. These are examples in which a supervised algorithm performs classification. The word classification comes from the classes available in the target column.

In regression learning , the objective is to predict the value of a continuous variable; for example, given the features of a property, such as the number of bedrooms, square feet, nearby areas, the township, and so forth, the asking price for the house is determined. In such scenarios, the regression models can be used. Similar examples include predicting stock prices or the sales, revenue, and profit of a business.

In an unsupervised learning algorithm, we do not have an outcome variable, and tagging or labeling is not available. We are interested in knowing the natural grouping of the observations, or records, or rows in a dataset. This natural grouping should be in such a way that within groups, similarity should be at a maximum and between groups similarity should be at a minimum.

In real-world scenarios, there are cases where regression does not help predict the target variable. In supervised regression techniques, the input data is also known as training data . For each record, there is a label that has a continuous numerical value. The model is prepared through a training process that predicts the right output, and the process continues until the desired level of accuracy is achieved. We may need advanced regression methods to understand the pattern existing in the dataset.

Introduction to Linear Regression

Linear regression analysis is known as the most reliable, easy to apply, and most widely used among all statistical techniques. This assumes linear, additive relationships between dependent and independent variables. The objective of linear regression is to predict the dependent or target variable through independent variables. The specification of the linear regression model is as follows.

Y = α + βX

This formula has a property in which the prediction for Y is a straight-line function of each of the X variables, keeping all others fixed, and the contributions of different X variables for the predictions are additive. The slopes of their individual straight-line relationships with Y are the coefficients of the variables. The coefficients and intercept are estimated by least squares (i.e., setting them equal to the unique values that minimize the sum of squared errors within the sample of data to which the model is fitted).

The model’s prediction errors are typically assumed to be independently and identically normally distributed. When the beta coefficient becomes zero, the input variable X has no impact on the dependent variable. The OLS method attempts to minimize the sum of the squared residuals. The residuals are defined as the difference between the points on the regression line to the actual data points in the scatterplot. This process seeks to estimate the beta coefficients in a multiple linear regression model.

Let’s take a sample dataset of 15 people. We capture the height and weight for each of them. By taking only their heights, can we predict the weight of a person using a linear regression technique? The answer is yes.

Person

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Height

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

Weight

115

117

120

123

126

129

132

135

139

142

146

150

154

159

164

To represent this graphically, we measure height on the x axis, and we measure weight on the y axis. The linear regression equation is on the graph where the intercept is 87.517 and the coefficient is 3.45. The data points are represented by dots and the connecting line shows linear relationship (see Figure 5-2).
../images/474315_1_En_5_Chapter/474315_1_En_5_Fig2_HTML.png
Figure 5-2

Height and weight relationships

Why do we assume that a linear relationship exists between the dependent variable and a set of independent variables, when most o real-life scenarios reflect any other type of relationship than a linear relationship? The reasons why we stick to linear relationship are described next.

It is easy to understand and interpret. There are ways to transform an existing deviation from linearity and make it linear. It is simple to generate prediction.

The field of predictive modeling is mainly concerned with minimizing the errors in a predictive model, or making the most accurate predictions possible. Linear regression was developed in the field of statistics. It is studied as a model for understanding the relationship between the input and the output of numerical variables, but it has been borrowed by machine learning. It is both a statistical algorithm and a machine learning algorithm. The linear regression model depends on the following set of assumptions.
  • The linear relationship between dependent and independent variables.

  • There should not be any multicollinearity among the predictors. If we have more than two predictors in the input feature space, the input features should not be correlated.

  • There should not be any autocorrelation.

  • There should not be any heteroscedasticity. The variance of the error term should be constant, along the predictors on another axis, which means the error variance should be constant.

  • The error term should be normally distributed. The error term is basically defined as the difference between an actual and a predicted variable.

Within linear regression, there are different variants but in machine learning we consider them as one method. For example, if we are using one explanatory variable to predict the dependent variable, it is called a simple linear regression model . If we are using more than one explanatory variable, then the model is called a multiple linear regression model . The ordinary least square is a statistical technique to predict the linear regression model; hence, sometimes the linear regression model is also known as an ordinary least square model .

Linear regression is very sensitive to missing values and outliers because the statistical method of computing a linear regression depends on the mean, standard deviation, and covariance between the variables. Mean is sensitive to outlier values; therefore, it is expected that we need to clear out the outliers before proceeding toward forming the linear regression model.

In machine learning literature, the method for getting optimum beta coefficients that minimize the error in a regression model is achieved by a method called a gradient descent algorithm . How does the gradient descent algorithm work? It starts with an initial value, preferably from zero, and updates the scaling factor by a learning rate regularly iteratively to minimize the error term.

Understanding linear regression based on a machine learning approach requires special data preparation that avoids assumptions by keeping the original data intact. Data transformation is required to make your model more robust.

Recipe 5-1. Data Preparation for the Supervised Model

Problem

How do we perform data preparation for creating a supervised learning model using PyTorch?

Solution

We take an open source dataset, mtcars.csv, which is a regression dataset, to test how to create an input and output tensor.

How It Works

First, the necessary library needs to be imported.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figa_HTML.jpg

The predictor for the supervised algorithm is qsec, which is used to predict the mileage per gallon provided by the car. What is important here is the data type. First, we import the data, which is in NumPy format, into a PyTorch tensor format. The default tensor format is a float. Using the tensor float format would cause errors when performing the optimization function, so it is important to change the tensor data type. We can reformat the tensor type by using the unsqueeze function and specifying that the dimension is equal to 1.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figb_HTML.jpg

To reproduce the same result, a manual seed needs to be set; so torch.manual_seed(1234) was used. Although we see that the data type is a tensor, if we check the type function, it will show as double, because a tensor type double is required for the optimization function.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figc_HTML.jpg

Recipe 5-2. Forward and Backward Propagation

Problem

How do we build a neural network torch class function so that we can build a forward propagation method?

Solution

Design the neural network class function, including the hidden layer from the input layer and from the hidden layer to the output layer. In the neural network architecture, the number of neurons in the hidden layer also needs to be specified.

How It Works

In the class Net() function, we first initialize the feature, hidden, and output layers. Then we introduce the back-propagation function using the rectified linear unit as the activation function in the hidden layer.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figd_HTML.jpg

The following image shows the ReLU activation function. It is popularly used across different neural network models; however, the choice of the activation function should be based on accuracy. If we get more accuracy in a sigmoid function, we should consider that.

../images/474315_1_En_5_Chapter/474315_1_En_5_Fige_HTML.jpg

Now the network architecture is mentioned in the supervised learning model. The n_feature shows the number of neurons in the input layer. Since we have one input variable, qsec, we will use 1. The number of neurons in the hidden layer can be decided based on the input and the degree of accuracy required in the learning model. We use the n_hidden equal to 20, which means 20 neurons in the hidden layer 1, and the output neuron is 1.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figf_HTML.jpg
The role of the optimization function is to minimize the loss function defined with respect to the parameters and the learning rate. The learning rate chosen here is 0.2. We also pass the neural network parameters into the optimizer. There are various optimization functions.
  • SGD . Implements stochastic gradient descent (optionally with momentum). The parameters could be momentum, learning rate, and weight decay.

  • Adadelta . Adaptive learning rate. Has five different arguments, parameters of the network, a coefficient used for computing a running average of the squared gradients, the addition of a term for achieving numerical stability of the model, the learning rate, and a weight decay parameter to apply regularization.

  • Adagrad . Adaptive subgradient methods for online learning and stochastic optimization. Has arguments such as iterable of parameter to optimize the learning rate and learning rate decay with weight decay.

  • Adam . A method for stochastic optimization. This function has six different arguments, an iterable of parameters to optimize, learning rate, betas (known as coefficients used for computing running averages of the gradient and its square), a parameter to improve numerical stability, and so forth.

  • ASGD . Acceleration of stochastic approximation by averaging. It has five different arguments, iterable of parameters to optimize, learning rate, decay term, weight decay, and so forth.

  • RMSprop algorithm . Uses a magnitude of gradients that are calculated to normalize the gradients.

  • SparseAdam . Implements a lazy version of the Adam algorithm suitable for sparse tensors. In this variant, only moments that show up in the gradient are updated, and only those portions of the gradient are applied to the parameters.

Apart from the optimization function, a loss function needs to be selected before running the supervised learning model. Again, there are various loss functions; let’s look at the error functions.
  • MSELoss. Creates a criterion that measures the mean squared error between elements in the input variable and target variable. For regression-related problems, this is the best loss function.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figg_HTML.jpg

After running the supervised learning model, which is a regression model, we need to print the actual vs. predicted values and represent them in a graphical format; therefore, we need to turn on the interactive feature of the model.

Recipe 5-3. Optimization and Gradient Computation

Problem

How do we build a basic supervised neural network training model using PyTorch with different iterations?

Solution

The basic neural network model in PyTorch requires six different steps: preparing training data, initializing weights, creating a basic network model, calculating loss function, selecting the learning rate, and optimizing the loss function with respect to the parameters of the model.

How It Works

Let’s follow a step-by-step approach to create a basic neural network model.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figh_HTML.jpg

The final prediction result from the model with the first iteration and the last iteration is now represented in the following graph.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figi_HTML.jpg

In the initial step, the loss function was 276.91. After optimization, the loss function became 35.1890. The fitted regression line and the way it is fitted to the dataset are represented.

Recipe 5-4. Viewing Predictions

Problem

How do we extract the best results from the PyTorch-based supervised learning model?

Solution

The computational graph network is represented by nodes and connected through functions. Various techniques can be applied to minimize the error function and get the best predictive model. We can increase the iteration numbers, estimate the loss function, optimize the function, print actual and predicted values, and show it in a graph.

How It Works

To apply tensor differentiation, the nn.backward() method needs to be applied. Let’s take an example to see how the error gradients are backpropagated. The grad() function holds the final output from the tensor differentiation.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figj_HTML.jpg
The tuning parameters that can increase the accuracy of the supervised learning model, which is a regression use case, can be achieved with the following methods.
  • Number of iterations

  • Type of loss function

  • Selection of optimization method

  • Selection of loss function

  • Learning rate

  • Decay in the learning rate

  • Momentum require for optimization

../images/474315_1_En_5_Chapter/474315_1_En_5_Figk_HTML.jpg

The real dataset looks like the following.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figl_HTML.jpg

The following script explains reading the mpg and qsec columns from the mtcars.csv dataset. It converts those two variables to tensors using the unsqueeze function, and then uses it inside the neural network model for prediction.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figm_HTML.jpg

After 1000 iterations, the model converges.

../images/474315_1_En_5_Chapter/474315_1_En_5_Fign_HTML.jpg

The neural networks in the torch library are typically used with the nn module. Let’s take a look at that.

Neural networks can be constructed using the torch.nn package, which provides almost all neural network related functionalities, including the following.
  • Linear layers : nn.Linear, nn.Bilinear

  • Convolution layers : nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.ConvTranspose2d

  • Nonlinearities : nn.Sigmoid, nn.Tanh, nn.ReLU, nn.LeakyReLU

  • Pooling layers : nn.MaxPool1d, nn.AveragePool2d

  • Recurrent networks : nn.LSTM, nn.GRU

  • Normalization : nn.BatchNorm2d

  • Dropout : nn.Dropout, nn.Dropout2d

  • Embedding : nn.Embedding

  • Loss functions : nn.MSELoss, nn.CrossEntropyLoss, nn.NLLLoss

The standard classification algorithm is another version of a supervised learning algorithm, in which the target column is a class variable and the features could be numeric and categorical.

Recipe 5-5. Supervised Model Logistic Regression

Problem

How do we deploy a logistic regression model using PyTorch?

Solution

The computational graph network is represented by nodes and connected through functions. Various techniques can be applied to minimize the error function and get the best predictive model. We can increase the iteration numbers, estimate the loss function, optimize the function, print actual and predicted values, and show it in a graph.

How It Works

To apply tensor differentiation , the nn.backward() method needs to be applied. Let’s look at an example.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figo_HTML.jpg

The following shows data preparation for a logistic regression model.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figp_HTML.jpg

Let’s look at the sample dataset for classification.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figq_HTML.jpg

Set up the neural network module for the logistic regression model.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figr_HTML.jpg

Check the neural network configuration.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figs_HTML.jpg

Run iterations and find the best solution for the sample graph.

../images/474315_1_En_5_Chapter/474315_1_En_5_Figt_HTML.jpg
The first iteration provides almost 99% accuracy, and subsequently, the model provides 100% accuracy on the training data (see Figures 5-3 and 5-4).
../images/474315_1_En_5_Chapter/474315_1_En_5_Fig3_HTML.jpg
Figure 5-3

Initial accuracy

../images/474315_1_En_5_Chapter/474315_1_En_5_Fig4_HTML.jpg
Figure 5-4

Final accuracy

Final accuracy shows 100, which is a clear case of overfitting, but we can control this by introducing the dropout rate, which is covered in the next chapter.

Conclusion

This chapter discussed two major types of supervised learning algorithms—linear regression and logistic regression—and their implementation using sample datasets and the PyTorch program. Both algorithms are linear models, one for predicting real valued output and the other for separating one class from another class. Although we considered a two-class classification in the logistic regression example, it can be extended to a multiclass classification model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.28.200