Supervised machine learning is the most sophisticated branch of machine learning. It is in use in almost all fields, including artificial intelligence, cognitive computing, and language processing. Machine learning literature broadly talks about three types of learning: supervised, unsupervised, and reinforcement learning. In supervised learning, the machine learns to recognize the output; hence, it is task driven and the task can be classification or regression.
In unsupervised learning, the machine learns patterns from data; thus, it generalizes the new dataset and the learning happens by taking a set of input features. In reinforcement learning, the learning happens in response to a system that reacts to situations.
Each object or row represents one event and each event is categorized into groups. Identifying which level group a record belongs to is called classification, in which the target variable has specific labels or tags attached to the events. For example, in a bank database, each customer is tagged as either a loyal customer or not a loyal customer. In a medical records database, each patient’s disease is tagged. In the telecom industry, each subscriber is tagged as a churn or non-churn customer. These are examples in which a supervised algorithm performs classification. The word classification comes from the classes available in the target column.
In regression learning , the objective is to predict the value of a continuous variable; for example, given the features of a property, such as the number of bedrooms, square feet, nearby areas, the township, and so forth, the asking price for the house is determined. In such scenarios, the regression models can be used. Similar examples include predicting stock prices or the sales, revenue, and profit of a business.
In an unsupervised learning algorithm, we do not have an outcome variable, and tagging or labeling is not available. We are interested in knowing the natural grouping of the observations, or records, or rows in a dataset. This natural grouping should be in such a way that within groups, similarity should be at a maximum and between groups similarity should be at a minimum.
In real-world scenarios, there are cases where regression does not help predict the target variable. In supervised regression techniques, the input data is also known as training data . For each record, there is a label that has a continuous numerical value. The model is prepared through a training process that predicts the right output, and the process continues until the desired level of accuracy is achieved. We may need advanced regression methods to understand the pattern existing in the dataset.
Introduction to Linear Regression
Linear regression analysis is known as the most reliable, easy to apply, and most widely used among all statistical techniques. This assumes linear, additive relationships between dependent and independent variables. The objective of linear regression is to predict the dependent or target variable through independent variables. The specification of the linear regression model is as follows.
Y = α + βX
This formula has a property in which the prediction for Y is a straight-line function of each of the X variables, keeping all others fixed, and the contributions of different X variables for the predictions are additive. The slopes of their individual straight-line relationships with Y are the coefficients of the variables. The coefficients and intercept are estimated by least squares (i.e., setting them equal to the unique values that minimize the sum of squared errors within the sample of data to which the model is fitted).
The model’s prediction errors are typically assumed to be independently and identically normally distributed. When the beta coefficient becomes zero, the input variable X has no impact on the dependent variable. The OLS method attempts to minimize the sum of the squared residuals. The residuals are defined as the difference between the points on the regression line to the actual data points in the scatterplot. This process seeks to estimate the beta coefficients in a multiple linear regression model.
Person | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Height | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 |
Weight | 115 | 117 | 120 | 123 | 126 | 129 | 132 | 135 | 139 | 142 | 146 | 150 | 154 | 159 | 164 |
Why do we assume that a linear relationship exists between the dependent variable and a set of independent variables, when most o real-life scenarios reflect any other type of relationship than a linear relationship? The reasons why we stick to linear relationship are described next.
It is easy to understand and interpret. There are ways to transform an existing deviation from linearity and make it linear. It is simple to generate prediction.
The linear relationship between dependent and independent variables.
There should not be any multicollinearity among the predictors. If we have more than two predictors in the input feature space, the input features should not be correlated.
There should not be any autocorrelation.
There should not be any heteroscedasticity. The variance of the error term should be constant, along the predictors on another axis, which means the error variance should be constant.
The error term should be normally distributed. The error term is basically defined as the difference between an actual and a predicted variable.
Within linear regression, there are different variants but in machine learning we consider them as one method. For example, if we are using one explanatory variable to predict the dependent variable, it is called a simple linear regression model . If we are using more than one explanatory variable, then the model is called a multiple linear regression model . The ordinary least square is a statistical technique to predict the linear regression model; hence, sometimes the linear regression model is also known as an ordinary least square model .
Linear regression is very sensitive to missing values and outliers because the statistical method of computing a linear regression depends on the mean, standard deviation, and covariance between the variables. Mean is sensitive to outlier values; therefore, it is expected that we need to clear out the outliers before proceeding toward forming the linear regression model.
In machine learning literature, the method for getting optimum beta coefficients that minimize the error in a regression model is achieved by a method called a gradient descent algorithm . How does the gradient descent algorithm work? It starts with an initial value, preferably from zero, and updates the scaling factor by a learning rate regularly iteratively to minimize the error term.
Understanding linear regression based on a machine learning approach requires special data preparation that avoids assumptions by keeping the original data intact. Data transformation is required to make your model more robust.
Recipe 5-1. Data Preparation for the Supervised Model
Problem
How do we perform data preparation for creating a supervised learning model using PyTorch?
Solution
We take an open source dataset, mtcars.csv, which is a regression dataset, to test how to create an input and output tensor.
How It Works
First, the necessary library needs to be imported.
The predictor for the supervised algorithm is qsec, which is used to predict the mileage per gallon provided by the car. What is important here is the data type. First, we import the data, which is in NumPy format, into a PyTorch tensor format. The default tensor format is a float. Using the tensor float format would cause errors when performing the optimization function, so it is important to change the tensor data type. We can reformat the tensor type by using the unsqueeze function and specifying that the dimension is equal to 1.
To reproduce the same result, a manual seed needs to be set; so torch.manual_seed(1234) was used. Although we see that the data type is a tensor, if we check the type function, it will show as double, because a tensor type double is required for the optimization function.
Recipe 5-2. Forward and Backward Propagation
Problem
How do we build a neural network torch class function so that we can build a forward propagation method?
Solution
Design the neural network class function, including the hidden layer from the input layer and from the hidden layer to the output layer. In the neural network architecture, the number of neurons in the hidden layer also needs to be specified.
How It Works
In the class Net() function, we first initialize the feature, hidden, and output layers. Then we introduce the back-propagation function using the rectified linear unit as the activation function in the hidden layer.
The following image shows the ReLU activation function. It is popularly used across different neural network models; however, the choice of the activation function should be based on accuracy. If we get more accuracy in a sigmoid function, we should consider that.
Now the network architecture is mentioned in the supervised learning model. The n_feature shows the number of neurons in the input layer. Since we have one input variable, qsec, we will use 1. The number of neurons in the hidden layer can be decided based on the input and the degree of accuracy required in the learning model. We use the n_hidden equal to 20, which means 20 neurons in the hidden layer 1, and the output neuron is 1.
SGD . Implements stochastic gradient descent (optionally with momentum). The parameters could be momentum, learning rate, and weight decay.
Adadelta . Adaptive learning rate. Has five different arguments, parameters of the network, a coefficient used for computing a running average of the squared gradients, the addition of a term for achieving numerical stability of the model, the learning rate, and a weight decay parameter to apply regularization.
Adagrad . Adaptive subgradient methods for online learning and stochastic optimization. Has arguments such as iterable of parameter to optimize the learning rate and learning rate decay with weight decay.
Adam . A method for stochastic optimization. This function has six different arguments, an iterable of parameters to optimize, learning rate, betas (known as coefficients used for computing running averages of the gradient and its square), a parameter to improve numerical stability, and so forth.
ASGD . Acceleration of stochastic approximation by averaging. It has five different arguments, iterable of parameters to optimize, learning rate, decay term, weight decay, and so forth.
RMSprop algorithm . Uses a magnitude of gradients that are calculated to normalize the gradients.
SparseAdam . Implements a lazy version of the Adam algorithm suitable for sparse tensors. In this variant, only moments that show up in the gradient are updated, and only those portions of the gradient are applied to the parameters.
MSELoss. Creates a criterion that measures the mean squared error between elements in the input variable and target variable. For regression-related problems, this is the best loss function.
After running the supervised learning model, which is a regression model, we need to print the actual vs. predicted values and represent them in a graphical format; therefore, we need to turn on the interactive feature of the model.
Recipe 5-3. Optimization and Gradient Computation
Problem
How do we build a basic supervised neural network training model using PyTorch with different iterations?
Solution
The basic neural network model in PyTorch requires six different steps: preparing training data, initializing weights, creating a basic network model, calculating loss function, selecting the learning rate, and optimizing the loss function with respect to the parameters of the model.
How It Works
Let’s follow a step-by-step approach to create a basic neural network model.
The final prediction result from the model with the first iteration and the last iteration is now represented in the following graph.
In the initial step, the loss function was 276.91. After optimization, the loss function became 35.1890. The fitted regression line and the way it is fitted to the dataset are represented.
Recipe 5-4. Viewing Predictions
Problem
How do we extract the best results from the PyTorch-based supervised learning model?
Solution
The computational graph network is represented by nodes and connected through functions. Various techniques can be applied to minimize the error function and get the best predictive model. We can increase the iteration numbers, estimate the loss function, optimize the function, print actual and predicted values, and show it in a graph.
How It Works
To apply tensor differentiation, the nn.backward() method needs to be applied. Let’s take an example to see how the error gradients are backpropagated. The grad() function holds the final output from the tensor differentiation.
Number of iterations
Type of loss function
Selection of optimization method
Selection of loss function
Learning rate
Decay in the learning rate
Momentum require for optimization
The real dataset looks like the following.
The following script explains reading the mpg and qsec columns from the mtcars.csv dataset. It converts those two variables to tensors using the unsqueeze function, and then uses it inside the neural network model for prediction.
After 1000 iterations, the model converges.
The neural networks in the torch library are typically used with the nn module. Let’s take a look at that.
Linear layers : nn.Linear, nn.Bilinear
Convolution layers : nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.ConvTranspose2d
Nonlinearities : nn.Sigmoid, nn.Tanh, nn.ReLU, nn.LeakyReLU
Pooling layers : nn.MaxPool1d, nn.AveragePool2d
Recurrent networks : nn.LSTM, nn.GRU
Normalization : nn.BatchNorm2d
Dropout : nn.Dropout, nn.Dropout2d
Embedding : nn.Embedding
Loss functions : nn.MSELoss, nn.CrossEntropyLoss, nn.NLLLoss
The standard classification algorithm is another version of a supervised learning algorithm, in which the target column is a class variable and the features could be numeric and categorical.
Recipe 5-5. Supervised Model Logistic Regression
Problem
How do we deploy a logistic regression model using PyTorch?
Solution
The computational graph network is represented by nodes and connected through functions. Various techniques can be applied to minimize the error function and get the best predictive model. We can increase the iteration numbers, estimate the loss function, optimize the function, print actual and predicted values, and show it in a graph.
How It Works
To apply tensor differentiation , the nn.backward() method needs to be applied. Let’s look at an example.
The following shows data preparation for a logistic regression model.
Let’s look at the sample dataset for classification.
Set up the neural network module for the logistic regression model.
Check the neural network configuration.
Run iterations and find the best solution for the sample graph.
Final accuracy shows 100, which is a clear case of overfitting, but we can control this by introducing the dropout rate, which is covered in the next chapter.
Conclusion
This chapter discussed two major types of supervised learning algorithms—linear regression and logistic regression—and their implementation using sample datasets and the PyTorch program. Both algorithms are linear models, one for predicting real valued output and the other for separating one class from another class. Although we considered a two-class classification in the logistic regression example, it can be extended to a multiclass classification model.