© Vaibhav Verdhan 2020
V. VerdhanSupervised Learning with Pythonhttps://doi.org/10.1007/978-1-4842-6156-9_2

2. Supervised Learning for Regression Analysis

Vaibhav Verdhan1 
(1)
Limerick, Ireland
 

“The only certainty is uncertainty.”

— Pliny the Elder

The future has uncertainty for sure. The best we can do is plan for it. And for meticulous planning, we need to have a peek into the future. If we can know beforehand how much the expected demand for our product is, we will manufacture adequately—not more, not less. We will also rectify the bottlenecks in our supply chain if we know what the expected traffic of goods is. Airports can plan resources better if they know the expected inflow of passengers. Or ecommerce portals can plan for the expected load if they know how many clicks are expected during the upcoming sale season.

It may not be possible to forecast accurately, but it is indeed required to predict these values. It is still done, with or without ML-based predictive modeling, for financial budget planning, allocation of resources, guidance issued, expected rate of growth, and so on. Hence, the estimation of such values is of paramount importance. In this second chapter, we will be studying the precise concepts to predict such values.

In the first chapter, we introduced supervised learning. The differentiation between supervised, unsupervised, and semi-supervised was also discussed. We also examined the two types of supervised algorithms: regression and classification. In this second chapter, we will study and develop deeper concepts of supervised regression algorithms.

We will be examining the regression process, how a model is trained, behind-the-scenes process, and finally the execution for all the algorithms. The assumptions for the algorithms, the pros and cons, and statistical background beneath each one will be studied. We will also develop code in Python using the algorithms. The steps in data preparation, data preprocessing, variable creation, train-test split, fitting the ML model, getting significant variables, and measuring the accuracy will all be studied and developed using Python. The codes and datasets are uploaded to a GitHub repository for easy access. You are advised to replicate those codes yourself.

Technical Toolkit Required

We are going to use Python 3.5 or above in this book. You are advised to get Python installed on your machine. We will be using Jupyter notebook; installing Anaconda-Navigator is required for executing the codes. All the datasets and codes have been uploaded to the Github repository at https://github.com/Apress/supervised-learning-w-python/tree/master/Chapter2 for easy download and execution.

The major libraries used are numpy, pandas, matplotlib, seaborn, scikit learn, and so on. You are advised to install these libraries in your Python environment.

Let us go into the regression analysis and examine the concepts in detail!

Regression analysis and Use Cases

Regression analysis is used to estimate the value of a continuous variable. Recall that a continuous variable is a variable which can take any numerical value; for example, sales, amount of rainfall, number of customers, number of transactions, and so on. If we want to estimate the sales for the next month or the number of passengers expected to visit the terminal in the next week or the number of customers expected to make bank transactions, we use regression analysis.

Simply put, if we want to predict the value of a continuous variable using supervised learning algorithms, we use regression methods. Regression analysis is quite central to business solving and decision making. The predicted values help the stakeholders allocate resources accordingly and plan for the expected increase/decrease in the business.

The following use cases will make the usage of regression algorithms clear:
  1. 1.

    A retailer wants to estimate the number of customers it can expect in the upcoming sale season. Based on the estimation, the business can plan on the inventory of goods, number of staff required, resources required, and so on to be able to cater to the demand. Regression algorithms can help in that estimation.

     
  2. 2.

    A manufacturing plant is doing a yearly budget planning. As a part of the exercise, expenditures under various headings like electricity, water, raw material, human resources, and so on have to be estimated in relation to the expected demand. Regression algorithms can help assess historical data and generate estimates for the business.

     
  3. 3.

    A real estate agency wishes to increase its customer base. One important attribute is the price of the apartments, which can be improved and generated judiciously. The agency wants to analyze multiple parameters which impact the price of property, and this can be achieved by regression analysis.

     
  4. 4.

    An international airport wishes to improve the planning and gauge the expected traffic in the next month. This will allow the team to maintain the best service quality. Regression analysis can help in that and make an estimation of the number of estimated passengers.

     
  5. 5.

    A bank which offers credit cards to its customers has to identify how much credit should be offered to a new customer. Based on customer details like age, occupation, monthly salary, expenditure, historical records, and so on, a credit limit has to be prescribed. Supervised regression algorithms will be helpful in that decision.

     
There are quite a few statistical algorithms to model for the regression problems. The major ones are listed as
  1. 1.

    Linear regression

     
  2. 2.

    Decision tree

     
  3. 3.

    Random forest

     
  4. 4.

    SVM

     
  5. 5.

    Bayesian methods

     
  6. 6.

    Neural networks

     

We will study the first three algorithms in this chapter and the rest in Chapter 4. We are starting with linear regression in the next section.

What Is Linear Regression

Recall from the section in Chapter 1 where we discussed the house price prediction problem using area, number of bedrooms, balconies, location, and so on. Figure 2-1(i) represents the representation of data in a vector-space diagram, while on the right in Figure 2-1(ii) we have suggested an ML equation termed as the line of best fit to explain the randomness in the data and predict the prices.
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig1_HTML.png
Figure 2-1

(i) The data in a vector-space diagram depicts how price is dependent on various variables. (ii) An ML regression equation called the line of best fit is used to model the relationship here, which can be used for making future predictions for the unseen dataset.

In the preceding example, there is an assumption that price is correlated to size, number of bedrooms, and so on. We discussed correlation in the last chapter. Let’s refresh some points on correlation:
  • Correlation analysis is used to measure the strength of association (linear relationship) between two variables.

  • If two variables move together in the same direction, they are positively correlated. For example: height and weight will have a positive relationship. If the two variables move in the opposite direction, they are negatively correlated. For example, cost and profit are negatively related.

  • The range of correlation coefficient ranges from –1 to +1. If the value is –1, it is absolute negative correlation; if it is +1, correlation is absolute positive.

  • If the correlation is 0, it means there is not much of a relationship. For example, the price of shoes and computers will have low correlation.

The objective of the linear regression analysis is to measure this relationship and arrive at a mathematical equation for the relationship. The relationship can be used to predict the values for unseen data points. For example, in the case of the house price problem, predicting the price of a house will be the objective of the analysis.

Formally put, linear regression is used to predict the value of a dependent variable based on the values of at least one independent variable. It also explains the impact of changes in an independent variable on the dependent variable. The dependent variable is also called the target variable or endogenous variable. The independent variable is also termed the explanatory variable or exogenous variable.

Linear regression has been in existence for a long time. Though there are quite a few advanced algorithms (some of which we are discussing in later chapters), still linear regression is widely used. It serves as a benchmark model and generally is the first step to study supervised learning. You are advised to understand and examine linear regression closely before you graduate to higher-degree algorithms.

Let us say we have a set of observations of x and Y where x is the independent variable and Y is the dependent variable. Equation 2-1 describes the linear relation between x and Y:
$$ {mathrm{Y}}_{mathrm{i}}={upbeta}_0+{upbeta}_1{mathrm{x}}_{mathrm{i}}+{upvarepsilon}_{mathrm{i}} $$
(Equation 2-1)

where

Yi = Dependent variable or the target variable which we want to predict

xi = Independent variable or the predicting variables used to predict the value of Y

β0 = Population Y intercept. It represents the value of Y when the value of xi is zero

β1 = Population slope coefficient. It represents the expected change in the value of Y by a unit change in xi

ε = random error term in the model

Sometimes, β0 and β1 are called the population model coefficients.

In the preceding equation, we are suggesting that the changes in Y are assumed to be caused by changes in x. And hence, we can use the equation to predict the values of Y. The representation of the data and Equation 2-1 will look like Figure 2-2.
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig2_HTML.jpg
Figure 2-2

A linear regression equation depicted on a graph showing the intercept, slope, and actual and predicted values for the target variable; the red line shows the line of best fit

These model coefficients0 and β1) have an important role to play. Y intercept (β0) is the value of dependent variable when the value of independent variable is 0, that is, it is the default value of dependent variable. Slope (β1) is the slope of the equation. It is the change expected in the value of Y with unit change in the value of x. It measures the impact of x on the value of Y. The higher the absolute value of (β1), the higher is the final impact.

Figure 2-2 also shows the predicted values. We can understand that for the value of x for observation i, the actual value of dependent variable is Yi and the predicted value or estimated value is $$ hat{Y_{mathrm{i}}} $$.

There is one more term here, random error , which is represented by ε. After we have made the estimates, we would like to know how we have done, that is, how far the predicted value is from the actual value. It is represented by random error, which is the difference between the predicted and actual value of Y and is given by $$ varepsilon =left(hat{Y_1}-{Y}_{mathrm{i}}
ight) $$. It is important to note that the smaller the value of this error, the better is the prediction.

There is one more important point to ponder upon here. There can be multiple lines which can be said to represent the relationship. For example, in the house price prediction problem, we can use many equations to determine the relationship as shown in Figure 2-3(ii) in different colors.
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig3_HTML.png
Figure 2-3

(i) The data in a vector-space diagram depicts how price is dependent on various variables. (ii) While trying to model for the data, there can be multiple linear equations which can fit, but the objective will be to find the equation which gives the minimum loss for the data at hand.

Hence, it turns out that we have to find out the best mathematical equation which can minimize the random error and hence can be used for making the predictions. This is achieved during training of the regression algorithm. During the process of training the linear regression model, we get the values of β0 and β1, which will minimize the error and can be used to generate the predictions for us.

The linear regression model makes some assumptions about the data that is being used. These conditions are a litmus test to check if the data we are analyzing follows the requirements or not. More often, data is not conforming to the assumptions and hence we have to take corrective measures to make it better. These assumptions will be discussed now.

Assumptions of Linear Regression

The linear regression has a few assumptions which need to be fulfilled. We check these conditions and depending on the results, decide on next steps:
  1. 1.
    Linear regression needs the relationship between dependent and independent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outliers. The linearity assumption can best be tested with scatter plots. In Figure 2-4, we can see the meaning of linearity vs. nonlinearity. The figure on the left has a somewhat linear relationship where the one on the right does not have a linear relationship.
    ../images/499122_1_En_2_Chapter/499122_1_En_2_Fig4_HTML.jpg
    Figure 2-4

    (i) There is a linear relationship between X and Y. (ii) There is not much of a relationship between X and Y variables; in such a case it will be difficult to model the relationship.

     
  2. 2.

    Linear regression requires all the independent variables to be multivariate normal. This assumption can be tested using a histogram or a Q-Q plot. The normality can be checked with a goodness-of-fit test like the Kolmogorov-Smirnov test. If the data is not normally distributed, a nonlinear transformation like log transformation can help fix the issue.

     
  3. 3.
    The third assumption is that there is little or no multicollinearity in the data. When the independent variables are highly correlated with each other, it gives rise to multicollinearity. Multicollinearity can be tested using three methods:
    1. a.

      Correlation matrix: In this method, we measure Pearson’s bivariate correlation coefficient between all the independent variables. The closer the value to 1 or –1, the higher is the correlation.

       
    2. b.

      Tolerance: Tolerance is derived during the regression analysis only and measures how one independent variable is impacting other independent variables. It is represented as (1 – R2). We are discussing R2 in the next section If the value of tolerance is < 0.1, there are chances of multicollinearity present in the data. If it is less than 0.01, then we are sure that multicollinearity is indeed present.

       
    3. c.

      Variance Inflation Factor (VIF) : VIF can be defined as the inverse of tolerance (1/T). If the value of VIF is greater than 10, there are chances of multicollinearity present in the data. If it is greater than 100, then we are sure that multicollinearity is indeed present.

      Note Centering the data (deducting mean from each score) helps to solve the problem of multicollinearity. We will examine this in detail in Chapter 5!

       
     
  4. 4.

    Linear regression assumes that there is little or no autocorrelation present in the data. Autocorrelation means that the residuals are not independent of each other. The best example for autocorrelated data is in a time series data like stock prices. The prices of stock on Tn+1 are dependent on Tn. While we use a scatter plot to check for the autocorrelation, we can use Durbin-Watson’s “d-test” to check for the autocorrelation. Its null hypothesis is that the residuals are not autocorrelated. If the value of d is around 2, it means there is no autocorrelation. Generally d can take any value between 0 and 4, but as a rule of thumb 1.5 < d < 2.5 means there is no autocorrelation present in the data. But there is a catch in the Durbin-Watson test. Since it analyzes only linear autocorrelation and between direct neighbors only, many times, scatter plot serves the purpose for us.

     
  5. 5.

    The last assumption in a linear regression problem is homoscedasticity, which means that the residuals are equal over the regression line. While a scatter plot can be used to check for homoscedasticity, the Goldfeld-Quandt test can also be used for heteroscedasticity. This test divides the data into two groups. Then it tests if the variance of residuals is similar across the two groups. Figure 2-5 shows an example of residuals which are not homoscedastic. In such a case, we can do a nonlinear correction to fix the problem. As we can see in Figure 2-5, heteroscedasticity results in a distinctive cone shape in the residual plots.

     
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig5_HTML.jpg
Figure 2-5

Heteroscedasticity is present in the dataset and hence there is a cone-like shape in the residual scatter plot

These assumptions are vital to be tested and more often we transform our data and clean it. It is imperative since we want our solution to work very well and have a good prediction accuracy. An accurate ML model will have a low loss. This accuracy measurement process for an ML model is based on the target variable. It is different for classification and regression problems. In this chapter, we are discussing regression base models while in the next chapter we will examine the classification model’s accuracy measurement parameters. That brings us to the important topic of measuring the accuracy of a regression problem, which is the next topic of discussion.

Measuring the Efficacy of Regression Problem

There are different ways to measure the robustness of a regression problem. The Ordinary least-squares (OLS) method is one of the most used and quoted ones. In this method, β0 and β1 are obtained by finding their respective values which minimize the sum of the squared distance between Y and $$ hat{Y} $$, which are nothing but the actual and predicted values of the dependent variable as shown in Figure 2-6. This is referred to as the loss function, which means the loss incurred in making a prediction. Needless to say, we want to minimize this loss to get the best model. The errors are referred as residuals too.
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig6_HTML.jpg
Figure 2-6

The difference between the actual values and predicted values of the target variable. This is the error made while making the prediction, and as a best model, this error is to be minimized.

A very important point: why do we take a square of the errors? The reason is that if we do not take the squares of the errors, the positive and negative terms can cancel each other. For example, if the error1 is +2 and error2 is -2, then net error will be 0!

From Figure 2-6, we can say that the
$$ {displaystyle egin{array}{l}mathrm{Minimum}kern0.17em mathrm{squared};mathrm{sum};mathrm{of}kern0.17em mathrm{error}=min;Sigma;{mathrm{e}}_{mathrm{i}}^2\ {}kern7.799994em =min;Sigma;left({mathrm{Y}}_{mathrm{i}}-hat{{mathrm{Y}}_{mathrm{i}}}
ight)\ {}kern5.279997em =min;Sigma;{left[{mathrm{Y}}_{mathrm{i}}-left({upbeta}_0+{upbeta}_1{mathrm{x}}_{mathrm{i}}
ight)
ight]}^2end{array}} $$
(Equation 2-2)
The estimated slope coefficient is
$$ {upbeta}_1={Sigma^{mathrm{n}}}_{mathrm{i}=1};left({mathrm{x}}_{mathrm{i}}-overline{x}
ight);left({mathrm{y}}_{mathrm{i}}-overline{y}
ight)----{Sigma^{mathrm{n}}}_{mathrm{i}=1};{left({mathrm{x}}_{mathrm{i}}-overline{x}
ight)}^2 $$
(Equation 2-3)
The estimated intercept coefficient is
$$ {upbeta}_0=overline{y}-{mathrm{b}}_1;overline{x} $$
(Equation 2-4)

A point to be noted is that the regression line always passes through the mean $$ overline{x} $$ and $$ overline{y} $$.

Based on the preceding discussion, we can study measures of variation. The total variation is made of two parts, which can be represented by the following equation:
$$ {displaystyle egin{array}{l} SST= SSR+ SSE\ {}mathrm{Total};mathrm{sum};mathrm{of}kern0.17em mathrm{squares}=mathrm{Regression};mathrm{sum};mathrm{of} mathrm{squares}+mathrm{Error};mathrm{sum};mathrm{of}kern0.17em mathrm{squares}\ {} SST=Sigma;{left({mathrm{y}}_{mathrm{i}}-overline{y}
ight)}^2kern1.56em SSR=Sigma;{left({hat{Y}}_{mathrm{i}}-overline{y}
ight)}^2kern2.4em SSE=Sigma;{left({mathrm{y}}_{mathrm{i}}-{hat{Y}}_{mathrm{i}}
ight)}^2end{array}} $$

Where $$ overline{y} $$: Average value of dependent variable

Yi: Observed value of the dependent variable

$$ overline{y} $$: Predicted value of y for the given value of xi

In the preceding equation, SST is the total sum of squares. It measures the variation of yi around their mean $$ overline{y} $$. SSR is a regression sum of squares, which is the explained variation attributable to the linear relationship between x and y. SSE is the error sum of squares, which is the variation attributable to factors other than the linear relationship between x and y. The best way to understand this concept is by means of Figure 2-7.
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig7_HTML.png
Figure 2-7

SST = Total sum of squares: SSE = Error sum of squares, and SSR = Regression sum of squares

The concepts discussed in the preceding frame a foundation for us to study the various measures and parameters to check the accuracy of a regression model, which we are discussing now:
  1. 1.

    Mean absolute error (MAE): As the name suggests, it is the average of the absolute difference between the actual and predicted values of a target variable as shown in Equation 2-5.

     
$$ mathrm{Mean}kern0.17em mathrm{Absolute}kern0.17em mathrm{error}=frac{Sigma;left(left|{hat{Y}}_{mathrm{i}}-{y}_{mathrm{i}}
ight|
ight)}{mathrm{n}} $$
(Equation 2-5)
The greater the value of MAE, the greater the error in our model.
  1. 2.

    Mean squared error (MSE): MSE is the average of the square of the error, that is, the difference between the actual and predicted values as shown in Equation 2-6. Similar to MAE, a higher value of MSE means higher error in our model.

     
$$ mathrm{Mean}kern0.17em mathrm{squared}kern0.17em mathrm{error}=frac{Sigma;{left(left|{hat{Y}}_{mathrm{i}}-{y}_{mathrm{i}}
ight|
ight)}^2}{mathrm{n}} $$
(Equation 2-6)
  1. 3.

    Root MSE: Root MSE is the square root of the average squared error and is represented by Equation 2-7.

     
$$ mathrm{Root}kern0.17em mathrm{mean}kern0.17em mathrm{squared}kern0.17em mathrm{error}=sqrt{frac{Sigma;{left(left|{hat{Y}}_{mathrm{i}}-{y}_{mathrm{i}}
ight|
ight)}^2}{mathrm{n}}} $$
(Equation 2-7)
  1. 4.

    R square (R2): It represents how much randomness in the data is being explained by our model. In other words, out of the total variation in our data how much is our model able to decipher.

    R2 = SSR/SST = Regression sum of squares/Total sum or squares

    R2 will always be between 0 and 1 or 0% and 100%. The higher the value of R2, the better it is. The way to visualize R2 is depicted in the following figures. In Figure 2-8, the value of R2 is equal to 1, which means 100% of the variation in the value of Y is explainable by x. In Figure 2-9, the value of R2 is between 0 and 1, depicting that some of the variation is understood by the model. In the last case, Figure 2-9, R2 = 0, which shows that no variation is understood by the model. In a normal business scenario, we get R2 between 0 and 1, that is, a portion of the variation is explained by the model.
    ../images/499122_1_En_2_Chapter/499122_1_En_2_Fig8_HTML.jpg
    Figure 2-8

    R2 is equal to 1; this shows that 100% of the variation in the values of the independent variable is explained by our regression model

    ../images/499122_1_En_2_Chapter/499122_1_En_2_Fig9_HTML.jpg
    Figure 2-9

    If the value of R2 is between 0 and 1 then partial variation is explained by the model. If the value is 0, then no variation is explained by the regression model

     
  2. 5.

    Pseudo R square : It extends the concept of R square. It penalizes the value if we include insignificant variables in the model. We can calculate pseudo R2 as in Equation 2-8.

     
$$ mathrm{Pseudo};{mathrm{R}}^2=1-frac{frac{SSE}{left(mathrm{n}hbox{-} mathrm{k}hbox{-} 1
ight)}}{frac{SST}{left(mathrm{n}hbox{-} 1
ight)}} $$
(Equation 2-8)

where n is the sample size and k is the number of independent variables

Using R square, we measure all the randomness explained in the model. But if we include all the independent variables including insignificant ones, we cannot claim that the model is robust. Hence, we can say pseudo R square is a better representation and measure to measure the robustness of a model.

Tip

Between R2 and pseudo R2, we prefer pseudo R2. The higher the value, the better the model.

Now we are clear on the assumptions of a regression model and how we can measure the performance of a regression model. Now let us study types of linear regression and then develop a Python solution.

Linear regression can be studied in two formats: simple linear regression and multiple linear regression.

Simple linear regression is, as its name implies, simple to understand and easy to use. It is used to predict the value of the target variable using only one independent variable. As in the preceding example, if we have only the house area as an independent variable to predict the house prices, it is an example of simple linear regression. The sample data have only the area and the price is shown in the following. In Figure 2-10, we have the scatter plot of the same data.

Square feet (x)

Price (in 1000 $)

1200

100

1500

120

1600

125

1100

95

1500

125

2200

200

2100

195

1410

110

../images/499122_1_En_2_Chapter/499122_1_En_2_Fig10_HTML.jpg
Figure 2-10

Scatter plot of the price and area data

We can represent the data in Equation 2-9.
$$ mathrm{Price}={upbeta}_0+{upbeta}_1mathrm{x};mathrm{Area} $$
(Equation 2-9)

Here price is Y (dependent variable) and area is the x variable (independent variable). The goal of the simple linear regression will be to estimate the values of β0 and β1 so that we can predict the values of price.

If we run a regression analysis, we will get the respective values of coefficients β0 and β1 as –28.07 and 0.10. We will be running the code using Python later, but let’s interpret the meaning for understanding.
  1. 1.

    β0 having a value of –28.07 means that when the square footage is 0 then the price is –$28,070. Now it is not possible to have a house with 0 area; it just indicates that for houses within the range of sizes observed, $28,070 is the portion of the house price not explained by the area.

     
  2. 2.

    β1 is 0.10;it indicates that the average value of a house increases by a 0.1 × (1000) = $100 with on average for each additional increase in the size by one square foot.

     

There is an important limitation to linear regression. We cannot use it to make predictions beyond the limit of the variables we have used in training. For example, in the preceding data set we cannot model to predict the house prices for areas above 2200 square feet and below 1100 square feet. This is because we have not trained the model for such values. Hence, the model is good enough for the values between the minimum and maximum limits only; in the preceding example the valid range is 1100 square feet to 2200 square feet.

It is time to hit the lab and solve the problem using a dataset in Python. We are using Python and the following are a few of the important attributes in a Python code. These are used when we train the data, set the parameters, and make a prediction:
  • Common hyperparameters
    • fit_interceprt: if we want to calculate intercept for the model, and it is not required if data is centered

    • normalize - X will be normalized by subtracting mean & dividing by standard deviation

    • By standardizing data before subjecting to model, coefficients tell the importance of features

  • Common attributes
    • coefficients are the weights for each independent variable

    • intercept is the bias of independent term of linear models

  • Common functions
    • fit - trains the model. Takes X & Y as the input value and trains the model

    • predict - once model is trained, for given X using predict function Y can be predicted

  • Training model
    • X should be in rows of data format, X.ndim == 2

    • Y should be 1D for single target & 2D for more than one target

    • fit function is used for training the model

We will be creating two examples for simple linear regression. In the first example, we are going to generate a simple linear regression problem. Then in example 2, simple linear regression problem will be solved. Let us proceed to example 1 now.

Example 1: Creating a Simple Linear Regression

We are going to create a simple linear regression by generating a dataset. This code snippet is only a warm-up exercise and will familiarize you with the simplest form of simple linear regression.

Step 1: Import the necessary libraries in the first step. We are going to import numpy, pandas, matplotlib, and scikit learn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
Step 2: In the next step, we are going to create a sample dataset which can be used for regression.
X,Y = make_regression(n_features=1, noise=5, n_samples=5000)

In the preceding code, n_features is the number of features we want to have in the dataset. n_samples is the number of samples we want to generate. Noise is the standard deviation of the gaussian noise which gets applied to the output.

Step 3: Plot the data using matplotlib library. Xlabel and ylabel will give the labels to the figure.
plt.xlabel('Feature - X')
plt.ylabel('Target - Y')
plt.scatter(X,Y,s=5)
../images/499122_1_En_2_Chapter/499122_1_En_2_Figa_HTML.jpg
Step 4: Initialize an instance of linear regression now. The name of the variable is linear_model
linear_model = LinearRegression()
Step 5: Fit the linear regression now. The input independent variable is X and target variable is Y
linear_model.fit(X,Y)
Step 6: The model is trained now. Let us have a look at the coefficient values for both the intercept and the slope of the linear regression model
linear_model.coef_
linear_model.intercept_

The value for the intercept is 0.6759344 and slope is 33.1810

Step 7: We use this trained model to predict the value using X and then plot it
pred = linear_model.predict(X)
plt.scatter(X,Y,s=25, label="training")
plt.scatter(X,pred,s=25, label="prediction")
plt.xlabel('Feature - X')
plt.ylabel('Target - Y')
plt.legend()
plt.show()
../images/499122_1_En_2_Chapter/499122_1_En_2_Figb_HTML.jpg

Blue dots represent maps to actual target data while orange dots represent the predicted values.

We can see that how close the training and actual predicted values are. Now, we will create a solution for simple linear regression using a dataset.

Example 2: Simple Linear Regression for Housing Dataset

We have a dataset which has one independent variable (area in square feet), and we have to predict the prices. It is again an example of simple linear regression, that is, we have one input variable. The code and dataset are uploaded to the Github link shared at the start of this chapter.

Step 1: Import all the required libraries here:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")
Step 2: Load the data set using pandas function:
house_df= pd.read_csv('House_data_LR.csv')
Note

It is a possibility that data is present in the form of .xls or .txt file. Sometimes, the data is to be loaded from the database directly by making a connection to the database.

house_df.head()
../images/499122_1_En_2_Chapter/499122_1_En_2_Figc_HTML.jpg
Step 3: Check if there is any null value present in the dataset:
house_df.isnull().any()
Step 4: There is a variable which does not make sense. We are dropping the variable “Unnamed”:
house_df.drop('Unnamed: 0', axis = 1, inplace = True)
Step 5: After dropping the variable, let’s have a look at the first few rows in the dataset:
house_df.head()
../images/499122_1_En_2_Chapter/499122_1_En_2_Figd_HTML.jpg
Step 6: We will now prepare the dataset for model building by separating the independent variable and target variable.
X = house_df.iloc[:, :1].values
y = house_df.iloc[:, -1].values

Step 7: The data is split into train and test now.

Train/Test Split: Creating a train and test dataset involves splitting the dataset into training and testing sets respectively, which are mutually exclusive. After which, you train with the training set and test with the testing set. This will provide a more accurate evaluation of out-of-sample accuracy because the testing dataset is not part of the dataset that has been used to train the data. It is more realistic for real-world problems.

This means that we know the outcome of each data point in this dataset, making it great to test with! And since this data has not been used to train the model, the model has no knowledge of the outcome of these data points. So, in essence, it is truly an out-of-sample testing. Here test data is 25% or 0.25.

Random state, as the name suggests, is for initializing the internal random number generator, which in turn decides the splitting the train/test indices. Keeping it fixed allows us to replicate the same train/test split and hence in verification of the output.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 5)
Step 8: Fit the data now using the linear regression model:
from sklearn.linear_model import LinearRegression
simple_lr= LinearRegression()
simple_lr.fit(X_train, y_train)
Step 9: The model is trained now. Let us use it to predict on the test data
y_pred = simple_lr.predict(X_test)
Step 10: We will first test the model on the training data. We will try to predict on training data and visualize the results on it.
plt.scatter(X_train, y_train, color = 'r')
plt.plot(X_train, simple_lr.predict(X_train), color = 'b')
plt.title('Sqft Living vs Price for Training')
plt.xlabel('Square feet')
plt.ylabel('House Price')
plt.show()
../images/499122_1_En_2_Chapter/499122_1_En_2_Fige_HTML.jpg
Step 11: Now let us test the model on the testing data. It is the correct measurement to check how robust the model is.
plt.scatter(X_test, y_test, color = 'r')
plt.plot(X_train, simple_lr.predict(X_train), color = 'b')
plt.title('Sqft Living vs Price for Test')
plt.xlabel('Square feet')
plt.ylabel('House Price')
../images/499122_1_En_2_Chapter/499122_1_En_2_Figf_HTML.jpg
Step 12: Now let’s figure out how good or how bad we are doing in the predictions. We will calculate the MSE and R2.
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(y_test, y_pred))
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
adj_r2 = 1 - float(len(y)-1)/(len(y)-len(simple_lr.coef_)-1)*(1 - r2)
rmse, r2, adj_r2, simple_lr.coef_, simple_lr.intercept_
../images/499122_1_En_2_Chapter/499122_1_En_2_Figg_HTML.jpg
Step 13: We will now make a prediction on unseen value of x:
import numpy as np
x_unseen=np.array([1500]).reshape(1,1)
simple_lr.predict(x_unseen)
The prediction is 376666.84

In the preceding two examples, we saw how we can use simple linear regression to train a model and make a prediction. In real-world problems, only one independent variable will almost never happen. Most business world problems have more than one variable, and such problems are solved using multiple linear regression algorithms which we are discussing now.

Multiple linear regression or multiple regression can be said as an extension of simple linear regression where instead of one independent variable we have more than one independent variable. For example, Figure 2-11 shows the representation of more than one variable of a similar dataset in the vector-space diagram:

Square Feet

No of bedrooms

Price (in 1000 $)

1200

2

100

1500

3

120

1600

3

125

1100

2

95

1500

2

125

2200

4

200

2100

4

195

1410

2

110

../images/499122_1_En_2_Chapter/499122_1_En_2_Fig11_HTML.png
Figure 2-11

Multiple regression model depiction in vector-space diagram where we have two independent variables, x1 and x2

Hence the equation for multiple linear regression is shown in Equation 2-10.
$$ {mathrm{Y}}_{mathrm{i}}={upbeta}_0+{upbeta}_1{mathrm{x}}_1+{upbeta}_2{mathrm{x}}_2+dots +{upvarepsilon}_{mathrm{i}} $$
(Equation 2-10)

Hence in the case of a multiple training of the simple linear regression, we will get an estimated value for β0 and β1 and so on.

The residuals in the case of a multiple regression model are shown in Figure 2-12. The example is for a two-variable model, and we can clearly visualize the value of residual, which is the difference between actual and predicted values.
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig12_HTML.png
Figure 2-12

The multiple linear regression model depicted by x1 and x2 and the residual shown with respect to a sample observation. The residual is the difference between actual and predicted values.

We will now create two example cases using multiple linear regression models. During the model development, we are going to do EDA, which is the first step and will also resolve the problem of null values in the data and how to handle categorical variables too.

Example 3: Multiple Linear Regression for Housing Dataset

We are working on the house price dataset. The target variable is the prediction of house prices and there are some independent variables. The dataset and the codes are uploaded to the Github link shared at the start of this chapter.

Step 1: Import all the required libraries first.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")
Step 2: Import the data which is in the form of .csv file. Then check the first few rows.
 house_mlr = pd.read_csv('House_data.csv')
house_mlr.head()
../images/499122_1_En_2_Chapter/499122_1_En_2_Figh_HTML.jpg

We have 21 variables in this dataset.

Step 3: Next let’s explore the dataset we have. This is done using the house_mlr.info() command.

../images/499122_1_En_2_Chapter/499122_1_En_2_Figi_HTML.jpg

By analyzing the output we can see that out of the 21 variables, there are a few float variables, some object and some integers. We will be treating these categorical variables to integer variables.

Step 4: house_mlr.describe() command will give the details about all the numeric variables.

../images/499122_1_En_2_Chapter/499122_1_En_2_Figj_HTML.jpg

Here we can see the range for all the numeric variables: the mean, standard deviation, the values at the 25th percentile, 50th percentile, and 75th percentile. The minimum and maximum values are also shown.

A very good way to visualize the variables is using a box-and-whisker plot using the following code.
fig = plt.figure(1, figsize=(9, 6))
ax = fig.add_subplot(111)
ax.boxplot(house_mlr['sqft_living15'])

../images/499122_1_En_2_Chapter/499122_1_En_2_Figk_HTML.jpg

The plot shows that there are a few outliers. In this case, we are not treating the outliers. In later chapters, we shall examine the best practices to deal with outliers.

Step 5: Now we are going to check for the correlations between the variables. This will be done using a correlation matrix which is developed using the following code:
house_mlr.drop(['id', 'date'], axis = 1, inplace = True)
fig, ax = plt.subplots(figsize = (12,12))
ax = sns.heatmap(house_mlr.corr(),annot = True)

The analysis of the correlation matrix shows that there is some correlation between a few variables. For example, between sqft_above and sqft_living there is a correlation of 0.88. And that is quite expected. For this first simple example, we are not treating the correlated variables.

../images/499122_1_En_2_Chapter/499122_1_En_2_Figl_HTML.jpg

Step 6: Now we will clean the data a little. There are a few null values present in the dataset. We are dropping those null values now. We are examining the concepts of missing value treatment in Chapter 5.
house_mlr.isnull().any()
house_mlr ['basement'] = (house_mlr ['sqft_basement'] > 0).astype(int)
house_mlr ['renovated'] = (house_mlr ['yr_renovated'] > 0).astype(int)
to_drop = ['sqft_basement', 'yr_renovated']
house_mlr.drop(to_drop, axis = 1, inplace = True)
house_mlr.head()
../images/499122_1_En_2_Chapter/499122_1_En_2_Figm_HTML.jpg

Step 7: The categorical variables are converted to numeric ones using one-hot encoding.

One-hot encoding converts categorical variables to numeric ones. Simply put, it adds new columns to the dataset with 0 or assigned depending on the value of the categorical variable, as shown in the following:

../images/499122_1_En_2_Chapter/499122_1_En_2_Fign_HTML.jpg
categorical_variables = ['waterfront', 'view', 'condition', 'grade', 'floors','zipcode']
house_mlr = pd.get_dummies(house_mlr, columns = categorical_variables, drop_first=True)
house_mlr.head()
../images/499122_1_En_2_Chapter/499122_1_En_2_Figo_HTML.jpg
Step 8: We will now split the data into train and test and then fit the model. Test size is 25% of the data.
X = house_mlr.iloc[:, 1:].values
y = house_mlr.iloc[:, 0].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 5)
from sklearn.linear_model import LinearRegression
multiple_regression = LinearRegression()
multiple_regression.fit(X_train, y_train)
Step 9: Predict the test set results.
y_pred = multiple_regression.predict(X_test)
Step 10: We will now check the accuracy of our model.
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(y_test, y_pred))
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
adj_r2 = 1 - float(len(y)-1)/(len(y)-len(multiple_regression.coef_)-1)*(1 - r2)
rmse, r2, adj_r2

The output we receive is (147274.98522602883, 0.8366403242154088, 0.8358351477235848).

The steps used in this example can be extended to any example where we want to predict a continuous variable. In this problem, we have predicted the value of a continuous variable but we have not selected the significant variables from the list of available variables. Significant variables are the ones which are more important than other independent variables in making the predictions. There are multiple ways to shortlist the variables. We will discuss one of them using the ensemble technique in the next section. The popular methodology of using p-value is discussed in Chapter 3.

With this, we have discussed the concepts and implementation using linear regression. So far, we have assumed that the relationship between dependent and independent variables is linear. But what if this relation is not linear? That is our next topic: nonlinear regression.

Nonlinear Regression Analysis

Consider this. In physics, we have laws of motion to describe the relationship between a body and all the forces acting upon it and how the motion responds to those forces. In one of the laws, the relationship of initial and final velocity is given by the following equation:

Final velocity = initial velocity + ½ acceleration*time2 OR v = u + ½ at2

If we analyze this equation, we can see that the relation between final velocity and time is not linear but quadratic in nature. Nonlinear regression is used to model such relationships. We can review the scatter plot to identify if the relationship is nonlinear as shown in Figure 2-13.

Formally put, if the relationship between target variable and independent variables is not linear in nature, then nonlinear regression is used to fit the data.

The model form for nonlinear regression is a mathematical equation, which can be represented as Equation 2-11 and as the curve in Figure 2-13. The shape of the curve will depend on the value of “n” and the respective values of β0, β1…, and so on.
$$ {mathrm{Y}}_{mathrm{i}}={upbeta}_0+{upbeta}_1{mathrm{x}}_1+{upbeta}_2{{mathrm{x}}^2}_1+{upbeta}_{mathrm{n}}{{mathrm{x}}^{mathrm{n}}}_1+dots +{upvarepsilon}_{mathrm{i}} $$
(Equation 2-11)

Where β0: Y intercept

β1: regression coefficient for linear effect of X on Y

β2 = regression coefficient for quadratic effect on Y and so on

εi = random error in Y for observation i

../images/499122_1_En_2_Chapter/499122_1_En_2_Fig13_HTML.jpg
Figure 2-13

A nonlinear regression will do a better job to model the data set than a simple linear model. There is no linear relationship between the dependent and the independent variable

Let’s dive deeper and understand nonlinear relationships by taking quadratic equations as an example. If we have a quadratic relationship between the dependent variables, it can be represented by Equation 2-12.
$$ {mathrm{Y}}_{mathrm{i}}={upbeta}_0+{upbeta}_1{mathrm{x}}_{1mathrm{i}}+{upbeta}_2{{mathrm{x}}^2}_{1mathrm{i}}+{upvarepsilon}_{mathrm{i}} $$
(Equation 2-12)

Where β0: Y intercept

β1: regression coefficient for linear effect of X on Y

β2 = regression coefficient for quadratic effect on Y

εi = random error in Y for observation i

The quadratic mode will take the following shapes depending on the values of β1 and β2 as shown in Figure 2-14, in which β1 is the coefficient of the linear term and β2 is the coefficient of the squared term.
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig14_HTML.jpg
Figure 2-14

The value of the curve is dependent on the values of β1 and β2. The figure depicts the different shapes of the curve in different directions.

Depending on the values of the coefficients, the shape of the curve will change its shape. An example of such a case is shown in the following data. Figure 2-15 on the right side shows the relationship between two variables: velocity and distance:

Velocity

Distance

3

9

4

15

5

28

6

38

7

45

8

69

10

96

12

155

18

260

20

410

25

650

../images/499122_1_En_2_Chapter/499122_1_En_2_Fig15_HTML.jpg
Figure 2-15

Nonlinear relationship between velocity and distance is depicted here

The data and the respective plot of the data indicate nonlinear relationships. Still, we should know how to detect if a relationship between target and independent variables is nonlinear, which we are discussing next.

Identifying a Nonlinear Relationship

While we try to fit a nonlinear model, first we have to make sure that a nonlinear equation is indeed required. Testing for significance for the quadratic effect can be checked by the standard NULL hypothesis test.
  1. 1.

    The estimate in the case of linear regression is ŷ = b0 + b1x1

     
  2. 2.

    The estimate in the case of quadratic regression is $$ hat{mathrm{y}} $$ = b0 + b1x1 + b2x21

     
  3. 3.
    The NULL hypothesis will be
    1. a.

      H0 : β2 = 0 (the quadratic term does not improve the model)

       
    2. b.

      H1 : β2 ≠ 0 (the quadratic term improves the model)

       
     
  4. 4.

    Once we run the statistical test, we will either accept or reject the NULL hypothesis

     
But this might not always be feasible to run a statistical test. While working on practical business problems, we can take these two steps:
  1. 1.
    To identify the nonlinear relationship, we can analyze the residuals after fitting a linear model. If we try to fit a linear relationship while the actual relationship is nonlinear, then the residuals will not be random but will have a pattern, as shown in Figure 2-16. If a nonlinear relationship is modeled instead, the residuals will be random in nature.
    ../images/499122_1_En_2_Chapter/499122_1_En_2_Fig16_HTML.jpg
    Figure 2-16

    If the residuals follow a pattern, it signifies that there can be a nonlinear relationship present in the data which is being modeled using a linear model. Linear fit does not give random residuals, while nonlinear fit will give random residuals.

     
  2. 2.

    We can also compare the respective R2 values of both the linear and nonlinear regression model. If the R2 of the nonlinear model is more, it means the relationship with nonlinear model is better.

     

Similar to linear regression, nonlinear models too have some assumptions, which we will discuss now.

Assumptions for a Nonlinear Regression

  1. 1.

    The errors in a nonlinear regression follow a normal distribution. Error is the difference between the actual and predicted values and nonlinear requires the variables to follow a normal distribution.

     
  2. 2.

    All the errors must have a constant variance.

     
  3. 3.

    The errors are independent of each other and do not follow a pattern. This is a very important assumption since if the errors are not independent of each other, it means that there is still some information in the model equation which we have not extracted.

     
We can also use a log transformation to solve some of the nonlinear models. The equation of a log transformation can be put forward as Equation 2-13 and Equation 2-14.
$$ mathrm{Y}={upbeta}_0{{mathrm{X}}_1}^{upbeta 1};{{mathrm{X}}_2}^{upbeta 2}upvarepsilon $$
(Equation 2-13)
and by taking a log transformation on both the sides:
$$ mathrm{Log}left(mathrm{Y}
ight)=log left({upbeta}_0
ight)+{upbeta}_1log left({mathrm{X}}_1
ight)+{upbeta}_2log left({mathrm{X}}_2
ight)+log left(upvarepsilon 
ight) $$
(Equation 2-14)

The coefficient of the independent variable can be interpreted as follows: a 1% change in the independent variable X1 leads to an estimated β1 percentage change in the average value of Y.

Note

Sometimes, β1 can also refer to elasticity of Y with respect to change in X1.

We have studied different types of regression models. It is one of the most stable solutions, but like any other tool or solution, there are a few pitfalls with regression models too. Some of those in the form of important insights will be uncovered while doing EDA. These challenges have to be dealt with while we are doing the data preparation for our statistical modeling . We will now discuss these challenges.

Challenges with a Regression Model

Though regression is quite a robust model to use, there are a few challenges we face with the algorithm:
  1. 1.
    Nonlinearities : Real-world data points are much more complex and generally do not follow a linear relationship. Even if we have a very large number of data points, a linear method might prove too simplistic to create a robust model, as shown in Figure 2-17. A linear model will not be able to do a good job for the one on the left, while if we have a nonlinear model the equation fits better. For example, seldom we will find the price and size following a linear relationship.
    ../images/499122_1_En_2_Chapter/499122_1_En_2_Fig17_HTML.jpg
    Figure 2-17

    The figure on the left shows that we are typing to model a linear model for nonlinear data. The figure on the right is the correct equation. Linear relationship is one of the important assumptions in linear regression

     
  2. 2.
    Multicollinearity : We have discussed the concept of multicollinearity earlier in the chapter. If we use correlated variables in our model, we will face the problem of multicollinearity. For example, if we include as units both sales in thousands and as a revenue in $, both the variables are essentially talking about the same information. If we have a problem of multicollinearity, it impacts the model as follows:
    1. a.

      The estimated coefficients of the independent variables become quite sensitive to even small changes in the model and hence their values can change quickly.

       
    2. b.

      The overall predictability power of our model takes a hit as the precision of the estimated coefficients of the independent variables can get impacted.

       
    3. c.

      We may not be able to trust the p-value of the coefficients and hence we cannot completely rely on the significance shown by the model.

       
    4. d.

      Hence, it undermines the overall quality of the trained model and this multicollinearity needs to be acted upon.

       
     
  3. 3.
    Heteroscedasticity : This is one more challenge we face while modeling a regression problem. The variability of the target variable is directly dependent on the change in the values of the independent variables. It creates a cone-shaped pattern in the residuals, which is visible in Figure 2-5. It creates a few challenges for us like the following:
    1. a.

      Heteroscedasticity messes with the significance of the independent variables. It inflates the variance of coefficient estimates. We would expect this increase to be detected by the OLS process but OLS is not able to. Hence, the t-values and F-values calculated are wrong and consequently the respective p-values estimated are less. That leads us to make incorrect conclusions about the significance of the independent variables.

       
    2. b.

      Heteroscedasticity leads to incorrect coefficient estimates for the independent variables. And hence the resultant model will be biased.

       
     
  4. 4.
    Outliers: Outliers lead to a lot of problems in our regression model. It changes our results and makes a huge impact on our insights and the ML model. The impacts are as follows:
    1. a.
      Our model equation takes a serious hit in the case of outliers as visible in Figure 2-18. In the presence of outliers, the regression equation tries to fit them too and hence the actual equation will not be the best one.
      ../images/499122_1_En_2_Chapter/499122_1_En_2_Fig18_HTML.jpg
      Figure 2-18

      Outliers in a dataset seriously impact the accuracy of the regression model since the equation will try to fit the outlier points too; hence, the results will be biased

       
    2. b.

      Outliers bias the estimates for the model and increase the error variance.

       
    3. c.

      If a statistical test is to be performed, its power and impact take a serious hit.

       
    4. d.

      Overall, from the data analysis, we cannot trust coefficients of the model and hence all the insights from the model will be erroneous.

       
     
Note

We will be dealing with how to detect outliers and how to tackle them in Chapter 5.

Linear regression is one of widely used techniques to predict the continuous variables. The usages are vast and many and across multiple domains. It is generally the first few methods we use to model a continuous variable and it acts as a benchmark for the other ML models.

This brings us to the end of our discussion on linear regression models. Next we will discuss a quite popular family of ML models called tree models or tree-based models. Tree-based models can be used for both regression and classification problems. In this chapter we will be studying only the regression problems and in the next chapter we will be working on classification problems.

Tree-Based Methods for Regression

The next type of algorithms used to solve ML problems are tree-based algorithms . Tree-based algorithms are very simple to comprehend and develop. They can be used for both regression and classification problems. A decision tree is a supervised learning algorithm, hence we have a target variable and depending on the problem it will be either a classification or a regression problem.

A decision tree looks like Figure 2-19. As you can see, the entire population is continuously split into groups and subgroups based on a criterion. We start with the entire population at the beginning of the tree and subsequently the population is divided into smaller subsets; at the same time, an associated decision tree is incrementally developed. Like we do in real life, we consider the most important factor first and then divide the possibilities around it. In a similar way, decision tree building starts by finding the features for the best splitting condition.
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig19_HTML.jpg
Figure 2-19

Basic structure of a decision tree shows how iterative decisions are made for splitting

Decision trees can be used to predict both continuous variables and categorical variables. In the case of a regression tree, the value achieved by the terminal node is the mean of the values falling in that region. While in the case of classification trees, it is the mode of the observations. We will be discussing both the methods in this book. In this chapter, we are examining regression problems; in the next chapter classification problems are solved using decision trees.

Before moving ahead , it is imperative we study the building blocks for a decision tree. As you can see in Figure 2-20, a decision tree is represented using the root node, decision node, terminal node, and a branch.
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig20_HTML.jpg
Figure 2-20

Building blocks of a decision tree consisting of root node, decision node, terminal node, and a branch

  1. 1.

    Root node is the entire population which is being analyzed and is displayed at the top of the decision tree.

     
  2. 2.

    Decision node represents the subnode which gets further split into subnodes.

     
  3. 3.

    Terminal node is the final element in a decision tree. Basically when a node cannot split further, it is the end of that path. That node is called terminal node. Sometimes it is also referred to as leaf.

     
  4. 4.

    Branch is a subsection of a tree. It is sometimes called a subtree.

     
  5. 5.

    Parent node and child node are the references made to nodes only. A node which is divided is a parent node and the subnodes are called the child nodes.

     

Let us now understand the decision tree using the house price prediction problem. For the purpose of understanding, let us assume the first criteria of splitting is area. If the area is less than 100 sq km. then the entire population is split in two nodes as shown in Figure 2-21(i). On the right hand, we can see the next criteria is number of bedrooms. If the number of bedrooms is less than four, the predicted price is 100; otherwise, the predicted price is 150. For the left side the criteria for splitting is distance. And this process will continue to predict the price values.

A decision tree can also be thought as a group of nested IF-ELSE conditions (as shown in Figure 2-21(ii)) which can be modeled as a tree wherein the decisions are made in the internal node and the output is obtained in the leaf node. A decision tree divides the independent variables into non-overlapping spaces. These spaces are distinct to each other too. Geometrically, a decision tree can also be thought of as a set of parallel hyperplanes which divide the space into a number of hypercuboids as shown in Figure 2-21(iii). We predict the value for the unseen data based on which hypercuboid it falls into.
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig21_HTML.jpg
Figure 2-21

(i) Decision tree based split for the housing prediction problem. (ii) Decision tree can be thought as a nested IF-ELSE block. (iii) Geometric representation of a decision tree shows parallel hyperplanes.

Now you have understood what a decision tree is. Let us also examine the criteria of splitting a node.

A decision tree utilizes a top-down greedy approach . As we know, a decision tree starts with the entire population and then recursively splits the data; hence it is called top-down. It is called a greedy approach, as the algorithm at the time of decision of split takes the decision for the current split only based on the best available criteria, that is, variable and not based on the future splits, which may result in a better model. In other words, for greedy approaches the focus is on the current split only and not the future splits. This splitting takes place recursively unless the tree is fully grown and the stopping criteria is reached. In the case of a classification tree, there are three methods of splitting: Gini index, entropy loss, and classification error. Since they deal with classification problems, we will study these criteria in the next chapter.

For a regression tree, variance reduction is the criteria for splitting. In variance splitting, variance at each node is calculated using the following formula.
$$ mathrm{Variance}=frac{Sigma;{left(mathrm{x}hbox{-} overline{x}
ight)}^2}{mathrm{n}} $$
(Equation 2-15)

We calculate variance for each split as the weighted average of variance for each node. And then the split with a lower variance is selected for the purpose of splitting.

There are quite a few decision tree algorithms available, like ID3, CART, C4.5, CHAID, and so on. These algorithms are explored in more detail in Chapter 3 after we have discussed concepts of classification using a decision tree.

Case study: Petrol consumption using Decision tree

It is time to develop a Python solution using a decision tree. The code and dataset are uploaded to the Github repository. You are advised to download the dataset from the Github link shared at the start of the chapter.

Step 1: Import all the libraries first.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Step 2: Import the dataset using read_csv file command.
petrol_data = pd.read_csv('petrol_consumption.csv')
petrol_data.head(5)
../images/499122_1_En_2_Chapter/499122_1_En_2_Figp_HTML.jpg
Step 3: Let’s explore the major metrices of the independent variables:
petrol_data.describe()
../images/499122_1_En_2_Chapter/499122_1_En_2_Figq_HTML.jpg
Step 4: We will now split the data into train and test and then try to fit the model. First the X and y variables are segregated and then they are split into train and test. Test is 20% of training data.
X = petrol_data.drop('Petrol_Consumption', axis=1)
y = petrol_data ['Petrol_Consumption']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.tree import DecisionTreeRegressor
decision_regressor = DecisionTreeRegressor()
decision_regressor.fit(X_train, y_train)
Step 5: Use the model for making a prediction on the test dataset.
y_pred = decision_regressor.predict(X_test)
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df
../images/499122_1_En_2_Chapter/499122_1_En_2_Figr_HTML.jpg
Step 6: Measure the performance of the model created by using various measuring parameters.
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
../images/499122_1_En_2_Chapter/499122_1_En_2_Figs_HTML.jpg

The MAE for our algorithm is 50.9, which is less than 10% of the mean of all the values in the 'Petrol_Consumption' column, indicating that our algorithm is doing a good job.

The visualization of the preceding solution is the code is checked in at GitHub link.

../images/499122_1_En_2_Chapter/499122_1_En_2_Figt_HTML.jpg

This is the Python implementation of a decision tree. The code can be replicated for any problem we want to solve using a decision tree. We will now explore the pros and cons of decision tree algorithms.

Advantages of Decision Tree
  1. 1.

    Decision trees are easy to build and comprehend. Since they mimic human logic in decision making, the output looks very structured and easy to grasp.

     
  2. 2.

    They require very less data preparation. They are able to work for both regression and classification problems and can handle huge datasets.

     
  3. 3.

    Decision trees are not impacted much by collinearity of the variables. The significant variable identification is inbuilt and we validate the outputs of decision trees using statistical tests.

     
  4. 4.

    Perhaps the most important advantage of decision trees is that they are very intuitive. Stakeholders or decision makers who are not from a data science background can also understand the tree.

     
Disadvantages of Decision Tree
  1. 1.

    Overfitting is the biggest problem faced in the decision tree. Overfitting occurs when the model is getting good training accuracy but very low testing accuracy. It means that the model has been able to understand the training data well but is struggling with unseen data. Overfitting is a nuisance and we have to reduce overfitting in our models. We deal with overfitting and how to reduce it in Chapter 5.

     
  2. 2.
    A greedy approach is used to create the decision trees. Hence, it may not result in the best tree or the globally optimum tree. There are methods proposed to reduce the impact of the greedy algorithm like dual information distance (DID). The DID heuristic makes a decision on attribute selection by considering both immediate and future potential effects on the overall solution. The classification tree is constructed by searching for the shortest paths over a graph of partitions. The shortest path identified is defined by the selected features. The DID method considers
    1. a.

      The orthogonality between the selected partitions,

       
    2. b.

      The reduction of uncertainty on the class partition given the selected attributes.

       
     
  3. 3.

    They are quite “touchy” to the training data changes and hence sometimes a small change in the training data can lead to a change in the final predictions.

     
  4. 4.

    For classification trees, the splitting is biased towards the variable with the higher number of classes.

     

We have discussed the concepts of decision tree and developed a case study using Python. It is very easy to comprehend, visualize, and explain. Everyone is able to relate to a decision tree, as it works in the way we make our decisions. We choose the best parameter and direction, and then make a decision on the next step. Quite an intuitive approach!

This brings us towards the end of decision tree–based solutions. So far we have discussed simple linear regression, multinomial regression, nonlinear regression, and decision tree. We understood the concepts, the pros and cons, and the assumptions, and we developed respective solutions using Python. It is a very vital and relevant step towards ML.

But all of these algorithms work individually and one at a time. It allows us to bring forward the next generation of solutions called ensemble methods, which we will examine now.

Ensemble Methods for Regression

“United we stand” is the motto for ensemble methods. They use multiple predictors and then “unite” or collate the information to make a final decision.

Formally put, ensemble methods train multiple predictors on a dataset. These predictor models might or might not be weak predictors themselves individually. They are selected and trained in such a way that each has a slightly different training dataset and may get slightly different results. These individual predictors might learn a different pattern from each other. Then finally, their individual predictions are combined and a final decision is made. Sometimes, this combined group of learners is referred to as meta model.

In ensemble methods, we ensure that each predictor is getting a slightly different data set for training. This is usually achieved at random with replacement or bootstrapping. In a different method, we can adjust the weights assigned to each of the data points. This increases the weights, that is, the focus on those data points.

To visualize how ensemble methods make a prediction, we can consider Figure 2-22 where a random forest is shown and predictor models are trained on slightly different datasets.
../images/499122_1_En_2_Chapter/499122_1_En_2_Fig22_HTML.jpg
Figure 2-22

Ensemble learning–based random forest where raw data is split into randomly selected subfeatures and then individual independent parallel trees are created. The final result is the average of all the predictions by subtrees.

Ensemble methods can be divided into two broad categories: bagging and boosting.
  1. 1.
    Bagging models or bootstrap aggregation improves the overall accuracy by the means of several weak models. The following are major attributes for a bagging model:
    1. a.

      Bagging uses sampling with replacement to generate multiple datasets.

       
    2. b.

      It builds multiple predictors simultaneously and independently of each other.

       
    3. c.

      To achieve the final decision an average/vote is done. It means if we are trying to build a regression model, the average or median of all the respective predictions will be taken while for the classification model a voting is done.

       
    4. d.

      Bagging is an effective solution to tackle variance and reduce overfitting.

       
    5. e.

      Random forest is one of the examples of a bagging method (as shown in Figure 2-22).

       
     
  2. 2.
    Boosting : Similar to bagging, boosting also is an ensemble method. The following are the main points about boosting algorithm:
    1. a.

      In boosting, the learners are grown sequentially from the last one.

       
    2. b.

      Each subsequent learner improves from the last iteration and focuses more on the errors in the last iteration.

       
    3. c.

      During the process of voting, higher vote is awarded to learners which have performed better.

       
    4. d.

      Boosting is generally slower than bagging but mostly performs better.

       
    5. e.

      Gradient boosting, extreme gradient boosting, and AdaBoosting are a few example solutions.

       
     

It is time for us to develop a solution using random forest. We will be exploring more on boosting in Chapter 4, where we study supervised classification algorithms.

Case study: Petrol consumption using Random Forest

For a random forest regression problem, we will be using the same case study we used for decision tree. In the interest of space, we are progressing after creating the training and testing dataset

Step 1: Import all the libraries and the dataset. We have already covered the steps while we were implementing decision tree algorithms.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
petrol_data = pd.read_csv('petrol_consumption.csv')
X = petrol_data.drop('Petrol_Consumption', axis=1)
y = petrol_data['Petrol_Consumption']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
Step 2: Import the random forest regressor library and initiate a RandomForestRegressor variable.
from sklearn.ensemble import RandomForestRegressor
randomForestModel = RandomForestRegressor(n_estimators=200,
                               bootstrap = True,
                               max_features = 'sqrt')
Step 3: Now fit the model on training and testing data.
randomForestModel.fit(X_train, y_train)
Step 4: We will now predict the actual values and check the accuracy of the model.
rf_predictions = randomForestModel.predict(X_test)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, rf_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, rf_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, rf_predictions)))

../images/499122_1_En_2_Chapter/499122_1_En_2_Figu_HTML.jpg

Step 5: Now we will extract the two most important features. Get the list of all the columns present in the dataset and we will get the numeric feature importance.
feature_list=X_train.columns
importances = list(randomForestModel.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
../images/499122_1_En_2_Chapter/499122_1_En_2_Figv_HTML.jpg
Step 6: We will now re-create the model with important variables.
rf_most_important = RandomForestRegressor(n_estimators= 500, random_state=5)
important_indices = [feature_list[2], feature_list[1]]
train_important = X_train.loc[:, ['Paved_Highways','Average_income','Population_Driver_licence(%)']]
test_important = X_test.loc[:, ['Paved_Highways','Average_income','Population_Driver_licence(%)']]
train_important = X_train.loc[:, ['Paved_Highways','Average_income','Population_Driver_licence(%)']]
test_important = X_test.loc[:, ['Paved_Highways','Average_income','Population_Driver_licence(%)']]
Step 7: Train the random forest algorithm.
rf_most_important.fit(train_important, y_train)
Step 8: Make predictions and determine the error.
predictions = rf_most_important.predict(test_important)
predictions
Step 9: Print the mean absolute error, mean squared error, and root mean squared error.
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
../images/499122_1_En_2_Chapter/499122_1_En_2_Figw_HTML.jpg

As we can observe, after selecting the significant variables the error has reduced for random forest.

Ensemble learning allows us to collate the power of multiple models and then make a prediction. These models individually are weak but together act as a strong model for prediction. And that is the beauty of ensemble learning. We will now discuss pros and cons of ensemble learning.

Advantages of ensemble learning:
  1. 1.

    An ensemble model can result in lower variance and low bias. They generally have a better understanding of the data.

     
  2. 2.

    The accuracy of ensemble methods is generally higher than regular methods.

     
  3. 3.

    Random forest model is used to tackle overfitting, which is generally a concern for decision trees. Boosting is used for bias reduction.

     
  4. 4.

    And most importantly, ensemble methods are a collection of individual models. Hence, more complex understanding of the data is generated.

     
Challenges with ensemble learning:
  1. 1.

    Owing to the complexity of ensemble learning, it is difficult to comprehend. For example, while we can easily visualize a decision tree it is difficult to visualize a random forest model.

     
  2. 2.

    Complexity of the models does not make them easy to train, test, deploy, and refresh, which is generally not the case with other models.

     
  3. 3.

    Sometimes, ensemble models take a long time to converge and train. And that increases the training time.

     

We have covered the concept of ensemble learning and developed a regression solution using random forest. Ensemble learning has been popular for a long time and has won quite a few competitions in Kaggle. You are advised to understand the concepts and replicate the code implementation.

Before we close the discussion on ensemble learning, we have to study an additional concept of feature selection using decision tree. Recall from the last section where we developed a multiple regression problem; we will be continuing with the same problem to select the significant variables.

Feature Selection Using Tree-Based Methods

We are continuing using the dataset we used in the last section where we developed a multiple regression solution using house data. We are using ensemble-based ExtraTreeClassifier to select the most significant features.

The initial steps of importing the libraries and dataset remain the same.

Step 1: Import the libraries and dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
feature_df = pd.read_csv('House_data.csv')
Step 2: Perform similar data preprocessing we did in the regression problem.
feature_df['basement'] = (feature_df['sqft_basement'] > 0).astype(int)
feature_df['renovated'] = (feature_df['yr_renovated'] > 0).astype(int)
to_drop = ['id', 'date', 'sqft_basement', 'yr_renovated']
feature_df.drop(to_drop, axis = 1, inplace = True)
cat_cols = ['waterfront', 'view', 'condition', 'grade', 'floors']
feature_df = pd.get_dummies(feature_df, columns = cat_cols, drop_first=True)
y = feature_df.iloc[:, 0].values
X = feature_df.iloc[:, 1:].values
Step 3: Let us now create a ExtraTreeClassifier.
from sklearn.ensemble import ExtraTreesClassifier
tree_clf = ExtraTreesClassifier()
tree_clf.fit(X, y)
tree_clf.feature_importances_
Step 4: We are now getting the respective importance of various variables and ordering them in descending order of importance.
importances = tree_clf.feature_importances_
feature_names = feature_df.iloc[:, 1:].columns.tolist()
feature_names
feature_imp_dir = dict(zip(feature_names, importances))
features = sorted(feature_imp_dir.items(), key=lambda x: x[1], reverse=True)
feature_imp_dir
Step 5: We will visualize the features in the order of their importance.
plt.bar(range(len(features)), [imp[1] for imp in features], align="center")
plt.title('The important features in House Data');
../images/499122_1_En_2_Chapter/499122_1_En_2_Figx_HTML.jpg
Step 6: We will now analyze how many variables have been selected and how many have been removed.
from sklearn.feature_selection import SelectFromModel
abc = SelectFromModel(tree_clf, prefit = True)
x_updated = abc.transform(X)
print('Total Features count:', np.array(X).shape[1])
print('Selected Features: ',np.array(x_updated).shape[1])

The output shows that the total number of variables was 30; from that list, 11 variables have been found to be significant.

Using ensemble learning–based ExtraTreeClassifier is one of the techniques to shortlist significant variables. We can look at the respective p-values and shortlist the variables.

Ensemble learning is a very powerful method to combine the power of weak predictors and make them strong enough to make better predictions. It offers a fast, easy, and flexible solution which is applicable for both classification and regression problems. Owing to their flexibility sometimes we encounter the problem of overfitting, but bagging solutions like random forest tend to overcome the problem of overfitting. Since ensemble techniques promote diversity in the modeling approach and use a variety of predictors to make the final decision, many times they have outperformed classical algorithms and hence have gained a lot of popularity.

This brings us to the end of ensemble learning methods. We are going to revisit them in Chapter 3 and Chapter 4.

So far, we have covered simple linear regression, multiple linear regression, nonlinear regression, decision tree, and random forest and have developed Python codes for them too. It is time for us to move to the summary of the chapter.

Summary

We are harnessing the power of data in more innovative ways. Be it through reports and dashboards, visualizations, data analysis, or statistical modeling, data is powering the decisions of our businesses and processes. Supervised regression learning is quickly and quietly impacting the decision-making process. We are able to predict the various indicators of our business and take proactive measures for them. The use cases are across pricing, marketing, operations, quality, Customer Relationship Management (CRM), and in fact almost all business functions.

And regression solutions are a family of such powerful solutions. Regression solutions are very profound, divergent, robust, and convenient. Though there are fewer use cases of regression problems as compared to classification problems, they still serve as the foundation of supervised learning models. Regression solutions are quite sensitive to outliers and changes in the values of target variables. The major reason is that the target variable is continuous in nature.

Regression solutions help in modeling the trends and patterns, deciphering the anomalies, and predicting for the unseen future. Business decisions can be more insightful in light of regression solutions. At the same time, we should be cautious and aware that the regression will not cater to unseen events and values which are not trained for. Events such as war, natural calamities, government policy changes, macro/micro economic factors, and so on which are not planned will not be captured in the model. We should be cognizant of the fact that any ML model is dependent on the quality of the data. And for having an access to clean and robust dataset, an effective data-capturing process and mature data engineering is a prerequisite. Then only can the real power of data be harnessed.

In the first chapter, we learned about ML, data and attributes of data quality, and various ML processes. In this second chapter, we have studied regression models in detail. We examined how a model is created, how we can assess the model’s accuracy, pros and cons of the model, and implementation in Python too. In the next chapter, we are going to work on supervised learning classification algorithms.

You should be able to answer the following questions.

EXERCISE QUESTIONS

Question 1: What are regression and use cases of regression problem?

Question 2: What are the assumptions of linear regression?

Question 3: What are the pros and cons of linear regression?

Question 4: How does a decision tree make a split for a node?

Question 5: How does an ensemble method make a prediction?

Question 6: What is the difference between the bagging and boosting approaches?

Question 7: Load the data set Iris using the following command:
from sklearn.datasets import load_iris
iris = load_iris()

Predict the sepal length of the iris flowers using linear regression and decision tree and compare the results.

Question 8: Load the auto-mpg.csv from the Github link and predict the mileage of a car using decision tree and random forest and compare the results. Get the most significant variables and re-create the solution to compare the performance.

Question 9: The next dataset contains information about the used cars listed on www.cardekho.com. It can be used for price prediction. Download the dataset from https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho, perform the EDA, and fit a linear regression model.

Question 10: The next dataset is a record of seven common species of fish. Download the data from https://www.kaggle.com/aungpyaeap/fish-market and estimate the weight of fish using regression techniques.

Question 11: Go through the research papers on decision trees at https://ieeexplore.ieee.org/document/9071392 and https://ieeexplore.ieee.org/document/8969926.

Question 12: Go through the research papers on regression at https://ieeexplore.ieee.org/document/9017166 and https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2013WR014203.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.35.148