© Vaibhav Verdhan 2020
V. VerdhanSupervised Learning with Pythonhttps://doi.org/10.1007/978-1-4842-6156-9_3

3. Supervised Learning for Classification Problems

Vaibhav Verdhan1 
(1)
Limerick, Ireland
 

“Prediction is very difficult, especially if it’s about the future.”

— Niels Bohr

We live in a world where the predictions for an event help us modify our plans. If we know that it is going to rain today, we will not go camping. If we know that the share market will crash, we will hold on our investments for some time. If we know that a customer will churn out from our business, we will take some steps to lure them back. All such predictions are really insightful and hold strategic importance for our business.

It will also help if we know the factors which make our sales go up/down or make an email work or result in a product failure. We can work on our shortcomings and continue the positives. The entire customer-targeting strategy can be modified using such knowledge. We can revamp the online interactions or change the product testing: the implementations are many. Such ML models are referred to as classification algorithms, the focus of this chapter.

In Chapter 2, we studied the regression problems which are used to predict a continuous variable. In this third chapter we will examine the concepts to predict a categorical variable. We will study how confident we are for an event to happen or not. We will study logistic regression, decision tree, k-nearest neighbor, naïve Bayes, and random forest in this chapter. All the algorithms will be studied; we will also develop Python code using the actual dataset. We will deal with missing values, duplicates, and outliers, do an EDA, measure the accuracy of the algorithm, and choose the best algorithm. Finally, we will solve a case study to complete the understanding.

Technical Toolkit Required

We are using Python 3.5+ in this book. You are advised to get Python installed on your machine. We are using Jupyter notebook; installing Anaconda-Navigator is required for executing the codes.

The major libraries used are numpy, pandas, matplotlib, seaborn, scikitlearn, and so on. You are advised to install these libraries in your Python environment. All the codes and datasets have been uploaded at the Github repository at the following link: https://github.com/Apress/supervised-learning-w-python/tree/master/Chapter%203

Before starting with classification “machine learning,” it is imperative to examine the statistical concept of critical region and p-value, which we are studying now. They are useful to judge the significance for a variable out of all the independent variables. A very strong and critical concept indeed!

Hypothesis Testing and p-Value

Imagine a new drug X is launched in the market, which claims to cure diabetes in 90% of patients in 5 weeks. The company tested on 100 patients and 90 of them got cured within 5 weeks. How can we ensure that the drug is indeed effective or if the company is making false claims or the sampling technique is biased?

Hypothesis testing is precisely helpful in answering the questions asked.

In hypothesis testing, we decide on a hypothesis first. In the preceding example of a new drug, our hypothesis is that drug X cures diabetes in 90% of patients in 5 weeks. That is called the null hypothesis and is represented by H0. In this case, H0 is 0.9. If the null hypothesis is rejected based on evidence, an alternate hypothesis H1 needs to be accepted. In this case, H1 < 0.9. We always start with assuming that the null hypothesis is true.

Then we define our significance level, ɑ. It is a measure for how unlikely we want the results of the sample to be, before we reject the null hypothesis H0. Refer to Figure 3-1(i) and you will be able to relate to the understanding.
../images/499122_1_En_3_Chapter/499122_1_En_3_Fig1_HTML.png
Figure 3-1

(i) The significance level has to be specified, that is, until what point we will accept the results and not reject the null hypothesis. (ii) The critical region is the 5% region shown in the middle image. (iii) The right side image is showing the p-value and how it falls in the critical region in this case.

We then define the critical region as “c” as shown in Figure 3-1(ii).

If X represents the number of diabetic patients cured, the critical region is defined as P(X<c) < α where α = 5%. In a 95% confidence interval, there is a 5% chance that the sample will not have the population mean. It is interpreted as, if a sample we are interested in falls in a critical region then we can safely reject the null hypothesis.

This is precisely the reason for 5% or 0.05 being referred to as the significance level. If we want a confidence interval of 99%, then 0.01 is the significance level.

Then the next step is to get the p-value. Refer to Figure 3-1(iii) for better understanding.

Formally out, p-value is the probability of getting a value up to and including the one in the sample in the direction of the critical region. It is a method to check if the results of a sample fall in the critical region of the hypothesis test. So, based on the p-value we take a call to reject or accept the null hypothesis.

Once the p-value is calculated, we analyze if the value falls in the critical region. If the answer we get is yes, we can reject the null hypothesis.

This knowledge of hypothesis testing and p-value is critical as it paves the way for identifying the significant variables. Whenever we train our ML algorithms, along with other results we get a p-value for each of the variables. As a rule of thumb, if the p-value is less than or equal to 0.05, the variable is considered as significant.

In this case, the null hypothesis is that the independent variable is not significant and does not impact the target variable. If the p-value is less than or equal to 0.05, we can reject the null hypothesis and hence can conclude that the variable is indeed significant. In the language of statistics, if the p-value for an independent variable x is less than or equal to 0.05, it suggests strong evidence against the null hypothesis as there is less than a 5% chance of the null hypothesis being correct. Or in other words, the variable x is a significant variable to make a prediction for the target variable y. But it does not mean there is 95% probability for the alternate hypothesis to be correct. To be noted is that p-value is a condition upon the null hypothesis and is unrelated to the alternate hypothesis.

We use the p-value to shortlist the significant variables and compare their respective importance. It is a universally accepted metric to choose significant variables.

We will now proceed to the concepts of classification algorithms in the next section.

Classification Algorithms

In our day-to-day business, we make decisions to either invest in stock or not, send a communication to a customer or not, accept a product or reject it, accept an application or ignore it. The basis of these decisions is some sort of insight we have about our business, our processes, our goals, and the factors which enter into our decision making. At the same time, we do expect a favorable output from this decision of ours.

Supervised classified algorithms are used to generate such insights and help us take that decision. They predict the probability for an event to happen or not. At the same time, depending on the choice of supervised learning algorithm, we get to know the factors which impact the occurrences of such an event.

Formally put, classification algorithms are a branch of supervised ML algorithms which are used to model the probability for a certain class. For example, if we want to perform binary classification, we will be modeling for two classes like pass or fail, healthy or sick, fraudulent or genuine, yes or no, and so on. It can be extended to multiclass classification problems—for example, good/bad/neutral, red/yellow/blue/green, cat/dog/horse/car, and so on.

Like a regression model, there is a target variable and independent variables. The only difference is that the target variable is categorical in nature. The independent variables used to make the predictions can be continuous or categorical, and it depends on the algorithm used for modeling.

The following use cases will make the usage of classification algorithms clear:
  1. 1.

    A retailer is losing its repeat customers, that is, customers who used to make purchases are not coming back. The supervised learning algorithm will help identify the customers who are more prone to churn and not come back. The retailer can then target those customers selectively and can offer discounts to bring them back to the business.

     
  2. 2.

    A manufacturing plant has to maintain the best quality of their products. And for that the technical team would like to ascertain if a particular combination of tools, raw materials, and physical conditions will lead to the best quality and yield. Supervised algorithms can help in that selection.

     
  3. 3.

    An insurance provider wants to model which customers should be given the policy or not. Depending on the customers’ previous history, employment details, transaction patterns, and so on, the decision has to be made. Here the classification ML model can help to predict acceptance score for the customers, which can be used to accept or reject the application.

     
  4. 4.

    A bank offering credit cards to its customers has to identify which incoming transactions are fraudulent and which are genuine. Based on the transaction details like origin, amount, time of transaction, mode, and other customer parameters, a decision has to be made. Supervised classification algorithms will be helpful to making that decision.

     
  5. 5.

    A telecom operator wishes to launch a new data product in the market. For it, the need is to target a few subscribers who have a higher probability of being interested in the product and recharging with it. Supervised classification algorithms will be able to generate a score for each subscriber and subsequently an offer can be made.

     

The preceding use cases are some of the pragmatic implementations in the industry. A classification algorithm generates a probability score for an event and accordingly the sales/marketing/operations/quality/risk teams can take a business call. Quite a powerful usage and very handy too!

There are quite a few algorithms which serve the purpose and we will be discussing a few in this chapter and rest in Chapter 4.

The algorithms which can be used are
  • Logistic regression

  • k-nearest neighbor

  • Decision tree

  • Random forest

  • Naïve Bayes

  • SVM

  • Gradient boosting

  • Neural networks

We are discussing the first five algorithms in this chapter and rest in the next chapter. Let us start the discussion with the logistic regression algorithm in the next section.

Logistic Regression for Classification

In Chapter 2 we learned how to predict the value for a continuous variable like number of customers, sales, rainfall, and so on using linear regression. Now we have to predict whether a customer will visit or not, whether the sale will go up or not, and so on. Using logistic regression, we can model and solve the preceding problems.

Formally put, logistic regression is a statistical model that utilizes logit function to model classification problems, that is, a categorical dependent variable. In the basic form, we model a binary classification problem and refer to it as binary logistic regression. In complex problems, where more than one categories have to be modeled, we use multinomial logistic regression.

Let us understand logistic regression by means of an example.

Consider we have to make a decision whether a credit card transaction is fraudulent or not. The response is binary (Yes or No); if the transaction seems promising it will be accepted, otherwise not. We will have incoming transaction attributes like amount, time of transaction, payment mode, and so on, and we have to make a decision based on them.

For such a problem, logistic regression models the probability of fraud. In the preceding case, the probability of fraud can be

Probability (fraud = Yes | amount)

The value for this probability will lie between 0 and 1. It can be interpreted as follows: given a value of “fraud amount” we can make a prediction for the genuineness of a credit card transaction.

The question arises of how we model such a relationship. If we use a linear equation, we can simply write it as
$$ mathrm{Probability}kern0.17em mathrm{or}; mathrm{pr}left(mathrm{x}
ight)={upbeta}_0+{upbeta}_1mathrm{x} $$
(Equation 3-1)

If we want to predict for “success” using “amount” by fitting the preceding formula, it can be represented in the form of the following graph. For the smaller values of the fraud amount, we can see that probability is less than zero while for the large values, fraud probability is greater than one. And both of these situations are not possible.

Hence, logistic regression is used to tackle this problem. Logistic regression uses the sigmoid function, which takes input as any real value and gives an output between 0 and 1. The standard logistic regression equation can be represented as in Equation 3-2 and shown in Figure 3-2.
$$ mathrm{P}left(mathrm{x}
ight)={mathrm{e}}^{mathrm{t}}--left({mathrm{e}}^{mathrm{t}}+1
ight);mathrm{where};mathrm{t}={upbeta}_0+{upbeta}_1mathrm{x} $$
(Equation 3-2)
../images/499122_1_En_3_Chapter/499122_1_En_3_Fig2_HTML.jpg
Figure 3-2

(i) Linear regression function will not be able to do justice to the task of predicting fraud. (ii) Logistic regression having an “S”-shaped curve will be more suitable as it will give scores between 0 and 1.

which can be rewritten as
$$ mathrm{P}left(mathrm{x}
ight)={mathrm{e}}^{upbeta_0+{upbeta}_1mathrm{x}}--left(1+{mathrm{e}}^{upbeta_0+{upbeta}_1mathrm{x}}
ight) $$
(Equation 3-3)

The next question that comes to mind is how to fit this equation. Recall in the case of linear regression, we want to make a prediction of the target variable as close to the actual values. A similar approach is followed here too. The fitting of the logistic regression is done using the maximum likelihood function. The likelihood function measures the goodness of fit of a statistical model and for logistic regression sometimes referred to as log-likelihood function. The mathematical proof is beyond the scope of the book.

If we manipulate Equation 3-3 we will get
$$ log;left(mathrm{p}left(mathrm{x}
ight)hbox{-} hbox{-} 1hbox{-} mathrm{p}left(mathrm{x}
ight)
ight)={upbeta}_0+{upbeta}_1mathrm{x} $$
(Equation 3-4)
If we take a natural log of both the sides,
$$ left(mathrm{p}left(mathrm{x}
ight)hbox{-} hbox{-} 1hbox{-} mathrm{p}left(mathrm{x}
ight)
ight)={mathrm{e}}^{upbeta_0+{upbeta}_1mathrm{x}} $$
(Equation 3-5)

In Equation 3-5, the quantity $$ frac{mathrm{p}left(mathrm{x}
ight)}{left(1-mathrm{p}left(mathrm{x}
ight)
ight)} $$ is referred to as odds. It iscalled odds because it is more intuitive as compared to probability, as odds are a common jargon in betting.

The term log $$ left(frac{mathrm{p}left(mathrm{x}
ight)}{1-mathrm{p}left(mathrm{x}
ight)}
ight) $$ is called logit. If we compare with a linear regression equation, we can easily make out that with each unit increase in x, the logit (or sometimes called log-odds) changes by β1. This value can take any value between 0 and infinity. We can visualize the function in Figure 3-2(ii).

In most of the business problems, we have more than one independent variable and hence we use multinomial logistic regression to solve it. Mathematically, it can be represented as follows:
$$ log;left(mathrm{p}left(mathrm{x}
ight)hbox{-} hbox{-} 1hbox{-} mathrm{p}left(mathrm{x}
ight)
ight)={upbeta}_0+{upbeta}_1mathrm{x}+dots +{upbeta}_{mathrm{n}} $$
(Equation 3-6)

But there is a question which still remains unanswered: why do we need logistic regression when we have linear regression with us?

Let’s say a bank is making an assessment of its service quality, based on the customer’s historical transactions and service details. In such a case, the predicted response is going to be positive, negative, or neutral. We code these responses as
../images/499122_1_En_3_Chapter/499122_1_En_3_Figa_HTML.png

If we treat the target variable as a continuous variable, it means we have to predict the actual value of y. But this implies that positive is one less than negative and negative is one less than neutral. And the difference between positive and negative is the same as the difference between negative and neutral. This argument is intrinsically wrong and does not make any practical sense.

Moreover, even if we reduce the number of responses from three to two, let’s say positive and negative only, then too the linear regression might give us some probability score beyond 1 or less than 0, which is mathematically not possible. If we fit the best-found regression line, it still won’t be enough to decide any point by which we can differentiate between the two classes. It will classify some positive as negative and vice versa. Moreover, if we get a score of 0.5 from linear regression, should it be classified as positive or negative? And an outlier can completely mess the outputs for us. Hence it is practically more sensible to use a classification algorithm like logistic regression instead of a linear regression model to solve the problem.

Like linear regression, logistic regression has a few assumptions:
  1. 1)

    Being a classification algorithm, the outcome of the logistic regression model is a binary or dichotomous variable like success/fail, yes/no, or zero/one.

     
  2. 2)

    There exists a linear relationship between the logit of the outcome and each of the independent variables.

     
  3. 3)

    Outliers do not exist or at least there are no significant outliers for continuous variables.

     
  4. 4)

    There exists no correlation between the independent variables.

     

An important point to be noted is that the accuracy of the algorithm depends on the training data which has been used to train the algorithm. If the training data is not representative, then the resultant model will not be a robust one. The training data should conform to the data quality standards we discussed in Chapter 1. We will be revisiting this concept in detail in Chapter 5.

Tip

To be able to have a representative dataset, it is advisable to have a minimum of 10 data points for each of the independent variables with reference to their least frequent value. For example, for 20 independent variables and a least frequent outcome of 0.2, we should have (20*10)/0.2 = 1000 data points.

Key points to note about logistic regression:
  1. 1)

    The output of a logistic classification model generally is a probability score for an event. It can be used for both binary classification and multi classification problems.

     
  2. 2)

    Since the output is probability, it cannot go beyond 1 and cannot be less than 1. And hence the shape of the logistic curve is “S”.

     
  3. 3)

    It can handle any number of classes as the target variable as well as both categorical and continuous independent variables.

     
  4. 4)

    The maximum likelihood algorithm helps to determine the respective coefficients for the equation. It is not required for the independent variables to be normally distributed or have an equal variance in each group.

     
  5. 5)

    $$ frac{mathrm{P}}{1hbox{-} mathrm{p}} $$ is the odds ratio and whenever this value is positive, the chances of success are above 50%.

     
  6. 6)

    The interpretation of coefficients is difficult in logistic regression as the relation is not straightforward as in the case of linear regression.

     

Before moving further, it is imperative to carefully examine the accuracy measurement methods. You are advised to be thorough with each of them. A vital component in supervised learning, indeed!

Assessing the Accuracy of the Solution

The objective to create an ML solution is to predict for future events. But before deploying the model to a production environment, it is imperative we measure the performance of the model. Moreover, we generally train multiple algorithms with multiple iterations. We have to choose the best algorithm based on the variously accurate KPIs. In this section, we are studying the most important accuracy assessment criteria.

The most important measures to measure the efficacy of a classification problem are as follows:
  1. 1.

    Confusion matrix : One of the most popular methods is confusion matrix. It can be used for both binary and multiclass problems. In its simplest form it is represented as a 2×2 matrix in Figure 3-3.

     
../images/499122_1_En_3_Chapter/499122_1_En_3_Fig3_HTML.png
Figure 3-3

Confusion matrix is a great way to measure the efficacy of an ML model. Using it, we can calculate precision, recall, accuracy, and F1 score to get the model’s performance.

We will now learn about each of the parameters separately:

  1. a.

    Accuracy: Accuracy is how many predictions were made correctly. In the preceding example, the accuracy is (131+27)/ (131+27+3+24) = 85%

     
  2. b.

    Precision: Precision represents out of positive predictions; how many were actually positive. In the preceding example, precision is 131/ (131+24) = 84%

     
  3. c.

    Recall or sensitivity: Recall is how of all the actual positive events; how many we were able to capture. In this example, 131/ (131+3) = 97%

     
  4. d.

    Specificity or true negative rate: Specificity is out of actual negatives; how many were predicted correctly. In this example: 27/ (27+24) = 92%

     
  5. 2.
    ROC curve and AUC value: ROC or receiver operating characteristics is used to compare different models. It is a plot between TPR (true positive rate) and FPR (false positive rate). The area under the ROC curve (AUC) is a measure of how good a model is. The higher the AUC values, the better the model, as depicted in Figure 3-4. The straight line at the angle of 45° represents 50% accuracy. A good model is having an area above 0.5 and hugging to the top left corner of the graph as shown in Figure 3-4. The one in the green seems to be the best model here.
    ../images/499122_1_En_3_Chapter/499122_1_En_3_Fig4_HTML.jpg
    Figure 3-4

    (i) ROC curve is shown on the left side. (ii) Different ROC curves are shown on the right. The green ROC curve is the best; it hugs to the top left corner and has the maximum AUC value.

     
  6. 3.

    Gini coefficient: We also use Gini coefficient to measure the goodness of a fit of our model. Formally put, it is the ratio of areas in a ROC curve and is a scaled version of the AUC.

    $$ GI=2ast AUChbox{-} 1 $$
    (Equation 3-7)

    Similar to AUC values, a higher-value Gini coefficient is preferred.

     
  7. 4.

    F1 score: Many times, we face the problem of which KPI to choose (i.e., higher precision or higher recall) when we are comparing the models. F1 score solves this dilemma for us.

    $$ mathrm{F}1;mathrm{Score}=2;left(mathrm{precision}ast mathrm{recall}
ight)/mathrm{precision}+mathrm{recall} $$
    (Equation 3-8)

    F1 score is the harmonic mean of precision and recall. The higher the F1 score, the better.

     
  8. 5.

    AIC and BIC: Akaike information criteria (AIC) and Bayesian information criteria (BIC) are used to choose the best model. AIC is derived from most frequent probability, while BIC is derived from Bayesian probability.

    $$ AIC=hbox{-} 2/mathrm{N}ast LL+2ast mathrm{k}/mathrm{N} $$
    (Equation 3-9)
    while
    $$ BIC=hbox{-} 2ast LL+log left(mathrm{N}
ight)ast mathrm{k} $$
    (Equation 3-10)

    In both formulas, N is number of examples in training set, LL is log-likelihood of the model on training dataset, and k is the number of variables in the model. The log in BIC is natural log to the base e and is called natural algorithm.

    We prefer lower values of AIC and BIC. AIC penalizes the model for its complexity, but BIC penalizes the model more than AIC. If we have a choice between the two, AIC will choose a more complex model as compared to BIC.

     
Tip

Given a choice between a very complex model and a simple model with comparable accuracy, choose the simpler one. Remember, nature always prefers simplicity!

  1. 6.

    Concordance and discordance: Concordance is one of the measures to gauge your model. Let us first understand the meaning of concordance and discordance.

    Consider if you are building a model to predict if a customer will churn from the business or not. The output is the probability of churn. The data is shown in Table 3-1.
    Table 3-1

    Respective Probability Scores for a Customer to Churn or Not

    Cust ID

    Probability

    Churn

    1001

    1

    0.75

    2001

    0

    0.24

    3001

    1

    0.34

    4001

    0

    0.62

    Group 1: (churn = 1): Customer 1001 and 3001

    Group 2: (churn = 0): Customer 2001 and 4001

    Now we create the pairs by taking a single data point from Group 1 and one for Group 2 and then compare them. Hence, they will look like this:

    Pair 1: 1001 and 2001

    Pair 2: 1001 and 4001

    Pair 3: 3001 and 2001

    Pair 4: 3001 and 4001

    By analyzing the pairs, we can easily make out that in the first three pairs the model is classifying the higher probability as churners. Here the model is correct in the classification. These pairs are called concordant pairs. Pair 4 is where the model is classifying the lower probability as churner, which does not make sense. This pair is called discordant. If the two pairs have comparable probabilities, this is referred to as tied pairs.

    We can measure the quality using Somers D, which is given by

    Somers D = (percentage concordant pair – percentage discordant pair). The higher the Somers D, the better the model.

    Concordance alone cannot be a parameter to make a model selection. It should be used as one of the measures and other measures should also be checked.

     
  2. 7.

    KS stats: KS statistics or Kolmogorov-Smirnov statistics is one of the measures to gauge the efficacy of the model. It is the maximum difference between the cumulative true positive and cumulative false positive. The higher the KS, the better the model.

     
  3. 8.
    It is also a recommended practice to test the performance of the model on the following datasets and compare the KPIs:
    1. a.

      Training dataset: the dataset used for training the algorithm

       
    2. b.

      Testing dataset: the dataset used for testing the algorithm

       
    3. c.

      Validation dataset: this dataset is used only once and in the final validation stage

       
    4. d.

      Out-of-time validation: It is a good practice to have out-of-time testing. For example, if the training/testing/validation datasets are from Jan 2015 to Dec 2017, we can use Jan 2018–Dec 2018 as the out-of-time sample. The objective is to test a model’s performance on an unseen dataset.

       
     

These are the various measures which are used to check the model’s accuracy. We generally create more than one model using multiple algorithms. And for each algorithm, there are multiple iterations done. Hence, these measures are also used to compare the models and pick and choose the best one.

There is one more point we should be cognizant of. We would always want our systems to be accurate. We want to predict better if the share prices are going to increase or decrease or whether it will rain tomorrow or not. But sometimes, accuracy can be dubious. We will understand it with an example.

For example, while developing a credit card fraud transaction system our business goal is to detect transactions which are fraudulent. Now generally most (more than 99%) of the transactions are not fraudulent. This means that if a model predicts each incoming transaction as genuine, still the model will be 99% accurate! But the model is not meeting its business objective of spotting fraud transactions. In such a business case, recall is the important parameter we should target.

With this, we conclude our discussion of accuracy assessment. Generally, for a classification problem, logistic regression is the very first algorithm we use to baseline. Let us now solve an example of a logistic regression problem.

Case Study: Credit Risk

Business Context: Credit risk is nothing but the default in payment of any loan by the borrower. In the banking sector, this is an important factor to be considered before approving the loan of an applicant. Dream Housing Finance company deals in all home loans. They have presence across all urban, semiurban, and rural areas. Customers first apply for a home loan; after that company validates the customers’ eligibility for the loan.

Business Objective : The company wants to automate the loan eligibility process (real time) based on customer detail provided while filling out the online application form. These details are gender, marital status, education, number of dependents, income, loan amount, credit history, and others. To automate this process, they have given a problem to identify the customer segments that are eligible for loan amounts so that they can specifically target these customers. Here they have provided a partial dataset.

Dataset: The dataset and the code is available at the Github link for the book shared at the start of the chapter. A description of the variables is given in the following:

Variable Description
  1. a.

    Loan_ID: Unique Loan ID

     
  2. b.

    Gender: Male/Female

     
  3. c.

    Married: Applicant married (Y/N)

     
  4. d.

    Dependents: Number of dependents

     
  5. e.

    Education: Applicant Education (Graduate/Undergraduate)

     
  6. f.

    Self_Employed: Self-employed (Y/N)

     
  7. g.

    ApplicantIncome: Applicant income

     
  8. h.

    CoapplicantIncome: Coapplicant income

     
  9. i.

    LoanAmount: Loan amount in thousands

     
  10. j.

    Loan_Amount_Term: Term of loan in months

     
  11. k.

    Credit_History: credit history meets guidelines

     
  12. l.

    Property_Area: Urban/ Semi Urban/ Rural

     
  13. m.

    Loan_Status: Loan approved (Y/N)

     

Let’s start the coding part using logistic regression. We will explore the dataset, clean and transform it, fit a model, and then measure the accuracy of the solution.

Step 1: Import all the requisite libraries first. Importing seaborn for statistical plots, to split data frames into training set and test set, we will use sklearn package's data-splitting function, which is based on random function. To calculate accuracy measures and confusion matrices, we have imported metrics from sklearn.
import pandas as pd
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import numpy as np
import os,sys
from scipy import stats
from sklearn import metrics
import seaborn as sn
%matplotlib inline
Step 2: Load the dataset using read_csv command. The output is shown in the following.
loan_df = pd.read_csv('CreditRisk.csv')
loan_df.head()
../images/499122_1_En_3_Chapter/499122_1_En_3_Figb_HTML.jpg
Step 3: Examine the shape of the data:
loan_df.shape
Step 4: credit_df = loan_df.drop('Loan_ID', axis =1 ) # dropping this column as it will be 1-1 mapping anyways:
credit_df.head()
../images/499122_1_En_3_Chapter/499122_1_En_3_Figc_HTML.jpg
Step 5: Next normalize the values of the Loan Value and visualize it too.
credit_df['Loan_Amount_Term'].value_counts(normalize=True)
plt.hist(credit_df['Loan_Amount_Term'], 50)
../images/499122_1_En_3_Chapter/499122_1_En_3_Figd_HTML.jpg
../images/499122_1_En_3_Chapter/499122_1_En_3_Fige_HTML.jpg
Step 6: Visualize the data next like a line chart.
plt.plot(credit_df.LoanAmount)
plt.xlabel('Loan Amount')
plt.ylabel('Frequency')
plt.title("Plot of the Loan Amount")
../images/499122_1_En_3_Chapter/499122_1_En_3_Figf_HTML.jpg
Tip

We have shown only one visualization. You are advised to generate more graphs and plots. Remember, plots are a fantastic way to represent data intuitively!

Step 7: The Loan_Amount_Term is highly skewed and hence we are deleting this variable.
credit_df.drop(['Loan_Amount_Term'], axis=1, inplace=True)
Step 8: Missing value treatment is done next and each variable’s missing value is replaced with 0. Compare the results after replacing the missing values with median.
credit_df = credit_df.fillna('0')
##credit_df = credit_df.replace({'NaN':credit_df.median()})
credit_df
Step 9: Next we will analyze how our variables are distributed.
credit_df.describe().transpose()
../images/499122_1_En_3_Chapter/499122_1_En_3_Figg_HTML.jpg

You are advised to create box-plot diagrams as we have discussed in Chapter 2.

Step 10: Let us look at the target column, ‘Loan_Status’, to understand how the data is distributed among the various values.
credit_df.groupby(["Loan_Status"]).mean()
../images/499122_1_En_3_Chapter/499122_1_En_3_Figh_HTML.jpg
Step 11: Now we will convert X & Y variable to a categorical variable.
credit_df['Loan_Status'] = credit_df['Loan_Status'].astype('category')
credit_df['Credit_History'] = credit_df['Credit_History'].astype('category')
Step 12: Check the data types present in the data we have now as shown in the output:
../images/499122_1_En_3_Chapter/499122_1_En_3_Figi_HTML.jpg
Step 13: Check how the data is balanced. We will get the following output.
prop_Y = credit_df['Loan_Status'].value_counts(normalize=True)
print(prop_Y)
../images/499122_1_En_3_Chapter/499122_1_En_3_Figj_HTML.jpg

There seems to be a slight imbalance in the dataset as one class is 31.28% and the other is 68.72%.

Note

While the dataset is not heavily imbalanced, we will also examine how to deal with data imbalance in Chapter 5.

Step 14: We will define the X and Y variables now.
X = credit_df.drop('Loan_Status', axis=1)
Y = credit_df[['Loan_Status']]
Step 15: Using one-hot encoding we will convert the categorical variables to numeric variables:
X = pd.get_dummies(X, drop_first=True)
Step 16: Now split into training and test sets. We are splitting into a ratio of 70:30
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30)
Step 17: Build the actual logistic regression model now:
import statsmodels.api as sm
logit = sm.Logit(y_train, sm.add_constant(X_train))
lg = logit.fit()
Step 18: We will now check the summary of the model. The results are as follows:
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
print(lg.summary())
../images/499122_1_En_3_Chapter/499122_1_En_3_Figk_HTML.jpg

Let us interpret the results. The pseudo r-square shows that only 24% of the entire variation in the data is explained by the model. It is really not a good model!

Step 19: Next we will calculate the odds ratio from the coefficients using the formula odds ratio=exp(coef). Next we will calculate the probability from the odds ratio using the formula probability = odds / (1+odds).
log_coef = pd.DataFrame(lg.params, columns=['coef'])
log_coef.loc[:, "Odds_ratio"] = np.exp(log_coef.coef)
log_coef['probability'] = log_coef['Odds_ratio']/(1+log_coef['Odds_ratio'])
log_coef['pval']=lg.pvalues
pd.options.display.float_format = '{:.2f}'.format
Step 20: We will now filter all the independent variables by significant p-value (p value <0.1) and sort descending by odds ratio. We will get the following output:
log_coef = log_coef.sort_values(by="Odds_ratio", ascending=False)
pval_filter = log_coef['pval']<=0.1
log_coef[pval_filter]
../images/499122_1_En_3_Chapter/499122_1_En_3_Figl_HTML.jpg

If we analyze the data, we can see that the customers who have credit history 1 have a 97% probability of defaulting the loan while the ones having history of 0 have 98% probability of defaulting.

Similarly, the customers in semiurban areas have odds of 2.50 times to default as compared to others.

Step 21: We are now fitting the model using the training data. The .fit is the function used for it.
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
Step 22: Once the model is ready and fit, we can use it to make a prediction. But first we have to check the accuracy of the model on the training data using confusion matrix; the output is as follows.
pred_train = log_reg.predict(X_train)
from sklearn.metrics import classification_report,confusion_matrix
mat_train = confusion_matrix(y_train,pred_train)
print("confusion matrix = ",mat_train)
../images/499122_1_En_3_Chapter/499122_1_En_3_Figm_HTML.jpg
Step 23: Next we will make the prediction for test set and visualize it, and we will get the following output.
pred_test = log_reg.predict(X_test)
mat_test = confusion_matrix(y_test,pred_test)
print("confusion matrix = ",mat_test)
ax= plt.subplot()
ax.set_ylim(2.0, 0)
annot_kws = {"ha": 'left',"va": 'top'}
sns.heatmap(mat_test, annot=True, ax = ax, fmt= 'g',annot_kws=annot_kws); #annot=True to annotate cells
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Not Approved', 'Approved']);
ax.yaxis.set_ticklabels(['Not Approved', 'Approved']);
../images/499122_1_En_3_Chapter/499122_1_En_3_Fign_HTML.jpg
Step 24: Let us now create the AUC ROC curve and get the AUC score. We will get the following output.
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, log_reg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, log_reg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
../images/499122_1_En_3_Chapter/499122_1_En_3_Figo_HTML.jpg
auc_score = metrics.roc_auc_score(y_test, log_reg.predict_proba(X_test)[:,1])
round( float( auc_score ), 2 )
The output is 0.81.
Interpretations of the Results: By comparing the training confusion matrix and testing confusion matrix, we can determine the efficacy of the solution as shown in the confusion matrix.
../images/499122_1_En_3_Chapter/499122_1_En_3_Figp_HTML.jpg

On testing data, the model’s overall accuracy is 85%. Sensitivity or recall is 97% and precision is 84%. The model has a good overall accuracy. However, the model can be improved as we can see that 24 applications were predicted as approved while they were actually not approved.

Additional Notes

You can do a quick visualization for all the variables in using only one command using the following code and as shown in the following graph. It depicts the relationship of loan status with all the variables.
import seaborn as sns
sns.pairplot(credit_df, hue="Loan_Status", palette="husl")
../images/499122_1_En_3_Chapter/499122_1_En_3_Figq_HTML.jpg
Note

If the testing accuracy is not similar to training accuracy and is significantly lower, it means the model is overfitting. We will study how to tackle this problem in Chapter 5.

Logistic regression is generally the first few algorithms which are used whenever we approach a classification problem. It is fast, easy to comprehend, and compact to handle categorical and continuous data points alike. Hence, it is quite popular. It can also be used to get the significant variables for a problem.

Now that we have examined logistic regression in detail, let us move to the second very important classifier used: naïve Bayes. Don’t go by the word “naïve”; this algorithm is quite robust in making classifications!

Naïve Bayes for Classification

Consider this: you are planning for camping. This trip will depend on a few factors like how the weather is, are there any predictions of rain, did you get a day off from the office, are your friends coming, and so on. You have the data from the history to make a prediction if you are going to camping or not which is shown in Table 3-2.
Table 3-2

Factors to Consider When Planning a Camping Trip

Day off

Weather

Friends coming

Humidity

Going camping

Yes

Rainy

No

High

Yes

No

Sunny

Yes

Low

Yes

Yes

Overcast

No

Low

No

Yes

Rainy

No

High

Yes

Yes

Sunny

Yes

Low

Yes

Yes

Rainy

Yes

High

Yes

No

Sunny

No

High

No

No

Overcast

Yes

Low

Yes

As shown in the table, the final decision to go camping or not is dependent on the outcome from other events. This brings us to the concepts of conditional probability. We will first discuss a few key points related to probability to help understand better:
  1. 1)

    If A is any event, then the complement of A, denoted by Â, is the event that A does not occur.

     
  2. 2)

    The probability of A is represented by P(A), and the probability of its complement P($$ hat{mathrm{A}} $$) = 1 – P(A).

     
  3. 3)

    Let A and B be any events with probabilities P(A) and P(B). If you are told that when B has occurred, then the probability of A might change. As in the previous case if the weather is rainy then the probability of camping changes. This new probability of A is called the conditional probability of A given B, which can be written as P(A|B).

     
  4. 4)

    Mathematically, $$ mathrm{P}left(mathrm{A}|mathrm{B}
ight)=frac{mathrm{P}left(mathrm{A}kern0.5em mathrm{and}kern0.5em mathrm{B}
ight)}{mathrm{P}left(mathrm{B}
ight)} $$ where P(A|B) means that probability of A given B which means the probability of A if B was known to have occurred.

     
  5. 5)

    This relationship can be viewed as probabilistic dependency and is called conditional probability. It means that knowledge of one event is of importance when assessing the probability of the other.

     
  6. 6)

    If the two events are mutually independent, then the multiplication rule simplifies to P (A and B) = P(A)P(B). For example, there will be no impact on your camping plans based on the price of milk.

     

There are many events which are mutually dependent on each other and hence it becomes imperative to understand the relationship: P(A|B) and P(B|A). This is true in the case of business activities too where the sales are dependent on the number of customers visiting the store, a customer will come back for shopping or not will depend on the previous experiences, and so on. Bayes’ theorem helps to model for such factors and make a prediction.

As per Bayes’s rule, if we have two events A and B. Then the conditional probability of A given B can be represented as
$$ mathrm{P}left(mathrm{A}left|mathrm{B}
ight.
ight)=mathrm{P}left(mathrm{B}left|mathrm{A}
ight.
ight)	imes mathrm{P}left(mathrm{A}
ight)hbox{-} hbox{-} mathrm{P}left(mathrm{B}
ight) $$
(Equation 3-11)

where P(A) and P(B): probability of A and B respectively, P(A|B): probability of A given B and P(B|A): probability of B given A.

For example, if we want to know in finding out a patient’s probability to have a heart disease if they have diabetes. The data we have is as follows: 10% of patients entering the clinic have heart disease while 5% of the patients have diabetes. Among the patients diagnosed with heart disease, 8% are diabetic. Then P(A) = 0.10, P(B) = 0.05, and P(B|A) = 0.08. And using Bayes’ rule, P(A|B) = (0.08×0.1)/0.05 = 0.16.

If we generalize the rule, let A1 through An be a set of mutually exclusive outcomes. The probabilities of the events A are P(A1) through P(An). These are called prior probabilities. Because an information outcome might influence our thinking about the probabilities of any Ai, we need to find the conditional probability P(Ai|B) for each outcome Ai. This is called the posterior probability of Ai.

Using Bayes’ rule, we can say that
$$ mathrm{P}left({mathrm{A}}_{mathrm{i}}left|mathrm{B}
ight.
ight)=mathrm{P}left(mathrm{B}left|{mathrm{A}}_{mathrm{i}}
ight.
ight);mathrm{P}left({mathrm{A}}_{mathrm{i}}
ight)/left{mathrm{P}
ight(mathrm{B}left|{mathrm{A}}_1left);mathrm{P}left({mathrm{A}}_1
ight)+mathrm{P}
ight(mathrm{B}left|{mathrm{A}}_2left);mathrm{P}left({mathrm{A}}_2
ight)+dots +mathrm{P}
ight(mathrm{B}left|{mathrm{A}}_{mathrm{n}}Big);mathrm{P}left({mathrm{A}}_{mathrm{n}}
ight)
ight.
ight.
ight. $$
(Equation 3-12)
Bayes’ rule says that the posterior is the likelihood times the prior, divided by a sum of likelihood times priors. The denominator in Bayes’ rule is the probability P(B).
$$ mathrm{Posterior}kern0.17em mathrm{probability}=left(mathrm{Conditional}kern0.17em mathrm{probability}	imes mathrm{Prior}kern0.17em mathrm{probability}
ight)/mathrm{evidence} $$
(Equation 3-13)

Bayes’ theorem is used for making classification (binary or multiclass), and it is referred to as naïve Bayes. It is called naïve due to a very strong assumption that the variables and features are independent of each other which is generally not true in the real world. Often this assumption is violated and still naïve Bayes tends to perform well. The idea is to factor all available evidence in the form of predictors into the naïve Bayes rule to obtain more accurate probability for class prediction.

As per Bayes’ rule, the naïve Bayes estimates conditional probability (i.e., the probability that something will happen, given that something else has already occurred). For example, if we want to design an email spam filter and find a given mail is spam if there is an appearance of the word “discount.” It is easy to implement, fast, robust, and quite accurate. Because of its ease of use, it is quite a popular technique.

Advantages of naïve Bayes algorithm:
  1. 1.

    It is a simple, easy, fast, and very robust method.

     
  2. 2.

    It does well with both clean and noisy data.

     
  3. 3.

    It requires few examples for training, but the underlying assumption is that the training dataset is a true representative of the population.

     
  4. 4.

    It is easy to get the probability for a prediction.

     
Disadvantages of naïve Bayes algorithm:
  1. 1.

    It relies on a very big assumption that independent variables are not related.

     
  2. 2.

    It is generally not suitable for datasets with large numbers of numerical attributes.

     
  3. 3.

    The predicted probabilities by the naïve Bayes algorithm is considered as less reliable in practice than predicted classes.

     
  4. 4.

    In some cases, it has been observed that if a rare event is not in training data but present in the test, then the estimated probability will be wrong.

     

But when some of our independent variables are continuous, we cannot calculate conditional probabilities! And hence in real-life variables, naïve Bayes is extended to Gaussian naïve Bayes.

In Gaussian naïve Bayes, continuous values associated with each attribute or independent variable are assumed to be following a Gaussian distribution. They are also easier to work with, as for the training we would only have to estimate the mean and standard deviation of the continuous variable.

Let us move to the case study to develop the naïve Bayes solution.

Case Study: Income Prediction on Census Data

Business Objective: We have census data and the objective is to predict whether income exceeds 50K/yr for an individual based on the value of other attributes.

Dataset: The dataset and code is available at the Github link shared at the start of this chapter.

Variable description:
  • Age: continuous

  • Workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

  • fnlwgt: continuous.

  • Education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

  • Education-num: continuous.

  • Marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

  • Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

  • Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

  • Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

  • Sex: Female, Male.

  • Capital-gain: continuous.

  • Capital-loss: continuous.

  • Hours-per-week: continuous.

  • Native-country: United States, Cambodia, England, Puerto Rico, Canada, Germany, Outlying-US (Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican Republic, Laos, Ecuador, Taiwan, Haiti, Colombia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El Salvador, Trinadad&Tobago, Peru, Hong, Holland-Netherlands.

  • Class: >50K, <=50K

Step 1: Import the necessary libraries here.
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split# used to split the dataset into train and test datasets
from sklearn.naive_bayes import GaussianNB # To model the Gaussian Naive Bayes classifier
from sklearn.metrics import accuracy_score # To calculate the accuracy score of the model
Step 2: We now import the data. Please note that this file has a .data extension. Now for importing the census data, we are passing four parameters. The ‘adult.data’ parameter is the file name. The header parameter suggests whether the first row of data consists of headers of the file or not. For our dataset there are no headers and hence we can pass ‘None’. The delimiter parameter indicates the delimiter that is separating the data. Here, we are using the ‘ , ’ delimiter. This delimiter allows the method to delete the spaces before and after the data values. This is very helpful when there is inconsistency in spaces used with data values.
census_df = pd.read_csv('adult.data', header = None, delimiter=' *, *', engine="python")
Step 3: We are now adding the headers to the dataframe. It is required so that we are able to access the columns later and better:
census_df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
Step 4: Let us print the total number of records (rows) in the dataframe:
len(census_df)
The output is 32561
Step 5: Check the presence of null values in our dataset, as follows:
census_df.isnull().sum()
../images/499122_1_En_3_Chapter/499122_1_En_3_Figr_HTML.jpg

The preceding output shows that there is no “null” value in our dataset.

There can be some categorical variables having missing values. We will check that, sometimes they have “?” in place of missing values.
for value in ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','income']:
    print(value,":", sum(census_df[value] == '?'))
../images/499122_1_En_3_Chapter/499122_1_En_3_Figs_HTML.jpg

The output of the preceding code snippet shows that there are 1836 missing values in the workclass attribute, 1843 missing values in the occupation attribute, and 583 values in the native_country attribute.

Step 6: We will now proceed to data preprocessing. First, we will create a deep copy of our data frame:
census_df_rev = census_df.copy(deep=True)
Step 7: Before doing missing value handling tasks, we need some summary statistics of our data frame. For this, we can use the describe() method. It can be used to generate various summary statistics, excluding NaN values.
census_df_rev.describe(), as follows:
../images/499122_1_En_3_Chapter/499122_1_En_3_Figt_HTML.jpg
census_df_rev.describe(include= 'all')
If all is passed, it means we want to check the summary of all the attributes as follows:
../images/499122_1_En_3_Chapter/499122_1_En_3_Figu_HTML.jpg
Step 8: We will now impute the missing categorical values:
for value in ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','income']:
    replaceValue = census_df_rev.describe(include='all')[value][2]
    census_df_rev[value][census_df_rev[value]=='?'] = replaceValue
Step 9: One-hot encoding to convert all the categorical variables to numeric
le = preprocessing.LabelEncoder()
workclass_category = le.fit_transform(census_df.workclass)
education_category = le.fit_transform(census_df.education)
marital_category   = le.fit_transform(census_df.marital_status)
occupation_category = le.fit_transform(census_df.occupation)
relationship_category = le.fit_transform(census_df.relationship)
race_category = le.fit_transform(census_df.race)
sex_category = le.fit_transform(census_df.sex)
native_country_category = le.fit_transform(census_df.native_country)
Step 10: We will now initialize the encoded categorical columns:
census_df_rev['workclass_category'] = workclass_category
census_df_rev['education_category'] = education_category
census_df_rev['marital_category'] = marital_category
census_df_rev['occupation_category'] = occupation_category
census_df_rev['relationship_category'] = relationship_category
census_df_rev['race_category'] = race_category
census_df_rev['sex_category'] = sex_category
census_df_rev['native_country_category'] = native_country_category
Step 11: Look at the first few lines of our data:
census_df_rev.head()
../images/499122_1_En_3_Chapter/499122_1_En_3_Figv_HTML.jpg
Step 12: There is no need of old categorical columns and we can drop them safely:
dummy_fields = ['workclass','education','marital_status','occupation','relationship','race', 'sex', 'native_country']
census_df_rev = census_df_rev.drop(dummy_fields, axis = 1)
Step 13: We will have to reindex all the columns and for that we will use the reindex_axis method.
census_df_rev = census_df_rev.reindex_axis(['age', 'workclass_category', 'fnlwgt', 'education_category', 'education_num', 'marital_category', 'occupation_category', 'relationship_category', 'race_category', 'sex_category', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country_category', 'income'], axis= 1) census_df_rev.head(5)
../images/499122_1_En_3_Chapter/499122_1_En_3_Figw_HTML.jpg
The method has deprecated and hence we will receive this error, as shown in the preceding illustration. Hence, let’s use a newer method and we will not get the error shown.
census_df_rev = census_df_rev.reindex(['age', 'workclass_category', 'fnlwgt', 'education_category', 'education_num', 'marital_category', 'occupation_category',  'relationship_category', 'race_category', 'sex_category', 'capital_gain',  'capital_loss', 'hours_per_week', 'native_country_category', 'income'], axis= 1)
census_df_rev.head(5)
../images/499122_1_En_3_Chapter/499122_1_En_3_Figx_HTML.jpg
Step 14: We will now get our data arranged into dependent variables and target variable:
X = census_df_rev.values[:,:14]  ## These are the input variables
Y = census_df_rev.values[:,14]  ## This is the Target variable
Step 15: Now split the data into train and test in the ratio of 75:25.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 5)
Step 16: We will fit the naïve Bayes model now.
clf = GaussianNB()
clf.fit(X_train, Y_train)
Step 17: The model classifier is now trained using training data and is ready to make predictions. We can use the predict() method with test set features as its parameters.
Y_pred = clf.predict(X_test)
Step 18: Check the accuracy of the model now.
accuracy_score(Y_test, Y_pred, normalize = True)
The accuracy we are getting is 0.79032.

With it we have implemented naïve Bayes using a live dataset. You are advised to understand each of the steps and practice the solution by replicating it.

Naïve Bayes is a fantastic algorithm to practice and use. Bayesian statistics is gathering a lot of attention and its power is being harnessed in research areas a lot. You might have heard the term Bayesian optimization . The beauty of the Bayes’ theorem lies in its simplicity, which is very much visible in our day-to-day life. A straightforward method indeed!

We have covered logistic regression and naïve Bayes so far. Now we will examine one more very widely used classifier called k-nearest neighbor. One of the popular methods, easy to understand and implement—let’s study knn next.

k-Nearest Neighbors for Classification

“Birds of the same feather flock together .” This old adage is perfect for k-nearest neighbors. It is one of the most popular ML techniques where the learning is based on the similarity of data points with each other. knn is a nonparametric model; it does not construct a “model” and the classification is based on a simple majority vote from the neighbors. It can be used for classification where the relationship between attributes and target classes is complex and difficult to understand, and yet items in a class tend to be fairly homogenous on the values of attributes. But it might not be the best choice for an unclean dataset or where the target classes are not distinctively clear. If the target classes are not clearly demarcated, then it leads to an obvious confusion while taking the majority vote. knn can also be used for regression problems to make a prediction for a continuous variable. In the case of regression, the final output will be the average of the values of the neighbors and that average will be assigned to the target variable. Let us examine this visually:

For example, we have some data points represented by circles and plus signs in a vector space diagram as shown in Figure 3-5. There are clearly two classes in this case. The objective is to classify a new data point (marked in yellow) shown in Figure 3-5(ii) and identify which class it belongs to.
../images/499122_1_En_3_Chapter/499122_1_En_3_Fig5_HTML.jpg
Figure 3-5

(i) The distribution of two classes in green circle and black plus sign. (ii) The right side shows the new data point which has to be classified (shown in yellow sign). The k-nearest neighbor algorithm has to be used to make the classification.

This yellow point can be a circle or a plus sign and nothing else. knn will help in this classification by taking a vote of majority from the other data points in the vicinity. And the value of “k” will guide us on how many data points to be considered for voting. Let’s assume we took k = 4. Hence, we will now make a circle with a yellow point as the center. And the circle should be just as big as to enclose only four data points. It is represented in Figure 3-6(i).
../images/499122_1_En_3_Chapter/499122_1_En_3_Fig6_HTML.jpg
Figure 3-6

(i) The presence of new unseen data if we select four nearest neighbors. It is an easy decision to make. (ii) The right side shows that the new data point is difficult to classify as the four neighbors are mixed.

The four closest points to the yellow point all belong to circles or we can say that all the neighbors of the unknown yellow point are circles. Hence, with a good confidence level we can predict that the yellow point should belong to the circle. Here the choice was comparatively easy and straightforward. Refer to Figure 3-6(ii), where it is not that simple. Hence, the choice of k plays a very crucial role.

The steps which are followed in k-nearest neighbor are as follows:
  1. 1.

    We receive the raw and unclassified dataset which has to be worked upon.

     
  2. 2.

    We choose a distance matrix from Euclidean, Manhattan or Minkowski.

     
  3. 3.

    Then calculate the distance between the new data points and the known classified training data points.

     
  4. 4.

    The number of neighbors to be considered is defined by the value of “k”.

     
  5. 5.

    It is followed by comparing with the list of classes which have the shortest distance and count the number of times each class appears.

     
  6. 6.

    The class with the highest votes wins. This means that the class which has the highest frequency and has appeared the greatest number of times is assigned to the unknown data point.

     
Tip

Parametric models make some assumptions about the input data like having a normal distribution. However, nonparametric methodology believes that data distributions are undefinable by a finite set of parameters and hence do not make any assumptions.

From the steps discussed for k-nearest neighbor, we can clearly understand that the final accuracy depends on the distance matrix used and the value of “k”.

Popular distance matrices used are
  1. 1.

    Euclidean Distance : probably the most common and easiest way to calculate between two points. It is square root of the sum of the squares of distances:

    $$ mathrm{Euclidean}kern0.17em mathrm{Distance}=sqrt{left({left(mathrm{y}2hbox{-} mathrm{y}1
ight)}^2+{left(mathrm{x}2hbox{-} mathrm{x}1
ight)}^2
ight)} $$
    (Equation 3-14)
     
  2. 2.

    Manhattan Distance : The distance between two points measured along axes at right angles. Sometimes it is also referred to as city block distance. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is

    $$ mathrm{Manhattan}kern0.17em mathrm{Distance}=left|{mathrm{x}}_1-{mathrm{x}}_2
ight|+left|{mathrm{y}}_1-{mathrm{y}}_2
ight| $$
    (Equation 3-15)
     
  3. 3.

    Minkowski Distance : This is a metric in a normed vector space. Minkowski distance is used for distance similarity of vectors. Given two or more vectors, find the distance similarity of these vectors. Mainly, the Minkowski distance is applied in ML to find out the distance similarity. It is a generalized distance metric and can be represented by the following formula:

    $$ {left({Sigma_{mathrm{i}=1}}^{mathrm{n}};{left|{mathrm{x}}_{mathrm{i}}-{mathrm{y}}_{mathrm{i}}
ight|}^{mathrm{p}}
ight)}^{1/p} $$
    (Equation 3-16)

    where by using different values of p we can get different values of distances. With the value of p = 1 we get Manhattan distance, with p = 2 we get Euclidean distance, and with p= ∞ we get Chebychev distance.

     
  4. 4.

    Cosine Similarity: It is a measure of similarity between two nonzero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0,π] radians.

     
The various distances can be viewed as shown in Figure 3-7.
../images/499122_1_En_3_Chapter/499122_1_En_3_Fig7_HTML.jpg
Figure 3-7

(i) Euclidean distance; (ii) Manhattan distance; (iii) cosine similarity

Tip

While approaching a knn problem, generally we start with Euclidean distance. In most business problems, it serves the purpose. Distance matrix is an important parameter in unsupervised clustering methods like k-means clustering.

Advantages of k-nearest neighbor
  1. 1.

    It is a nonparametric method and does not make any assumptions about distributions of the various classes in vector space.

     
  2. 2.

    It can be used for the binary classification as well as multiclassification problems.

     
  3. 3.

    The method is quite easy to comprehend and implement.

     
  4. 4.

    The method is robust and if the value of k is large, it is not impacted by outliers.

     
Disadvantages of k-nearest neighbor
  1. 1.

    The accuracy depends on the value of k and hence finding the most optimal value can be a difficulty sometimes.

     
  2. 2.

    The method requires the class distributions to be non-overlapping.

     
  3. 3.

    There is no specific output as a model and if the value of k is small, it is negatively impacted by the presence of outliers.

     
  4. 4.

    The method is calculation intensive as it is a lazy learner. The distances have to be calculated between all the points and then a majority is to be taken. And hence, it is not that fast of a method to use.

     

There are other forms of knn too, which are as follows:

Radius Neighbor Classifier
  1. 1.

    This classifier implements the learning based on a number of neighbors. The neighbors are within a fixed radius r of each training point, where r is a floating point value specified by the user.

     
  2. 2.

    We prefer this method when the data sampling is not uniform. But, in case of quite a few independent variables and a sparse dataset, it suffers with the curse of dimensionality.

     
Tip

When the number of dimensions increases, the volume of space increases at a very fast pace and the resultant data becomes very sparse. This is called the curse of dimensionality. Data sparsity makes statistical analysis for any dataset quite a challenging task.

KD Tree Nearest Neighbor
  1. 1.

    This method is effective if the dataset is large but the number of independent variables is less.

     
  2. 2.

    The method takes less time to compute as compared to other methods.

     

It is now time to create a Python solution using knn, which we are executing next.

Case Study: k-Nearest Neighbor

The dataset to be audited, which consists of a wide variety of intrusions simulated in a military network environment, was provided. It created an environment to acquire raw TCP/IP dump data for a network by simulating a typical US Air Force LAN. The LAN was focused like a real environment and blasted with multiple attacks. A connection is a sequence of TCP packets starting and ending at some time duration between which data flows to and from a source IP address to a target IP address under some well-defined protocol. Also, each connection is labeled either as normal or as an attack with exactly one specific attack type. Each connection record consists of about 100 bytes. For each TCP/IP connection, 41 quantitative and qualitative features are obtained from normal and attack data (3 qualitative and 38 quantitative features).

The class variable has two categories: normal and anomalous.

The Dataset

The dataset is available at the git repository as Network_Intrusion.csv file. The code is also available at the Github link shared at the start of the chapter.

Business Objective

We have to fit a k-nearest neighbor algorithm to detect network intrusion.

Step 1: Import all the required libraries. We are importing pandas, numpy, matplotlib, seaborn.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Step 2: Import the dataset using read.csv method from pandas. Let’s have a look at the top five rows of the data first as follows:
network_data= pd.read_csv('Network_Intrusion.csv')
network_data.head()
../images/499122_1_En_3_Chapter/499122_1_En_3_Figy_HTML.jpg
Step 3: Now we will do the regular checkup of the data using info() and describe command.
network_data.info()
../images/499122_1_En_3_Chapter/499122_1_En_3_Figz_HTML.jpg
network_data.describe().transpose()
../images/499122_1_En_3_Chapter/499122_1_En_3_Figaa_HTML.jpg
Step 4: Now check for the null values. In our dataset there are no null values fortunately.
network_data.isnull().sum()
Note

This dataset does not have any null values; we will study in detail on how to deal with null values, NA, NaN, and so on in Chapter 5.

Step 5: Let’s have a look at the class distribution. And we will visualize it too.
network_data["class"].value_counts(normalize=True)
../images/499122_1_En_3_Chapter/499122_1_En_3_Figab_HTML.jpg
pd.value_counts(network_data["class"]).plot(kind="bar")
../images/499122_1_En_3_Chapter/499122_1_En_3_Figac_HTML.jpg
Step 6: There are a few categorical variables in our dataset. We have to convert them to numerical variables using one-hot encoding .
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
network_data['class'] = label_encoder.fit_transform(dataset['class'])
network_data['protocol_type'] = label_encoder.fit_transform(dataset['protocol_type'])
network_data['service'] = label_encoder.fit_transform(dataset['service'])
network_data['flag'] = label_encoder.fit_transform(dataset['flag'])
Step 7: One-hot encoding increases the number of variables in the dataset. Let’s see the number of columns in the dataset with added variables:
network_data.columns
Step 8: Next we will standardize our dataset by using StandardScaler in scikit learn.
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
X_std = pd.DataFrame(StandardScaler().fit_transform(network_data))
X_std.columns = network_data.columns
Step 9: Now it is the time to split the data in train and test. We are dividing the data in an 80:20 ratio.
import numpy as np
from sklearn.cross_validation import train_test_split
X = np.array(network_data.ix[:, 1:5]) #Transform data into features
y = np.array(network_data['class']) #Transform data into targets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
../images/499122_1_En_3_Chapter/499122_1_En_3_Figad_HTML.jpg
Step 10: The sklearn.cross_validation has deprecated and hence we received this error. Again, try to split in train and test using sklearn.model_selection.
from sklearn.model_selection import train_test_split
# Transform data into features and target
X = np.array(network_data.ix[:, 1:5])
y = np.array(network_data['class'])
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
Step 11: Print the shape of the data by print(X_train.shape).
print(y_train.shape)
../images/499122_1_En_3_Chapter/499122_1_En_3_Figae_HTML.jpg
Step 12: Print the shape of the test data by print(X_test.shape).
print(y_test.shape)
../images/499122_1_En_3_Chapter/499122_1_En_3_Figaf_HTML.jpg
Step 13: We will now train the model using training data and iterate with different values of k=3,5,9.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
# instantiate learning model (k = 3)
knn_model = KNeighborsClassifier(n_neighbors = 3)
Fitting the model
knn_model.fit(X_train, y_train)
y_pred = knn_model.predict(X_test) # predict the response
print(accuracy_score(y_test, y_pred)) # Evaluate accuracy
The answer is 0.9902758483826156
knn_model = KNeighborsClassifier(n_neighbors=5) # With k = 5
knn_model.fit(X_train, y_train) # Fitting the model
y_pred = knn_model.predict(X_test) # Predict the response
print(accuracy_score(y_test, y_pred)) # Evaluate accuracy
The answer is 0.9882913276443739
With k = 9
knn_model = KNeighborsClassifier(n_neighbors=9)
knn_model.fit(X_train, y_train) # Fitting the model
y_pred = knn_model.predict(X_test) # Predict the response
print(accuracy_score(y_test, y_pred)) # Evaluate accuracy
The answer is 0.9867037110537805
Step 14: We have tested with three values of k. We will now iterate on multiple values of k. We will run the knn with the no. of neighbors to be 1,3,5…19 and then find the optimal number of neighbors based on the lowest misclassification error.
k_list = list(range(1,20)) # creating odd list of K for KNN
k_neighbors = list(filter(lambda x: x % 2 != 0, k_list)) # subsetting just the odd ones
ac_scores = [] # empty list that will hold accuracy scores
# perform accuracy metrics for values from 1,3,5....19
for k in k_neighbors:
    knn_model = KNeighborsClassifier(n_neighbors=k)
    knn_model.fit(X_train, y_train)
    y_pred = knn_model.predict(X_test)   # predict the response
    scores = accuracy_score(y_test, y_pred)  # evaluate accuracy
    ac_scores.append(scores)
# changing to misclassification error
MSE = [1 - x for x in ac_scores]
# determining best k
optimal_k = k_neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)
Step 15: Let’s print the impact of different values of k on the misclassification error.
import matplotlib.pyplot as plt
# plot misclassification error vs k
plt.plot(k_neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
../images/499122_1_En_3_Chapter/499122_1_En_3_Figag_HTML.jpg
Step 16: It turns out that k=3 gives us the best result. Let’s implement it:
#Use k=3 as the final model for prediction
knn = KNeighborsClassifier(n_neighbors = 3)
# fitting the model
knn.fit(X_train, y_train)
# predict the response
y_pred = knn.predict(X_test)
# evaluate accuracy
print(accuracy_score(y_test, y_pred))
print(recall_score(y_test, y_pred))

The accuracy and recall are 0.9902758483826156 and 0.9911944869831547 respectively.

With k = 3, we are getting a very good accuracy and a great recall value too. This model can be considered as the final one to be used.

We developed a solution using knn and are getting good accuracies. k-nearest neighbor is very easy to explain and visualize. Not being very statistics- or mathematics-heavy, even for non–data science users, the method is not very tedious to understand. And this is one of the most important properties of this method. Being a nonparametric method, there is no assumption about the data and hence it requires less data preparation.

We have covered logistic regression, naïve Bayes, and k-nearest neighbor. Generally, when we start any classification solution, we start with these three algorithms to check their accuracy. The next in the series are tree-based algorithms: decision tree and random forest. We have discussed the theoretical concepts about both in Chapter 2. In the next section, we will cover the differences and then implement them on a dataset.

Tree-Based Algorithms for Classification

Recall in Chapter 2 that we studied decision trees to predict the values of a continuous variable. Since decision trees can be used for both classification and regression problems, in this chapter we will study classification solutions using decision tree. The building blocks for a decision tree remain the same as shown in Figure 3-8.
../images/499122_1_En_3_Chapter/499122_1_En_3_Fig8_HTML.jpg
Figure 3-8

A decision tree is comprised of root node, decision node, and a terminal node. A subtree is called a branch

The difference is the process of splitting followed by a classification algorithm which is discussed now.

The objective of splitting is to create as many pure nodes as possible. If a resultant node after splitting contains all the data points belonging to the same class it is called pure or homogeneous . If the node contains records belonging to different classes, the node is impure or heterogenous . The objective is to create pure nodes. There are three primary ways to measure the impurity: entropy, Gini index, and classification error. We describe them in detail now and compare their respective processes. Consider that we have a dataset like in Table 3-3.
Table 3-3

Transportation Mode Dependency on Other Factors Like Gender and Income Level

Vehicle Count

Gender

Cost

Income

Transportation Mode

1

Female

Less

Middle class

Train

0

Male

Less

Low income

Bus

1

Female

High

Middle class

Train

1

Male

Less

Middle class

Bus

0

Male

Medium

Middle class

Train

1

Male

Less

Middle class

Bus

2

Female

High

High class

Car

0

Female

Less

Low income

Bus

2

Male

High

Middle class

Car

1

Female

High

High class

Car

We want to train an algorithm to predict the transportation mode. As per the preceding example, we can calculate that the respective probabilities are
Probability (Bus) = 4/10 = 0.4
Probability (Car) = 3/10 = 0.3
Probability (Train) = 3/10 = 0.3
The three methods are entropy, Gini index, and classification error, which are described now:
  • Entropy: Entropy and information gain walk hand-in-hand. A pure node will require less information to describe itself while an impure node will require more information. It can be understood in the form of entropy too. (Information gain = 1 – Entropy.)

$$ mathrm{Entropy}kern0.17em mathrm{of}kern0.17em mathrm{the}kern0.17em mathrm{system}=hbox{-} mathrm{p};{log}_2mathrm{p}hbox{-} mathrm{q};{log}_2mathrm{q} $$
(Equation 3-17)
  • where p and q are the probability of success and failure, respectively, in that node. The logarithm is to the base of 2 here.

  • In the preceding example, entropy = –0.4 (log 0.4) – 0.3(log 0.3) – 0.3(log 0.3) = 1.571

  • Entropy of a pure node is 0 and can be represented as Figure 3-9.

../images/499122_1_En_3_Chapter/499122_1_En_3_Fig9_HTML.png
Figure 3-9

Values of maximum entropy for different numbers of classes (n). Probability p = 1/n.

  • Gini coefficient : Gini index can also be used to measure the impurity. The formula to be used is as follows:

    $$ mathrm{Gini}kern0.17em mathrm{Index}=1-Sigma {mathrm{p}}^2mathrm{j} $$
    (Equation 3-18)
  • In the preceding example, Gini index = 1 – (0.4^2 + 0.3^2 + 0.3^2) = 0.660

  • Gini of a node containing a single class is 0 because the probability is 1. Like entropy, it also takes maximum value when all the classes in the node have the same probability. The movement of Gini can be represented as in Figure 3-10.

../images/499122_1_En_3_Chapter/499122_1_En_3_Fig10_HTML.png
Figure 3-10

Values of maximum Gini index for different number of classes (n). Probability p = 1/n.

  • The value of Gini will always be between 0 and 1 irrespective of the number of classes in the model.

  • Classification Error : The next way to measure the degree of impurity is using classification error. It is given by the formula

$$ mathrm{Classification}kern0.17em mathrm{Error}kern0.17em mathrm{Index}=1-max left({mathrm{p}}_{mathrm{i}}
ight) $$
(Equation 3-19)
  • where i is the number of classes

  • Similar to the other two, its value lies between 0 and 1.

  • For the preceding example, classification error index = 1 – max(0.4,0.3,0.3) = 0.6

We can use any of these three splitting methods for classification. There are some common decision tree algorithms which are used for regression and classification problems. Since we have studied both regression and classification concepts, it is a good time to examine different types of decision tree algorithms, which is the next section.

Types of Decision Tree Algorithms

There are some significant decision tree algorithms which are used in the industry. Some of them are suitable for classification problems while some are a better choice for regression solutions. We are discussing all the aspects of the algorithms.

The prominent algorithms are as follows:
  1. 1.

    ID3 or Iterative Dichotomizer 3 is a decision tree algorithm using greedy search to split the dataset in each iteration. It uses entropy or information gain as a factor to perform the split iteratively. For each successive iteration in the model, it uses unused variables in the last iteration, calculates the entropy for those unused variables, and then selects the variable with lowest entropy. Or in other words, it selects the variable with highest information gain. ID3 can lead to overfitting and may not be the most optimal choice. It fares quite well with categorical variables. But when the dataset contains continuous variables, it becomes slower to converge since there are many values on which the node splitting can be done.

     
  2. 2.

    CART or classification and regression tree is a flexible tree-based solution. It can model for both continuous or categorical target variables and hence it is one of the highly used tree-based algorithms. Like a regular decision tree algorithm, we choose input variables and split the nodes iteratively till we achieve a robust tree to work upon. The selection of the input variables is done using a greedy approach with the objective to minimize the loss. The tree construction stops based on a predefined criteria like minimum observations to be present in each of the leaves. Python library scikit-learn uses an optimized version of CART but does not support categorical variables as of now.

     
  3. 3.

    C4.5 is an extension of ID3 and is used for classification problems. Similar to ID3, it utilizes entropy or information gain to make the split. It is a robust choice since it can handle both categorical and continuous variables. For continuous variables, it assigns a threshold value and does the split based on the threshold. Variables with value above the threshold are in one bucket while variables with threshold less than or equal to threshold are in a different bucket. It allows missing variables in the data as the missing values are not considered while calculating the entropy values.

     
  4. 4.

    CHAID (Chi-square automatic interaction detection ) is a popular algorithm in the field of market research and marketing; for example, if we want to understand how a certain group of customers will respond to a new marketing campaign. This marketing campaign can be for a new product or service and will be useful for the marketing team to strategize accordingly. CHAID is primarily based on adjusted significance testing and is mostly used when we have a categorical target variable and categorical independent variables. It proves to be quite a handy and convenient method to visualize such a dataset.

     
  5. 5.

    MARS or multivariate adaptive regression splines is a nonparametric regression technique. It is mostly suitable for measuring nonlinear relationships between variables. It is a flexible regression model which can handle both categorical and continuous variables. It is quite a robust solution to handle massive datasets and requires very much less data preparation, making is comparatively faster to implement. Owing to its flexibility and ability to model nonlinearities in the dataset, MARS generally is a good choice to tackle overfitting in the model.

     

The tree-based algorithms discussed previously are unique in their own way. Some of them are more suitable for classification problems, while some are a better choice for regression problems. CART can be used for both classification and regression problems.

Now it is time to develop a case study using decision tree.

The code and the dataset are available at the Github link shared at the start of this chapter.

We can use the same dataset we have used for the logistic regression problem. The implementation follows after we have created the training and testing data.

Step 1: Import the necessary libraries first.
from sklearn.tree import DecisionTreeClassifier
Step 2: Now we are calling the decision tree classifier and training the model.
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(X_train, y_train)
Step 3: Use the trained model to make a prediction on the test data.
y_pred = dt_classifier.predict(X_test)
Step 4: Get the confusion matrix. To visualize it, you are advised to use the method used in the logistic regression method.
print(confusion_matrix(y_test, y_pred))
ax= plt.subplot()
ax.set_ylim(2.0, 0)
annot_kws = {"ha": 'left',"va": 'top'}
sns.heatmap(mat_test, annot=True, ax = ax, fmt= 'g',annot_kws=annot_kws); #annot=True to annotate cells
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Not Approved', 'Approved']);
ax.yaxis.set_ticklabels(['Not Approved', 'Approved']);
../images/499122_1_En_3_Chapter/499122_1_En_3_Figah_HTML.jpg

We will now model the same problem using a random forest model. Recall that random forest is an ensemble-based technique where it creates multiple smaller trees using a subset of data. The final decision is based on the voting mechanism by each of the trees. In the last chapter, we have used random forest for a regression problem; here we are using random forest for a classification problem.

Tip

Decision trees are generally prone to overfitting; ensemble-based random forest model is a good choice to tackle overfitting.

Step 1: Import the library and fit the model. Create the model with 500 trees.
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=500, bootstrap = True,
                               max_features = 'sqrt')
Now we will fit on training data
rf_model.fit(X_train, y_train)
Step 2: Predict on test data and plot the confusion matrix.
y_pred = rf_model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
ax= plt.subplot()
ax.set_ylim(2.0, 0)
annot_kws = {"ha": 'left',"va": 'top'}
sns.heatmap(mat_test, annot=True, ax = ax, fmt= 'g',annot_kws=annot_kws); #annot=True to annotate cells
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Not Approved', 'Approved']);
ax.yaxis.set_ticklabels(['Not Approved', 'Approved']);
../images/499122_1_En_3_Chapter/499122_1_En_3_Figai_HTML.jpg

This is the implementation of a decision tree algorithm and ensemble-based random forest algorithm. Tree-based algorithms are very easy to comprehend and implement. They are generally the first few algorithms which we implement and test the accuracy of the system. Tree-based solutions are recommended if we want to create a quick solution, but they are prone to overfitting. We can use tree pruning or setting a constraint on tree size to overcome overfitting. We will again visit this concept in Chapter 5, in which we will discuss all the techniques to overcome the problem of overfitting in our ML model.

With this, we come to the end of our discussion on tree-based algorithms. In this chapter, we have studied the classification algorithms and implemented them too. These algorithms are quite popular in the industry and powerful enough to help us make a robust ML model. Generally, we test the data on these algorithms at the start and then choose the one which is giving us the best results. And then we tune it further till we achieve the most desirable output. The desirable output may not be the most complex solution but surely will be one which will deliver the desired level of measurement parameters, reproducibility, robustness, flexibility, and ease of deployment. Remember, complexity is not proportional to accuracy. A more complex model does not mean a higher degree of performance!

Summary

Prediction is a powerful tool in our hands. Using these ML algorithms, we can not only take a confident decision we can also ascertain the factors which affect that decision. These algorithms are heavily used across sectors like banking, retail, manufacturing, insurance, aviation, and so on. The uses include fraud detection, quality inspection, churn prediction, loan default prediction, and so on.

You should note that these algorithms are not the only sources of knowledge. A sound exploratory analysis is a prerequisite for a good ML algorithm. And the most important resource is “data” itself. A good-quality and representative dataset is of paramount importance. We have discussed the qualities of a good dataset in Chapter 1.

It is also imperative that a sound business problem has been created from the start. The choice of the target variable should align with the business problem at hand. The training data used to train the algorithm plays a very crucial role, as on it depends the patterns learned by the algorithm. It is important to note that we do measure the performance of the algorithms using the various parameters like precision, recall, AUC, F1 score, and so on. An algorithm which is performing well on training, testing, and validation datasets will be the best algorithm. But still there are a few other parameters based on which we choose the final algorithm which can be deployed into production, which we discuss in Chapter 5.

In Chapter 1, we examined ML, various types, data and attributes of data quality and ML process. In Chapter 2, we studied ML algorithms to model a continuous variable. In this third chapter, we complemented the knowledge with classification algorithms. These first chapters have created a firm base for you to solve most of the business problems in the data science world. Also, you are now ready to take the next step in Chapter 4.

In the first three chapters, we have discussed basic and intermediate algorithms. In the next chapter, we are going to cover much more complex algorithms like SVM, gradient boosting, and neural networks for regression and classification problems. So stay focused!

Exercise Questions

Question 1: How does a logistic regression algorithm make a classification prediction?

Question 2: What is the difference between precision and recall?

Question 3: What is posterior probability?

Question 4: What are the assumptions in a naïve Bayes algorithm?

Question 5: How can we choose the value of k in k-nearest neighbor?

Question 6: What are the various performance measurement parameters for classification algorithms?

Question 7: The sinking of the ship Titanic in 1912 was indeed heart-breaking. Some passengers survived, some did not. Download the dataset from https://www.kaggle.com/c/titanic. Using ML, predict which passengers were more likely to survive than others based on the various attributes of the passengers.

Question 8: Load the dataset Iris using the following command:
from sklearn.datasets import load_iris
iris = load_iris()

We have worked upon the same dataset in the last chapter. Here, you have to classify the type of the flower using classification algorithms and compare the results.

Question 9: Download the Bank Marketing Dataset from the link https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification problem goal is to predict if the client will subscribe to a term deposit. Create various classification models and compare the respective KPIs.

Question 10: Get the German Credit Risk data from https://www.kaggle.com/uciml/german-credit. The dataset contains attributes of each person who takes credit from the bank. The objective is to classify each person as a good or bad credit risk according to their attributes. Use classification algorithms to create the model and choose the best model.

Question 11: Go through the research paper on logistic regression at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6696525/.

Question 12: Go through the research paper on random forest at https://aip.scitation.org/doi/pdf/10.1063/1.4977376. There is one more good paper at https://aip.scitation.org/doi/10.1063/1.4952607.

Question 13: Examine the research paper on knn at https://pdfs.semanticscholar.org/a196/39771e987588b378879c65300b61b4af86af.pdf.

Question 14: Study the research paper on naïve Bayes at https://www.cc.gatech.edu/~isbell/reading/papers/Rish.pdf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.192.3