Predicting with logistic regression

A logistic regression model consists of input units and an output unit, which is the predicted value. Input units are combined into a single transformed output value. This calculation uses the unit's activation function. An activation function has two parts: the first part is the combination function and merges all of the inputs into a single value (weighted sum, for example); the second part is the transfer function, which transfers the value of the combination function to the output value of the unit. The transfer function is called the sigmoid or logistic function and is S-shaped. Training a logistic regression model is the process of setting the best weights on the inputs of each of the units to maximize the quality of the predicted output. The formula for the logistic function is:

   

The following figure shows the logistic regression algorithm graphically:

The logistic regression algorithm

In the previous module, you have already seen the rxLogit() function from the RevoScaleR package in action. Now you are going to learn how to use the glm() function from the base installation. In addition, you will see how you can improve the quality of predictions by improving data and selecting different algorithms.

Let's start the advanced analytics session by re-reading the target mail data. In addition, the following code defines the ordered levels of the Education factor variables and changes the BikeBuyer variable to a labeled factor:

TM = read.table("C:\SQL2017DevGuide\Chapter14_TM.csv", 
                sep=",", header=TRUE, 
                stringsAsFactors = TRUE); 
TM$Education = factor(TM$Education, order=TRUE,  
                      levels=c("Partial High School",  
                               "High School","Partial College", 
                               "Bachelors", "Graduate Degree")); 
TM$BikeBuyer <- factor(TM$BikeBuyer, 
                       levels = c(0,1), 
                       labels = c("No","Yes")); 

Next, you need to prepare the training and test sets. The split must be random. However, by defining the seed, you can reproduce the same split later, meaning you can get the same cases in the test and in the training set again:

set.seed(1234); 
train <- sample(nrow(TM), 0.7 * nrow(TM)); 
TM.train <- TM[train,]; 
TM.test <- TM[-train,]; 

Now it's time to create the first logistic regression model. For the sake of simplicity, the first model uses three input variables only:

TMLogR <- glm(BikeBuyer ~ 
              YearlyIncome + Age + NumberCarsOwned, 
              data=TM.train, family=binomial()); 

You test the model by performing predictions on the test dataset. Logistic regression returns the output as a continuous value between zero and 1. The following code recodes the output value to a factor where a value greater than 0.5 is transformed to Yes, meaning this is a predicted bike buyer, and otherwise to No, meaning this is not a predicted bike buyer. The results are shown in a pivot table together with the actual values:

probLR <- predict(TMLogR, TM.test, type = "response"); 
predLR <- factor(probLR > 0.5, 
                 levels = c(FALSE, TRUE), 
                 labels = c("No","Yes")); 
perfLR <- table(TM.test$BikeBuyer, predLR, 
                dnn = c("Actual", "Predicted")); 
perfLR; 

The pivot table created is called the classification (or confusion) matrix. Here is the result:

          Predicted
    Actual   No  Yes
       No  1753 1084
       Yes 1105 1604

You can compare the predicted and the actual numbers to find the following results to check true positives (the values that were predicted correctly as Yes), true negatives (the values that were predicted correctly as No), false positives (the values that were predicted incorrectly as Yes), and false negatives (the values that were predicted incorrectly as No). The result is not over-exciting. So let's continue the session by creating and testing a new logistic regression model, this time with all possible input variables, to see whether we can get a better result:

TMLogR <- glm(BikeBuyer ~ 
              MaritalStatus + Gender + 
              TotalChildren + NumberChildrenAtHome + 
              Education + Occupation + 
              HouseOwnerFlag + NumberCarsOwned + 
              CommuteDistance + Region + 
              YearlyIncome + Age, 
              data=TM.train, family=binomial()); 
probLR <- predict(TMLogR, TM.test, type = "response"); 
predLR <- factor(probLR > 0.5, 
                 levels = c(FALSE, TRUE), 
                 labels = c("No","Yes")); 
perfLR <- table(TM.test$BikeBuyer, predLR, 
                dnn = c("Actual", "Predicted")); 
perfLR; 

This time, the results are slightly better, as you can see from the following classification matrix. Still, the results are not very exciting:

          Predicted
    Actual   No  Yes
       No  1798 1039
       Yes  928 1781

In the target mail dataset, there are many variables that are integers, although they actually are factors (nominal or ordinal variables). Let's try to help the algorithm by explicitly defining them as factors. All of them are also ordered. Of course, after changing the original data set, you need to recreate the training and test sets. The following code does all the aforementioned tasks:

TM$TotalChildren = factor(TM$TotalChildren, order=TRUE); 
TM$NumberChildrenAtHome = factor(TM$NumberChildrenAtHome, order=TRUE); 
TM$NumberCarsOwned = factor(TM$NumberCarsOwned, order=TRUE); 
TM$HouseOwnerFlag = factor(TM$HouseOwnerFlag, order=TRUE); 
set.seed(1234); 
train <- sample(nrow(TM), 0.7 * nrow(TM)); 
TM.train <- TM[train,]; 
TM.test <- TM[-train,]; 

Now let's rebuild and test the model with all possible independent variables:

TMLogR <- glm(BikeBuyer ~ 
                MaritalStatus + Gender + 
                TotalChildren + NumberChildrenAtHome + 
                Education + Occupation + 
                HouseOwnerFlag + NumberCarsOwned + 
                CommuteDistance + Region + 
                YearlyIncome + Age, 
              data=TM.train, family=binomial()); 
probLR <- predict(TMLogR, TM.test, type = "response"); 
predLR <- factor(probLR > 0.5, 
                 levels = c(FALSE, TRUE), 
                 labels = c("No","Yes")); 
perfLR <- table(TM.test$BikeBuyer, predLR, 
                dnn = c("Actual", "Predicted")); 
perfLR; 

The results have improved again:

          Predicted
    Actual   No  Yes
       No  1850  987
       Yes  841 1868

Still, there is room for improving the predictions. However, this time we are going to use a different algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.93.169