Classification of the reviews

At the beginning of this section, we will try to classify the corpus using algorithms we have already discussed (Naïve Bayes and k-NN). We will then briefly discuss two new algorithms: logistic regression and support vector machines.

Document classification with k-NN

We know k-Nearest Neighbors, so we'll just jump into the classification. We will try with three neighbors and five neighbors:

1  library(class) # knn() is in the class packages
2  library(caret) # confusionMatrix is in the caret package
3  set.seed(975)
4  Class3n = knn(TrainDF[,-1], TrainDF[,-1], TrainDF[,1], k = 3)
5  Class5n = knn(TrainDF[,-1], TrainDF[,-1], TrainDF[,1], k = 5)
6  confusionMatrix(Class3n,as.factor(TrainDF$quality))

The confusion matrix and the following statistics (the output has been partially reproduced) show that classification with three neighbors doesn't seem too bad: the accuracy is 0.74; yet, the kappa value is not good (it should be at least 0.60):

Confusion Matrix and Statistics

Reference
Prediction    0    1
         0  358  126
         1  134  382
                                          
Accuracy : 0.74
95% CI : (0.7116, 0.7669)
No Information Rate : 0.508
P-Value [Acc > NIR] : <2e-16 

Kappa : 0.4876

Let's examine the solution with five neighbors:

confusionMatrix(Class5n,as.factor(TrainDF$quality))

The output shows that there is not much difference between the three neighbor and five neighbor solutions, but five neighbors is worse:

Confusion Matrix and Statistics
Reference
Prediction    0    1
         0  358  126
         1  134  382
Accuracy : 0.682
95% CI : (0.6521, 0.7108)
No Information Rate : 0.508 
P-Value [Acc > NIR] : <2e-16 
Kappa : 0.364 

Further, we have only looked at the training dataset. How well would things go with the testing data set? We'll only take a look at the three neighbors solution:

set.seed(975)
Class3nTest = knn(TrainDF[,-1], TestDF[,-1], TrainDF[,1], k = 3)
confusionMatrix(Class3nTest,as.factor(TestDF$quality))

The following output shows that the classification is pretty bad on the testing dataset; the accuracy is just about what we would expect by attributing all attributes to one class and the kappa value is about 0, showing that there is no improvement over classification by chance:

Confusion Matrix and Statistics

Reference
Prediction    0    1
         0  235  226
         1  273  266
                                          
Accuracy : 0.501           
95% CI : (0.4695, 0.5324)
No Information Rate : 0.508           
P-Value [Acc > NIR] : 0.68243
                                          
Kappa : 0.032         

Maybe we will have more luck using Naïve Bayes. Let's see!

Document classification with Naïve Bayes

Let's start by computing the model, following which we will try to classify the training dataset:

1  library(e1071)
2  set.seed(345)
3  model <- naiveBayes(TrainDF[-1], as.factor(TrainDF[[1]]))
4  classifNB = predict(model, TrainDF[,-1])
5  confusionMatrix(as.factor(TrainDF$quality),classifNB)

The partial output here presents the confusion matrix for the training dataset and some performance information. We can see that the classification is not too bad with regard to accuracy, yet, the kappa value is too low (it should be at least 0.60):

Confusion Matrix and Statistics

Reference
Prediction    0    1
         0  353  139
         1   74  434
                                          
Accuracy : 0.787 
95% CI : (0.7603, 0.812)
No Information Rate : 0.573 
P-Value [Acc > NIR] : < 2e-16 

Kappa : 0.573          

Let's examine how well we can classify the test dataset using the model we just computed:

classifNB = predict(model, TestDF[,-1])
confusionMatrix(as.factor(TestDF$quality),classifNB)

The following output shows that the results on the testing data are still disappointing; the accuracy has dropped to 71 percent and the kappa value is quite bad:

Confusion Matrix and Statistics

Reference
Prediction    0    1
         0  335  173
         1  120  372

Accuracy : 0.707           
95% CI : (0.6777, 0.7351)
No Information Rate : 0.545 
P-Value [Acc > NIR] : < 2e-16 

Kappa : 0.4148          

Classification using logistic regression

We can check the association between review length and quality using logistic regression as a quick tutorial. There is sadly no space to explain logistic regression in detail. Let's simply say that logistic regression predicts the probability of an outcome rather than a value, as in linear regression. We will only give an interpretation of the results, but first let's compute the model:

model = glm(quality~ lengths, family = binomial)
summary(model)

The following is the output of this model:

Classification using logistic regression

The output shows that the length of the review is significantly associated with the perceived quality of the movie. The estimates are given in log odds, so we must perform an exponentiation to know what the slope means:

exp(0.0018276)

The output is 1.001828. A value of exactly 1 would mean there is no relationship between the length of the review and the perceived quality of the movie. Here, the value is slightly higher than 1, but remember that the unit is the term in the processed review. The intercept represents the odds that a movie with a review containing 0 terms is considered good. It has therefore no interpretable meaning here. The value of the exponentiated slope means the odds of a movie being considered good increases by 0.01829 percent for each additional term. Let's compute the fitted values in terms of probability and examine the relationship by plotting them.

We obtain the probabilities of movies being classified as good for each review using either of the following lines of code:

Prob1 = exp(-0.6383373 + lengths * 0.0018276) / 
   (1 + exp(-0.6383373 + lengths * 0.0018276))
Prob2 = model$fitted

The attribute Prob1 is the value computed manually, whereas the attribute Prob2 contains the values computed by the glm() function. Both attributes have the same value up to the fourth decimal.

We can classify documents with a probability higher than 0.5 as positive reviews and display the confusion matrix:

1  classif = Prob1
2  classif[classif>0.5] = 1
3  classif[classif<=0.5] = 0
4  table(classif, quality)

The following output shows that more instances are correctly classified than incorrectly classified, but that many instances are not correctly classified:

 

quality

 

classif

0

1

0

614

507

1

386

493

The output of the following line of code shows that the kappa value is only 0.11, which is pretty bad:

cohen.kappa(table(classif, quality))

Now that the basics of logistic regression are understood, let's get to business. Can we obtain a reliable classification using logistic regression by including the linguistic content of the review in the analysis (the 100 terms)? We will attempt this here:

model2 = glm(quality ~ ., family = binomial, data = TrainDF)

Note

In Linear Regression, we warned you about multicollinearity: when predictors are strongly correlated, the reliability of the estimates (the slopes coefficients) is threatened because of the assumption of the independence of the predictors. But be assured, this is not a problem for the predictive power of a model; the adjusted coefficient of determination (the value-adjusted R squared we discussed in that chapter) is unaffected by multicollinearity. In other words, in case of multicollinearity, we cannot disentangle the contribution of the separate predictors, but we can accurately know their contribution together. Including so many predictors is therefore not much of a problem because we are not interested in the slope coefficients but only the predictions.

Let's now examine the fitted values in the training set and how well we can predict perceived movie quality:

TrainDF$classif = fitted.values(model2, type= "response")
TrainDF$classif[TrainDF$classif>0.5] = 1
TrainDF$classif[TrainDF$classif<=0.5] = 0

Let's now examine the performance of our classification in detail. We will omit the output and briefly show the results:

confusionMatrix(TrainDF$quality, TrainDF$classif)

We can see that we have a 0.857 percent accuracy (which is meaningful given we have an almost equal number of positive and negative reviews) and a kappa value of 0.71. We can therefore be satisfied by our classification.

We can now use the model we just created to predict the values in the testing set:

1  TestDF$classif = predict(model2, TestDF, type = "response")
2  TestDF$classif[TestDF$classif>0.5] = 1
3  TestDF$classif[TestDF$classif<=0.5] = 0
4  confusionMatrix(TestDF$quality, TestDF$classif)

As you can see on your screen, the predictions in the testing dataset are poorer than what we observed in the training dataset. We still have an accuracy of 0.72, which is not too bad, but the kappa value has dropped to 0.45. We might obtain better results using another algorithm we have yet to discover: support vector machines.

Document classification with support vector machines

Support vector machines (SVM) attempt to find a separation between the two classes that is as broad as possible. Cases are then classified depending on their position in the separations. Unlike logistic regression, SVMs are not limited to linear relationships. Actually, any kind of relationship can be discovered with SVM by using the kernel trick. We will not go into detailed explanations of SVM here, as these are quite complex. The interested reader can refer to the book, Learning with kernels: Support Vector Machines, Regularization, Optimization, and Beyond, by Scholkopf and Smola (2001).

Let's directly fit the model using SVM and examine the reliability of the predictions:

1    library(e1071)
2    modelSVM = svm (quality ~ ., data = TrainDF)
3    probSVMtrain = predict(modelSVM, TrainDF[,-1])
4    classifSVMtrain = probSVMtrain
5    classifSVMtrain[classifSVMtrain>0.5] = 1
6    classifSVMtrain[classifSVMtrain<=0.5] = 0
7    confusionMatrix(TrainDF$quality, classifSVMtrain)

We have an excellent classification using SVM on the training dataset. It is even better than with logistic regression: the accuracy is 0.93 and the kappa value is 0.85.

What performance will we get for the classification using the testing set? Let's find out:

1    probSVMtest = predict(modelSVM, TestDF[,-1])
2    classifSVMtest = probSVMtest
3    classifSVMtest[classifSVMtest>0.5] = 1
4    classifSVMtest[classifSVMtest<=0.5] = 0
5    confusionMatrix(TestDF$quality, classifSVMtest)

Sadly, our classification of the testing set was not better than with logistic regression (it was even a little worse). Depending on the context, the performance of the algorithm (here, 71 percent of data was correctly classified, kappa = 0.42) might be sufficient. In other contexts, much better performance might be required.

We have shown you different alternatives so that you can try them on your data. We didn't have much luck with our classification of the reviews on the testing dataset here. This really depends a lot on the dataset. Before you get too disappointed by document classification, let's examine a successful case.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.12.50