Classification/regression

We very briefly discussed supervised learning in Chapter 1, Introduction to Scala and Machine Learning. Supervised learning is a ML technique used when we have some historical data. We will discuss two specific approaches to supervised learning: classification and regression. In both these approaches, we ultimately create a learning model, which allows us to assign a label to an unknown dataset. We say a label assignment is classification when the label is categorical and regression when the label is continuous. For example, prediction of the stock price for the next day is a regression problem and determining whether or not it will rain tomorrow is a classification problem (a binary classification problem). Let us now see different algorithms provided by Spark.

Linear methods

In Spark, the different linear methods use convex optimization to find the objective function. There are two types of optimization methods available in Spark—Stochastic Gradient Descent (SGD) and Limited memory Broyden Fletcher Goldfarb Shanno (L-BFGS)—of which SGD is more prominent in different algorithms. Which brings us to these different Linear algorithms—linear SVM, logistic regression, linear regression (using ordinary least squares, ridge regression, or Lasso). Note that there are variations of these algorithms with either SGD or L-BFGS. Spark also provides a streaming data variant of regression using the class StreamingLinearRegressionWithSGD. The following illustration shows how the linear regression algorithm forms a straight line:

Linear methods

The following illustration shows how the SVM algorithm finds the hyperplane that best separates points in a plane:

Linear methods

Naive Bayes

The Naive Bayes model is based on the Bayes rule, which finds the probability of a class label when we present it with the feature values of a data instance. When we derive this calculation using the Bayes rule, we will see that there are conditional dependencies among features, which are eliminated with a naive assumption; that probability of a class given a feature does not depend on any other feature value. This naive assumption defines the Naive Bayes model. This assumption also makes the Naive Bayes model a fast learning model and we can perform multiclass classification with high accuracy in practice.

Bayes rule:

Naive Bayes

The Bayes rule under naive assumption:

Naive Bayes

Decision trees

Decision trees are different to other models in the sense that they do not form a mathematical model for regression or classification. Rather, decision trees form an actual representation of yes/no decisions with a tree structure. We can perform both classification and regression using decision trees.

Ensembles

Spark also provides ensembles of decision trees—random forests and gradient boosted trees. An ensemble method is an approach to creating learning models where a model is composed of other similar models. In the case of a decision tree ensemble, majority voting is used to generate the final instance label.

Let's now see some examples of classification/regression algorithms in action. The first dataset we will consider is student grades datasets: https://archive.ics.uci.edu/ml/datasets/Student+Performance.

The student performance dataset can be termed as both a classification and a regression problem. Only one of 21 grades (0 to 20) is assigned to each student. Since the class label or the grade is discreet, countable and finite, we can use Naive Bayes multi-class classifier for this problem. We can also model this as regression problem by considering the target label as a real number.

First, let's take a look at the Naive Bayes implementation. Naive Bayes takes a lambda parameter with a value from 0 to 1. We try to vary this parameter in increasing order and then observe its effect on the predictions. Its code is as follows:

Ensembles

Here are the results, which show the effect of the lambda parameter on Root Mean Square Error (RMSE):

Ensembles

Next, we will implement a DecisionTree classifier as well as a DecisionTree regressor for the same problem and see how they perform on this problem.

Ensembles

Here are the results of the three different algorithms:

Algorithm

Train accuracy

Train RMSE

Test accuracy

Test RMSE

Naive Bayes

0.361963190184049

5.58847777526052

0.249169435215947

5.70364457709227

D tree regression

0.049079754601227

2.05902978183439

0.026578073089701

1.9848078258

D tree classification

0.3865030675

3.9249086582

0.3621262458

4.5724147607

As you can observe, the train accuracy for the regression tree is very low but that is because it is doing a one-to-one comparison. Since the regression tree will generate a real valued output, it is likely that the label will be a mismatch by a very small value but will still fail the equals comparison. However, when we look at the RMSE for both train and test data, we observe a significant decrease in the overall error. So for this dataset, it seems like regression is the correct algorithm to choose. Also, we should notice that the grades have a default ordering, that is a grade with higher value means a bigger grade. However, for problems where there is no implicit ordering among different class labels, we should go for a classification algorithm.

The next dataset that we will explore is TV News Channel Commercial Detection Dataset Data Set from here: https://archive.ics.uci.edu/ml/datasets/TV+News+Channel+Commercial+Detection+Dataset.

Let's download the dataset into the datasets folder inside the codebase using the following commands:

$ cd code/datasets/
$ wget -c https://archive.ics.uci.edu/ml/machine-learning-databases/00326/TV_News_Channel_Commercial_Detection_Dataset.zip
$ mkdir TVData/
$ cd TVData/
$ unzip ../TV_News_Channel_Commercial_Detection_Dataset.zip

This dataset is available to us in the LibSVM format so it is fairly easy to load into Spark. We have data for five news channels—CNN IBN, NDTV 24X7, Times Now, BBC, and CNN. The data for each of the channels is present in a separate text file, so in total we have five text files (LibSVM format). We don't need to do any processing except to concatenate them. The class labels are binary so we can use any binary classification algorithm: SVM, Naive Bayes, logistic regression, and so on. The total dataset size is 191 MB uncompressed.

For this problem, we will first use the Naive Bayes binary classifier. The code is exactly the same as the one we used for the student performance dataset.

Its code is as follows:

$ cd code/
$ sbt "run-main chapter04.TVNewsChannelsNaiveBayes datasets/TVData/"
... OUTPUT SKIPPED ...
$ python scripts/ch04/student-grade-plots.py

Here are the results with varying lambda values:

Ensembles

The change in error is not significant. The train error drops in the range 0.4 to 0.9 and then rises again from 0.9 to 1.0. The test error consistently drops for values of lambda from 0.1 to 0.4 and then increases. So, if we were to choose a lambda value, we could choose it somewhere near 0.4.

Now let's look at how this problem behaves when using a decision tree regression algorithm.

Ensembles

Its code is as follows:

$ cd code/
$ sbt "run-main chapter04.TVNewsChannelComparison datasets/TVData/"
... OUTPUT SKIPPED …
$ python scripts/ch04/tv-news-plots.py

This code was run on a single machine so it took some time. We can also take a look at the progress using Spark UI:

Ensembles

Here are the final results:

Algorithm

Train accuracy

Train RMSE

Test accuracy

Test RMSE

Naive Bayes classifier

0.7046054071

1.0870043108

0.7021088278

1.0915881498

Decision tree classifier

0.00018015

0.6698337317

9.62056491957208E-005

0.6739810325

As we observed with the student dataset, the train/test accuracy is very low (the reason was also mentioned earlier). Again, we can see a significant increase in the RMSE values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.108.175