Building a scalable classifier with NB

In this section, we will see a step-by-step example using Naive Bayes (NB) algorithm. As already stated, NB is highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. This scalability has enabled the Spark community to make predictive analytics on large-scale datasets using this algorithm. The current implementation of NB in Spark MLlib supports both the multinomial NB and Bernoulli NB.

Bernoulli NB is useful if the feature vectors are binary. One application would be text classification with a bag of words (BOW) approach. On the other hand, multinomial NB is typically used for discrete counts. For example, if we have a text classification problem, we can take the idea of Bernoulli trials one step further and instead of BOW in a document we can use the frequency count in a document.

In this section, we will see how to predict the digits from the Pen-Based Recognition of Handwritten Digits dataset by incorporating Spark machine learning APIs including Spark MLlib, Spark ML, and Spark SQL:

Step 1. Data collection, preprocessing, and exploration - The Pen-based recognition of handwritten digits dataset was downloaded from the UCI Machine Learning Repository at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/pendigits. This dataset was generated after collecting around 250 digit samples each from 44 writers, correlated to the location of the pen at fixed time intervals of 100 milliseconds. Each digit was then written inside a 500 x 500 pixel box. Finally, those images were scaled to an integer value between 0 and 100 to create consistent scaling between each observation. A well-known spatial resampling technique was used to obtain 3 and 8 regularly spaced points on an arc trajectory. A sample image along with the lines from point to point can be visualized by plotting the 3 or 8 sampled points based on their (x, y) coordinates; it looks like what is shown in the following table:

Set '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' Total
Training 780 779 780 719 780 720 720 778 718 719 7493
Test 363 364 364 336 364 335 336 364 335 336 3497
Table 2: Number of digits used for the training and the test set

As shown in the preceding table, the training set consists of samples written by 30 writers and the testing set consists of samples written by 14 writers.

Figure 4: Example of digit 3 and 8 respectively

More on this dataset can be found at http://archive.ics.uci.edu/ml/machine-learning-databases/pendigits/pendigits-orig.names. A digital representation of a sample snapshot of the dataset is shown in the following figure:

Figure 5: A snap of the 20 rows of the hand-written digit dataset

Now to predict the dependent variable (that is, label) using the independent variables (that is, features), we need to train a multiclass classifier since, as shown previously, the dataset now has nine classes, that is, nine handwritten digits. For the prediction, we will use the Naive Bayes classifier and evaluate the model's performance.

Step 2. Load the required library and packages:

import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation
.MulticlassClassificationEvaluator
import org.apache.spark.sql.SparkSession

Step 3. Create an active Spark session:

val spark = SparkSession
.builder
.master("local[*]")
.config("spark.sql.warehouse.dir", "/home/exp/")
.appName(s"NaiveBayes")
.getOrCreate()

Note that here the master URL has been set as local[*], which means all the cores of your machine will be used for processing the Spark job. You should set SQL warehouse accordingly and other configuration parameter based on the requirements.

Step 4. Create the DataFrame - Load the data stored in LIBSVM format as a DataFrame:

val data = spark.read.format("libsvm")
.load("data/pendigits.data")

For digits classification, the input feature vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of sparsity. Since the training data is only used once, and moreover the size of the dataset is relatively smaller (that is, few MBs), we can cache it if you use the DataFrame more than once.

Step 5. Prepare the training and test set - Split the data into training and test sets (25% held out for testing):

val Array(trainingData, testData) = data
.randomSplit(Array(0.75, 0.25), seed = 12345L)

Step 6. Train the Naive Bayes model - Train a Naive Bayes model using the training set as follows:

val nb = new NaiveBayes()
val model = nb.fit(trainingData)

Step 7. Calculate the prediction on the test set - Calculate the prediction using the model transformer and finally show the prediction against each label as follows:

val predictions = model.transform(testData)
predictions.show()
Figure 6: Prediction against each label (that is, each digit)

As you can see in the preceding figure, some labels were predicted accurately and some of them were wrongly. Again we need to know the weighted accuracy, precision, recall and f1 measures without evaluating the model naively.

Step 8. Evaluate the model - Select the prediction and the true label to compute test error and classification performance metrics such as accuracy, precision, recall, and f1 measure as follows:

val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
val evaluator1 = evaluator.setMetricName("accuracy")
val evaluator2 = evaluator.setMetricName("weightedPrecision")
val evaluator3 = evaluator.setMetricName("weightedRecall")
val evaluator4 = evaluator.setMetricName("f1")

Step 9. Compute the performance metrics - Compute the classification accuracy, precision, recall, f1 measure, and error on test data as follows:

val accuracy = evaluator1.evaluate(predictions)
val precision = evaluator2.evaluate(predictions)
val recall = evaluator3.evaluate(predictions)
val f1 = evaluator4.evaluate(predictions)

Step 10. Print the performance metrics:

println("Accuracy = " + accuracy)
println("Precision = " + precision)
println("Recall = " + recall)
println("F1 = " + f1)
println(s"Test Error = ${1 - accuracy}")

You should observe values as follows:

Accuracy = 0.8284365162644282
Precision = 0.8361211320692463
Recall = 0.828436516264428
F1 = 0.8271828540349192
Test Error = 0.17156348373557184

The performance is not that bad. However, you can still increase the classification accuracy by performing hyperparameter tuning. There are further opportunities to improve the prediction accuracy by selecting appropriate algorithms (that is, classifier or regressor) through cross-validation and train split, which will be discussed in the following section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.120.159