Exploration and preparation of the OCR dataset

According to the dataset description, glyphs are scanned using an OCR reader on to the computer then they are automatically converted into pixels. Consequently, all the 16 statistical attributes (in figure 2) are recorded to the computer too. The the concentration of black pixels across various areas of the box provide a way to differentiate 26 letters using OCR or a machine learning algorithm to be trained.

Recall that support vector machines (SVM), Logistic Regression, Naive Bayesian-based classifier, or any other classifier algorithms (along with their associated learners) require all the features to be numeric. LIBSVM allows you to use a sparse training dataset in an unconventional format. While transforming the normal training dataset to the LIBSVM format. Only the nonzero values that are also included in the dataset are stored in a sparse array/matrix form. The index specifies the column of the instance data (feature index). However, any missing data is taken as holding zero value too. The index serves as a way to distinguish between the features/parameters. For example, for three features, indices 1, 2, and 3 would correspond to the x, y, and z coordinates, respectively. The correspondence between the same index values of different data instances is merely mathematical when constructing the hyperplane; these serve as coordinates. If you skip any index in between, it should be assigned a default value of zero.

In most practical cases, we might need to normalize the data against all the features points. In short, we need to convert the current tab-separated OCR data into LIBSVM format to make the training step easier. Thus, I'm assuming you have downloaded the data and converted into LIBSVM format using their own script. The resulting dataset that is transformed into LIBSVM format consisting of labels and features is shown in the following figure:

Figure 3: A snapshot of 20 rows of the OCR dataset in LIBSVM format

Interested readers can refer to the following research article for gaining in-depth knowledge: Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. You can also refer to a public script provided on my GitHub repository at https://github.com/rezacsedu/RandomForestSpark/ that directly converts the OCR data in CSV into LIBSVM format. I read the data about all the letters and assigned a unique numeric value to each. All you need is to show the input and output file path and run the script.

Now let's dive into the example. The example that I will be demonstrating has 11 steps including data parsing, Spark session creation, model building, and model evaluation.

Step 1. Creating Spark session - Create a Spark session by specifying master URL, Spark SQL warehouse, and application name as follows:

val spark = SparkSession.builder
                     .master("local[*]") //change acordingly
                     .config("spark.sql.warehouse.dir", "/home/exp/")
                     .appName("OneVsRestExample") 
                     .getOrCreate()

Step 2. Loading, parsing, and creating the data frame - Load the data file from the HDFS or local disk and create a data frame, and finally show the data frame structure as follows:

val inputData = spark.read.format("libsvm")
                     .load("data/Letterdata_libsvm.data")
inputData.show()

Step 3. Generating training and test set to train the model - Let's generate the training and test set by splitting 70% for training and 30% for the test:

val Array(train, test) = inputData.randomSplit(Array(0.7, 0.3))

Step 4. Instantiate the base classifier - Here the base classifier acts as the multiclass classifier. For this case, it is the Logistic Regression algorithm that can be instantiated by specifying parameters such as the number of max iterations, tolerance, regression parameter, and Elastic Net parameters.

Note that Logistic Regression is an appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, Logistic Regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval, or ratio level independent variables.

For a a Spark-based implementation of the Logistic Regression algorithm, interested readers can refer to https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression.

In brief, the following parameters are used to training a Logistic Regression classifier:

MaxIter: This specifies the number of maximum iterations. In general, more is better.
Tol: This is the tolerance for the stopping criteria. In general, less is better, which helps the model to be trained more intensively. The default value is 1E-4.
FirIntercept: This signifies if you want to intercept the decision function while generating the probabilistic interpretation.
Standardization: This signifies a Boolean value depending upon if would like to standardize the training or not.
AggregationDepth: More is better.
RegParam: This signifies the regression params. Less is better for most cases.
ElasticNetParam: This signifies more advanced regression params. Less is better for most cases.

Nevertheless, you can specify the fitting intercept as a Boolean value as true or false depending upon your problem type and dataset properties:

 val classifier = new LogisticRegression()
                        .setMaxIter(500)          
                        .setTol(1E-4)                                                                                                  
                        .setFitIntercept(true)
                        .setStandardization(true) 
                        .setAggregationDepth(50) 
                        .setRegParam(0.0001) 
                        .setElasticNetParam(0.01)

Step 5. Instantiate the OVTR classifier - Now instantiate an OVTR classifier to convert the multiclass classification problem into multiple binary classifications as follows:

val ovr = new OneVsRest().setClassifier(classifier)

Here classifier is the Logistic Regression estimator. Now it's time to train the model.

Step 6. Train the multiclass model - Let's train the model using the training set as follows:

val ovrModel = ovr.fit(train)

Step 7. Score the model on the test set - We can score the model on test data using the transformer (that is, ovrModel) as follows:

val predictions = ovrModel.transform(test)

Step 8. Evaluate the model - In this step, we will predict the labels for the characters in the first column. But before that we need instantiate an evaluator to compute the classification performance metrics such as accuracy, precision, recall, and f1 measure as follows:

val evaluator = new MulticlassClassificationEvaluator()
                           .setLabelCol("label")
                           .setPredictionCol("prediction")    
val evaluator1 = evaluator.setMetricName("accuracy")
val evaluator2 = evaluator.setMetricName("weightedPrecision")
val evaluator3 = evaluator.setMetricName("weightedRecall")
val evaluator4 = evaluator.setMetricName("f1")

Step 9. Compute performance metrics - Compute the classification accuracy, precision, recall, f1 measure, and error on test data as follows:

val accuracy = evaluator1.evaluate(predictions)
val precision = evaluator2.evaluate(predictions)
val recall = evaluator3.evaluate(predictions)
val f1 = evaluator4.evaluate(predictions)

Step 10. Print the performance metrics:

println("Accuracy = " + accuracy)
println("Precision = " + precision)
println("Recall = " + recall)
println("F1 = " + f1)
println(s"Test Error = ${1 - accuracy}")

You should observe the value as follows:

Accuracy = 0.5217246545696688
Precision = 0.488360500637862
Recall = 0.5217246545696688
F1 = 0.4695649096879411
Test Error = 0.47827534543033123

Step 11. Stop the Spark session:

spark.stop() // Stop Spark session

This way, we can convert a multinomial classification problem into multiple binary classifications problem without sacrificing the problem types. However, from step 10, we can observe that the classification accuracy is not good at all. It might be because of several reasons such as the nature of the dataset we used to train the model. Also even more importantly, we did not tune the hyperparameters while training the Logistic Regression model. Moreover, while performing the transformation, the OVTR had to sacrifice some accuracy.

Table of Contents for Exploration and preparation of the OCR dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Exploration and preparation of the OCR dataset