Improving the handwritten digit recognition application

Let's see how our CNN architecture will look when written in Java. We'll also run the Java application and test the improved model from the graphical user interface. We'll draw some digits and ask models for predictions, and maybe simulate a case when a convolution will outperform the simple neural network model.

Before checking out the code, let's first look at the CNN architecture that we saw in the previous section from a different point of view:

So, we have this table here, and in the extreme left, there are the layers. Then we have these two columns, which are the activation's. So the activations are just the input, hidden layers, or convolution layers, and one activation shows the shape of the matrix dimensions, while the other shows the complete size, which is just a multiplication of the values in the first activation. The parameters are the connection between the input or the activation's, so they're the activation's in the previous layer with the output in the next layer. And we can see that the activation's decrease from layer to layer until we get the output size of 10, while the parameters more or less increase over time. We have the zeros because for max pooling, there's nothing to learn, and since we've just started, we can't learn anything. Over time, we'll see an increase, and even after the convolution, we'll see a dramatic increase in the parameters; as we have already seen, the fully connected layers explore the parameters to learn.

Now, let's jump into the code. As always, let's start with some parameters:

public class DigitRecognizerConvolutionalNeuralNetwork {
private static final String OUT_DIR = "HandWrittenDigitRecognizer/src/main/resources/cnnCurrentTrainingModels";
 private static final String TRAINED_MODEL_FILE = "HandWrittenDigitRecognizer/src/main/resources/cnnTrainedModels/bestModel.bin";
 private MultiLayerNetwork preTrainedModel;
private static final int CHANNELS = 1;
 /**
 * Number prediction classes.
 * We have 0-9 digits so 10 classes in total.
 */
 private static final int OUTPUT = 10;
 /**
 * Mini batch gradient descent size or number of matrices processed in parallel.
 * For CORE-I7 16 is good for GPU please change to 128 and up
 */
 private static final int MINI_BATCH_SIZE = 16;// Number of training epochs
 /**
 * Number of total traverses through data. In this case it is used as the maximum epochs we allow
 * with 5 epochs we will have 5/@MINI_BATCH_SIZE iterations or weights updates
 */
 private static final int MAX_EPOCHS = 20;
/**
 * The alpha learning rate defining the size of step towards the minimum
 */
private static final double LEARNING_RATE = 0.01;
/**
 * https://en.wikipedia.org/wiki/Random_seed
 */
 private static final int SEED = 123;

We specify the number of channels by private static final int OUTPUT which is 10, so we have digits 0 to 9 for prediction classes, the mini burst size specified by private static final int MINI_BATCH_SIZE is the level of parallelism; which is 16 because we are using a CPU, but for GPU, feel free to increase this value. By epoch in, MAX_EPOCHS we mean, maximum number of iteration through the data.

Having these parameters in place, we are now ready to build the architecture.

First, we define the parameters:

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
 .seed(SEED)
 .learningRate(LEARNING_RATE)
 .weightInit(WeightInit.XAVIER)
 .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
 .updater(Updater.NESTEROVS)
 .list()
 .layer(0, new ConvolutionLayer.Builder(5, 5)
 .nIn(CHANNELS)
 .stride(1, 1)
 .nOut(20)
 .activation(Activation.IDENTITY)
 .build())
 .layer(1, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
 .kernelSize(2, 2)
 .stride(2, 2)
 .build())
 .layer(2, new ConvolutionLayer.Builder(5, 5)
 .nIn(20)
 .stride(1, 1)
 .nOut(50)
 .activation(Activation.IDENTITY)
 .build())
 .layer(3, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
 .kernelSize(2, 2)
 .stride(2, 2)
 .build())
 .layer(4, new DenseLayer.Builder().activation(Activation.RELU)
 .nIn(800)
 .nOut(128).build())
 .layer(5, new DenseLayer.Builder().activation(Activation.RELU)
 .nIn(128)
 .nOut(64).build())
 .layer(6, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
 .nOut(OUTPUT)
 .activation(Activation.SOFTMAX)
 .build())
 .setInputType(InputType.convolutionalFlat(28, 28, 1))
 .backprop(true).pretrain(false).build();
EarlyStoppingConfiguration earlyStoppingConfiguration = new EarlyStoppingConfiguration.Builder()
 .epochTerminationConditions(new MaxEpochsTerminationCondition(MAX_EPOCHS))
 .scoreCalculator(new AccuracyCalculator(new MnistDataSetIterator(MINI_BATCH_SIZE, false, 12345)))
 .evaluateEveryNEpochs(1)
 .modelSaver(new LocalFileModelSaver(OUT_DIR))
 .build();

Things such as weight initialization, WeightInit, which is xavier, helps us to start. Then, we use the exponentially-weighted average updater, or the momentum updater, and then we move to the first convolution layer, which is just a 5 x 5 filter with a stride of one. We have two ones in .Strides(1,1) because it's possible to use a different stride for right and for down, but we use one for both of them. Now, notice that we aren't defining the first two dimensions.

When using Deeplearning4j, the first two dimensions are somehow figured out, but you need to define only the channels, which is the third dimension of the matrix. And we start with one channel for black and white, and then we have 20, so, nOut(20) means we have the 5 x 5 x 20 layer, which has been shown in the architecture. We then move to the max pooling layer, which is defined as .kernelSize(2,2) and .Stride(2,2). We don't need to define any input and output because the max pooling layer doesn't change the number of channels, and the first two dimensions are already defined.

Then, we have the other convolution layer, 5 x 5, the number of the input channel is 20,and the output is 50, so we increase the number of channels. Then, we have the other max pooling layer, 2 x 2 and a stride of two. That's it for the convolution part. Now begins the part that we've already seen, the dense layer or the fully-connected hidden layers. The number of inputs is 800, so .nIn(800), the number of outputs is 128, .nout(128), so this is the number of neurons in the first fully-connected hidden layer, and we use a ReLU activation, which is the default choice nowadays. Then we have the second hidden layer, which has an input of 128, but 64 new rows as an output. Then, gain, we use the ReLU activation. In the end, we close with the output layer, which is a softmax with 10 digits - 0-9. And .setInputType(InputType.convolutionalFlat(28, 28, 1)) is how Deeplearning4j understands the first two dimensions, because, as we said at the beginning, the shape of the input is 28 x 28, and it only requires the number of channels, which is 1, and then it calculates all the first two dimensions.

The number of inputs isn't required, but in this example, we defined it for the sake of clarity; maybe in other examples we'll omit it.

For training, we use a slightly different technique, called early stopping:

EarlyStoppingTrainer trainer = new EarlyStoppingTrainer(earlyStoppingConfiguration, conf, mnistTrain);
EarlyStoppingResult<MultiLayerNetwork> result = trainer.fit();
log.info("Termination reason: " + result.getTerminationReason());
 log.info("Termination details: " + result.getTerminationDetails());
 log.info("Total epochs: " + result.getTotalEpochs());
 log.info("Best epoch number: " + result.getBestModelEpoch());
 log.info("Score at best epoch: " + result.getBestModelScore());
 }

As soon as you define a goal, and that goal is accomplished, early stopping immediately stops and gives you the best model seen so far. In our case the goal is simple - we'll say stop as soon as you reach these 20 epochs.

Early stopping requires an evaluator. We use a simple one that takes the test dataset and then evaluates the current model in one of the epochs against this test dataset, and basically takes the model that has the highest accuracy:

public class AccuracyCalculator implements ScoreCalculator<MultiLayerNetwork> {
 private final MnistDataSetIterator dataSetIterator;
public AccuracyCalculator(MnistDataSetIterator dataSetIterator) {
 this.dataSetIterator = dataSetIterator;
 }
private int i = 0;
@Override
 public double calculateScore(MultiLayerNetwork network) {
 Evaluation evaluate = network.evaluate(dataSetIterator);
 double accuracy = evaluate.accuracy();
 log.info("Accuracy at iteration" + i++ + " " + accuracy);
 return 1 - evaluate.accuracy();
 }
}

So, through epochs, it evaluates against the test dataset and saves the model on that epoch that performs best. Then, we analyse every epoch, and, in the end, we save the best model, in other words, the one with the highest accuracy, to this directory. Then we have a matrix that justifies why the early stopping was executed.

Now, let's run the application. It takes a couple of hours to gain good results on 99 or something. Let's see a log I ran for 40 minutes:

It started at 96%, then moved slowly to 98.65%, then to 98.98%, and finally to 99%:

The good thing about the convolution is the more time you give it, the more it improves, while with standard neural networks, even if you leave it for longer, they get stuck at 97%. So the convolution really helps to detect more features and to gain higher accuracy.

Now, let's see the application from the graphical user interface. So let's try with the first digit, a 3:

Let's see with a 6:

The simple neural networks says 4, while the CNN is able to detect it:

Similarly, with a 9, the simple neural network says 7, while the CNN says 9. This shows that the CNN does a better job.

Table of Contents for Improving the handwritten digit recognition application

Create new playlist

Sign In

Sign Up

Table of Contents for
Improving the handwritten digit recognition application