In this section, we'll implement a convolutional neural network for image classification. We are going to use the famous dataset of handwritten digits called the Modified National Institute of Standards and Technology (MNIST), which can be found at http://yann.lecun.com/exdb/mnist/. The dataset is a standard that was proposed by the US National Institute of Standards and Technology to calibrate and compare image recognition methods using machine learning, primarily based on neural networks.
The creators of the dataset used a set of samples from the US Census Bureau, with some samples written by students of American universities added later. All the samples are normalized, anti-aliased grayscale images of 28 x 28 pixels. The MNIST database contains 60,000 images for training and 10,000 images for testing. There are four files:
- train-images-idx3-ubyte: Training set images
- train-labels-idx1-ubyte: Training set labels
- t10k-images-idx3-ubyte: Test set images
- t10k-labels-idx1-ubyte: Test set labels
The files that contain labels are in the following format:
Offset |
Type |
Value |
Description |
0 |
32-bit integer |
0x00000801(2049) |
Magic number (MSB first) |
4 |
32-bit integer |
60,000 or 10,000 |
Number of items |
8 |
Unsigned char |
?? |
Label |
9 |
Unsigned char |
?? |
Label |
... |
... |
... |
... |
The label's values are from 0 to 9. The files that contain images are in the following format:
Offset |
Type |
Value |
Description |
0 |
32-bit integer |
0x00000803(2051) |
Magic number (MSB first) |
0 |
32-bit integer |
60,000 or 10,000 |
Number of images |
0 |
32-bit integer |
28 |
Number of rows |
0 |
32-bit integer |
28 |
Number of columns |
0 |
Unsigned byte |
?? |
Pixel |
0 |
Unsigned byte |
?? |
Pixel |
... |
... |
... |
... |
Pixels are stored in a row-wise manner, with values in the range of [0, 255]. 0 means background (white), while 255 means foreground (black).
In this example, we are using the PyTorch deep learning framework. This framework is primarily used with the Python language. However, its core part is written in C++, and it has a well-documented and actively developed C++ client API called LibPyTorch. This framework is based on the linear algebra library called ATen, which heavily uses the Nvidia CUDA technology for performance improvement. The Python and C++ APIs are pretty much the same but have different language notations, so we can use the official Python documentation to learn how to use the framework. This documentation also contains a section stating the differences between C++ and Python APIs and specific articles about the usage of the C++ API.
The PyTorch framework is widely used for research in deep learning. As we discussed previously, the framework provides functionality for managing big datasets. It can automatically parallelize loading the data from a disk, manage pre-loaded buffers for the data to reduce memory usage, and limit expensive performance disk operations. It provides the torch::data::Dataset base class for the implementation of the user custom dataset. We only need to override two methods here: get and size. These methods are not virtual because we have to use the C++ template's polymorphism to inherit from this class.