Understanding the MNIST dataset

Modified National Institute of Standards and Technology (MNIST) is a dataset that contains images of handwritten digits. This dataset is pretty popular in the ML community for implementing and testing computer vision algorithms. The MNIST dataset is an open dataset made available by Professor Yann LeCun at http://yann.lecun.com/exdb/mnist/, where separate files that represent the training dataset and test dataset are available. The labels corresponding to the test and training datasets are also available as separate files. The training dataset has 60,000 samples and the test dataset has 10,000 samples.

The following diagram shows some sample images from the MNIST dataset. Each of the images also comes with a label indicating the digit shown in the following screenshot:

Sample images from MNIST dataset

The labels for the images shown in the preceding diagram are 5, 0, 4, and 1. Each image in the dataset is a grayscale image and is represented in 28 x 28 pixels. A sample image represented with pixels is shown in the following screenshot:

Sample image from MNIST dataset represented with 28 * 28 pixels

It is possible to flatten the 28 x 28 pixel matrix and represent it as a vector of 784 pixel values. Essentially, the training dataset is a 60,000 x 784 matrix that could be used with ML algorithms. The test dataset is a 10,000 x 784 matrix. The training and test datasets may be downloaded from the source with the following code:

# setting the working directory where the files need to be downloaded
setwd('/home/sunil/Desktop/book/chapter 19/MNIST')
# download the training and testing dataset from source
download.file("http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz","train-images-idx3-ubyte.gz")
download.file("http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz","train-labels-idx1-ubyte.gz")
download.file("http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz","t10k-images-idx3-ubyte.gz")
download.file("http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz","t10k-labels-idx1-ubyte.gz")
# unzip the training and test zip files that are downloaded
R.utils::gunzip("train-images-idx3-ubyte.gz")
R.utils::gunzip("train-labels-idx1-ubyte.gz")
R.utils::gunzip("t10k-images-idx3-ubyte.gz")
R.utils::gunzip("t10k-labels-idx1-ubyte.gz")

Once the data is downloaded and unzipped, we will see the files in our working directory. However, these files are in binary format and they cannot be directly loaded through the regular read.csv command. The following custom function code helps to read the training and test data from the binary files:

# function to load the image files
load_image_file = function(filename) {
  ret = list()
  # opening the binary file in read mode 
  f = file(filename, 'rb')
  # reading the binary file into a matrix called x
 readBin(f, 'integer', n = 1, size = 4, endian = 'big')
 n = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
 nrow = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
 ncol = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
 x = readBin(f, 'integer', n = n * nrow * ncol, size = 1, signed = FALSE)
  # closing the file
  close(f)
  # converting the matrix and returning the dataframe
  data.frame(matrix(x, ncol = nrow * ncol, byrow = TRUE))
}
# function to load label files
load_label_file = function(filename) {
  # reading the binary file in read mode
  f = file(filename, 'rb')
  # reading the labels binary file into y vector 
  readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  n = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  y = readBin(f, 'integer', n = n, size = 1, signed = FALSE)
  # closing the file
  close(f)
  # returning the y vector
  y
}

The functions may be called with the following code:

# load training images data through the load_image_file custom function
train = load_image_file("train-images-idx3-ubyte")
# load  test data through the load_image_file custom function
test  = load_image_file("t10k-images-idx3-ubyte")
# load the train dataset labels
train.y = load_label_file("train-labels-idx1-ubyte")
# load the test dataset labels
test.y  = load_label_file("t10k-labels-idx1-ubyte")

In RStudio, when we execute the code, we see train , test, train.y, and test.y displayed under the Environment tab. This confirms that the datasets are successfully loaded and the respective dataframes are created, as shown in the following screenshot:

Once the image data is loaded into the dataframe, it is in the form of a series of numbers that represent the pixel values. The following is a helper function that visualizes the pixel data as an image in RStudio:

# helper function to visualize image given a record of pixels
show_digit = function(arr784, col = gray(12:1 / 12), ...) {
  image(matrix(as.matrix(arr784), nrow = 28)[, 28:1], col = col, ...)
}

The show_digit() function may be called like any other R function with the dataframe record number as a parameter. For example, the function in the following code block helps to visualize the 3 record in the training dataset as an image in RStudio:

# viewing image corresponding to record 3 in the train dataset
show_digit(train[3, ])

This will give the following output:

Dr. David Robinson, in his blog on Exploring handwritten digit classification: a tidy analysis of the MNIST dataset (http://varianceexplained.org/r/digit-eda/), performed a beautiful exploratory data analysis of the MNIST dataset, which will help you better understand the dataset.

Table of Contents for Understanding the MNIST dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding the MNIST dataset