Importing the UCI ML handwritten digits dataset

While we will be using the MNIST dataset as in Chapter 04, Advanced Matplotlib (since we will be demonstrating visualization along with model building in machine learning), we will take a shortcut to speed up the training process. Instead of using the 60,000 images with 28x28 pixels each, we will import another similar dataset of 8x8-pixel images from the scikit-learn package.

This dataset is obtained from the University of California, Irvine Machine Learning Repository, found at http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits. It is a preprocessed version of images of digits written by 43 people, with 30 and 13 contributing to the training and testing set respectively. The preprocessing method is described in M. D. Garris, J. L. Blue, G. T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C. L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469, 1994.

The code to import the dataset is as follows:

from sklearn.datasets import load_digits

To begin, let's store our dataset into a variable. We will call it digits and reuse it throughout the chapter:

digits = load_digits()

Let's see what is loaded by the load_digits() function by printing out the variable digits:

print(type(digits))
print(digits)

The type of digits is <class 'sklearn.utils.Bunch'>, which is specific to loading sample dataset.

As the output for print(digits) is somewhat long, we will show its beginning and end in two screenshots:

The following screenshot shows the tail of the output:

We can see that there are five members within the class digits:

'data': Pixel values flattened into 1D NumPy arrays
'target': A NumPy array of identity labels of each element in the dataset
'target_names': A list of unique labels existing in the dataset—integers 0-9 in this case
'images': Pixel values reshaped into 2D NumPy arrays in the dimension of images
'DESCR': Description of the dataset

Besides having smaller image dimensions than MNIST, this dataset also has far fewer images. So, how many are there in total? We get the dimensions of a NumPy array in a tuple of dimensions by nd.shape, where nd is the array. Hence, to inquire about the shape of digits.image, we call:

print(digits.images.shape)

We get (1797, 8, 8) as result.

You may wonder why the number is so peculiar. If you have particularly sharp eyes, you might have seen that there are 5,620 instances in the description. In fact, the description is retrieved from the archive web page. The data we have loaded is actually the testing portion of the full dataset. You may also download the plain text equivalent from http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes.

If you are interested in getting the full MNIST dataset, scikit-learn also offers an API to fetch it:
from sklearn.datasets import fetch_mldata mnist = fetch_mldata('MNIST original', data_home=custom_data_home)

Table of Contents for Importing the UCI ML handwritten digits dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Importing the UCI ML handwritten digits dataset