We introduce TensorFlow datasets (TFDS). We discuss many facets of a TFDS with code examples. We continue with a complete TFDS modeling example.
Notebooks for chapters are located at the following URL: https://github.com/paperd/tensorflow.
TFDS is a collection of datasets ready to use with TensorFlow. Like all TensorFlow consumable datasets, TFDS are exposed as tf.data.Datasets, which allows us to create easy-to-use, high-performance input pipelines.
- 1.
Click Runtime in the top-left menu.
- 2.
Click Change runtime type from the drop-down menu.
- 3.
Choose GPU from the Hardware accelerator drop-down menu.
- 4.
Click SAVE.
Import the tensorflow library. If ‘/device:GPU:0’ is displayed, the GPU is active. If ‘..’ is displayed, the regular CPU is active.
TensorFlow Datasets (TFDS)
Start with the first URL as it introduces TFDS. The second URL demonstrates how to display a list of all TFDS and additional technical information. The third URL shows how TFDS are categorized.
Colab Abends
- 1.
Restart all runtimes.
- 2.
Close the program and restart it from scratch.
Available TFDS
Begin by importing the tfds module. Use the list_builders() method to display.
Wow! There are 244 datasets (as of this writing) we can use to practice TensorFlow.
Peruse the following URL for a really nice tutorial on TFDS:
https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb
Load a TFDS
We can load a TFDS with one line of code! In the tutorial on the TFDS website, it states that we must install tensorflow-datasets. But, in Google Colab, we don’t have to do this.
If you are working in an environment other than Google Colab, you can install the tfds module by running the following code snippet: !pip install tensorflow-datasets
The tfds.load function loads the named dataset into a tf.data.Dataset. We add the info element to enable display of useful metadata about the dataset. Metadata is a set of data that describes and gives information about other data.
We loaded the MNIST train and test dataset from the TFDS container. The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. The database consists of a set of 60,000 training examples and 10,000 test examples. We included the info element, which provides detailed information about the dataset.
In machine learning vernacular, a data element is typically described as either an example or sample. The word example is used interchangeably with sample.
Although feedforward neural nets don’t tend to perform well on images, MNIST is an exception because it is heavily preprocessed and the images are small. That is, MNIST images are roughly of the same small size, centered in the middle of the image space, and vertically oriented.
Extract Useful Information
The info element includes methods that allow us to extract specific information about a dataset.
Extracting meaningful information from a dataset
We just extracted the number of classes and class labels with the features method.
Inspect the TFDS
- 1.
Print the element.
- 2.
Print the element with the element_spec method.
Either way, the output is very similar. Tensor shapes are (28, 28, 1). So training and test data consists of 28 × 28 images. The 1 value means that images are displayed in grayscale. Image data (the feature set) is composed of tf.uint8 data, and label (or target) data is composed of tf.int64 data.
A grayscale image is one where the value of each pixel is a single sample representing only an amount of light. That is, it carries only intensity information. Grayscale images are composed exclusively of shades of gray. The contrast of an image ranges from black at the weakest intensity to white at the strongest.
The show_examples method displays sample images from a tf.data.Dataset.
Feature Dictionaries
All TFDS contain feature dictionaries that map feature names to tensor values. By default, tfds.load returns a dictionary of tf.Tensors. A tf.Tensor represents a rectangular array of data in TensorFlow.
A typical dataset, like MNIST, has two keys: image and label. Let’s inspect one sample with take(1). The number we feed into the take function renders the number of samples we receive from the dataset.
We see the two keys as expected. The image key references the images in the dataset. The label key references the labels in the dataset. The formal dictionary structure is represented as {‘image’: tf.Tensor, ‘label’: tf.Tensor}.
The shape of the first feature sample is (28, 28, 1), and the value of the first label is 4. We used the numpy() method to convert the target tensor to a scalar value. Since any dataset consumed by a machine learning algorithm must have the same shape, we get the shape of a dataset from a single sample!
We see [4, 1, 0, 7, 8, 1, 2, 7, 1], which matches the labels from show_examples in the previous section.
The sample is in the form (image, label).
We use tfds.as_numpy to convert tf.Tensor to np.array and tf.data.Dataset to Generator[np.array].
By using batch_size=-1, we can load the full dataset in a single batch. We see numpy tensor (60000, 28, 28, 1), which means training data contains 60,000 28 × 28 grayscale images.
In summary, tfds.load returns a dictionary by default, a tuple with as_supervised=True of tf.Tensor, or a np.array with tfds.as_numpy. Be careful that your dataset can fit in memory and that all examples have the same shape.
Build the Input Pipeline
Prefetching improves model performance because it adds efficiency to the batching process. While our training algorithm is working on one batch, TensorFlow is working on the dataset in parallel to get the next batch ready.
Use the test set we loaded earlier in the chapter. We don’t shuffle test data because it is considered new data.
As expected, feature shapes are (None, 28, 28, 1). The first dimension is None, which indicates that batch size can be any value.
Build the Model
The model has an input layer, a dense hidden layer, and a dense output layer. The input layer flattens images for processing in the next layer. The hidden layer accepts data into 512 neurons. The output layer accepts data from the hidden layer into ten neurons that represent the ten digit classes.
Model Summary
The first layer accepts 28 × 28 grayscale images. So we get output shape (None, 784). None is the first parameter because TensorFlow models can accept any batch size. We get the second parameter 784 by multiplying 28 by 28 by 1. So each image has 784 pixels. The number of parameters is 0 because the first layer doesn’t act on the data.
The second layer output shape is (None, 512) since we have 512 neurons. The number of trainable parameters is 401920. We get 401408 by multiplying 784 neurons from the first layer by 512 neurons in this layer. We then add 512 neurons from this layer to get 401920.
The third layer output shape is (None, 10) since we have ten classes. We get 5130 by multiplying 10 by 512 neurons from the second layer and adding 10 neurons from this layer.
Compile and Train the Model
We train the model for three epochs. That is, we pass data three times through the model. We validate the model with test data. We get pretty good results with not much overfitting with just three epochs!
Generalize on Test Data
Visualize Performance
Visualize training performance
The visualization shows that our model fits data pretty well because train and test accuracy are very well aligned!
DatasetBuilder (tfds.builder)
Since tfds.load is really just a thin convenience wrapper around DatasetBuilder, we can build an input pipeline with the MNIST dataset directly with tfds.builder. DatasetBuilder is an abstract base class for all datasets. That is, every TensorFlow dataset is exposed as a DatasetBuilder.
Where to download data from and how to extract it and write it to a standard format with DatasetBuilder.download_and_prepare
How to load data from disk with DatasetBuilder.as_dataset
All information about data including names, types, feature shapes, number of records in each train and test split, and source URLs with DatasetBuilder.info
Ability to directly instantiate any DatasetBuilder with tfds.builder
Unlike tfds.load, however, we have to manually fetch the DatasetBuilder by name, call download_and_prepare(), and call as_dataset(). The advantage of tfds.builder is that it allows more control over the loading process should we need it.
Load MNIST with tfds.builder
Begin with tfds.builder to create the dataset. Include information with the info method. Process data with the download_and_prepare method. Place the processed dataset into a variable.
We see that the first feature has a shape of (28, 28, 1) and target value of 4.
MNIST Metadata
We see a lot of useful information about MNIST.
We see useful information about images and labels.
Number of classes and class labels
tfds.load has access to the same metadata as tfds.builder.
Show Examples
As demonstrated earlier in the chapter, tfds.show_examples allows us to conveniently visualize images (and labels) from an image classification dataset.
Prepare DatasetBuilder Data
Prepare the input pipeline for DatasetBuilder train and test data.
As expected, feature shapes are (None, 28, 28, 1).
Build the Model
Compile the Model
Train the Model
As expected, results are very similar to training with tfds.load.
Generalize on Test Data
Load CIFAR-10
Let’s use the DatasetBuilder method tfds.builder to manipulate another dataset and show some examples. The CIFAR-10 dataset consists of 60,000 32 × 32 color images in ten classes, with 6000 images per class. There are 50,000 training images and 10,000 test images.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. Class automobile includes sedans, SUVs, and other items of that sort. Class truck includes only big trucks. Neither includes pickup trucks.
The dataset is divided into five training batches and one test batch, each with 10,000 images. The test batch contains exactly 1,000 randomly selected images from each class. Training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, training batches contain exactly 5,000 images from each class. Since CIFAR-10 is preprocessed with batches, we can use this information to better batch data for the training model.
Prepare the CIFAR-10 dataset for TensorFlow consumption
We see that image tensors have shape (32, 32, 3). And label tensors have shape (). So each image is represented by 32 × 32 pixels. The 3 value means that images are in color. Image tensors are datatype tf.uint8. Each label is a scalar value. Label tensors are datatype tf.int64.
TensorFlow leverages the RGB color model to produce color images. The RGB color model is an additive color model where red, green, and blue lights are added together in various ways to reproduce a broad array of colors. The name of the model comes from the initials of the three additive primary colors, namely, red, green, and blue.
Inspect the Dataset
To simplify coding, we used list comprehension.
Prepare the Input Pipeline
Model the Data
We must get the input shape correct. For CIFAR-10, input shape is (32, 32, 3)!
The first layer accepts and flattens 32 × 32 color images. So we get output shape (None, 3072). None is the first parameter because TensorFlow models can accept any batch size. We get the second parameter 3072 by multiplying 32 by 32 by 3. Each image has 1,024 pixels, which is the result of multiplying 32 by 32. Since each image is in color, we multiply 1,024 by 3 to get 3,072 neurons. The number of parameters is 0 because this is the first layer.
The second layer output shape is (None, 512) since we have 512 neurons. The number of parameters is 1573376. We get 1572864 by multiplying 3,072 neurons from the first layer by 512 neurons in this layer. We then add 512 neurons from this layer to get 1,573,376.
The third layer output shape is (None, 10) since we have ten classes. We get 5130 by multiplying 10 by 512 neurons from the second layer and adding 10 neurons from this layer.
Accuracy below 50% is not good. Our model performed very poorly because feedforward neural networks are not designed to work well with image data.
Our goal was to show you how to load and model a TFDS. We introduce models that work well with image data in a later chapter.