© David Paper 2021
D. PaperTensorFlow 2.x in the Colaboratory Cloudhttps://doi.org/10.1007/978-1-4842-6649-6_3

3. Working with TensorFlow Data

David Paper1  
(1)
Logan, UT, USA
 

We introduce TensorFlow datasets (TFDS). We discuss many facets of a TFDS with code examples. We continue with a complete TFDS modeling example.

Notebooks for chapters are located at the following URL: https://github.com/paperd/tensorflow.

TFDS is a collection of datasets ready to use with TensorFlow. Like all TensorFlow consumable datasets, TFDS are exposed as tf.data.Datasets, which allows us to create easy-to-use, high-performance input pipelines.

Enable the GPU (if not already enabled):
  1. 1.

    Click Runtime in the top-left menu.

     
  2. 2.

    Click Change runtime type from the drop-down menu.

     
  3. 3.

    Choose GPU from the Hardware accelerator drop-down menu.

     
  4. 4.

    Click SAVE.

     
Test if GPU is active:
import tensorflow as tf
# display tf version and test if GPU is active
tf.__version__, tf.test.gpu_device_name()

Import the tensorflow library. If ‘/device:GPU:0’ is displayed, the GPU is active. If ‘..’ is displayed, the regular CPU is active.

TensorFlow Datasets (TFDS)

Start with the first URL as it introduces TFDS. The second URL demonstrates how to display a list of all TFDS and additional technical information. The third URL shows how TFDS are categorized.

Colab Abends

When we run Google Colab for a long time (several hours) without pause or load large datasets into memory and process said data, it may crash (or abend). When this happens, you have two choices that we know of:
  1. 1.

    Restart all runtimes.

     
  2. 2.

    Close the program and restart it from scratch.

     

Available TFDS

Let’s begin by displaying a list of available TFDS:
import tensorflow_datasets as tfds
# See available datasets
tfds.list_builders()

Begin by importing the tfds module. Use the list_builders() method to display.

Find out how many TFDS are in the tensorflow_datasets container:
print (str(len(tfds.list_builders())) + ' datasets')

Wow! There are 244 datasets (as of this writing) we can use to practice TensorFlow.

Peruse the following URL for a really nice tutorial on TFDS:

https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb

Load a TFDS

We can load a TFDS with one line of code! In the tutorial on the TFDS website, it states that we must install tensorflow-datasets. But, in Google Colab, we don’t have to do this.

Tip

If you are working in an environment other than Google Colab, you can install the tfds module by running the following code snippet: !pip install tensorflow-datasets

Import the tfds module:
import tensorflow_datasets as tfds
Load training data directly with tfds.load:
# load train set
train, info = tfds.load('mnist', split="train",
 with_info=True)
info

The tfds.load function loads the named dataset into a tf.data.Dataset. We add the info element to enable display of useful metadata about the dataset. Metadata is a set of data that describes and gives information about other data.

Load test data directly:
# load test data
test, info = tfds.load('mnist', split="test", with_info=True)
info

We loaded the MNIST train and test dataset from the TFDS container. The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. The database consists of a set of 60,000 training examples and 10,000 test examples. We included the info element, which provides detailed information about the dataset.

Note

In machine learning vernacular, a data element is typically described as either an example or sample. The word example is used interchangeably with sample.

Although feedforward neural nets don’t tend to perform well on images, MNIST is an exception because it is heavily preprocessed and the images are small. That is, MNIST images are roughly of the same small size, centered in the middle of the image space, and vertically oriented.

Extract Useful Information

The info element includes methods that allow us to extract specific information about a dataset.

Listing 3-1 extracts information about the classes.
# create a variable to hold a return symbol
br = ' '
# display number of classes
num_classes = info.features['label'].num_classes
class_labels = info.features['label'].names
# display class labels
print ('number of classes:', num_classes)
print ('class labels:', class_labels)
Listing 3-1

Extracting meaningful information from a dataset

We just extracted the number of classes and class labels with the features method.

Inspect the TFDS

We have two ways to inspect an element:
  1. 1.

    Print the element.

     
  2. 2.

    Print the element with the element_spec method.

     
Print elements:
# display training and test set
print (train)
print (test)
Print elements with element_spec:
# display with element_spec method
print (train.element_spec)
print (test.element_spec)

Either way, the output is very similar. Tensor shapes are (28, 28, 1). So training and test data consists of 28 × 28 images. The 1 value means that images are displayed in grayscale. Image data (the feature set) is composed of tf.uint8 data, and label (or target) data is composed of tf.int64 data.

A grayscale image is one where the value of each pixel is a single sample representing only an amount of light. That is, it carries only intensity information. Grayscale images are composed exclusively of shades of gray. The contrast of an image ranges from black at the weakest intensity to white at the strongest.

We can also display training examples from a TFDS with one line of code:
# Show train feature image examples
fig = tfds.show_examples(train, info)

The show_examples method displays sample images from a tf.data.Dataset.

Feature Dictionaries

All TFDS contain feature dictionaries that map feature names to tensor values. By default, tfds.load returns a dictionary of tf.Tensors. A tf.Tensor represents a rectangular array of data in TensorFlow.

A typical dataset, like MNIST, has two keys: image and label. Let’s inspect one sample with take(1). The number we feed into the take function renders the number of samples we receive from the dataset.

Take one sample from the train dataset and display its keys:
for sample in train.take(1):
  print (list(sample.keys()))

We see the two keys as expected. The image key references the images in the dataset. The label key references the labels in the dataset. The formal dictionary structure is represented as {‘image’: tf.Tensor, ‘label’: tf.Tensor}.

Now that we know the keys, we can easily display the feature shape and target value from the first train sample:
for sample in train.take(1):
  print ('feature shape:', sample['image'].shape)
  print ('target value: ', sample['label'].numpy())

The shape of the first feature sample is (28, 28, 1), and the value of the first label is 4. We used the numpy() method to convert the target tensor to a scalar value. Since any dataset consumed by a machine learning algorithm must have the same shape, we get the shape of a dataset from a single sample!

Let’s get nine examples from the train set:
n, ls = 9, []
for sample in train.take(n):
  ls.append(sample['label'].numpy())
ls

We see [4, 1, 0, 7, 8, 1, 2, 7, 1], which matches the labels from show_examples in the previous section.

By using the as_supervised=True parameter with tfds.load, we get a tuple of (feature, label) instead of a dictionary:
ds = tfds.load('mnist', split="train", as_supervised=True)
ds = ds.take(1)
for image, label in ds:
  print (image.shape, br, label)

The sample is in the form (image, label).

We can also get a numpy tuple of (feature, label):
ds = tfds.load('mnist', split="train", as_supervised=True)
ds = ds.take(1)
for image, label in tfds.as_numpy(ds):
  print (type(image), type(label), label)

We use tfds.as_numpy to convert tf.Tensor to np.array and tf.data.Dataset to Generator[np.array].

Finally, we can get a batched tf.Tensor:
image, label = tfds.as_numpy(tfds.load(
    'mnist',
 split='train',
 batch_size=-1,
 as_supervised=True,
))
type(image), image.shape

By using batch_size=-1, we can load the full dataset in a single batch. We see numpy tensor (60000, 28, 28, 1), which means training data contains 60,000 28 × 28 grayscale images.

In summary, tfds.load returns a dictionary by default, a tuple with as_supervised=True of tf.Tensor, or a np.array with tfds.as_numpy. Be careful that your dataset can fit in memory and that all examples have the same shape.

Build the Input Pipeline

Scale, shuffle, batch, and prefetch train data:
train_sc = train.map(lambda items:
 (tf.cast(items['image'],
 tf.float32) / 255.,
 items['label']))
train_ds = train_sc.shuffle(10000).batch(32).prefetch(1)
Use the train dataset we loaded earlier in the chapter. Although this looks complicated, we use a lambda function to divide each image by 255. We then shuffle, batch, and prefetch.

Prefetching improves model performance because it adds efficiency to the batching process. While our training algorithm is working on one batch, TensorFlow is working on the dataset in parallel to get the next batch ready.

Scale, batch, and prefetch test data:
test_sc = test.map(lambda items:
 (tf.cast(items['image'],
 tf.float32) / 255.,
 items['label']))
test_ds = test_sc.batch(32).prefetch(1)

Use the test set we loaded earlier in the chapter. We don’t shuffle test data because it is considered new data.

Inspect the tensors:
train_ds, test_ds

As expected, feature shapes are (None, 28, 28, 1). The first dimension is None, which indicates that batch size can be any value.

Build the Model

Import libraries:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Dropout
Create the model:
# clear previous model
tf.keras.backend.clear_session()
model = Sequential([
  Flatten(input_shape=[28, 28, 1]),
  Dense(512, activation="relu"),
  Dense(10, activation="softmax")
])

The model has an input layer, a dense hidden layer, and a dense output layer. The input layer flattens images for processing in the next layer. The hidden layer accepts data into 512 neurons. The output layer accepts data from the hidden layer into ten neurons that represent the ten digit classes.

Model Summary

Display a summary of the model:
model.summary()

The first layer accepts 28 × 28 grayscale images. So we get output shape (None, 784). None is the first parameter because TensorFlow models can accept any batch size. We get the second parameter 784 by multiplying 28 by 28 by 1. So each image has 784 pixels. The number of parameters is 0 because the first layer doesn’t act on the data.

The second layer output shape is (None, 512) since we have 512 neurons. The number of trainable parameters is 401920. We get 401408 by multiplying 784 neurons from the first layer by 512 neurons in this layer. We then add 512 neurons from this layer to get 401920.

The third layer output shape is (None, 10) since we have ten classes. We get 5130 by multiplying 10 by 512 neurons from the second layer and adding 10 neurons from this layer.

Compile and Train the Model

Compile the model with optimizer, loss, and metrics parameters:
model.compile(optimizer='adam',
 loss='sparse_categorical_crossentropy',
 metrics=['accuracy'])
Train the model:
epochs = 3
history = model.fit(train_ds, epochs=epochs, verbose=1,
                     validation_data=test_ds)

We train the model for three epochs. That is, we pass data three times through the model. We validate the model with test data. We get pretty good results with not much overfitting with just three epochs!

Generalize on Test Data

It’s always a good idea to evaluate based on test data:
model.evaluate(test_ds)

Visualize Performance

Get the training record into a variable:
# get training record into a variable
history_dict = history.history
Listing 3-2 plots accuracy and loss for the model.
import matplotlib.pyplot as plt
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']
epochs = range(1, len(acc) + 1)
plt.figure(figsize=(12,9))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim((0.5,1))
plt.show()
# clear previous figure
plt.clf()
plt.figure(figsize=(12,9))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Listing 3-2

Visualize training performance

The visualization shows that our model fits data pretty well because train and test accuracy are very well aligned!

DatasetBuilder (tfds.builder)

Since tfds.load is really just a thin convenience wrapper around DatasetBuilder, we can build an input pipeline with the MNIST dataset directly with tfds.builder. DatasetBuilder is an abstract base class for all datasets. That is, every TensorFlow dataset is exposed as a DatasetBuilder.

DatasetBuilder performs several duties for TensorFlow datasets:
  • Where to download data from and how to extract it and write it to a standard format with DatasetBuilder.download_and_prepare

  • How to load data from disk with DatasetBuilder.as_dataset

  • All information about data including names, types, feature shapes, number of records in each train and test split, and source URLs with DatasetBuilder.info

  • Ability to directly instantiate any DatasetBuilder with tfds.builder

Unlike tfds.load, however, we have to manually fetch the DatasetBuilder by name, call download_and_prepare(), and call as_dataset(). The advantage of tfds.builder is that it allows more control over the loading process should we need it.

Listing 3-3 shows how to load MNIST with tfds.builder.
mnist_builder = tfds.builder('mnist')
mnist_info = mnist_builder.info
mnist_builder.download_and_prepare()
datasets = mnist_builder.as_dataset()
Listing 3-3

Load MNIST with tfds.builder

Begin with tfds.builder to create the dataset. Include information with the info method. Process data with the download_and_prepare method. Place the processed dataset into a variable.

Build the train and test sets:
mnist_train, mnist_test = datasets['train'], datasets['test']
Use the feature dictionary to get critical information:
for sample in mnist_train.take(1):
  print ('feature shape:', sample['image'].shape)
  print ('target value: ', sample['label'].numpy())

We see that the first feature has a shape of (28, 28, 1) and target value of 4.

MNIST Metadata

Like tfds.load, tfds.builder can access metadata about MNIST:
mnist_info

We see a lot of useful information about MNIST.

Access feature information:
mnist_info.features

We see useful information about images and labels.

Listing 3-4 displays the number of classes and class labels.
# display number of classes
num_classes = mnist_info.features['label'].num_classes
class_labels = mnist_info.features['label'].names
# display class labels
print ('number of classes:', num_classes)
print ('class labels:', class_labels)
Listing 3-4

Number of classes and class labels

Access shapes and datatypes:
print (mnist_info.features.shape)
print (mnist_info.features.dtype)
Access image information:
print (mnist_info.features['image'].shape)
print (mnist_info.features['image'].dtype)
Access label information:
print (mnist_info.features['label'].shape)
print (mnist_info.features['label'].dtype)
Train and test splits:
print (mnist_info.splits)
Available split keys:
print (list(mnist_info.splits.keys()))
Number of train and test examples:
print (mnist_info.splits['train'].num_examples)
print (mnist_info.splits['test'].num_examples)
Note

tfds.load has access to the same metadata as tfds.builder.

Show Examples

As demonstrated earlier in the chapter, tfds.show_examples allows us to conveniently visualize images (and labels) from an image classification dataset.

Let’s show examples from the test set:
fig = tfds.show_examples(mnist_test, info)

Prepare DatasetBuilder Data

Prepare the input pipeline for DatasetBuilder train and test data.

Scale, shuffle, batch, and prefetch train data:
train_sc = mnist_train.map(lambda items:
 (tf.cast(items['image'],
 tf.float32) / 255.,
 items['label']))
train_build = train_sc.shuffle(1024).batch(128).prefetch(1)
Scale, batch, and prefetch test data:
test_sc = mnist_test.map(lambda items:
                         (tf.cast(items['image'],
 tf.float32) / 255.,
                          items['label']))
test_build = test_sc.batch(128).prefetch(1)
Inspect tensors:
train_build, test_build

As expected, feature shapes are (None, 28, 28, 1).

Build the Model

Create the model:
tf.keras.backend.clear_session()
model = Sequential([
  Flatten(input_shape=[28, 28, 1]),
  Dense(512, activation="relu"),
  Dense(10, activation="softmax")
])

Compile the Model

Compile :
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
 metrics=['accuracy'])

Train the Model

Train:
model.fit(train_build, epochs=3, validation_data=test_build)

As expected, results are very similar to training with tfds.load.

Generalize on Test Data

Evaluate based on test data:
model.evaluate(test_build)

Load CIFAR-10

Let’s use the DatasetBuilder method tfds.builder to manipulate another dataset and show some examples. The CIFAR-10 dataset consists of 60,000 32 × 32 color images in ten classes, with 6000 images per class. There are 50,000 training images and 10,000 test images.

The ten classes are
[airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck]

The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. Class automobile includes sedans, SUVs, and other items of that sort. Class truck includes only big trucks. Neither includes pickup trucks.

The dataset is divided into five training batches and one test batch, each with 10,000 images. The test batch contains exactly 1,000 randomly selected images from each class. Training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, training batches contain exactly 5,000 images from each class. Since CIFAR-10 is preprocessed with batches, we can use this information to better batch data for the training model.

Listing 3-5 loads and processes the dataset for TensorFlow consumption.
cifar10_builder = tfds.builder('cifar10')
cifar10_info = cifar10_builder.info
cifar10_builder.download_and_prepare()
cifar10_train = cifar10_builder.as_dataset(split='train')
cifar10_test = cifar10_builder.as_dataset(split='test')
Listing 3-5

Prepare the CIFAR-10 dataset for TensorFlow consumption

Inspect the train set:
cifar10_train

We see that image tensors have shape (32, 32, 3). And label tensors have shape (). So each image is represented by 32 × 32 pixels. The 3 value means that images are in color. Image tensors are datatype tf.uint8. Each label is a scalar value. Label tensors are datatype tf.int64.

TensorFlow leverages the RGB color model to produce color images. The RGB color model is an additive color model where red, green, and blue lights are added together in various ways to reproduce a broad array of colors. The name of the model comes from the initials of the three additive primary colors, namely, red, green, and blue.

Inspect the Dataset

Get information about the dataset:
cifar10_info
Access feature information:
cifar10_info.features
Get class names:
cifar10_info.features['label'].names
Get available split keys:
print (list(cifar10_info.splits.keys()))
Show train examples:
fig = tfds.show_examples(cifar10_train, info)
Use the feature dictionary to display train labels:
[sample['label'].numpy() for sample in cifar10_train.take(9)]

To simplify coding, we used list comprehension.

For completeness, show test examples:
fig = tfds.show_examples(cifar10_test, info)

Prepare the Input Pipeline

Scale, shuffle, batch, and prefetch train data:
train_sc = cifar10_train.map(lambda items:
                             (tf.cast(items['image'],
 tf.float32) / 255.,
                              items['label']))
train_cd = train_sc.shuffle(1024).batch(128).prefetch(1)
Scale, batch, and prefetch test data:
test_sc = cifar10_test.map(lambda items:
                           (tf.cast(items['image'],
 tf.float32) / 255.,
                            items['label']))
test_cd = test_sc.batch(128).prefetch(1)
Inspect tensors:
train_cd, test_cd

Model the Data

Create the model:
tf.keras.backend.clear_session()
model = Sequential([
  Flatten(input_shape=[32, 32, 3]),
  Dense(512, activation="relu"),
  Dense(10, activation="softmax")
])

We must get the input shape correct. For CIFAR-10, input shape is (32, 32, 3)!

Inspect the model:
model.summary()

The first layer accepts and flattens 32 × 32 color images. So we get output shape (None, 3072). None is the first parameter because TensorFlow models can accept any batch size. We get the second parameter 3072 by multiplying 32 by 32 by 3. Each image has 1,024 pixels, which is the result of multiplying 32 by 32. Since each image is in color, we multiply 1,024 by 3 to get 3,072 neurons. The number of parameters is 0 because this is the first layer.

The second layer output shape is (None, 512) since we have 512 neurons. The number of parameters is 1573376. We get 1572864 by multiplying 3,072 neurons from the first layer by 512 neurons in this layer. We then add 512 neurons from this layer to get 1,573,376.

The third layer output shape is (None, 10) since we have ten classes. We get 5130 by multiplying 10 by 512 neurons from the second layer and adding 10 neurons from this layer.

Compile the model:
model.compile(optimizer='adam',
 loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
Train the model:
epochs = 3
history = model.fit(train_cd, epochs=epochs, verbose=1,
                    validation_data=test_cd)

Accuracy below 50% is not good. Our model performed very poorly because feedforward neural networks are not designed to work well with image data.

Our goal was to show you how to load and model a TFDS. We introduce models that work well with image data in a later chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.218.254