Chapter 2. Data Preprocessing

Introduction

In this chapter, you’ll learn how to prepare and set up data for training. Some of the most common data formats for ML work are tables, images and text. There are commonly practiced techniques associated with each, though how you set up your data engineering pipeline will of course depend on what your problem statement is and what you are trying to predict.

I’ll look at all three formats in detail, using specific examples to walk you through the techniques. All of the data can be read directly into your Python runtime memory; however, this is not the most efficient way to use your compute resources. When I discuss text data, I’ll give particular attention to tokenization and dictionaries. By the end of this chapter, you will learn how to prepare table, image, and text data for training.

Preparing tabular data for training

In a tabular dataset, it is important to identify which columns are considered categorical, because you have to encode their value as a class or a binary representation of the class (one-hot encoding), rather than a numerical value. Another aspect of tabular datasets is the potential for interactions among multiple features. This section will also look at the API that TensorFlow provides to make it easier to model column interactions.

It’s common to encounter tabular datasets as comma-separated value (CSV) files or simply as structured output from a database query. For this example, we’ll start with a dataset that’s already in pandas DataFrame and then learn how to transform it and set it up for model training. We’ll use the Titanic survivor dataset, an open-source tabular dataset that is often used for teaching because of its manageable size and availability. This dataset contains attributes for each passenger, such as age, gender, cabin grade, and whether or not they survived. We are going to try to predict each passenger’s probability of survival based on their attributes or features. Be aware that this is a small dataset for teaching and learning purposes only. In reality, your dataset will often be much larger. You may make different decisions and default values for some of these input parameters, so don’t take them too literally from this example.

Let’s start with loading all the necessary libraries:

import functools
import numpy as np
import tensorflow as tf
import pandas as pd
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

Load the data from Google’s public storage:

TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

Now take a look at train_file_path:

print(train_file_path)
/root/.keras/datasets/train.csv

You will see this file path pointing to a CSV.

Now read the CSV as a pandas DataFrame:

titanic_df = pd.read_csv(train_file_path, header='infer')

Figure 2-1 shows what titanic_df looks like as a pandas DataFrame.

Titanic dataset as a pandas DataFrame
Figure 2-1. Titanic dataset as a pandas DataFrame

Marking columns

As you can see in Figure 2-1, there are numeric as well as categorical columns in this data. The target column, or the column for prediction, is the survived column. You’ll need to mark it as the target and mark the rest of the columns as features.

Note

A best practice in TensorFlow is to convert your table into a streaming dataset. This practice ensures that the data’s size does not affect memory consumption.

TensorFlow provides an API tf.data.experimental.make_csv_dataset to do exactly that:

LABEL_COLUMN = 'survived'
LABELS = [0, 1]
train_ds = tf.data.experimental.make_csv_dataset(
      train_file_path,
      batch_size=3,
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)
test_ds = tf.data.experimental.make_csv_dataset(
      test_file_path,
      batch_size=3,
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)

In the function signature above, you specify the file path for which you wish to generate a dataset object. The batch_size is arbitrarily set to something small, like 3, for convenience in inspecting the data. We also set label_name to be survived column. For data quality, if a question mark “?”is specified in any cell, you want it to be interpreted as “NA” (not applicable). For training, set num_epochs to iterate over the dataset once. You can ignore any parsing errors or empty lines.

Next, inspect the data:

for batch, label in train_ds.take(1):
  print(label)
  for key, value in batch.items():
    print("{}: {}".format(key,value.numpy()))

It will appear similar to Figure 2-2:

A batch of Titanic dataset
Figure 2-2. A batch of Titanic dataset

Here are the major steps for training a paradigm to consume your training dataset:

  1. Designate columns by feature types.

  2. Decide whether or not to embed or cross columns.

  3. Choose the columns of interest, possibly as experiment.

  4. Create a ‘feature layer’ for consumption by training paradigm.

Now that you have set up the data as datasets, you can designate each column by its feature types, such as numeric or categorical, bucketized (by binning) if necessary. You can also embed the column if there are too many unique categories and dimension reduction would be helpful. You can also cross columns model feature interactions.

Let’s go ahead with step 1. There are four numeric columns: age, n_siblings_spouses, parch, and fare. Five columns are categorical: sex, class, deck, embark_town, alone. You will create a feature_columns list to hold all the feature columns once you are done.

Here is how to designate numeric columns based strictly on the actual numeric values, without any transformation:

feature_columns = []
# numeric cols
for header in ['age', 'n_siblings_spouses', 'parch', 'fare']:
  feature_columns.append(feature_column.numeric_column(header))

Note that in addition to using age as-is, you could also bin age into bucket, such as by quantile of age distribution. But what are the bin boundaries (quantile)? You can inspect the general statistics of numeric columns in a pandas DataFrame:

titanic_df.describe()

Here is the output of general statistics in Figure 3-3:

Statistics for numeric columns in Titanic dataset
Figure 2-3. Statistics for numeric columns in Titanic dataset

Try three bin boundaries for age: 23, 28 and 35. This means you’ll bin passenger age into first quantile, second quantile, and third quantile, as shown in Figure 2-3:

age = feature_column.numeric_column('age')
age_buckets = feature_column.bucketized_column(age, boundaries=[23, 28, 35])

Therefore, in addition to age, you have generated another column: age_bucket.

To understand the nature of each categorical column, it would be helpful to know the distinct values in them. You’ll need to encode the vocabulary list with the unique entries in each column. For categorical columns, this means you need to determine which entries are unique.

h = {}
for col in titanic_df:
  if col in ['sex', 'class', 'deck', 'embark_town', 'alone']:
    print(col, ':', titanic_df[col].unique())
    h[col] = titanic_df[col].unique()

The result is shown in Figure 2-4.

Unique values in each categorical column of Titanic dataset
Figure 2-4. Unique values in each categorical column of Titanic dataset

You need to keep track of these unique values in a dictionary format for the model to do the mapping and lookup. Therefore, you’ll encode unique categorical values in the sex column:

sex_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', ['male' 'female'])
sex_type_one_hot = feature_column.indicator_column(sex_type)

However, if the list is long, it becomes inconvenient to write it out. Therefore, as you iterate through all the categorical columns, you can also save each column’s unique values in a Python dictionary data structure h for future lookup. Now you can pass the unique value as a list into these vocabulary lists:

sex_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', h.get('sex').tolist())
sex_type_one_hot = feature_column.indicator_column(sex_type)
class_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', h.get('class').tolist())
class_type_one_hot = feature_column.indicator_column(class_type)
deck_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', h.get('deck').tolist())
deck_type_one_hot = feature_column.indicator_column(deck_type)
embark_town_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', h.get('embark_town').tolist())
embark_town_type_one_hot = feature_column.indicator_column(embark_town_type)
alone_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', h.get('alone').tolist())
alone_one_hot = feature_column.indicator_column(alone_type)

You can also embed the deck column, since there are eight unique values, more than any other categorical column. Reduce its dimension to 3:

deck = feature_column.categorical_column_with_vocabulary_list(
      'deck', titanic_df.deck.unique())
deck_embedding = feature_column.embedding_column(deck, dimension=3)

Another way to reduce the dimensions of categorical columns is by using a hashed feature column. This method calculates a hashed value base on the input data. It then designates a hashed bucket to the data. The code below reduces the dimension of the class column to 4:

class_hashed = feature_column.categorical_column_with_hash_bucket(
      'class', hash_bucket_size=4)

Encoding column interactions as possible features

Now comes the most interesting part: you’re going to cross different columns and encode their interaction as a possible feature. This is also where your intuition and domain knowledge can benefit your feature engineering endeavor. For example, a question that comes to mind based on the historical background of the Titanic disaster is: were women in high-class cabinsmore likely to survive than women in lower-class cabins? To rephrase this as a data science question, you’ll need to consider interactions between the gender and class of the passengers. Then you’ll need to pick a starting dimension size to represent the data variability. Let’s say you arbitrarily decide to bin the variability into five dimensions (hash_bucket_size):

cross_type_feature = feature_column.crossed_column(['sex', 'class'], hash_bucket_size=5)

Now that you have created all the features, you need to put them together-- and perhaps experiment to decide which to include in the training process. For that, you’ll first create a list to hold all the features you want to use:

feature_columns = []

Then append each feature of interest to the list:

feature_columns = []
# append numeric cols
for header in ['age', 'n_siblings_spouses', 'parch', 'fare']:
  feature_columns.append(feature_column.numeric_column(header))
# append bucketized cols
age = feature_column.numeric_column('age')
age_buckets = feature_column.bucketized_column(age, boundaries=[23, 28, 35])
feature_columns.append(age_buckets)
# append categorical columns
indicator_column_names = ['sex', 'class', 'deck', 'embark_town', 'alone']
for col_name in indicator_column_names:
  categorical_column = feature_column.categorical_column_with_vocabulary_list(
      col_name, titanic_df[col_name].unique())
  indicator_column = feature_column.indicator_column(categorical_column)
  feature_columns.append(indicator_column)
# append embedding columns
deck = feature_column.categorical_column_with_vocabulary_list(
      'deck', titanic_df.deck.unique())
deck_embedding = feature_column.embedding_column(deck, dimension=3)
feature_columns.append(deck_embedding)
# append crossed columns
feature_columns.append(feature_column.indicator_column(cross_type_feature))

Now create a feature layer:

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

This layer will serve as the first (input) layer in the model you are about to build and train. This is how you’ll provide all the feature engineering frameworks into the model’s training process.

Creating a cross-validation dataset

Before you start training, you have to create a small dataset for cross-validation purpose. Since there are only two partitions (training and testing) to begin with, one way to generate a cross validation dataset is to simply subdivide one of the partitions:

val_df, test_df = train_test_split(test_df, test_size=0.4)

Here, you’ll randomly reserve 40% of the original test_df partition as test_df, and the remaining 60% is now val_df. Usually, test datasets are the smallest of the three (training, validation, test), since they don’t add value to model training. They are only for final evaluation.

Now that you have taken care of feature engineering and data partitioning, there is one last thing to do: stream the data into the training process with the dataset. You’ll convert all three DataFrames (training, validation and testing) into their respective datasets:

batch_size = 33
labels = train_df.pop('survived')
working_ds = tf.data.Dataset.from_tensor_slices((dict(train_df), labels))
working_ds = working_ds.shuffle(buffer_size=len(train_df))
train_ds = working_ds.batch(batch_size)

As shown in the code above, first, you’ll arbitrarily decide the number of samples in a batch (batch_size). Then you need to set aside a label designation (‘survived’). tf.data.Dataset.from_tensor_slices takes a tuple as argument. In this tuple, there are two elements: feature columns and the label column.

The first element is dict(train_df). This dict operation essentially transforms the DataFrame into a key-value pair, where each key represents a column name and the corresponding value is an array of the values in this column. The other element is labels.

Finally, shuffle and batch the dataset.

Since this conversion will be applied to all three datasets, it would be convenient to use these steps as a helper function to reduce repetition:

def pandas_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('survived')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

Now you can apply this function to both validation and test data:

val_ds = pandas_to_dataset(val_df, shuffle=False, batch_size=batch_size)
test_ds = pandas_to_dataset(test_df, shuffle=False, batch_size=batch_size)

Starting the model training process

Now you’re ready to start the model training process. Technically this isn’t a part of preprocessing, but running through this short section will allow you to see how the work you have done fits into the model training process itself.

You’ll start by building a model architecture:

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dropout(.1),
  layers.Dense(1)
])

For demonstration purposes, you’ll build a simple two-layer deep learning perceptron model, which is a basic configuration of a feed-forward neural network.1 Notice that since this is a multilayer perceptron model, you’ll use a sequential API. Inside this API, the first layer is feature_layer, which represents all the feature engineering logics and derived features such as age bins and crosses that model feature interactions.

Compile it and set up the loss function for binary classification:

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

Then you can start the training. You’ll only train it for 10 epochs:

model.fit(train_ds,
          validation_data=val_ds,
          epochs=10)

You can expect an outcome similar to that pictured in Figure 2-5.

Example training outcome from survival prediction in Titanic dataset
Figure 2-5. Example training outcome from survival prediction in Titanic dataset

Summary

In this section, you have seen how to deal with tabular data that consists of multiple data types. You also saw that TensorFlow provides the feature_column API, which enables proper casting of data types, handling of categorical data, and feature cross for potential interactions. This is very helpful in simplifying data and feature engineering tasks.

Preparing image data for processing

For images, you need to reshape or resample all the images into same pixel count; this is known as standardization. You also need to ensure all pixel values are within the same color range, so that it falls within the finite range of RGB values of each pixel.

Image data comes with different file extensions, such as jpg, tiff, and bmp. These are not really problematic, as there are APIs in Python and TensorFlow that can read and parse images of all file extensions. The tricky part about image data is capturing its dimensions-- its height, width and depth-- as measured by pixel counts. (If it is a color image encoded with RGB, these appear as three separate channels).

If all of the images in your dataset (including training, validation, and all the images during testing or deployment time) are expected to have the same dimensions and you are going to build your own model, then processing image data is not too much of a problem. However, if you wish to leverage pre-built models such as ResNet or Inception, then you have to observe and conform to their required image dimensions. As an example, ResNet requires each input image to be 224 by 224 by 3 pixels and presented as a numpy multidimensional array. This means that, in the preprocessing routine, you have to resample your images to conform to those dimensions.

Another situation for resampling arises when you cannot reasonably expect all the images, especially during deployment, to have same size. In this case, you need to consider a proper image dimension as you build the model, then set up the preprocessing routine to ensure resampling is done properly.

In this section, you are going to use the flower dataset provided by TensorFlow. It consists of five types of flowers and diverse image dimensions. This is a convenient dataset to use, since all images are already in jpg format. You are going to process these image data to train a model to parse each image and classify it as one of the five classes of flowers.

As usual, import all necessary libraries:

import tensorflow as tf
import numpy as np
import matplotlib.pylab as plt
import pathlib

Now download the flower dataset from the URL:

data_dir = tf.keras.utils.get_file(
    'flower_photos',
'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)

This file is a compressed tgz file. There you need to set untar=True.

When using tf.keras.utils.get_file, by default, you will find the downloaded data in the ~/.keras/datasets directory.

In a Mac or Linux system’s Jupyter Notebook cell, execute:

!ls -lrt ~/.keras/datasets/flower_photos

You will find the flower dataset as shown in Figure 2-6:

Flower dataset folders
Figure 2-6. Flower dataset folders

Now let’s take a look at one of the types:

!ls -lrt ~/.keras/datasets/flower_photos/roses | head -10

You should see the first ten images now, as shown in Figure 2-7.

Ten example image files in rose directory
Figure 2-7. Ten example image files in rose directory

These images are all different sizes. You can verify this with a couple of images. Here is a helper function2 you can leverage in order to plot the image in its original size:

def display_image_in_actual_size(im_path):
    dpi = 100
    im_data = plt.imread(im_path)
    height, width, depth = im_data.shape
    # What size does the figure need to be in inches to fit the image?
    figsize = width / float(dpi), height / float(dpi)
    # Create a figure of the right size with one axis that takes up the full figure
    fig = plt.figure(figsize=figsize)
    ax = fig.add_axes([0, 0, 1, 1])
    # Hide spines, ticks, etc.
    ax.axis('off')
    # Display the image.
    ax.imshow(im_data, cmap='gray')
    plt.show()

Let’s use it to display an image (Figure 2-8):

IMAGE_PATH = "/root/.keras/datasets/flower_photos/roses/7409458444_0bfc9a0682_n.jpg"
display_image_in_actual_size(IMAGE_PATH)
Rose image sample 1
Figure 2-8. Rose image sample 1

Now try a different one, shown in Figure 2-9:

IMAGE_PATH = "/root/.keras/datasets/flower_photos/roses/5736328472_8f25e6f6e7.jpg"
display_image_in_actual_size(IMAGE_PATH)
Rose image sample 2
Figure 2-9. Rose image sample 2

Clearly, the dimensions and aspect ratios of these images are all different.

Transforming images to a fixed specification

Now you’re ready to transform these images to a fixed specification. In this particular example here, you will select ResNet input image spec, which is 224 by 224 with three color channels (RGB). Also, it is a best practice to use data streaming whenever possible. Therefore, your goal here is to transform these color images into the shape of 224 by 224 pixels and build a dataset from them for streaming into the training paradigm.

You will use two important TensorFlow API to accomplish this: ImageDataGenerator and flow_from_directory.

ImageDataGenerator is responsible for creating a generator object, which generates streaming data from the directory as specified by flow_from_directory.

In general, the coding pattern is:

my_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    **datagen_kwargs)
my_generator = my_datagen.flow_from_directory(
data_dir, **dataflow_kwargs)

In both cases, keyword argument options, or kwargs, give these APIs great flexibility. (Keyword arguments are frequently seen in Python.) These arguments provide values to optional parameters into the function. As it turns out, in ImageDataGenerator, there are two parameters relevant to your needs: rescale and validation_split. rescale is used for normalizing pixel values into a finite range. validation_split lets you subdivide a partition of data, such as for cross-validation.

In flow_from_directory, there are three parameters that are useful for this example: target_size, batch_size, and interpolation. target_size helps you specify the desired dimension of image. batch_size is for specifying the number of samples in a batch of images.

As for interpolation, remember how you need to interpolate or resample each image to a prescribed dimension, target_size? Supported methods for interpolation are nearest, bilinear, and bicubic. For this example, first try bilinear.

You can define these keyword arguments as follows. Later you’ll pass them into their respective function calls.

pixels =224
BATCH_SIZE = 32
IMAGE_SIZE = (pixels, pixels)
datagen_kwargs = dict(rescale=1./255, validation_split=.20)
dataflow_kwargs = dict(target_size=IMAGE_SIZE, batch_size=BATCH_SIZE,
                   interpolation="bilinear")

Create a generator object:

valid_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
**datagen_kwargs)

Now you can specify the source directory for which this generator will stream the data. This generator will only stream 20% of the data, and this is designated as a validation dataset.

valid_generator = valid_datagen.flow_from_directory(
    data_dir, subset="validation", shuffle=False, **dataflow_kwargs)

You can reuse the generator object for training data:

train_datagen = valid_datagen
train_generator = train_datagen.flow_from_directory(
data_dir, subset="training", shuffle=True, **dataflow_kwargs)

Inspect the output of the generator:

for image_batch, labels_batch in train_generator:
  print(image_batch.shape)
  print(labels_batch.shape)
  break
(32, 224, 224, 3)
(32, 5)

The outputs are represented as numpy arrays. For a batch of images, the sample size is 32, with 224 pixels in height and width, and three channels representing RGB color space. For label batch, there are likewise 32 samples. Each row is one-hot encoded to represent which of the five classes it belongs to.

Another important thing is to retrieve the lookup dictionary to labels. During inferencing, the model will output the probability for each of the five classes. The only way to decode which class has the highest probability is with a prediction lookup dictionary to labels:

labels_idx = (train_generator.class_indices)
idx_labels = dict((v,k) for k,v in labels_idx.items())
print(idx_labels)
{0: 'daisy', 1: 'dandelion', 2: 'roses', 3: 'sunflowers', 4: 'tulips'}

A typical output from our classification model would be a numpy array similar to this:

(0.7, 0.1, 0.1, 0.05, 0.05)

The position with the highest probability value is the first element. Map this index to the first key in idx_labels, in this case ‘daisy’. This is how you capture the results of the prediction. Save the idx_labels dictionary:

import pickle
with open('prediction_lookup.pickle', 'wb') as handle:
    pickle.dump(idx_labels, handle, protocol=pickle.HIGHEST_PROTOCOL)

This is how to load it back:

with open('prediction_lookup.pickle', 'rb') as handle:
    lookup = pickle.load(handle)

Training the model

Finally, for training, you’lll use a model built from a pre-trained ResNet feature vector. This technique is known as transfer learning. TensorFlow Hub hosts many pre-trained models for free. This is how to access it during your model construction process:

import tensorflow_hub as hub
NUM_CLASSES = 5
mdl = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=IMAGE_SIZE + (3,)),
         hub.KerasLayer("https://tfhub.dev/google/imagenet/resnet_v1_101/feature_vector/4", trainable=False),
tf.keras.layers.Dense(NUM_CLASSES, activation='softmax', name = 'custom_class')
])
mdl.build([None, 224, 224, 3])

Notice the first layer InputLayer. Remember that the expected input is 224 by 224 by 3 pixels? You’ll use the tuple addition trick to append an extra dimension to IMAGE_SIZE :

IMAGE_SIZE + (3,)

Now you have (224, 224, 3), which is a tuple that represents the dimension of an image as a numpy array.

The next layer is the pre-trained ResNet feature vector referenced by the URL to TensorFlow Hub. Let’s use it as-is so we don’t have to retrain it.

Next is the dense layer with five nodes of output. Each output is the probability of the image belonging to that class. Then you’ll build the model skeleton, with None as the first dimension. This means the first dimension, which represent sample size of a batch, is not decided until runtime. This is how to handle batch input.

Inspect the model summary to make sure it’s what you expected:

mdl.summary()

The output is shown in Figure 2-10:

Image classification model summary
Figure 2-10. Image classification model summary

Compile the model with optimizer and the corresponding loss function:

mdl.compile(
  optimizer=tf.keras.optimizers.SGD(lr=0.005, momentum=0.9),
  loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True, label_smoothing=0.1),
  metrics=['accuracy'])

Train it:

steps_per_epoch = train_generator.samples // train_generator.batch_size
validation_steps = valid_generator.samples // valid_generator.batch_size
mdl.fit(
    train_generator,
    epochs=5, steps_per_epoch=steps_per_epoch,
    validation_data=valid_generator,
    validation_steps=validation_steps)

You may see output similar to that in Figure 3-11.

Output from training the image classification model
Figure 2-11. Output from training the image classification model

Summary

In this section, you have learned how to process image files. Specifically, it is necessary to make sure you have a pre-determined image size requirement set before you design the model. Once that standard is accepted, then the next step is to resample images into that size and normalize the pixel’s value into a smaller dynamic range. These routines are nearly universal. Also, streaming images into training workflow is the most efficient method, and even a best practice, in case where your working sample size approaches your Python runtime’s memory. Therefore, it is important to have a clear understanding of the image input workflow.

Preparing text data for processing

For text data, each word or character always needs to be represented as a numerical integer. This process is known as tokenization. Further, if the goal is classification, then the target needs to be encoded as classes. If the goal is something more complicated, such as translation, then the target language in the training data (such as the French in an English-to-French translation) also requires its own tokenization process. This is because the target is essentially a long string of text, just like the input text. Likewise, you also need to think about whether to tokenize the target at either the word or the character level.

Text data can be presented in many different formats. From the content organization perspective, it may be stored and organized as a table, with one column containing the body or string of text and another column contains labels, such as a binary sentiment indicator. It may be a free form file, with lines of different lengths and a carriage return at the end of each line. It may be a manuscript in which unit of blocks are defined by paragraphs or sections.

There are many ways to determine the processing techniques and logics to use as you set up a natural language processing (NLP) machine learning problem; this section will cover some of the most frequently used techniques.

This example will use text from William Shakespeare’s tragedy Coriolanus, which is a simple public-domain example hosted on Google. You will build a model that will learn how to write in Shakespeare’s style. This is a text generation model.

Tokenizing text

Text is represented by strings of characters. These characters need to be converted and represented as integers for modeling tasks. This example is a raw text string from Coriolanus.

import tensorflow as tf
import numpy as np
import os
import time
FILE_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
FILE_NAME = 'shakespeare.txt'
path_to_file = tf.keras.utils.get_file('shakespeare.txt', FILE_URL)

Open it and output a few lines of sample text:

text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
print ('Length of text: {} characters'.format(len(text)))

Inspect this text by printing the first 400 characters:

print(text[:400])

The output is shown in Figure 2-12:

Sample of William Shakespeare s Coriolanus
Figure 2-12. Sample of William Shakespeare’s Coriolanus

To tokenize each character in this file, a simple set operation will suffice. This operation will create a unique set of characters found in the text string:

vocabulary = sorted(set(text))
print ('There are {} unique characters'.format(len(vocabulary)))
There are 65 unique characters

A glimpse of the vocabulary is shown in Figure 2-13.

Part of the vocabulary list from Coriolanus
Figure 2-13. Part of the vocabulary list from Coriolanus

These tokens include punctuation, as well as both capital and lowercase characters. It is not always necessary to include both capital and lowercase characters; if you don’t want to, you can convert every character to lowercase before performing the set operation. Since you sorted the token list, you can see that special characters are also being tokenized. In some cases, this is not necessary; these tokens can be removed manually. The code below will convert all characters to lowercase and then perform the set operation:

vocabulary = sorted(set(text.lower()))
print ('There are {} unique characters'.format(len(vocabulary)))
There are 39 unique characters

You might be wondering if it is also reasonable to tokenize a text at the word level instead of character level. After all, the word is the fundamental unit of semantic understanding of a text string. Well, while this reasoning is sound and logical, in reality, it creates more work and problems, while not really adding value to the training process or accuracy to the model. To illustrate why, let’s try to tokenize the text string by words. The first thing to recognize is that words are, of course, separated by a space. So you need to split the text string on space:

vocabulary_word = sorted(set(text.lower().split(' ')))
print ('There are {} unique words'.format(len(vocabulary_word)))
There are 41623 unique words

Inspect the list vocabulary_word, shown in Figure 2-14:

Sample of tokenized words
Figure 2-14. Sample of tokenized words

With special characters and carriage returns embedded in each word token, this list is nearly unusable. It would require even more work to clean it up with regular expressions or more sophisticated logics. As you can see, in some cases, punctuation marks (such as commas and apostrophes) are attached to words. Further, the list of word tokens is much larger than the character-level token list. This makes it much more complicated for the model to learn the patterns in the text. For these reasons and lack of proven benefit, it is not a common practice to tokenize text at the word level.

Creating a dictionary and reverse dictionary

Once you have the list of tokens that contains reasonably chosen characters, you’ll need to map each token to an integer. This is known as the dictionary. Likewise, you need to create a reverse dictionary that maps the integer back to the token.

Generating an integer is easy with the enumerate function. The input requirement is a list. This function will return an integer corresponding to each unique element in the list. In this case, the list contains tokens:

for i, u in enumerate(vocabulary):
  print(i, u)

You can see a sample of this result in Figure 2-15:

Sample enumerated output of a token list
Figure 2-15. Sample enumerated output of a token list

Next, you need to make this into a dictionary. A dictionary is really a collection of key-value pairs used as a lookup table: when you give a key, it returns the value corresponding to the key.

The notation to build a dictionary, with the key being the token and the value being the integer, is:

char_to_index = {u:i for i, u in enumerate(vocabulary)}

The output will look like Figure 2-16:

Sample of character to index dictionary
Figure 2-16. Sample of character-to-index dictionary

This dictionary is used to convert text into integers. At inference time, the model output is also in the format of integers. Therefore, if you want the output as text, then you’ll also need a reverse dictionary to map the output back to characters. Simply reverse the order of i and u:

index_to_char = {i:u for i, u in enumerate(vocabulary)}

Tokenization is the most basic and necessary step in most NLP problems. A text generation model will not generate plain text as the output; it generates the outputs in a series of integers. In order for this series of indices to map to letters (tokens), you need a lookup table. index_to_char is specifically built for this purpose. Using index_to_char, you can look up each character (token) by key, where key is the index from the model’s output. Without index_to_char, you will not be able to map model outputs back to a readable, plain-text format.

Summary

In this chapter, you learned how to handle some of the most common data structures: tables, images, and text. Tabular datasets (the structured, CSV style of data) are very common, are returned from a typical database query, and are frequently used as training data. You learned how to deal with columns of different data type in such structure, as well as how to model feature interactions by crossing columns of our interest.

For image data, you learned that you need to standardize image size and pixel values before using the image collection as a whole to train a model. Further, you also need to keep track of image labels.

Text data is by far the most diverse data type, in terms of both format and use. Nevertheless, whether the data is for text classification, translation, or question-and-answer models, tokenization and dictionary construction processes are very common. The methods and approaches described in this chapter are by no means exhaustive or comprehensive; rather, they represent ‘table stakes’ when dealing with these data types.

1 For more on this, see Aurélien Géron, Neural Networks and Deep Learning (O’Reilly).

2 Answered on StackOverflow by user Joe Kington, January 13, 2016, https://stackoverflow.com/questions/34768717/matplotlib-unable-to-save-image-in-same-resolution-as-original-image/34769840, accessed October 23, 2020.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.208.197.243