Chapter 1. Chapter 2: Data Storage and Ingestion

To envision how to set up a machine learning (ML) model to solve a problem, you have to start thinking about data structure patterns. In this chapter, we’ll look at some general patterns in storage, data formats, and data ingestion. Typically, once you understand your business problem and set it up as a data science problem, you have to think about how to get the data into a format or structure that your model training process can use. Data ingestion during the training process is fundamentally a data transformation pipeline. Without this transformation, you won’t be able to deliver and serve the model in an enterprise-driven or use-case-driven setting; it would remain nothing more than an exploration tool and could not scale to handle large amounts of data.

This chapter will show you how to design a data ingestion pipeline for two common data structures: tables and images. You will learn how to make the pipeline scalable by using TensorFlow’s APIs.

Data streaming is the means by which the data is ingested in small batches to the model for training. Python streaming is not a new concept. However, grasping it is fundamental to understanding how the more advanced APIs in TensorFlow work. Thus, this chapter will start with Python generators.

Then we’ll look at how tabular data is stored, including how to indicate and track features and labels. We’ll then move to designing your data structure, and finish by discussing how to ingest data to your model for training and how to stream tabular data. The rest of the chapter covers how to organize image data for image classification and stream image data.

Streaming data with Python Generators

There are times when Python runtime’s memory is not big enough to handle loading the entire dataset in its entirety. In this case, the recommended practice is to load the data in small batches. Therefore, the data are streamed into the model during the training process.

As it turns out, sending data in small batches has many other advantages. One is that a gradient descent algorithm is applied to each batch to calculate error (the difference between the model output and the ground truth) and to gradually update the model’s weights and biases to make this error as small as possible. This lets us parallelize the gradient calculation, since the error calculation of one batch does not depend on the other. This is known as mini-batch gradient descent. At the end of each epoch (that is, one pass over the entire training data), after a full training dataset has gone through the model, gradients from all batches are summed and weights are updated. Then training starts again for the next epoch, with the newly updated weights and biases, and error is calculated. This process repeats according to a user-defined parameter, which is known as “number of epochs for training.”

A Python generator is a function which returns an object instance that can be iterated over. A trivial example of how it works is demonstrated in the code below and Figure 1-1:

Let’s start with a numpy library for this simple demonstration of Python generators. I’ve created a simple function, my_generator, that accepts a numpy array and iterate two records at a time in the array.

import numpy as np
def my_generator(my_array):
    i = 0
    while True:
        yield my_array[i:i+2, :]
        i += 1

This is the test array I created, which will be passed into my_generator:

test_array = np.array([[10.0, 2.0],
                       [15, 6.0],
                       [3.2, -1.5],
                       [-3, -2]], np.float32)

This numpy array has four records, each consisting of two floating point values. Then I pass this array to my_generator:

output = my_generator(test_array)

To get output:

next(output)

The output should be:

array([[10.,  2.],
       [15.,  6.]], dtype=float32)

If you run next(output) again, the output will be different:

array([[15. ,  6. ],
       [ 3.2, -1.5]], dtype=float32)

Run it again:

next(output) 

The output is once again different:

array([[ 3.2, -1.5],
       [-3. , -2. ]], dtype=float32)

And again:

next(output)

Now the output is:

array([[-3., -2.]], dtype=float32)

Now that the last record is shown, you have finished streaming this data. If you run it again, it will return you an empty array:

array([], shape=(0, 2), dtype=float32)

As you can see, the my_generator function streams two records in a numpy array each time. The unique aspect of the generator function is the use of yield instead of return. Unlike return, yield produces a sequence of values without storing the entire sequence in the Python runtime memory. Yield continues to produce a sequence each time we invoke the next function until the end of the array is reached.

This example demonstrates how a subset of data can be generated via a generator function. However, in this example, the numpy array is created on the fly and therefore is held in the Python runtime memory. Let’s take a look at how to iterate over a dataset that is stored as a file.

Streaming file content with a generator

To understand how a file in the storage can be streamed, you may find it easier to use a CSV file as an example. The file I use here, the Pima Indian Diabetes open source dataset, is available for download. Download it and store it in your local machine.

This file does not contain a header. You will also need to download the column names and respective descriptions for this dataset.

Briefly, the columns in this file are:

['Pregnancies', 'Glucose', 'BloodPressure',
 'SkinThickness', 'Insulin', 'BMI',
 'DiabetesPedigree', 'Age', 'Outcome']

Let’s look at this file with the following lines of code:

import csv
import pandas as pd
file_path = 'working_data/'
file_name = 'pima-indians-diabetes.data.csv'
col_name = ['Pregnancies', 'Glucose', 'BloodPressure',
            'SkinThickness', 'Insulin', 'BMI',
            'DiabetesPedigree', 'Age', 'Outcome']
pd.read_csv(file_path + file_name, names = col_name)

The first few rows of the file are shown in Figure 1-1.

Pima Indians Diabetes dataset
Figure 1-1. Pima Indians Diabetes dataset

Since we want to stream this dataset, it is more convenient to read it as a CSV and use the generator to output the rows, just like we did with the numpy array in the previous section. The way to do this is through the following code:

import csv
file_path = 'working_data/'
file_name = 'pima-indians-diabetes.data.csv'
with open(file_path + file_name, newline='
') as csvfile:
    f = csv.reader(csvfile, delimiter=',')
    for row in f:
        print(','.join(row))

Let’s take a closer look at this code.

This time, you’ll need to use with open to create a file handle. csvfile is a file handle object that knows where the file is stored. Therefore, the next step is to pass it to the reader function in the csv library:

f = csv.reader(csvfile, delimiter=',')

f is the entire file in the Python runtime memory. To inspect the file, execute this short piece of a for loop:

for row in f:
        print(','.join(row))

The output of the first few rows looks like Figure 1-2:

Pima Indians Diabetes dataset CSV output.
Figure 1-2. Pima Indians Diabetes dataset CSV output.

Now that you understand how to use a file handle, let’s refactor the code above a bit so that we can use yield in a function, effectively making a generator to stream the content of the CSV:

def stream_file(file_handle):
    holder = []
    for row in file_handle:
        holder.append(row.rstrip("
"))
        yield holder
        holder = []
with open(file_path + file_name, newline = '
') as handle:
    for part in stream_file(handle):
        print(part)

Recall that a Python generator is a function that uses yield to iterate through an iterable object. You can use with open to acquire a file handle as usual. Then pass handle to a generator function stream_file. stream_file contains a for loop that iterates the file in handle row by row, removed newline code , then fill up a holder. Each row is passed back to the main thread’s print by yield from the generator. The output is shown in Figure 1-3:

Pima Indians Diabetes dataset output by Python generator.
Figure 1-3. Pima Indians Diabetes dataset output by Python generator.

Now that you have a clear idea how a dataset can be streamed, let’s look at how to apply this in TensorFlow. As it turns out, TensorFlow leverages this approach to build a framework for data ingestion. Streaming is usually the best way to ingest large amounts of data (such as hundreds of thousands rows in one table, or distributed across multiple tables).

Multiple CSV files as training data

Tabular data is a very common and convenient format for encoding features and labels for ML model training. CSV is probably the most common tabular data format. You can think of each field separated by the comma delimiter as a column. Each column is defined with a data type, such as numeric (integer or floating point) or string.

Tabular data is not the only data format that is well structured, by which I mean that every record follows the same convention and the order of fields in every record is the same. Another common data structure is JSON format. JSON is a structure built with nested, hierarchical key-value pairs. You can think of keys as column names, and values as the actual value of the data in that sample. JSON and CSV formats can be converted to one another. Sometimes the original data comes as a JSON and it is necessary to convert it to CSV, which is easier to display and inspect.

An example JSON record appears as the key-value pair may be:

{
   "id": 1,
   "name": {
      "first": "Dan",
      "last": "Jones"
   },
   "rating": [
      8,
      7,
      9
   ]
},

Notice that key “rating” is associated with the value of an array [8, 7, 9].

There are plenty of examples of using a CSV file or a table as training data and ingesting it into the TensorFlow model training process. Typically, the data is read into a pandas dataframe. However, this strategy only works if all the data can fit into the Python runtime memory. You can use streaming to handle data without the Python runtime restricting memory allocation. Since you learned how a Python generator works in the previous section, you’re now ready to take a look at TensorFlow’s API, which operates on the same principle as a Python generator, and learn how to use TensorFlow’s adoption of the Python generator framework.

Setting up a pattern for file names

When working with a set of files, you will encounter patterns in file-naming conventions. To simulate an enterprise environment where new data is continuously being generated and stored, you will find an open source CSV file, split it into multiple parts by row count, then rename each part with a fixed prefix. This approach is similar to how Hadoop Distributed File System (HDFS) names the parts of a file.

Feel free to use your own CSV if you have one ready. If not, you can download the suggested CSV file for this example, as seen in Figure 1-4. (You may clone this repository if you wish.)

For now, all you need is owid-covid-data.csv.

Once it is downloaded, inspect the file and determine the number of rows:

wc -l owid-covid-data.csv

There are over 32,000 rows.

32788 owid-covid-data.csv

Next, inspect the first three lines of the CSV. This lets you see if there is a header. You can look at a few rows of data to see what they actually look like, and inspect what type of delimiter is used to separate the fields.

head -3 owid-covid-data.csv
iso_code,continent,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,tests_units,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
AFG,Asia,Afghanistan,2019-12-31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83
AFG,Asia,Afghanistan,2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,0.0,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83

This file contains a header. You’ll see the header in each of the part files.

Splitting a single CSV to multiple CSV

Now let’s split this file into multiple CSV files, each with 330 rows. You should end up with 100 CSV files, each with the header. If you use Linux or Mac OSX, use the following command:

cat owid-covid-data.csv| parallel --header : --pipe -N330 'cat >owid-covid-data-part00{#}.csv'

You may need to install the parallel command for Mac OSX:

brew install parallel

Here are some of the files created:

-rw-r--r-- 1 mbp16 staff 54026 Jul 26 16:45 owid-covid-data-part0096.csv

-rw-r--r-- 1 mbp16 staff 54246 Jul 26 16:45 owid-covid-data-part0097.csv

-rw-r--r-- 1 mbp16 staff 51278 Jul 26 16:45 owid-covid-data-part0098.csv

-rw-r--r-- 1 mbp16 staff 62622 Jul 26 16:45 owid-covid-data-part0099.csv

-rw-r--r-- 1 mbp16 staff 15320 Jul 26 16:45 owid-covid-data-part00100.csv

This pattern represents a standard storage arrangement for multiple CSV formats. There is a distinct pattern to the naming convention: either all files have the same header, or none has any header at all.

It is really a good idea to maintain a file naming pattern. It will come in handy when you have tens or hundreds of files. When your naming pattern can be easily represented with wildcard notation, it is easier to create a reference or file pattern object that points to all the data in storage.

In the next section, we look at how to use the TensorFlow API to create a file pattern object, which we’ll use to create a streaming object for this dataset.

Creating a file pattern object using tf.io

TensorFlow provides tf.io API for referencing a distributed dataset that contains files with a common naming pattern. This is not to say that you want to read the distributed dataset: what you want is a list of file paths and names for all the dataset files you want to read. This is not a new idea. In Python, the glob library is a popular choice for retrieving a list of file names that matches a naming convention. Tf.io leverages the glob library to generate a list of file names that fits the pattern object:

import tensorflow as tf
base_pattern = 'dataset'
file_pattern = 'owid-covid-data-part*'
files = tf.io.gfile.glob(base_pattern + '/' + file_pattern)

files is a list that contains the names of all of the CSV file names that are parts of the original CSV, in no particular order:

['dataset/owid-covid-data-part0091.csv',
 'dataset/owid-covid-data-part0085.csv',
 'dataset/owid-covid-data-part0052.csv',
 'dataset/owid-covid-data-part0046.csv',
 'dataset/owid-covid-data-part0047.csv',
……]

This list will be the input to the next step, which is to create a streaming dataset object based on Python generators.

Creating a streaming dataset object

Now that you have your files list ready, you can use it as the input to create a streaming dataset object. Note that this code is only meant to demonstrate how to convert a list of CSV files into a TensorFlow dataset object. If you were really going to use this data to train a supervised ML model, you would perform data cleansing, normalization, and aggregation, all of which we’ll cover in chapter 8. For the purposes of this example, new_deaths is selected as the target column.

csv_dataset = tf.data.experimental.make_csv_dataset(files,
              header = True,
              batch_size = 5,
              label_name = 'new_deaths',
              num_epochs = 1,
              ignore_errors = True)

The code above specifies that each file in files contains a header. For convenience, as we inspect it, let’s set a small batch size of 5. We also designate a target column, label_name, as if we are going to use this data for training a supervised ML model. num_epochs is used to specify how many times you want to stream over the entire dataset.

In order to look at actual data, you’ll need to use the csv_dataset object to iterate through the data:

for features, target in csv_dataset.take(1):
    print("'Target': {}".format(target))
    print("'Features:'")
    for k, v in features.items():
        print("  {!r:20s}: {}".format(k, v))

This code uses the first batch of the dataset ( take(1) ), which contains five samples.

Since you specified a label_name, the other columns are all considered to be features. In the dataset, contents are formatted as key value pairs. Therefore, the output from above code will be similar to this:

'Target': [ 0.  0. 16.  0.  0.]
'Features:'
  'iso_code'          : [b'SWZ' b'ESP' b'ECU' b'ISL' b'FRO']
  'continent'         : [b'Africa' b'Europe' b'South America' b'Europe' b'Europe']
  'location'          : [b'Swaziland' b'Spain' b'Ecuador' b'Iceland' b'Faeroe Islands']
  'date'              : [b'2020-04-04' b'2020-02-07' b'2020-07-13' b'2020-04-01' b'2020-06-11']
  'total_cases'       : [9.000e+00 1.000e+00 6.787e+04 1.135e+03 1.870e+02]
  'new_cases'         : [  0.   0. 661.  49.   0.]
  'total_deaths'      : [0.000e+00 0.000e+00 5.047e+03 2.000e+00 0.000e+00]
  'total_cases_per_million': [7.758000e+00 2.100000e-02 3.846838e+03 3.326007e+03 3.826870e+03]
  'new_cases_per_million': [  0.      0.     37.465 143.59    0.   ]
  'total_deaths_per_million': [  0.      0.    286.061   5.861   0.   ]
  'new_deaths_per_million': [0.    0.    0.907 0.    0.   ]
  'new_tests'         : [b'' b'' b'1331.0' b'1414.0' b'']
  'total_tests'       : [b'' b'' b'140602.0' b'20889.0' b'']
  'total_tests_per_thousand': [b'' b'' b'7.969' b'61.213' b'']
  'new_tests_per_thousand': [b'' b'' b'0.075' b'4.144' b'']
  'new_tests_smoothed': [b'' b'' b'1986.0' b'1188.0' b'']
  'new_tests_smoothed_per_thousand': [b'' b'' b'0.113' b'3.481' b'']
  'tests_units'       : [b'' b'' b'units unclear' b'tests performed' b'']
  'stringency_index'  : [89.81 11.11 82.41 53.7   0.  ]
  'population'        : [ 1160164. 46754784. 17643060.   341250.    48865.]
  'population_density': [79.492 93.105 66.939  3.404 35.308]
  'median_age'        : [21.5 45.5 28.1 37.3  0. ]
  'aged_65_older'     : [ 3.163 19.436  7.104 14.431  0.   ]
  'aged_70_older'     : [ 1.845 13.799  4.458  9.207  0.   ]
  'gdp_per_capita'    : [ 7738.975 34272.36  10581.936 46482.957     0.   ]
  'extreme_poverty'   : [b'' b'1.0' b'3.6' b'0.2' b'']
  'cardiovasc_death_rate': [333.436  99.403 140.448 117.992   0.   ]
  'diabetes_prevalence': [3.94 7.17 5.55 5.31 0.  ]
  'female_smokers'    : [b'1.7' b'27.4' b'2.0' b'14.3' b'']
  'male_smokers'      : [b'16.5' b'31.4' b'12.3' b'15.2' b'']
  'handwashing_facilities': [24.097  0.    80.635  0.     0.   ]
  'hospital_beds_per_thousand': [2.1  2.97 1.5  2.91 0.  ]
  'life_expectancy'   : [60.19 83.56 77.01 82.99 80.67]

This data is retrieved during runtime (lazy execution). As indicated by the batch size, each column contains five records. Next, let’s discuss how to stream this dataset.

Streaming CSV dataset

Now that a CSV dataset object has been created, you can easily iterate over it in batches with this line of code, which uses the iter function to make an iterator from the CSV dataset.

features, label = next(iter(csv_dataset))

Second, the next function returns the next item in the iterator. Remember that in this dataset, there are two types of elements: Features and Targets. These elements are returned as a tuple (similar to a list of objects, except that the order and the value of objects cannot be changed or reassigned). You can unpack a tuple by assigning the tuple elements to respective variables.

If you examine features and label, you will see the content of the first batch.

label
<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 0.,  0.,  1., 33., 29.], dtype=float32)>

If you execute the same command again, you will see the second batch.

features, label = next(iter(csv_dataset))

Let’s just take a look at label:

<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 7., 15.,  1.,  0.,  6.], dtype=float32)>

Indeed, this is the second batch of observations. It contains different values from the first batch. This is how a streaming CSV dataset is produced in a data ingestion pipeline. As each batch is sent to the model for training, the model computes the prediction in the forward pass, which computes the output by multiplying the input value and the current weight and bias in each node of the neural network. Then compares the prediction with the label and calculates the loss function. Then comes the backward pass, where the model computes the error with respect to the expected output and then goes backward into each node of the network to update the weight and bias. The model recalculates and updates the gradients. A new batch of data is sent to the model for training, and the process repeats. Next, we will look at how to organize image data for storage and stream it like we streamed the structured data.

Organizing image data

Image classification tasks require organizing images in certain ways. This is because, unlike CSV or tabular data, attaching a label to an image requires special techniques. A very straightforward and common pattern for organizing images files is with the following directory structure:

<PROJECT_NAME>
	-train
		-class_1
			<FILENAME>.jpg
			<FILENAME>.jpg
			…
		-calss_n
			<FILENAME>.jpg
			<FILENAME>.jpg
			…
	-validation
		-class_1
			<FILENAME>.jpg
			<FILENAME>.jpg
			…
		-class_n
			<FILENAME>.jpg
			<FILENAME>.jpg
	-test
		-class_1
			<FILENAME>.jpg
			<FILENAME>.jpg
			…
		-calss_n
			<FILENAME>.jpg
			<FILENAME>.jpg
			…

The directory <PROJECT_NAME> is the base directory. The first level below contains training, validation, and test directories. Within each of these directories, there are directories named by the image labels, each of which contains the raw image files. This is shown in Figure 1-5.

File organization for image classification and partitioning for training work.
Figure 1-5. File organization for image classification and partitioning for training work.

This structure is common because it makes it easy to keep track of labels and respective images. By no means this is the only way to organize image data, though.

Let’s look at another structure for organizing images. This is very similar to the previous one, except that training, testing, and validation are all separate. Immediately below the <PROJECT_NAME> base directory are the directories of different image classes as shown in Figure 1-6.

File organization for images based on labels.
Figure 1-6. File organization for images based on labels.

Using TensorFlow image generator

Now let’s take a look at how to deal with images. Besides the nuances of file organization, working with images also requires certain necessary steps for standardization and normalization. The model architecture requires a fixed shape (fixed dimensions) for all images. At the pixel level, the values are normalized, typically to a range of [0, 1] (dividing the pixel value by 255).

For this example, you’ll reuse an open source image set of five different types of flowers (or feel free to use your own image set). Let’s assume you decide that images should be [224, 224] pixels, where the dimensions correspond to [height, width]. These are the expected dimensions for input images if you want to use pre-trained ResNet as the image classifier.

Let’s look at the code:

import tensorflow as tf
data_dir = tf.keras.utils.get_file(
    'flower_photos',
'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz', untar=True)

The code above will download the images of five different types of flowers, all in different dimensions, and put these images in the aforementioned file structure as in Figure 1-6. We will refer to data_dir as the base directory. It should be similar to:

'/Users/XXXXX/.keras/datasets/flower_photos’

If you list the content from the base directory, you will see:

-rw-r-----    1 mbp16  staff  418049 Feb  8  2016 LICENSE.txt
drwx------  801 mbp16  staff   25632 Feb 10  2016 tulips
drwx------  701 mbp16  staff   22432 Feb 10  2016 sunflowers
drwx------  643 mbp16  staff   20576 Feb 10  2016 roses
drwx------  900 mbp16  staff   28800 Feb 10  2016 dandelion
drwx------  635 mbp16  staff   20320 Feb 10  2016 daisy

There are three steps to streaming the images. Let’s look more closely.

  1. Create a ImageGenerator object. In this step, you’ll specify normalization parameters. Use rescale parameter to indicate the normalization scale, and specify that 20% of the data will be set aside for cross-validation.

    train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
        rescale = 1./255, validation_split = 0.20)
  2. A minor modification to the above function signature is wrapping rescale and validation_split as a dictionary that consists of key-value pairs. This is a convenient way of reusing the same parameters and keeping multiple input arguments under wrap (this is a Pythonic technique known as dictionary unpacking when passing it to a function).

    datagen_kwargs = dict(rescale=1./255, validation_split=0.20)
    train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
        **datagen_kwargs)
  3. Connect ImageGenerator to the data source. In this step, you’ll specify image resizing parameters to the fixed dimension.

dataflow_kwargs = dict(target_size=IMAGE_SIZE, batch_size=BATCH_SIZE, interpolation="bilinear")
train_generator = train_datagen.flow_from_directory(
data_dir, subset="training", shuffle=True, **dataflow_kwargs)
  1. Prepare a map for indexing the labels. In this step, you retrieve the index the generator has assigned to each label and create a dictionary that maps it to the actual label name. As it turns out, the TensorFlow generator internally keeps track of labels from the directory name below data_dir. This can be retrieved through train_generator.class_indices. It returns a key-value pair of labels and indices. You can take advantage of this and reverse it to deploy the model for scoring. The model will output the index. To implement the reverse lookup to the label, simply reverse the label dictionary generated by train_generator.class_indices:

labels_idx = (train_generator.class_indices)
idx_labels = dict((v,k) for k,v in labels_idx.items())

These are the idx_labels:

{0: 'daisy', 1: 'dandelion', 2: 'roses', 3: 'sunflowers', 4: 'tulips'}

Now you can inspect the shape of the items generated by train_generator:

for image_batch, labels_batch in train_generator:
  print(image_batch.shape)
  print(labels_batch.shape)
  break

You should expect to see the following shapes for images and labels for the first batch yielded by the generator iterating through the base directory:

(32, 224, 224, 3)
(32, 5)

The first tuple indicates a batch size of 32 images, each with [224, 224, 3] dimensions of [height, width, depth], where “depth” represents three color channels (RGB). The second tuple indicates 32 labels, each corresponding to one of the five flower types. It is one-hot encoded per idx_labels.

Streaming cross-validation images

Recall that in creating the generator for streaming training data, you specified a validation_split parameter with a value of 0.2. If you don’t do this, validation_split defaultes to a value of 0. When validation_split is set to a non-zero fraction, then when you invoke the flow_from_directory function, you also have to specify subset to be either training or validation. In the example above, it is subset="training”.

You may be wondering how you’ll know which images belong to the training subset, from our previous endeavor of creating a training generator? Well, you don’t have to know this if you reassign and reuse the training generator:

valid_datagen = train_datagen

valid_generator = valid_datagen.flow_from_directory(
    data_dir, subset="validation", shuffle=False, **dataflow_kwargs)

As you can see, a TensorFlow generator knows and keeps track of training and validation subsets, so you can reuse the same generator to stream over different subsets. dataflow_kwargs dictionary is also reused. This is a convenience feature provided by TensorFlow generators.

Because you reuse train_datagen, you can be sure that image rescaling is done the same way as for the training images. And in valid_datagen.flow_from_directory function, you’ll pass into it the same dataflow_kwargs to set image size for cross-validation to be same as for training images.

If you prefer to organize the images into training, validation, and testing yourself, what you learned above still applies, with two exceptions. First, your data_dir is at the level of the training, validation, or testing directory. Second, you don’t need to specify validation_split in ImageDataGenerator and subset in flow_from_directory.

Inspecting resized images

Now let’s inspect the resized images coming off the generator. Below is the code snippet for iterating through a batch of data streamed by a generator:

import matplotlib.pyplot as plt
import numpy as np
image_batch, label_batch = next(iter(train_generator))
fig, axes = plt.subplots(8, 4, figsize=(10, 20))
axes = axes.flatten()
for img, lbl, ax in zip(image_batch, label_batch, axes):
    ax.imshow(img)
    label_ = np.argmax(lbl)
    label = idx_labels[label_]
    ax.set_title(label)
    ax.axis('off')
plt.show()

This code will produce 32 images from the first batch coming off the generator (Figure 1-7).

A batch of reshaped images.
Figure 1-7. A batch of reshaped images.

Let’s examine the code.

image_batch, label_batch = next(iter(train_generator))

You need to iterate over the base directory with the generator. Apply the iter function to the generator, and leverage the next function to output the image batch and label batch as respective numpy arrays.

fig, axes = plt.subplots(8, 4, figsize=(10, 20))

In this line, you set up the number of subplots you expect, which is 32, your batch size.

axes = axes.flatten()
for img, lbl, ax in zip(image_batch, label_batch, axes):
    ax.imshow(img)
    label_ = np.argmax(lbl)
    label = idx_labels[label_]
    ax.set_title(label)
    ax.axis('off')
plt.show()

Then you fix the figure axes, using a for loop to display numpy arrays as images and labels. As shown in Figure 1-7, all the images are resized into 224 by 224-pixel squares. Although the subplot holder is a rectangle with figsize=(10, 20), you can see that all of the images are squares. This means your code for resizing and normalizing images in the generator workflow works as expected.

Summary

In this chapter, you learn the fundamentals of streaming data using Python. This is a workhorse technique when working with large, distributed datasets. You also have seen some common file organization patterns for tabular and image data.

In the section on tabular data, you learned how a good choice of file-naming conventions can make it easier to build a reference to all the files, regardless of how many there are. This means you now know how to build a scalable pipeline that can ingest as much data as needed into a Python runtime for any use (in this case, for TensorFlow to create a dataset).

You also learned how image files are usually organized in file storage and how to associate images with labels. In the next chapter, you will leverage what you’ve learned here about data organization and streaming to integrate it with the model-training process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.93.178.221