Explaining the data preparation

Now let's get to the coding part for the data preprocessing. From now on, we will show you what we have changed from the original repository of the tensorflow/models. Basically, we take the code to the process flowers dataset as the starting point and modify them to suit our needs.

In the download_and_convert_data.py file, we have added a new line at the beginning of the file:

from datasets import download_and_convert_diabetic 
and a new else-if clause to process the dataset_name "diabetic" at line 69: 
  elif FLAGS.dataset_name == 'diabetic': 
      download_and_convert_diabetic.run(FLAGS.dataset_dir)

With this code, we can call the run method in the download_and_convert_diabetic.py in the datasets folder. This is a really simple approach to separating the preprocessing code of multiple datasets, but we can still take advantage of the others parts of the image classification library.

The download_and_convert_diabetic.py file is a copy of the download_and_convert_flowers.py file with some modifications to prepare our diabetic dataset.

In the run method of the download_and_convert_diabetic.py file, we made changes as follows:

  def run(dataset_dir): 
    """Runs the download and conversion operation. 
 
    Args: 
      dataset_dir: The dataset directory where the dataset is stored. 
    """ 
    if not tf.gfile.Exists(dataset_dir): 
        tf.gfile.MakeDirs(dataset_dir) 
 
    if _dataset_exists(dataset_dir): 
        print('Dataset files already exist. Exiting without re-creating   
        them.') 
        return 
 
    # Pre-processing the images. 
    data_utils.prepare_dr_dataset(dataset_dir) 
    training_filenames, validation_filenames, class_names =   
    _get_filenames_and_classes(dataset_dir) 
    class_names_to_ids = dict(zip(class_names,    
    range(len(class_names)))) 
 
    # Convert the training and validation sets. 
    _convert_dataset('train', training_filenames, class_names_to_ids,   
    dataset_dir) 
    _convert_dataset('validation', validation_filenames,    
    class_names_to_ids, dataset_dir) 
 
    # Finally, write the labels file: 
    labels_to_class_names = dict(zip(range(len(class_names)),    
    class_names)) 
    dataset_utils.write_label_file(labels_to_class_names, dataset_dir) 
 
    print('
Finished converting the Diabetic dataset!')

In this code, we use the prepare_dr_dataset from the data_utils package that was prepared in the root of this book repository. We will look at that method later. Then, we changed the _get_filenames_and_classes method to return the training and validation filenames. The last few lines are the same as the flowers dataset example:

  def _get_filenames_and_classes(dataset_dir): 
    train_root = os.path.join(dataset_dir, 'processed_images', 'train') 
    validation_root = os.path.join(dataset_dir, 'processed_images',   
    'validation') 
    class_names = [] 
    for filename in os.listdir(train_root): 
        path = os.path.join(train_root, filename) 
        if os.path.isdir(path): 
            class_names.append(filename) 
 
    train_filenames = [] 
    directories = [os.path.join(train_root, name) for name in    
    class_names] 
    for directory in directories: 
        for filename in os.listdir(directory): 
            path = os.path.join(directory, filename) 
            train_filenames.append(path) 
 
    validation_filenames = [] 
    directories = [os.path.join(validation_root, name) for name in    
    class_names] 
    for directory in directories: 
        for filename in os.listdir(directory): 
            path = os.path.join(directory, filename) 
            validation_filenames.append(path) 
    return train_filenames, validation_filenames, sorted(class_names)

In the preceding method, we find all the filenames in the processed_images/train and processed/validation folder, which contains the images that were preprocessed in the data_utils.prepare_dr_dataset method.

In the data_utils.py file, we have written the prepare_dr_dataset(dataset_dir) function, which is responsible for the entire preprocessing of the data.

Let's start by defining the necessary variables to link to our data:

num_of_processing_threads = 16 
dr_dataset_base_path = os.path.realpath(dataset_dir) 
unique_labels_file_path = os.path.join(dr_dataset_base_path, "unique_labels_file.txt") 
processed_images_folder = os.path.join(dr_dataset_base_path, "processed_images") 
num_of_processed_images = 35126 
train_processed_images_folder = os.path.join(processed_images_folder, "train") 
validation_processed_images_folder = os.path.join(processed_images_folder, "validation") 
num_of_training_images = 30000 
raw_images_folder = os.path.join(dr_dataset_base_path, "train") 
train_labels_csv_path = os.path.join(dr_dataset_base_path, "trainLabels.csv")

The num_of_processing_threads variable is used to specify the number of threads we want to use while preprocessing our dataset, as you may have already guessed. We will use a multi-threaded environment to preprocess our data faster. Later on, we have specified some directory paths to contain our data inside different folders while preprocessing.

We will extract the images in their raw form and then preprocess them to get them into a suitable consistent format and size, and then we will generate the tfrecords files from the processed images with the _convert_dataset method in the download_and_convert_diabetic.py file. After that, we will feed these tfrecords files into the training and testing networks.

As we said in the previous section, we have already extracted the dataset files and the labels files. Now, as we have all of the data extracted and present inside our machine, we will process the images. A typical image from the DR dataset looks like this:

What we want is to remove this extra black space because it is not necessary for our network. This will reduce the unnecessary information inside the image. After this, we will scale this image into a 299x299 JPG image file.

We will repeat this process for all of the training datasets.

The function to crop the black image borders is as follows:

  def crop_black_borders(image, threshold=0):
     """Crops any edges below or equal to threshold

     Crops blank image to 1x1.

     Returns cropped image.

     """
     if len(image.shape) == 3:
         flatImage = np.max(image, 2)
     else:
         flatImage = image
     assert len(flatImage.shape) == 2

     rows = np.where(np.max(flatImage, 0) > threshold)[0]
     if rows.size:
         cols = np.where(np.max(flatImage, 1) > threshold)[0]
         image = image[cols[0]: cols[-1] + 1, rows[0]: rows[-1] + 1]
     else:
         image = image[:1, :1]

     return image

This function takes the image and a threshold for a grayscale, below which it will remove the black borders around the image.

As we are doing all of this processing in a multithreaded environment, we will process the images in batches. To process an image batch, we will use the following function:

  def process_images_batch(thread_index, files, labels, subset):

     num_of_files = len(files)

     for index, file_and_label in enumerate(zip(files, labels)):
         file = file_and_label[0] + '.jpeg'
         label = file_and_label[1]

         input_file = os.path.join(raw_images_folder, file)
         output_file = os.path.join(processed_images_folder, subset,   
         str(label), file)

         image = ndimage.imread(input_file)
         cropped_image = crop_black_borders(image, 10)
         resized_cropped_image = imresize(cropped_image, (299, 299, 3),   
         interp="bicubic")
         imsave(output_file, resized_cropped_image)

         if index % 10 == 0:
             print("(Thread {}): Files processed {} out of  
             {}".format(thread_index, index, num_of_files))

The thread_index tells us the ID of the thread in which the function has been called. The threaded environment around processing the image batch is defined in the following function:

   def process_images(files, labels, subset):

     # Break all images into batches with a [ranges[i][0], ranges[i] 
     [1]].
     spacing = np.linspace(0, len(files), num_of_processing_threads +  
     1).astype(np.int)
     ranges = []
     for i in xrange(len(spacing) - 1):
         ranges.append([spacing[i], spacing[i + 1]])

     # Create a mechanism for monitoring when all threads are finished.
     coord = tf.train.Coordinator()

     threads = []
     for thread_index in xrange(len(ranges)):
         args = (thread_index, files[ranges[thread_index] 
         [0]:ranges[thread_index][1]],
                 labels[ranges[thread_index][0]:ranges[thread_index] 
                 [1]],
                 subset)
         t = threading.Thread(target=process_images_batch, args=args)
         t.start()
         threads.append(t)

     # Wait for all the threads to terminate.
     coord.join(threads)

To get the final result from all of the threads, we use a TensorFlow class, tf.train.Coordinator(), whose join function is responsible for handling all of the threads' final approach point.

For the threading, we use threading.Thread, in which the target argument specifies the function to be called and the args argument specifies the target function arguments.

Now, we will process the training images. The training dataset is divided into a train set (30,000 images) and a validation set (5,126 images).

The total preprocessing is handled as follows:

def process_training_and_validation_images():
     train_files = []
     train_labels = []

     validation_files = []
     validation_labels = []

     with open(train_labels_csv_path) as csvfile:
         reader = csv.DictReader(csvfile)
         for index, row in enumerate(reader):
             if index < num_of_training_images:
                 train_files.extend([row['image'].strip()])
                 train_labels.extend([int(row['level'].strip())])
             else:
                 validation_files.extend([row['image'].strip()])
                   
   validation_labels.extend([int(row['level'].strip())])

     if not os.path.isdir(processed_images_folder):
         os.mkdir(processed_images_folder)

     if not os.path.isdir(train_processed_images_folder):
         os.mkdir(train_processed_images_folder)

     if not os.path.isdir(validation_processed_images_folder):
         os.mkdir(validation_processed_images_folder)

     for directory_index in range(5):
         train_directory_path =   
    os.path.join(train_processed_images_folder,   
    str(directory_index))
         valid_directory_path =   
   os.path.join(validation_processed_images_folder,  
   str(directory_index))

         if not os.path.isdir(train_directory_path):
             os.mkdir(train_directory_path)

         if not os.path.isdir(valid_directory_path):
             os.mkdir(valid_directory_path)

     print("Processing training files...")
     process_images(train_files, train_labels, "train")
     print("Done!")

     print("Processing validation files...")
     process_images(validation_files, validation_labels,  
     "validation")
     print("Done!")

     print("Making unique labels file...")
     with open(unique_labels_file_path, 'w') as unique_labels_file:
         unique_labels = ""
         for index in range(5):
             unique_labels += "{}
".format(index)
         unique_labels_file.write(unique_labels)

     status = check_folder_status(processed_images_folder, 
     num_of_processed_images,
     "All processed images are present in place",
     "Couldn't complete the image processing of training and  
     validation files.")

     return status

Now, we will look at the last method for preparing the dataset, the _convert_dataset method that is called in the download_and_convert_diabetic.py file:

def _get_dataset_filename(dataset_dir, split_name, shard_id): 
    output_filename = 'diabetic_%s_%05d-of-%05d.tfrecord' % ( 
        split_name, shard_id, _NUM_SHARDS) 
    return os.path.join(dataset_dir, output_filename) 
def _convert_dataset(split_name, filenames, class_names_to_ids, dataset_dir): 
    """Converts the given filenames to a TFRecord dataset. 
 
    Args: 
      split_name: The name of the dataset, either 'train' or  
     'validation'. 
      filenames: A list of absolute paths to png or jpg images. 
      class_names_to_ids: A dictionary from class names (strings) to  
      ids 
        (integers). 
      dataset_dir: The directory where the converted datasets are  
     stored. 
    """ 
    assert split_name in ['train', 'validation'] 
 
    num_per_shard = int(math.ceil(len(filenames) /  
    float(_NUM_SHARDS))) 
 
    with tf.Graph().as_default(): 
        image_reader = ImageReader() 
 
        with tf.Session('') as sess: 
 
            for shard_id in range(_NUM_SHARDS): 
                output_filename = _get_dataset_filename( 
                    dataset_dir, split_name, shard_id) 
 
                with tf.python_io.TFRecordWriter(output_filename)
                as   
                tfrecord_writer: 
                    start_ndx = shard_id * num_per_shard 
                    end_ndx = min((shard_id + 1) * num_per_shard,  
                    len(filenames)) 
                    for i in range(start_ndx, end_ndx): 
                        sys.stdout.write('
>> Converting image  
                         %d/%d shard %d' % ( 
                            i + 1, len(filenames), shard_id)) 
                        sys.stdout.flush() 
 
                        # Read the filename: 
                        image_data =  
                    tf.gfile.FastGFile(filenames[i], 'rb').read() 
                        height, width =          
                    image_reader.read_image_dims(sess, image_data) 
 
                        class_name =  
                     os.path.basename(os.path.dirname(filenames[i])) 
                        class_id = class_names_to_ids[class_name] 
 
                        example = dataset_utils.image_to_tfexample( 
                            image_data, b'jpg', height, width,   
                             class_id) 
                         
                 tfrecord_writer.write(example.SerializeToString()) 
 
                  sys.stdout.write('
') 
                  sys.stdout.flush()

In the preceding function, we will get the image filenames and then store them in the tfrecord files. We will also split the train and validation files into multiple tfrecord files instead of using only one file for each split set.

Now, as the data processing is out of the way, we will formalize the dataset into an instance of slim.dataset. Dataset from Tensorflow Slim. In the datasets/diabetic.py file, you will see a method named get_split, as follows:

_FILE_PATTERN = 'diabetic_%s_*.tfrecord' 
SPLITS_TO_SIZES = {'train': 30000, 'validation': 5126} 
_NUM_CLASSES = 5 
_ITEMS_TO_DESCRIPTIONS = { 
    'image': 'A color image of varying size.', 
    'label': 'A single integer between 0 and 4', 
} 
def get_split(split_name, dataset_dir, file_pattern=None, reader=None): 
  """Gets a dataset tuple with instructions for reading flowers. 
  Args: 
    split_name: A train/validation split name. 
    dataset_dir: The base directory of the dataset sources. 
    file_pattern: The file pattern to use when matching the dataset sources. 
      It is assumed that the pattern contains a '%s' string so that the split 
      name can be inserted. 
    reader: The TensorFlow reader type. 
  Returns: 
    A `Dataset` namedtuple. 
  Raises: 
    ValueError: if `split_name` is not a valid train/validation split. 
  """ 
  if split_name not in SPLITS_TO_SIZES: 
    raise ValueError('split name %s was not recognized.' % split_name) 
 
  if not file_pattern: 
    file_pattern = _FILE_PATTERN 
  file_pattern = os.path.join(dataset_dir, file_pattern % split_name) 
 
  # Allowing None in the signature so that dataset_factory can use the default. 
  if reader is None: 
    reader = tf.TFRecordReader 
 
  keys_to_features = { 
      'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''), 
      'image/format': tf.FixedLenFeature((), tf.string, default_value='png'), 
      'image/class/label': tf.FixedLenFeature( 
          [], tf.int64, default_value=tf.zeros([], dtype=tf.int64)), 
  } 
  items_to_handlers = { 
      'image': slim.tfexample_decoder.Image(), 
      'label': slim.tfexample_decoder.Tensor('image/class/label'), 
  } 
  decoder = slim.tfexample_decoder.TFExampleDecoder( 
      keys_to_features, items_to_handlers) 
 
  labels_to_names = None 
  if dataset_utils.has_labels(dataset_dir): 
    labels_to_names = dataset_utils.read_label_file(dataset_dir) 
 
  return slim.dataset.Dataset( 
      data_sources=file_pattern, 
      reader=reader, 
      decoder=decoder, 
      num_samples=SPLITS_TO_SIZES[split_name], 
      items_to_descriptions=_ITEMS_TO_DESCRIPTIONS, 
      num_classes=_NUM_CLASSES, 
      labels_to_names=labels_to_names)

The preceding method will be called during the training and evaluating routines. We will create an instance of slim.dataset with the information about our tfrecord files so that it can automatically perform the work to parse the binary files. Moreover, we can also use slim.dataset.Dataset with the support of DatasetDataProvider from Tensorflow Slim to read the dataset in parallel, so we can increase the training and evaluating routines.

Before we start training, we need to download the pre-trained model of Inception V3 from the Tensorflow Slim image classification library so we can leverage the performance of Inception V3 without training from scratch.

The pre-trained snapshot can be found here:

https://github.com/tensorflow/models/tree/master/research/slim#Pretrained

In this chapter, we will use Inception V3, so we need to download the inception_v3_2016_08_28.tar.gz file and extract it to have the checkpoint file named inception_v3.ckpt.

Table of Contents for Explaining the data preparation

Create new playlist

Sign In

Sign Up

Table of Contents for
Explaining the data preparation