Storing data in TFRecords

Let's start by considering the example of training a network for image classification. In this case, our data will be a collection of images with an associated label. One way we might store our data is in a directory-like structure of folders. For each label, we will have a folder containing the images belonging to that label:

-Data 
- Person 
   -im1.png 
- Cat 
   -im2.png 
- Dog 
   -im3.png

Although this might seem a simple way to store our data, it has some major drawbacks as soon as the dataset size becomes too big. One big disadvantage comes when we start loading it.

Opening a file is a time-consuming operation, and having to open many millions of files multiple times is going to add a large overhead to training time. On top of this, as we have our data all split up, it is not going to be in one nice block of memory. The hard drive is going to have to do even more work trying to locate and access them all.

What is the solution? We put them all into a single file. The advantage of doing this is that all your data will be better aligned in your computer memory for reading, which will speed things up. Having everything in one file also means that we don't have to spend time loading many millions of files, which would be extremely slow and inefficient.

There are several different formats we can use to store our data as we want, such as HDF5 or LMDB. However, as we are using TensorFlow, we will go ahead and use its own built-in format called TFRecords. TFRecords is TensorFlow's own standard file format for storing your data. It is a binary file format providing sequential access to its contents. It is flexible enough that we can store complicated datasets and labels along with any metadata we might want as well.

Table of Contents for Storing data in TFRecords

Create new playlist

Sign In

Sign Up

Table of Contents for
Storing data in TFRecords