Sharding

Although we said that it is best if we have all our data in one file, this is not actually 100% true. As TFRecords are read sequentially, we are unable to shuffle our dataset if we use just one file. Every time you reach the end of the TFRecord after an epoch of training, you will go back to the start of the dataset but, unfortunately, the data will be in the same order every time you go through the file.

In order to allow us to shuffle data, one thing we can do is shard our data by creating multiple TFRecord files and spreading out data across these multiple files. This way, we can just shuffle the order that we load the TFRecord files each epoch and thus our data will be effectively shuffled for us while we train. Something like 1,000 shards for every million images is a good baseline to follow.

In the next section, we will see how to use our TFRecords to make efficient data feeding pipelines.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.104.124