Preparing the dataset

Now, let's start preparing the dataset of our network.

For this inception network, we'll use the TFRecord class to manage our dataset. The output dataset files after the preprocessing will be protofiles, which TFRecord can read, and it's just our data stored in a serialized format for faster reading speed. Each protofile has some information stored within it, which is information such as image size and format.

The reason we are doing this is that the size of the dataset is too large and we cannot load the entire dataset into memory (RAM) as it will take up a huge amount of space. Therefore, to manage efficient RAM usage, we have to load the images in batches and delete the previously loaded images that are not being used right now.

The input size the network will take is 299x299. So, we will find a way to first reduce the image size to 299x299 to have a dataset of consistent images.

After reducing the images, we will make protofiles that we can later feed into our network, which will get trained on our dataset.

You need to first download the five training ZIP files and the labels file from here:

https://www.kaggle.com/c/diabetic-retinopathy-detection/data

Unfortunately, Kaggle only lets you download the training ZIP files through an account, so this procedure of downloading the dataset files (as in the previous chapters) can't be made automatic.

Now, let's assume that you have downloaded all five training ZIP files and labels file and stored them in a folder named diabetic. The structure of the diabetic folder will look like this:

diabetic
- train.zip.001
- train.zip.002
- train.zip.003
- train.zip.004
- train.zip.005
- trainLabels.csv.zip

In order to simplify the project, we will do the extraction manually using the compression software. After the extraction is completed, the structure of the diabetic folder will look like this:

diabetic
- train
- 10_left.jpeg
- 10_right.jpeg
- ...
- trainLabels.csv
- train.zip.001
- train.zip.002
- train.zip.003
- train.zip.004
- train.zip.005
- trainLabels.csv.zip

In this case, the train folder contains all the images in the .zip files and trainLabels.csv contains the ground truth labels for each image.

The author of the models repository has provided some example code to work with some popular image classification datasets. Our diabetic problem can be solved with the same approach. Therefore, we can follow the code that works with other datasets such as flower or MNIST dataset. We have already provided the modification to work with diabetic in the repository of this book at https://github.com/mlwithtf/mlwithtf/.

You need to clone the repository and navigate to the chapter_08 folder. You can run the download_and_convert_data.py file as follows:

python download_and_convert_data.py --dataset_name diabetic --dataset_dir D:\datasets\diabetic

In this case, we will use dataset_name as diabetic and dataset_dir is the folder that contains the trainLabels.csv and train folder.

It should run without any issues, start preprocessing our dataset into a suitable (299x299) format, and create some TFRecord file in a newly created folder named tfrecords. The following figure shows the content of the tfrecords folder:

Table of Contents for Preparing the dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Preparing the dataset