Building our dataset

While we can use all 25,000 images and build some nice models on them, if you remember, our problem objective includes the added constraint of having a small number of images per category. Let's build our own dataset for this purpose. You can refer to the Datasets Builder.ipynb Jupyter Notebook in case you want to run the examples yourself.

To start with, we load up the following dependencies, including a utility module called utils, which is available in the utils.py file present in the code files for this chapter. This is mainly used to get a visual progress bar when we copy images into new folders:

import glob 
import numpy as np 
import os 
import shutil 
from utils import log_progress 
 
np.random.seed(42)

Let's now load up all the images in our original training data folder as follows:

files = glob.glob('train/*') 
 
cat_files = [fn for fn in files if 'cat' in fn] 
dog_files = [fn for fn in files if 'dog' in fn] 
len(cat_files), len(dog_files) 
 
Out [3]: (12500, 12500)

We can verify with the preceding output that we have 12,500 images for each category. Let's now build our smaller dataset so that we have 3,000 images for training, 1,000 images for validation, and 1,000 images for our test dataset (with equal representation for the two animal categories):

cat_train = np.random.choice(cat_files, size=1500, replace=False) 
dog_train = np.random.choice(dog_files, size=1500, replace=False) 
cat_files = list(set(cat_files) - set(cat_train)) 
dog_files = list(set(dog_files) - set(dog_train)) 
 
cat_val = np.random.choice(cat_files, size=500, replace=False) 
dog_val = np.random.choice(dog_files, size=500, replace=False) 
cat_files = list(set(cat_files) - set(cat_val)) 
dog_files = list(set(dog_files) - set(dog_val)) 
 
cat_test = np.random.choice(cat_files, size=500, replace=False) 
dog_test = np.random.choice(dog_files, size=500, replace=False) 
 
print('Cat datasets:', cat_train.shape, cat_val.shape, cat_test.shape) 
print('Dog datasets:', dog_train.shape, dog_val.shape, dog_test.shape) 
 
 
Cat datasets: (1500,) (500,) (500,) 
Dog datasets: (1500,) (500,) (500,)

Now that our datasets have been created, let's write them out to our disk in separate folders, so that we can come back to them anytime in the future without worrying if they are present in our main memory:

train_dir = 'training_data' 
val_dir = 'validation_data' 
test_dir = 'test_data' 
 
train_files = np.concatenate([cat_train, dog_train]) 
validate_files = np.concatenate([cat_val, dog_val]) 
test_files = np.concatenate([cat_test, dog_test]) 
 
os.mkdir(train_dir) if not os.path.isdir(train_dir) else None 
os.mkdir(val_dir) if not os.path.isdir(val_dir) else None 
os.mkdir(test_dir) if not os.path.isdir(test_dir) else None 
 
for fn in log_progress(train_files, name='Training Images'): 
    shutil.copy(fn, train_dir) 
for fn in log_progress(validate_files, name='Validation Images'): 
    shutil.copy(fn, val_dir) 
for fn in log_progress(test_files, name='Test Images'): 
    shutil.copy(fn, test_dir)

The progress bars depicted in the following screenshot become green once all the images have been copied to their respective directory:

Table of Contents for Building our dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Building our dataset