Any custom dataset class, say for example, our Dogs dataset class, has to inherit from the PyTorch dataset class. The custom class has to implement two main functions, namely __len__(self) and __getitem__(self, idx). Any custom class acting as a Dataset class should look like the following code snippet:
from torch.utils.data import Dataset
class DogsAndCatsDataset(Dataset):
def __init__(self,):
pass
def __len__(self):
pass
def __getitem__(self,idx):
pass
We do any initialization, if required, inside the init method—for example, reading the index of the table and reading the filenames of the images, in our case. The __len__(self) operation is responsible for returning the maximum number of elements in our dataset. The __getitem__(self, idx) operation returns an element based on the idx every time it is called. The following code implements our DogsAndCatsDataset class:
class DogsAndCatsDataset(Dataset):
def __init__(self,root_dir,size=(224,224)):
self.files = glob(root_dir)
self.size = size
def __len__(self):
return len(self.files)
def __getitem__(self,idx):
img = np.asarray(Image.open(self.files[idx]).resize(self.size))
label = self.files[idx].split('/')[-2]
return img,label
Once the DogsAndCatsDataset class is created, we can create an object and iterate over it, which is shown in the following code:
for image,label in dogsdset:
#Apply your DL on the dataset.
Applying a deep learning algorithm on a single instance of data is not optimal. We need a batch of data, as modern GPUs are optimized for better performance when executed on a batch of data. The DataLoader class helps to create batches by abstracting a lot of complexity.