Network training

First, we should create PyTorch data loader objects for the train and test datasets. The data loader object is responsible for sampling objects from the dataset and making mini-batches from them. This object can be configured as follows:

  1. First, we initialize the MNISTDataset type objects representing our datasets.
  2. Then, we use the torch::data::make_data_loader function to create a data loader object. This function takes the torch::data::DataLoaderOptions type object with configuration settings for the data loader. We set the mini-batch size equal to 256 items and set 8 parallel data loading threads. We should also configure the sampler type, but in this case, we'll leave the default one – the random sampler.

The following snippet shows how to initialize the train and test data loaders:

auto train_images = root_path / "train-images-idx3-ubyte";
auto train_labels = root_path / "train-labels-idx1-ubyte";
auto test_images = root_path / "t10k-images-idx3-ubyte";
auto test_labels = root_path / "t10k-labels-idx1-ubyte";

// initialize train dataset
// ----------------------------------------------
MNISTDataset train_dataset(train_images.native(),

auto train_loader = torch::data::make_data_loader(<>()),

// initialize test dataset
// ----------------------------------------------
MNISTDataset test_dataset(test_images.native(),

auto test_loader = torch::data::make_data_loader(<>()),

Notice that we didn't pass our dataset objects directly to the torch::data::make_data_loader function, but we applied the stacking transformation mapping to it. This transformation allows us to sample mini-batches in the form of the torch::Tensor object. If we skip this transformation, the mini-batches will be sampled as the C++ vector of tensors. Usually, this isn't very useful because we can't apply linear algebra operations to the whole batch in a vectorized manner.

The next step to initialize the neural network object of the LeNet5 type, which we defined previously. We'll move it to the GPU to improve training and evaluation performance:

LeNet5 model;

When the model of our neural network has been initialized, we can initialize an optimizer. We chose stochastic gradient descent with momentum optimization for this. It is implemented in the torch::optim::SGD class. The object of this class should be initialized with model (network) parameters and the torch::optim::SGDOptions type object. All torch::nn::Module type objects have the parameters() method, which returns the std::vector<Tensor> object containing all the parameters (weights) of the network. There is also the named_parameters method, which returns the dictionary of named parameters. Parameter names are created with the names we used in the register_module function call. This method is handy if we want to filter parameters and exclude some of them from the training process.

The torch::optim::SGDOptions object can be configured with the values of the learning rate, the weight decay regularization factor, and the momentum value factor:

double learning_rate = 0.01;
double weight_decay = 0.0001; // regularization parameter
torch::optim::SGD optimizer(model->parameters(),

Now that we have our initialized data loaders, the network object, and the optimizer object, we are ready to start the training cycle. The following snippet shows the training cycle's implementation:

int epochs = 100;
for (int epoch = 0; epoch < epochs; ++epoch) {
model->train(); // switch to the training mode

// Iterate the data loader to get batches from the dataset
int batch_index = 0;
for (auto& batch : (*train_loader)) {
// Clear gradients

// Execute the model on the input data
torch::Tensor prediction = model->forward(;

// Compute a loss value to estimate error of our model
// target should have size of [batch_size]
torch::Tensor loss =

// Compute gradients of the loss and parameters of our model

// Update the parameters based on the calculated gradients.

// Output the loss every 10 batches.
if (++batch_index % 10 == 0) {
std::cout << "Epoch: " << epoch << " | Batch: " << batch_index
<< " | Loss: " << loss.item<float>() << std::endl;

We've made a loop that repeats the training cycle for 100 epochs. At the beginning of the training cycle, we switched our network object to training mode with model->train(). For one epoch, we iterate over all the mini-batches provided by the data loader object:

for (auto& batch : (*train_loader)){

For every mini-batch, we did the next training steps, cleared the previous gradient values by calling the zero_grad method for the optimizer object, made a forward step over the network object, model->forward(, and computed the loss value with the nll_loss function. This function computes the negative log-likelihood loss. It takes two parameters: the vector containing the probability that a training sample belongs to a class identified by position in the vector and the numeric class label (number). Then, we called the backward method of the loss tensor. It recursively computes the gradients for the overall network. Finally, we called the step method for the optimizer object, which updated all the parameters (weights) and their corresponding gradient values. The step method only updated the parameters that were used for initialization.

It's common practice to use test or validation data to check the training process after each epoch. We can do this in the following way:

model->eval();  // switch to the training mode
unsigned long total_correct = 0;
float avg_loss = 0.0;
for (auto& batch : (*test_loader)) {
// Execute the model on the input data
torch::Tensor prediction = model->forward(;

// Compute a loss value to estimate error of our model
torch::Tensor loss =

avg_loss += loss.sum().item<float>();
auto pred = std::get<1>(prediction.detach_().max(1));
total_correct += static_cast<unsigned long>(
avg_loss /= test_dataset.size().value();
double accuracy = (static_cast<double>(total_correct) / test_dataset.size().value());
std::cout << "Test Avg. Loss: " << avg_loss << " | Accuracy: " << accuracy << std::endl;

First, we switched the model to evaluation mode by calling the eval method. Then we iterated over all the batches from the test data loader. For each of these batches, we performed a forward pass over the network, calculating the loss value in the same way that we did for our training process. To estimate the total loss (error) value for the model, we averaged the loss values for all the batches. To get the total loss for the batch, we used loss.sum().item<float>(). Here, we summarized the losses for each training sample in the batch and moved it to the CPU floating-point variable with the item<float>() method.

Next, we calculate the accuracy value. This is the ratio between correct answers and misclassified ones. Let's go through this calculation with the following approach. First, we determine the predicted class labels by using the max method of the tensor object:

auto pred = std::get<1>(prediction.detach_().max(1));

The max method returns a tuple, where the values are the maximum value of each row of the input tensor in the given dimension and the location indices of each maximum value the method found. Then, we compare the predicted labels with the target ones and calculate the number of correct answers:

total_correct += static_cast<unsigned long>(pred.eq(<long>());

We used the eq tensor's method for our comparison. This method returns a boolean vector whose size is equal to the input vector, with values equal to 1 where the vector element components are equal and with values equal to 0 where they're not. To perform the comparison operation, we made a view for the target labels tensor with the same dimensions as the predictions tensor. The view_as method is used for this comparison. Then, we calculated the sum of 1s and moved the value to the CPU variable with the item<long>() method.

By doing this, we can see that the specialized framework has more options we can configure and is more flexible for neural network development. It has more layer types and supports dynamic network graphs. It also has a powerful specialized linear algebra library that can be used to create new layers, as well as new loss and activation functions. It has powerful abstractions that enable us to work with big training data. One more important thing to note is that it has a C++ API very similar to the Python API, so we can easily port Python programs to C++ and vice versa.

