The classification loss is the last part of loss function.
This loss is the sum of squared error loss for classification. Again the term is 1 when there is a object on a cell, and 0 otherwise. The idea is that we don't take into account the classification error when there is on object.
The , terms serve to mask the loss on the case that we have an object on the ground-truth and have an object on the model output for a particular cell. Also the same is truth when the ground-truth and the model output doesn't match.
So for example when for a particular cell we don't have a match our loss will be:
And when we have a match:
During implementation in practice you will try to vectorize this loss and avoid for-loops and improve performance, this is specially true for libraries like Tensorflow.
Here's the TensorFlow implementation of YOLO loss:
def loss_layer(self, predicts, labels, scope='loss_layer'): with tf.variable_scope(scope): predict_classes = tf.reshape(predicts[:, :self.boundary1], [self.batch_size, self.cell_size, self.cell_size, self.num_class]) predict_scales = tf.reshape(predicts[:, self.boundary1:self.boundary2], [self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell]) predict_boxes = tf.reshape(predicts[:, self.boundary2:], [self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell, 4]) response = tf.reshape(labels[:, :, :, 0], [self.batch_size, self.cell_size, self.cell_size, 1]) boxes = tf.reshape(labels[:, :, :, 1:5], [self.batch_size, self.cell_size, self.cell_size, 1, 4]) boxes = tf.tile(boxes, [1, 1, 1, self.boxes_per_cell, 1]) / self.image_size classes = labels[:, :, :, 5:] offset = tf.constant(self.offset, dtype=tf.float32) offset = tf.reshape(offset, [1, self.cell_size, self.cell_size, self.boxes_per_cell]) offset = tf.tile(offset, [self.batch_size, 1, 1, 1]) predict_boxes_tran = tf.stack([(predict_boxes[:, :, :, :, 0] + offset) / self.cell_size, (predict_boxes[:, :, :, :, 1] + tf.transpose(offset, (0, 2, 1, 3))) / self.cell_size, tf.square(predict_boxes[:, :, :, :, 2]), tf.square(predict_boxes[:, :, :, :, 3])]) predict_boxes_tran = tf.transpose(predict_boxes_tran, [1, 2, 3, 4, 0]) iou_predict_truth = self.tf_iou_vectorized(predict_boxes_tran, boxes) # calculate I tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL] object_mask = tf.reduce_max(iou_predict_truth, 3, keep_dims=True) object_mask = tf.cast((iou_predict_truth >= object_mask), tf.float32) * response # calculate no_I tensor [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL] noobject_mask = tf.ones_like(object_mask, dtype=tf.float32) - object_mask boxes_tran = tf.stack([boxes[:, :, :, :, 0] * self.cell_size - offset, boxes[:, :, :, :, 1] * self.cell_size - tf.transpose(offset, (0, 2, 1, 3)), tf.sqrt(boxes[:, :, :, :, 2]), tf.sqrt(boxes[:, :, :, :, 3])]) boxes_tran = tf.transpose(boxes_tran, [1, 2, 3, 4, 0]) # class_loss class_delta = response * (predict_classes - classes) class_loss = tf.reduce_mean(tf.reduce_sum(tf.square(class_delta), axis=[1, 2, 3]), name='class_loss') * self.class_scale # object_loss object_delta = object_mask * (predict_scales - iou_predict_truth) object_loss = tf.reduce_mean(tf.reduce_sum(tf.square(object_delta), axis=[1, 2, 3]), name='object_loss') * self.object_scale # noobject_loss noobject_delta = noobject_mask * predict_scales noobject_loss = tf.reduce_mean(tf.reduce_sum(tf.square(noobject_delta), axis=[1, 2, 3]), name='noobject_loss') * self.noobject_scale # coord_loss coord_mask = tf.expand_dims(object_mask, 4) boxes_delta = coord_mask * (predict_boxes - boxes_tran) coord_loss = tf.reduce_mean(tf.reduce_sum(tf.square(boxes_delta), axis=[1, 2, 3, 4]), name='coord_loss') * self.coord_scale