Capsule networks 

We have discussed how various CNN architectures evolved, we and have looked at their successive improvements. Can we now apply CNNs for more advanced applications, such as Advanced Driver Assistance Systems (ADAS) and self-driving cars? Can we detect an obstacle, a pedestrian, and other overlapping objects on roads in real-world scenarios and in real time? Maybe not! We are still not quite there. In spite of the tremendous success of CNNs in ImageNet competitions, CNNs still have some severe limitations that restrict their applicability to more advanced, real-world problems. CNNs have poor translational invariance and lack information about orientation (or pose).

Pose information refers to three-dimensional orientation relative to the viewer, but also to lighting and color. CNNs do have trouble when objects are rotated or when lighting conditions are changed. According to Hinton, CNNs cannot do handedness detection at all; for instance, they can't tell a left shoe from a right shoe, even if they are trained on both. One of the reasons for these limitations of CNNs is the use of max pooling, which is a crude way to introduce invariance. By crude invariance, we mean the output of max pooling does not change much if the image is slightly shifted/rotated. We actually need more than invariance—we need equivariance; that is, invariance under symmetric transformation of images.

Edge detectors are the first layer in CNNs, and perform much the same function as the visual cortex system in the human brain. The difference between the brain and a CNN occurs in the higher levels. Efficiently routing low-level visual information to higher-level information, such as objects in various poses and colors, or at various scales and velocities, is believed to be done by cortical micro columns—which Hinton names capsules. This routing mechanism makes the human visual system more powerful than the CNN.

Capsule Networks (CapsNets) make two fundamental changes to the CNN architecture—firstly, they replace the scalar-output feature detectors of CNNs with vector-output capsules; secondly, they use max pooling with routing-by-agreement. Here is a simple CapsNet architecture:

This is a shallow architecture to be trained on MNIST data (28 x 28 handwritten digit images). This has two convolutional layers. The first convolution layer has 256 feature maps with 9 x 9 kernels (stride =1) and ReLu activation. So, each feature map is (28-9+1) x (28-9+1), or 20 x 20. The second convolution layer again has 256 feature maps with 9 x 9 kernels (stride =2) and ReLu activation. Here each feature map is 6 x 6, 6=((20-9)/2+1). This layer is reshaped—or, rather, the feature maps are regrouped into 32 groups, with each group having 8 feature maps (256=8 x 32). The grouping process aims to create feature vectors each of size 8. To represent pose, a vector representation is a more natural representation. The grouped feature maps from the second layer is called the primary capsule layer. We have (32 x 6 x 6) eight-dimensional capsule vectors where each capsule contains eight convolutional units with a 9 x 9 kernel and a stride of 2. The final capsule layer (DigitCaps) has one sixteen-dimensional capsule per digit class (10 classes), and each of these capsules receives input from all the capsules in the primary capsule layer.

The length of the output vector of a capsule represents the probability that the entity represented by the capsule is present in the current input. The length of the capsule vector is normalized and kept between 0 and 1. Also, a squashing function is used on the norm of the vector such that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below 1.

The squashing function is :

Where x is the norm of a vector and hence x > 0 (see the following plot):

W_ij is a weight matrix between each u_i , i ∈ (1, 32 x 6 x 6) in primary capsules and v_j , j ∈ (1, 10) in DigitCaps. Here, is called the prediction vector and is like a transformed (rotated/translated) input capsule vector, u_i. The total input to a capsule, s_j, is a weighted sum over all prediction vectors from the capsules in the layer below. These weights, c_ij, add up to 1 and are called coupling coefficients by Hinton. At the beginning, its assumed that log prior probabilities that capsule i should be coupled with parent capsule j is the same for all, i, j, and is denoted by b_ij. So, the coupling coefficients can be calculated by this well-known softmax transform:

These coupling coefficients are iteratively updated along with the weights of the network by an algorithm called routing-by-agreement. In short, it does the following: if the primary capsule i's prediction vector has a large scalar product with the output of possible parent j, coupling coefficient b_ij is increased for that parent and decreased for other parents.

The full routing algorithm is given here:

The left side—shows how all the primary capsules are connected to a digit capsule with the weight matrix W_ij. Also, it depicts how the coupling coefficients are calculated and how the sixteen-dimensional output of DigitCaps is calculated by the non-linear squashing function. On the right side—suppose primary capsules capture two basic shapes: a triangle and a rectangle in the input image. Aligning them after rotation gives either a House or a Sail Boat, based on the amount of rotation done. It's clear that, with minimal or almost no rotation, the two objects combine to form a sail boat. That is, the two primary capsules are more aligned to form a boat than a house. Thus, the routing algorithm should update the coupling coefficient for b_i,_boat :

procedure routing (, r, l):
  for all capsule i in layer l and capsule j in layer (l + 1): bij <- 0
for r iterations do:
  for all capsule i in layer l: ci <- softmax (bi)
  for all capsule j in layer (l + 1): 
  for all capsule j in layer (l + 1): vj <- squash (sj)
  for all capsule i in layer l and capsule j in layer (l + 1):  
return vj

Finally, we need a proper loss function to train this network. Here, margin loss for digit existence is used as the loss function. It also considers the case of overlapping digits. A separate margin loss, L_k, for each digit capsule, k, is used to allow the detection of multiple overlapping digits. L_k looks at the length of the capsule vector—for a digit of class k, the length of the k^th capsule vector should be maximum compared to others:

Here, T_k =1, if the k^th digit is present. m⁺ = 0.9 and m⁻ = 0.1. The λ is used for down-weighting of the loss for absent digit classes. Along with L_ks, an image reconstruction error loss is used as a regularization of the network. As depicted in the CapsNet architecture, the output of the digit capsule is fed into a decoder consisting of three fully connected layers. The sum of squared differences between the outputs of the logistic units and the original image pixel intensities are minimized. The reconstruction loss is scaled down by a factor of 0.0005 so that it does not dominate the margin loss during training.

The TensorFlow implementation of CapsNet is available here: https://github.com/naturomics/CapsNet-Tensorflow.

Table of Contents for Capsule networks&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Capsule networks