Learning to beat the previous MNIST state-of-the-art results with Capsule Networks

Capsule Networks (or CapsNets) is a very recent and innovative type of deep learning network. This technique was introduced at the end of October 2017 in a seminal paper titled Dynamic Routing Between Capsules by Sara Sabour, Nicholas Frost and Geoffrey Hinton (https://arxiv.org/abs/1710.09829). Hinton is one of the fathers of Deep Learning and, therefore, the whole Deep Learning community is excited to see the progress made with capsules. Indeed, CapsNets are already beating the best CNN at MNIST classification which is... well, impressive!

So what is the problem with CNNs? In CNNs each layer understands an image at a progressive level of granularity. As we discussed in multiple recipes, the first layer will most likely recognize straight lines or simple curves and edges, while subsequent layers will start to understand more complex shapes such as rectangles and complex forms such as human faces.

Now, one critical operation used for CNNs is pooling. Pooling aims to create the positional invariance and it is generally used after each CNN layer to make any problem computationally tractable. However, pooling introduces a significant problem because it forces us to lose all the positional data. This is not good. Think about a face: it consists in two eyes, a mouth, and a nose, and what is important is that there is a spatial relationship between these parts (the mouth is below the nose, which is typically below the eyes). Indeed, Hinton said:

The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.

Technically, we do not need positional invariance; instead, we need equivariance. Equivariance is a fancy term for indicating that we want to understand the rotation or proportion change in an image, and we want to adapt the network accordingly. In this way, the spatial positioning of the different components in an image is not lost.

So what is new with Capsule Networks? According to the authors, our brain has modules called capsules, and each capsule specializes in handling particular types of information. In particular, there are capsules that work well for understanding the concept of position, the concept of size, the concept of orientation, the concept of deformation, the textures, and so on and so forth. In addition to that, the authors suggest that our brain has particularly efficient mechanisms for dynamically routing each piece of information to the capsule, which is considered best suited for handling a particular type of information.

So, the main difference between CNN and CapsNets is that with a CNN you keep adding layers for creating a deep network, while with CapsNet you nest a neural layer inside another. A capsule is a group of neurons and introduces more structure in the net; it produces a vector to signal the existence of an entity in the image. In particular, Hinton uses the length of the activity vector to represent the probability that the entity exists, and its orientation to represent the instantiation parameters. When multiple predictions agree, a higher-level capsule becomes active. For each possible parent, the capsule produces an additional prediction vector.

Now a second innovation comes in place: we will use dynamic routing across capsules and will no longer use the raw idea of pooling. A lower-level capsule prefers to send its output to the higher-level capsule, and the activity vectors have a big scalar product, with the prediction coming from the lower-level capsule. The parent with the largest scalar prediction vector product increases the capsule bond. All the other parents decrease their bond. In other words, the idea is that if a higher-level capsule agrees with a lower level one, then it will ask to send more information of that type. If there is no agreement, it will ask to send less of them. This dynamic routing using the agreement method is superior to the current mechanisms such as max-pooling and, according to Hinton, routing is ultimately a way to parse the image. Indeed, Max-pooling is ignoring anything but the largest value, while dynamic routing selectively propagates information according to the agreement between lower layers and upper layers.

A third difference is that a new nonlinear activation function has been introduced. Instead of adding a squashing function to each layer as in CNN, CapsNet adds a squashing function to a nested set of layers. The nonlinear activation function is represented in the following figure and it is called the squashing function (equation 1):

Squashing function as seen in Hinton's seminal paper

Moreover, Hinton and others show that a discriminatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional network at recognizing highly overlapping digits.

The paper Dynamic Routing Between Capsules shows us simple CapsNet architecture:

A simple CapsNet architecture

The architecture is shallow, with only two convolutional layers and one fully connected layer. Conv1 has 256, 9 × 9 convolution kernels with a stride of 1, and ReLU activation. The role of this layer is to convert pixel intensities to the activities of local feature detectors that are then used as input to the primary capsules. PrimaryCapsules is a convolutional capsule layer with 32 channels; each primary capsule contains eight convolutional units with a 9 × 9 kernel and a stride of 2. In total, PrimaryCapsules has [32, 6, 6] capsule outputs (each output is an 8D vector) and each capsule in the [6, 6] grid is sharing its weight with each other. The final layer (DigitCaps) has one 16D capsule per digit class and each one of these capsules receives input from all the other capsules in the layer below. Routing happens only between two consecutive capsule layers (such as PrimaryCapsules and DigitCaps).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.37.123