Whether you're new to deep learning or a serious researcher, you've surely encountered the term Convolutional Neural Networks. They are one of the most researched and top-performing architectures in the field. That being said, CNNs have a few drawbacks in recognizing features of input data when they are in different orientations. To address this problem, Geoffrey E. Hinton together with his team Sara Sabour and Nicholas Frosst came up with a new type of Neural Network. They called them Capsule Networks.

In this article we’ll discuss the following topics to give an introduction to capsule networks:

  • Convolutional Neural Networks and the Orientation Problem
  • Problems with Pooling in CNNs
  • Intuition Behind Capsule Networks
  • Capsule Networks Architecture
  • Final Notes

Bring this project to life

Convolutional Neural Networks and the Orientation Problem

Convolutional neural networks are one of the most popular deep learning architectures that are widely used for computer vision applications. From image classification to object detection and segmentation, CNNs have achieved state of the art results. That being said, these networks have their complications and difficulties for different datasets. Let's start with their origins, then see how they're currently performing.

Yann LeCun first proposed the CNN in the year 1998. He was then able to detect handwritten digits with a simple five-layer convolutional neural network that was trained on the MNIST dataset, containing over 60,000 examples. The idea is simple: train the network, identify the features in the images, and classify them. Then in 2019, EfficientNet-B7 achieved the state of the art performance in classifying images on the ImageNet Dataset. The network can identify the label of a particular picture from over 1.2 million images with 84.4% of accuracy.


Looking at these results and progress, we can infer that convolutional approaches make it possible to learn many sophisticated features of the data with simple computations. By performing many matrix multiplications and summations on our input, we can arrive at a reliable answer to our question.

But CNNs are not perfect. If the CNNs are fed with images of different sizes and orientations, they fail.

Let me run through an example. Say you rotate a face upside down and then feed it to a CNN; it would not be able to identify features like the eyes, nose, or mouth. Similarly, if you reconstruct specific regions of the face (i.e., switch the positions of the nose and eyes), the network will still be able to recognize the face—even though it isn't exactly a face anymore. In short, CNNs can learn the patterns of the images statistically, but not how the actual image looks in different orientations.

Image Left: Pug: 0.8. Image Right: Pug: 0.2

The Problem with Pooling in CNNs

To understand pooling, one must know how the CNN works. The building block of CNNs is the convolutional layer. They are responsible for identifying the features in the given images, like the curves, edges, sharpness, and color. In the end, the fully connected layers of the network will combine very high-level features and produce classification predictions. You can see below what a basic convolutional network looks like.

The authors use the max/average pooling operation, or have successive convolutional layers throughout the network. The pooling operation reduces unnecessary information. Using this design, we can reduce the spacial size of the data flowing through the network, and thus increase the "field of view" of the neurons in higher layers, allowing them to detect higher-order features in a broader region of the input image. This is how max-pooling operations in CNNs help achieve state of the art performance. But we should not be fooled by its performance; CNNs work better than any model before them, but max-pooling is nevertheless losing valuable information.

Max pooling operation extracting higher-level features

Geoffrey Hinton stated in one of his lectures that,

"The pooling operation used in convolutional neural networks is a big mistake, and the fact that it works so well is a disaster!"

Now let’s see how Capsule Networks overcome this problem.

How Capsule Networks Work

To overcome the problems of the rotational relationship in the images, Hinton and Sarbour drew inspiration from neuroscience. In this study, they found out that the brain is organized into modules called capsules. Then they proposed capsule networks with the use of dynamic routing algorithms to estimate features of objects like pose (position, size, orientation, deformation, velocity, albedo, hue, texture, etc.). This research came out in the year 2017 in the paper titled Dynamic Routing Between Capsules.

Now let’s get involved by learning more about capsules and Dynamic Routing

What are Capsules in the Capsule Networks? Capsules represent various features of a particular entity that are present in the image. Unlike normal neurons, capsules perform some quite complicated internal computations on their inputs and then encapsulate the results of these computations into a small vector of highly informative outputs. They learn to recognize the implicitly defined visual entity over a limited domain of viewing conditions like precise pose, lighting, and deformation.


A Capsule could be considered as a replacement to an Artificial Neuron; a capsule deals with vectors whereas an artificial neuron deals with scalars. If we recall the process-flow that an artificial neuron goes through, we could consolidate the working as follows:

1. Multiply the input scalars with the weighted connections between the neurons.
2. Compute the weighted sum of the input scalars.
3. Apply an activation function (scalar nonlinearity) to get the output.

A capsule, on the other hand, does go-through the above steps with an additional step to achieve the affine transformation (preserving collinearity and ratio of distances) of the input. Here, the process flow is as follows:

1. Multiply the input vectors with weight matrices (which encode spatial relationships between low-level features and high-level features) (matrix multiplication).
2. Multiply the above resultant outputs with weights.
3. Compute the weighted sum of the input vectors.
4. Apply an activation function (vector non-linearity) to get the output.
Image: Capsule vs Artificial Neuron (https://github.com/naturomics/CapsNet-Tensorflow/)

A detailed analysis of what each step does shall give a better intuition about the working of a Capsule:

  • Multiply the input vectors with weight matrices (Affine transformation)

The input vectors represent the actual input, or the input being sent by a lower layer which is present below the actual one. These vectors are then multiplied by weight matrices. The length of the input vector denotes the objects that have been detected, and the direction denotes the internal states (orientations).  The weight matrix, as described previously, captures the spatial relationship, like, say (between two objects), one object is centered around the other and they are equally proportioned in size. Now what the product of an input vector and a weight matrix signify is, it talks about the high-level feature. For example, if the low-level features are nose, mouth, left eye, and right eye, then if the predictions of the four low-level features point to the same orientation and state of a face, then it’s a face. This is what the high-level feature is.

Img: Prediction of face (https://pechyonkin.me/capsules-2/)
  • Multiply the above resultant outputs with weights

In this step, the outputs obtained from the previous step are multiplied with the weights of the network. What can the weights be? In a usual Artificial Neural Network, the weights are adjusted based on the error rate, followed by backpropagation. However, this mechanism isn’t applied in a Capsule Network. Dynamic Routing is what rules the modification of weights in a network. It defines the strategy of assigning weights to the neurons' connections.

A Capsule Network adjusts the weights such that a low-level capsule is strongly associated with high-level capsules that are in proximity with the low-level capsule. The proximity measure is determined by the affine transformation step, that has been discussed before. As the high-level capsules would already have a set of lower-level capsules associated with them, the distance between the outputs obtained from the affine transformation step and the dense clusters of the predictions of low-level capsules is computed (the dense clusters could be formed if the predictions made by the low-level capsules are similar, thus lying near to each other). The one which fares low, or the high-level capsule that has the minimum distance between the cluster of already made predictions and the newly predicted one, has a higher weight, and the remaining capsules would be assigned lower weights, based on the distance metric.

Img: Dynamic Routing in a Capsule Network

In the above image, the weights would be assigned in the following order: middle > left > right. In nutshell, the essence of dynamic routing algorithm could be put forth as follows:

The lower level capsule will send its input to the higher level capsule that “agrees” with its input. This is the essence of the dynamic routing algorithm.

  • Compute the weighted sum of the input vectors

This sums up all the obtained outputs from the previous step.

  • Apply an activation function (vector non-linearity) to get the output

In a Capsule Network, the vector non-linearity is obtained by squashing (is an activation function in a Capsule Network) the output vector, for it to have a length of 1 and a constant direction. The non-linearity function is given by:

<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mi>j</mi></msub><mo>&#xA0;</mo><mo>=</mo><mo>&#xA0;</mo><mfrac><msup><mfenced open="||" close="||"><msub><mi>s</mi><mi>j</mi></msub></mfenced><mn>2</mn></msup><mrow><mn>1</mn><mo>&#xA0;</mo><mo>+</mo><mo>&#xA0;</mo><msup><mfenced open="||" close="||"><msub><mi>s</mi><mi>j</mi></msub></mfenced><mn>2</mn></msup></mrow></mfrac><mo>&#xA0;</mo><mfrac><msub><mi>s</mi><mi>j</mi></msub><mfenced open="||" close="||"><msub><mi>s</mi><mi>j</mi></msub></mfenced></mfrac></math>

sj is the output obtained from the previous step.

vj is the output obtained after applying the non-linearity.

The left side of the equation -> performs additional squashing

The right side of the equation ->performs unit scaling of the output vector.

On the whole, the overall dynamic routing algorithm that takes all the other steps into account for the weight updation would be:

Line 1: This line defines the procedure of ROUTING, which takes affine transformed input (u), the number of routing iterations (r), and the layer number as inputs (l).

Line 2: bij is a temporary value that is used to initialize ci in the end.

Line 3: The for loop iterates for ‘r’ times.

Line 4: The softmax function applied onto bi makes sure to output a non-negative ci where all the outputs sum up to 1.

Line 5: For every capsule in the succeeding layer, the weighted sum is computed.

Line 6: For every capsule in the succeeding layer, the weighted sum is squashed.

Line 7: The weights bij are updated here. uji denotes the input to the capsule from low-level capsule i, and vj denotes the output of high-level capsule j.

CapsNet Architecture

The CapsNet Architecture is made of an Encoder and a Decoder where each has a set of 3 layers.

  • An Encoder has Convolutional Layer, PrimaryCaps Layer, and DigitCaps Layer.
  • A Decoder has 3 Fully-Connected Layers.
Note: CapsNet Architecture is with reference to MNIST dataset.

Let's now look into each of these networks,

-> Encoder Network

Img: CapsNet Encoder Architecture (https://arxiv.org/abs/1710.09829)

An encoder has two convolutional layers and one fully-connected layer. Conv1 layer has 256, 9x9 convolutional kernels with a stride of 1 and ReLU activation. This layer is responsible for converting the pixel intensities to the activities of local feature detectors, which are then fed to the PrimaryCaps layer comprising primary capsules. Primary capsules perform inverse graphics, meaning they reverse-engineer the process of the actual image generation. The PrimaryCaps layer is a convolutional layer that has 32 channels of convolutional 8D capsules (each capsule has 8 convolutional units with 9x9 kernel and a stride of 2). The capsule applies eight 9x9x256 kernels onto the 20x20x256 input volume, which gives 6x6x8 output tensor. As there are 32 8D capsules, the output would thus be of size, 6x6x8x32. The DigitCaps layer has 16D capsules per class, where each capsule receives input from the low-level capsule.


The 8x16 Wij is the weight matrix used for affine transformation against each 8D capsule. The routing mechanism which has been discussed before always exists between two capsule layers (say, between PrimaryCaps and DigitCaps).

In the end, a reconstruction loss is used to encode the instantiation parameters. The loss is calculated for each training example against all the output classes. The total loss is the sum of losses of all the digit capsules. The loss equation is given by,

<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mi>k</mi></msub><mo>&#xA0;</mo><mo>=</mo><mo>&#xA0;</mo><msub><mi>T</mi><mi>k</mi></msub><mo>&#xA0;</mo><mi>m</mi><mi>a</mi><mi>x</mi><msup><mfenced><mrow><mn>0</mn><mo>,</mo><mo>&#xA0;</mo><msup><mi>m</mi><mo>+</mo></msup><mo>&#xA0;</mo><mo>-</mo><mo>&#xA0;</mo><mfenced open="||" close="||"><msub><mi>v</mi><mi>k</mi></msub></mfenced></mrow></mfenced><mn>2</mn></msup><mo>&#xA0;</mo><mo>+</mo><mo>&#xA0;</mo><mi>&#x3BB;</mi><mfenced><mrow><mn>1</mn><mo>&#xA0;</mo><mo>-</mo><mo>&#xA0;</mo><msub><mi>T</mi><mi>k</mi></msub></mrow></mfenced><mo>&#xA0;</mo><mi>m</mi><mi>a</mi><mi>x</mi><msup><mfenced><mrow><mn>0</mn><mo>,</mo><mo>&#xA0;</mo><mfenced open="||" close="||"><msub><mi>v</mi><mi>k</mi></msub></mfenced><mo>&#xA0;</mo><mo>-</mo><mo>&#xA0;</mo><msup><mi>m</mi><mo>-</mo></msup></mrow></mfenced><mn>2</mn></msup></math>

where,

Tk = 1 if a digit of class k is present,

m+ = 0.9,

m- = 0.1,

vk = vector obtained from DigitCaps layer

The first term of the equation  represents the loss for correct DigitCap and the second term    of the equation represents the loss for wrong DigitCaps.

->Decoder Network

Img: CapsNet Decoder Architecture (https://arxiv.org/abs/1710.09829)

A decoder takes the correct 16D digit capsule and decodes it into an image. All the incorrect digit capsules are not taken into consideration. The loss is calculated by finding the Euclidean distance between the input image and the reconstructed image. A decoder has three fully-connected layers. The first FC layer has 512 neurons, the second FC layer has 1024 neurons, and the third FC layer has 784 neurons (gives 28x28 MNIST reconstructed images).

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

Final Notes

A Capsule Network could be considered as a ‘real-imitation’ of the human brain. Unlike Convolutional Neural Networks that do not evaluate the spatial relationships in the given data, Capsule Networks consider the orientation of images to be significant in analyzing the data. They examine the hierarchical relationships to better identify images. The Inverse-graphics mechanism which our brains make use of, to build a hierarchical representation of an image to match it with what we’ve learned, is what drives a Capsule Network to show remarkable results. Though it isn’t yet computationally efficient, the accuracy does seem beneficial in tackling real-world scenarios. The Dynamic Routing of Capsules is what makes all of this possible. It employs an unusual strategy of updating the weights in a network, thus avoiding the pooling operation. As time passes by, Capsule Networks shall penetrate into various other fields, making machines more human-like.