Autoencoders and Visual Similarity

In this followup article, we will be taking a look at another beneficial use of autoencoders. We explored how an autoencoder's encoder can be used as a feature extractor with the extracted features then compared using cosine similarity in order to find similar images.

2 years ago   •   11 min read

By Oreolorun Olu-Ipinlaye
Table of contents

Bring this project to life

Ever wondered how image search works, or how social media platforms are able to recommend similar images to those that you often like? In this article, we will be taking a look at another beneficial use of autoencoders, and attempting to explain their utility in computer vision recommendation systems.


We first need to import the relevant packages for the task today:

#  article dependencies
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import torchvision.datasets as Datasets
from import Dataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt
import cv2
from tqdm.notebook import tqdm
from tqdm import tqdm as tqdm_regular
import seaborn as sns
from torchvision.utils import make_grid
import random
import pandas as pd

We also check the machine for a GPU, and enable Torch to run on CUDA if one is available.

#  configuring device
if torch.cuda.is_available():
  device = torch.device('cuda:0')
  print('Running on the GPU')
  device = torch.device('cpu')
  print('Running on the CPU')

Visual Similarity

In the context of human vision, we humans are able to make comparison between images by perceiving their shapes and colors, using this information to access how similar they may be. However, when it comes to computer vision, in order to make sense of images their features have to be extracted first. Thereafter, in order to compare how similar two images may be, their features need to be compared in some kind of way so as to measure similarity in numerical terms.

The Role of Autoencoders

As we know, autoencoders are fantastic at representation learning. In fact, they learn representations well enough to be able to piece together pixels and derive the original image as it was.

Basically, an autoencoder's encoder serves as a feature extractor with the extracted features then compressed into a vector representation in the bottleneck/code layer. The output of the bottleneck layer in this instance can be taken as the most salient features of an image which holds an encoding of it's colors and edges. With this encoding of features, one can then proceed to compare two images in a bid to measure their similarities.

The Cosine Similarity Metric

In order to measure the similarity between the vector representations mentioned in the previous section, we need a metric which is specifically suited to this task. This is where cosine similarity comes in, a metric which measures the likeness of two vectors by comparing the angles between them in a vector space.

Unlike distance measures like euclidean distance which compare vectors by their magnitudes, cosine similarity is only concerned with weather both vector are pointing in the same direction a property which makes it quite desirable for measuring salient similarities.

Mathematical formula for cosine similarity.

Utilizing Autoencoders for Visual Similarity

In this section, we will train an autoencoder then proceed to write a function for visual similarity using the autoencoder's encoder as feature extractor and cosine similarity as a metric to assess similarity.


Typical to articles in this autoencoder series, we will be using the CIFAR-10 dataset. This dataset contains 32 x 32 pixel images of objects such as frogs, horses, cars etc. The dataset can be loaded using the code cell below.

#  loading training data
training_set = Datasets.CIFAR10(root='./', download=True,

#  loading validation data
validation_set = Datasets.CIFAR10(root='./', download=True, train=False,
CIFAR-10 images.

Since we are training an autoencoder which is basically unsupervised, we do not need to class labels meaning we can just extract the images themselves. For visualization sake, we will extract images from each class so as to see how well the autoencoder does in reconstructing images in all classes.

def extract_each_class(dataset):
  This function searches for and returns
  one image per class
  images = []
  ITERATE = True
  i = 0
  j = 0

  while ITERATE:
    for label in tqdm_regular(dataset.targets):
      if label==j:
        print(f'class {j} found')
        if j==10:
          ITERATE = False

  return images
#  extracting training images
training_images = [x for x in]

#  extracting validation images
validation_images = [x for x in]

#  extracting validation images
test_images = extract_each_class(validation_set)

Next, we need to define a PyTorch dataset class so as to be able to use our dataset in training a PyTorch model. This is done in the following code cell.

#  defining dataset class
class CustomCIFAR10(Dataset):
  def __init__(self, data, transforms=None): = data
    self.transforms = transforms

  def __len__(self):
    return len(

  def __getitem__(self, idx):
    image =[idx]

    if self.transforms!=None:
      image = self.transforms(image)
    return image
#  creating pytorch datasets
training_data = CustomCIFAR10(training_images, transforms=transforms.Compose([transforms.ToTensor(),
                                                                              transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]))
validation_data = CustomCIFAR10(validation_images, transforms=transforms.Compose([transforms.ToTensor(),
                                                                                  transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]))
test_data = CustomCIFAR10(test_images, transforms=transforms.Compose([transforms.ToTensor(),
                                                                                  transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]))

Autoencoder Architecture

The autoencoder architecture pictured above is implemented in the code block below and will be used for training purposes. This autoencoder is custom built just for illustration purposes and is specifically tailored to the CIFAR-10 dataset. A bottleneck size of 1000 is used for this particular article instead of 200.

#  defining encoder
class Encoder(nn.Module):
  def __init__(self, in_channels=3, out_channels=16, latent_dim=1000, act_fn=nn.ReLU()):
    super().__init__() = nn.Sequential(
        nn.Conv2d(in_channels, out_channels, 3, padding=1), # (32, 32)
        nn.Conv2d(out_channels, out_channels, 3, padding=1), 
        nn.Conv2d(out_channels, 2*out_channels, 3, padding=1, stride=2), # (16, 16)
        nn.Conv2d(2*out_channels, 2*out_channels, 3, padding=1),
        nn.Conv2d(2*out_channels, 4*out_channels, 3, padding=1, stride=2), # (8, 8)
        nn.Conv2d(4*out_channels, 4*out_channels, 3, padding=1),
        nn.Linear(4*out_channels*8*8, latent_dim),

  def forward(self, x):
    x = x.view(-1, 3, 32, 32)
    output =
    return output

#  defining decoder
class Decoder(nn.Module):
  def __init__(self, in_channels=3, out_channels=16, latent_dim=1000, act_fn=nn.ReLU()):

    self.out_channels = out_channels

    self.linear = nn.Sequential(
        nn.Linear(latent_dim, 4*out_channels*8*8),

    self.conv = nn.Sequential(
        nn.ConvTranspose2d(4*out_channels, 4*out_channels, 3, padding=1), # (8, 8)
        nn.ConvTranspose2d(4*out_channels, 2*out_channels, 3, padding=1, 
                           stride=2, output_padding=1), # (16, 16)
        nn.ConvTranspose2d(2*out_channels, 2*out_channels, 3, padding=1),
        nn.ConvTranspose2d(2*out_channels, out_channels, 3, padding=1, 
                           stride=2, output_padding=1), # (32, 32)
        nn.ConvTranspose2d(out_channels, out_channels, 3, padding=1),
        nn.ConvTranspose2d(out_channels, in_channels, 3, padding=1)

  def forward(self, x):
    output = self.linear(x)
    output = output.view(-1, 4*self.out_channels, 8, 8)
    output = self.conv(output)
    return output

#  defining autoencoder
class Autoencoder(nn.Module):
  def __init__(self, encoder, decoder):
    self.encoder = encoder

    self.decoder = decoder

  def forward(self, x):
    encoded = self.encoder(x)
    decoded = self.decoder(encoded)
    return decoded

Convolutional Autoencoder Class

Bring this project to life

So as to neatly package model training and utilization into a single object, a convolutional autoencoder class is defined as seen below. This class has utilization methods such as autoencode which facilitates the entire autoencoding process, encode which triggers the encoder and bottleneck returning a 1000 element vector encoding and decode which takes a 1000 element vector as input and attempts to reconstruct an image.

class ConvolutionalAutoencoder():
  def __init__(self, autoencoder): = autoencoder
    self.optimizer = torch.optim.Adam(, lr=1e-3)

  def train(self, loss_function, epochs, batch_size, 
            training_set, validation_set, test_set):
    #  creating log
    log_dict = {
        'training_loss_per_batch': [],
        'validation_loss_per_batch': [],
        'visualizations': []

    #  defining weight initialization function
    def init_weights(module):
      if isinstance(module, nn.Conv2d):
      elif isinstance(module, nn.Linear):

    #  initializing network weights

    #  creating dataloaders
    train_loader = DataLoader(training_set, batch_size)
    val_loader = DataLoader(validation_set, batch_size)
    test_loader = DataLoader(test_set, 10)

    #  setting convnet to training mode

    for epoch in range(epochs):
      print(f'Epoch {epoch+1}/{epochs}')
      train_losses = []

      #  TRAINING
      for images in tqdm(train_loader):
        #  zeroing gradients
        #  sending images to device
        images =
        #  reconstructing images
        output =
        #  computing loss
        loss = loss_function(output, images.view(-1, 3, 32, 32))
        #  calculating gradients
        #  optimizing weights

        # LOGGING

      for val_images in tqdm(val_loader):
        with torch.no_grad():
          #  sending validation images to device
          val_images =
          #  reconstructing images
          output =
          #  computing validation loss
          val_loss = loss_function(output, val_images.view(-1, 3, 32, 32))

        # LOGGING

      print(f'training_loss: {round(loss.item(), 4)} validation_loss: {round(val_loss.item(), 4)}')

      for test_images in test_loader:
        #  sending test images to device
        test_images =
        with torch.no_grad():
          #  reconstructing test images
          reconstructed_imgs =
        #  sending reconstructed and images to cpu to allow for visualization
        reconstructed_imgs = reconstructed_imgs.cpu()
        test_images = test_images.cpu()

        #  visualisation
        imgs = torch.stack([test_images.view(-1, 3, 32, 32), reconstructed_imgs], 
        grid = make_grid(imgs, nrow=10, normalize=True, padding=1)
        grid = grid.permute(1, 2, 0)
    return log_dict

  def autoencode(self, x):

  def encode(self, x):
    encoder =
    return encoder(x)
  def decode(self, x):
    decoder =
    return decoder(x)

With everything setup, the autoencoder can now be trained by instantiating it, and calling the train method with parameters as seen below.

#  training model
model = ConvolutionalAutoencoder(Autoencoder(Encoder(), Decoder()))

log_dict = model.train(nn.MSELoss(), epochs=15, batch_size=64, 
                       training_set=training_data, validation_set=validation_data,

After the first epoch, we can see that the autoencoder has began to learn representations strong enough to be able to put together input images albeit without much detail.

Epoch 1.

However, by the 15th epoch the autoencoder has began to put together input images in more detail with accurate colors and better form.

Epoch 15.

Looking at the training and validation loss plots, both plots are down-trending,  and, therefore, the model will in fact benefit from more epochs of training. However, for this article training for 15 epochs is deemed sufficient enough.


Writing a Visual Similarity Function

Now, that an autoencoder has been trained to reconstruct images of all 10 classes in the CIFAR-10 dataset, we can proceed to use the autoencoder's encoder as a feature extractor for any set of images and then compare extracted features using cosine similarity.

In our case, let's write a function capable of receiving any image as input after which it looks through a set of images (we will be using the validation set for this purpose) for similar images. The function is defined below as described; care must be taken to preprocess the input image just as training images were preprocessed since this is what the model expects.

def visual_similarity(filepath, model, dataset, features):
  This function replicates the visual similarity process
  as defined previously.
  #  reading image
  image = cv2.imread(filepath)
  image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
  image = cv2.resize(image, (32, 32))

  #  converting image to tensor/preprocessing image
                                    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
  image = my_transforms(image)

  #  encoding image
  image =
  with torch.no_grad():
    image_encoding = model.encode(image)

  #  computing similarity scores
  similarity_scores = [F.cosine_similarity(image_encoding, x) for x in features]
  similarity_scores = [x.cpu().detach().item() for x in similarity_scores]
  similarity_scores = [round(x, 3) for x in similarity_scores]
  #  creating pandas series
  scores = pd.Series(similarity_scores)
  scores = scores.sort_values(ascending=False)

  #  deriving the most similar image
  idx = scores.index[0]
  most_similar = [image, dataset[idx]]

  #  visualization
  grid = make_grid(most_similar, normalize=True, padding=1)
  grid = grid.permute(1,2,0)

  print(f'similarity score = {scores[idx]}')

Since we are going to be comparing the uploaded image to images in the validation set we could save time by extracting features from all 1000 images prior to using the function. This process would as well have been written into the similarity function but it will come at the expense of compute time. This is done below.

#  extracting features from images in the validation set
with torch.no_grad():
  image_features = [model.encode( for x in tqdm_regular(validation_data)]

Computing Similarity

In this section, some images will be supplied to the visual similarity function in a bid to access the results produced. It should be borne in mind however that only images in classes present in the training set will produce reasonable results.  

Image 1

Consider the image of a German Shepard with a white background as seen below. This dog is has a predominantly golden coat with a black saddle and it is observed to be standing at alert facing the left.  

Upon passing this image to the visual similarity function, a plot of the uploaded image against the most similar image in the validation set is produced. Note that the original image was downsized to 32 x 32 pixels as required by the model.

visual_similarity('image_1.jpg', model=model, 

From the result, a white background image of a seemingly dark coat dog standing at alert facing the left is returned with a similarity score of 92.2%. In this case, the model essentially finds an image which matches most of the details of the original which is exactly what we want.

Image 2

The image below is that of a generally brownish looking frog in a prone position facing the rightward direction on a white background. Again, passing the image through our visual similarity function produces a plot of the uploaded image against it's most similar image.

visual_similarity('image_2.jpg', model=model, 

From the resulting plot, a somewhat gray looking frog in a similar position (prone) to our uploaded image is returned with a similarity score of about 91%. Notice that the image is also depicted on a white background.

Image 3

Lastly, below we have an image of another frog. This frog is of greenish coloration in a similarly prone position to the frog in the previous image but with distinctions of facing the leftward direction and being depicted on a textured background (sand in this case).

visual_similarity('image_3.jpg', model=model, 

Just like in the previous two sections, when the image is supplied to the visual similarity function a plot of the original image and the most similar image found in the validation set is returned. The most similar image in this case is that of a brownish looking frog in a prone position, facing the leftward direction, depicted on a textured background as well. A similarity score of approximately 90% is returned.

From the images used as examples in this section it can be seen that the visual similarity function works as it should. However, with more epochs of training or perhaps a better architecture, there is a possibility that better similarity recommendations will be made beyond the first few most similar images.

Final Remarks

In this article, we were able to look at another beneficial use of autoencoders, this time as a tool for visual similarity recommendation. Here we explored how an autoencoder's encoder can be used as a feature extractor with the extracted features then compared using cosine similarity in order to find similar images.

Basically all the autoencoder does in this instance is to extract features. Indeed, if you are quite conversant with convolutional neural networks, then you will agree that not only autoencoders could serve as feature extractors, but that networks used for classification purposes could also be used for feature extraction. Thus, this implies their utility for visual similarity tasks in turn.

Add speed and simplicity to your Machine Learning workflow today

Get startedTalk to an expert

Spread the word

Keep reading