/ Series

Dimension Reduction - Autoencoders

This tutorial is from a 7 part series on Dimension Reduction:
  1. Understanding Dimension Reduction with Principal Component Analysis (PCA)
  2. Diving Deeper into Dimension Reduction with Independent Components Analysis (ICA)
  3. Multi-Dimension Scaling (MDS)
  4. LLE
  5. t-SNE
  6. IsoMap
  7. Autoencoders

(This post assumes you have a working knowledge of neural networks. A notebook with the code is available at github repo)

An autoencoder can be defined as a neural network whose primary purpose is to learn the underlying manifold or the feature space in the dataset. An autoencoder tries to reconstruct the inputs at the outputs. Unlike other non-linear dimension reduction methods, the autoencoders do not strive to preserve to a single property like distance(MDS), topology(LLE). An autoencoder generally consists of two parts an encoder which transforms the input to a hidden code and a decoder which reconstructs the input from hidden code. A simple example of an autoencoder would be something like the neural network shown in the diagram below.


One might wonder "what is the use of autoencoders if the output is same as input? How does feature learning or dimension reduction happen if the end result is same as input?".
The assumption behind autoencoders is that the transformation input --> hidden --> input will help us learn important properties of the dataset. The properties which we aim to learn in turn depend upon the restrictions put on the network.

Types of AutoEncoders
Let's discuss a few popular types of autoencoders.

  1. Regularized Autoencoders: These types of autoencoders use various regularization terms in their loss functions to achieve desired properties.
    The size of the hidden code can be greater than input size.
    1.1 Sparse AutoEncoders - A sparse autoencoder adds a penalty on the sparsity of the hidden layer. Regularization forces the hidden layer to activate only some of the hidden units per data sample. By activation, we mean that If the value of jth hidden unit is close to 1 it is activated else deactivated. The output from a deactivated node to the next layer is zero. This restriction forces the network to condense and store only important features of the data. The loss function of the sparse autoencoders can be represented as
    L(W, b) = J(W,b) + regularization term
    The middle layer represents the hidden layer. The green and red nodes represent the deactivated and activated nodes respectively.

1.2 Denoising Autoencoders: In denoising autoencoders, a random noise is deliberately added to the input and network is forced to reconstruct the unadulterated input. The decoder function learns to resist small changes in the input. This pretraining result in a robust neural network which is immune to noise in input up to a certain extent.
The standard normal function is used as the noising function to produce the corrupted input.

1.3 Contractive autoencoders: Instead of adding noise to input contractive autoencoders add a penalty on the large value of derivative of the feature extraction function. A small value of feature extraction function( f(x) ) derivative results in a negligible change in features when changes in the input are insignificant. In contractive encoders, feature extraction function is robust while in denoising encoders decoder function is robust.
2. Variational AutoEncoders: The variational autoencoders are based on nonlinear latent variable models. In a latent variable model, we assume that observable x are generated from hidden variables y. These hidden variables y contain important properties about the data. These autoencoders consist of two neural networks first for learning the latent variable distribution and second for generating the observables from a random sample obtained from latent variables distribution. Apart from minimizing the reconstruction loss these autoencoders also minimize the difference between the assumed distribution of latent variables and distribution resulting from the encoder. They are highly popular for generating images.
A good choice for latent variables distribution is gaussian distribution. As shown in the image above encoder outputs the parameters of the assumed gaussian. Next, a random sample is extracted from the gaussian distribution and decoder reconstructs the input from the random sample.
3.Undercomplete Autoencoders: The size of hidden layer is smaller than the input layer in undercomplete autoencoders. By reducing the hidden layer size we force the network to learn the important features of the dataset. Once the training phase is over decoder part is discarded and the encoder is used to transform a data sample to feature subspace. If the decoder transformation is linear and loss function is MSE(mean squared error) the feature subspace is same as that of PCA. For a network to learn something useful the size of the hidden code should not be close to or greater than input size network. Also, a network with high capacity(deep and highly nonlinear ) may not be able to learn anything useful. Dimension reduction methods are based on the assumption that dimension of data is artificially inflated and its intrinsic dimension is much lower. As we increase the number of layers in an autoencoder the size of the hidden layer will have to decrease. If the size of the hidden layer becomes smaller than the intrinsic dimension of the data and it will result in loss of information. The decoder could learn to map the hidden layer to specific inputs since the number of layers is large and it is highly nonlinear.
image of a multiplayer encoder and decoder. A simple autoencoder is shown below.

Loss function of the undercomplete autoencoders is given by:
L(x, g(f(x))) = (x - g(f(x)))2

Since this post is on dimension reduction using autoencoders, we will implement undercomplete autoencoders on pyspark.
There are few open source deep learning libraries for spark. E.g. bigdl from intel, tensorflowonspark by yahoo and spark deep learning from databricks .
We will be using intel's bigdl.

step1 install bigdl
If you have already installed spark run pip install --user bigdl --no-deps else run pip install --user bigdl. In latter case pip will install pyspark along with bigdl.

step2. Necessary imports

%matplotlib inline
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

# some imports from bigdl
from bigdl.nn.layer import *
from bigdl.nn.criterion import *
from bigdl.optim.optimizer import *
from bigdl.util.common import *
from bigdl.dataset.transformer import *
from pyspark import SparkContext

# function to initialize the bigdl library

step3. Load and Prepare the data

# bigdl provides a nice function for 
# downloading and reading mnist dataset

from bigdl.dataset import mnist
mnist_path = "mnist"
images_train, labels_train = mnist.read_data_sets(mnist_path, "train")

# mean and stddev of the pixel values

mean = np.mean(images_train)
std = np.std(images_train)

# parallelize, center and scale the images_train
rdd_images =  (sc.parallelize(images_train).
                            map(lambda features: (features - mean)/std))

print("total number of images ",rdd_images.count())

step3 Create the function for model

# Parameters for training


# Network Parameters
# shape of the input data
# function for creating an autoencoder

def get_autoencoder(hidden_size, input_size):

    # Initialize a sequential type container
    module = Sequential()

    # create encoder layers
    module.add(Linear(input_size, hidden_size))

    # create decoder layers
    module.add(Linear(hidden_size, input_size))


step4 Set up the deep learning graph

undercomplete_ae = get_autoencoder( SIZE_HIDDEN, SIZE_INPUT)

# transform dataset to rdd(Sample) from rdd(ndarray).
# Sample represents a record in the dataset. A sample 
# consists of two tensors a features tensor and a label tensor. 
# In our autoencoder features and label will be same
train_data = (rdd_images.map(lambda x:

# Create an Optimizer
optimizer = Optimizer(
    model = undercomplete_ae,
    training_rdd = train_data,
    criterion = MSECriterion(),
    optim_method = Adam(),
    end_trigger = MaxEpoch(NUM_EPOCHS),
    batch_size = BATCH_SIZE)

# write summary 
train_summary = TrainSummary(log_dir='/tmp/bigdl_summary',


print("logs to saved to ",app_name)

step5 Train the model

# run training process
trained_UAE = optimizer.optimize()

step6 Model performance on test data

# let's check our model performance on the test data

(images, labels) = mnist.read_data_sets(mnist_path, "test")
rdd_test =  (sc.parallelize(images).
                    map(lambda features: ((features - 
                    lambda features: Sample.
                    from_ndarray(features, features)))
examples = trained_UAE.predict(rdd_test).take(10)
f, a = plt.subplots(2, 10, figsize=(10, 2))
for i in range(10):
    a[0][i].imshow(np.reshape(images[i], (28, 28)))
    a[1][i].imshow(np.reshape(examples[i], (28, 28)))

As we can see from the image the reconstructions are very close to the original inputs.
Conclusion: Through this post, we discussed how autoencoders can be used for dimension reduction. In the beginning, we talked about different types of autoencoders and their purpose. Later on, we implemented an undercomplete autoencoder using intel's bigdl and pyspark. For more tutorials on bigdl visit bigdl tutorials

This post concludes our series of posts on dimension reduction.

Dimension Reduction - Autoencoders
Share this

Subscribe to Hello Paperspace