This tutorial is from a 7 part series on Dimension Reduction:

  1. Understanding Dimension Reduction with Principal Component Analysis (PCA)
  2. Diving Deeper into Dimension Reduction with Independent Components Analysis (ICA)
  3. Multi-Dimension Scaling (MDS)
  4. LLE (Coming Soon!)
  5. t-SNE (Coming Soon!)
  6. IsoMap (Coming Soon!)
  7. Autoencoders (Coming Soon!)

(A Jupyter Notebook with math and code (python and pyspark) is available on github.)

Multi-Dimension Scaling is a distance-preserving manifold learning method. All manifold learning algorithms assume the dataset lies on a smooth, non linear manifold of low dimension and that a mapping f: RD -> Rd (D>>d) can be found by preserving one or more properties of the higher dimension space. Distance preserving methods assume that a manifold can be defined by the pairwise distances of its points. In distance preserving methods, a low dimensional embedding is obtained from the higher dimension in such a way that pairwise distances between the points remain same. Some distance preserving methods preserve spatial distances (MDS) while some preserve graph distances.

MDS is not a single method but a family of methods. MDS takes a dissimilarity matrix D where Dij represents the dissimilarity between points i and j and produces a mapping on a lower dimension, preserving the dissimilarities as closely as possible. The dissimilarity matrix could be observed or calculated from the given dataset. MDS has been widely popular and developed in the field of human sciences like sociology, anthropology and especially in psychometrics.

Let's understand it better with an example from the MDS book. The table below represents correlations between rates of different types of crimes from the US in 1970 and the image shows it MDS embedding. As the number of variables increases, it becomes harder for a human mind to identify the relationships among the variables.

The relative position of points on the plot depends on the dissimilarity between them in the table of correlations, i.e. crime rates which share high correlation are mapped close to each other, while crime rates which do not share high correlation are mapped farther apart. From the figure we can see that along horizontal dimension crime distribution can be interpreted as "violent crime vs property crime" whereas on vertical dimension distribution of crime can be interpreted as the "street crime vs hidden crime".

MDS can be divided into two categories:

  • Metric MDS - Metric MDS is used for quantitative data and tries to preserve the original dissimilarity metrics. Given a dissimilarity matrix D , a monotone function f, and p(number of dimension in subspace) metric MDS tries to find an optimal configuration X ⊂ Rp s.t. f(Dij) ≈ d ij = (xi − xj)2 . Another version of metric MDS is classical MDS(original MDS) formulation which provides closed form solution. Intead of trying to approximate the dissimilarity metrics in lower dimension, it uses eigenvalue decomposition for in the solution.

  • Non-Metric MDS - Non-metric MDS is used for ordinal data. It tries to keep the order of dissimialrity metrics intact. For example if Pij is dissimilarity between ith & jth and P32 > P89 , then non-metric mds creates a mapping s.t.d32 > d89.

We will implement metric MDS using SMACOF( scaling by majorization of complicated function) algorithm. Before diving into the implementation of metric MDS, we need to learn a bit about MM( Majorization- Minorization) algorithm for optimization.

MM for finding an optimum of a function


MM algorithm is an iterative algorithm for finding optimum of a complex function. Suppose we have a function f(x) for which we need to find a minimum. Instead of directly optimizing f(x) MM uses an approximating function g(x,xm) to find an optimum. If problem is to find a minimum of f(x) then, g(x,xm) is called majorizing function else minorizing function and xm is called support point.
If g(x,xm) is majorizing function of f(x) then, it has to satisfy following conditions:

  1. Optimizing g(x,xm) should be easier than f(x) .
  2. For any x , f(x)g(x,xm)
  3. f(xm)=g(xm,xm)

Steps of MM algorithm:

  1. choose a random support point xm
  2. find xmin = argminx g(x,xm)
  3. if f(xmin)−f(xm)≈ e break (where e is a very small number) else go to step 4
  4. set xm=xmin and go to step 2
    Let's understand the MM algorithm with an example.

    The plot in green is the function for which we need to find the minimum. Each image represents an iteration of the algorithm with first image as initial set up. Our initial pivot point is xm=9.00 , the minimum of g(x,9) is at xmin=6.49. Now if we move to the next image, we see that xm is now 6.49 and we have a new xmin=6.20 based on g(x,6.49). If we move to next iteration xm becomes 6.20 and xmin value changes to 5.84. Continuing to subsequent iterations, we see the minima moves towards the minimum of the green plot(as shown in the image below). The most important part of MM algorithm is to find a good approximating function.

    Now , let's move to SMACOF algorithm for metric MDS. As it was said earlier that metric MDS tries to appoximate the dissimilarity matrix and minimizes the stress function which is given by
    σ(X)=Σij wijij−dij(X))2 , where
    wij is weightage assigned to disimilarity between i and j
    δij is element of the given dissimilarity matrix
    dij(X) is the dissimilarity from the X which we need to find
    We aren't going to delve into derivation of the majorizing function for stress function. If you wish to follow please consult the excellent book (Topic - Majorizing Stress, page -187).

Steps of MDS

  1. Create a disimilarity matrix.
  2. choose a random point Xm as pivot.
  3. Find the minimum of stress majorizing functions.
  4. if σ(Xm) − σ(Xmin) < e break else set Xm=Xmin and go to step 2

Step One - Dissimilarity matrix:
We need a distance metric to calculate the distance between two points in the dataset. Let's go through few popular distance metrics quickly.
Euclidean Metric: This is the most popular metric used in distance measurement. It is equal to the straight line distance between two points.
Manhattan Metric: The distance between two points is measured along right angles to the axes. It is also known as rectilinear distance, taxi-cab metric or city block distance.
Cosine Metric: The cosine distance measures the cosine of the angle between two vectors. Smaller the value of cosine closer will the points.
Mahalanobis Metric: Mahalanobis metric is used to measure a point p's distance to a distribution D. It is particularly useful in detecting outliers.
Hamming Metric: Hamming metric counts the number of places at which entries in two vectors differ. It is an important metric in information theory.
We will be using Euclidean metric as dissimilarity measure.

from sklearn import datasets
import math as ma
import numpy as np
from pyspark.sql import types as t
from pyspark.sql import functions as f

digits = datasets.load_digits(n_class=6)

data = digits.data
# repartitioning the dataframe by id column will speed up the join operation 

df = spark.createDataFrame(sc.parallelize(data.tolist()).zipWithIndex()).toDF("features",
                   "id").repartition("id")
df.cache()

euclidean = lambda x,y:ma.sqrt(np.sum((np.array(x)-np.array(y))**2))
data_bc = sc.broadcast(df.sort("id").select("features").rdd.collect())

# create the distance metric
def pairwise_metric1(y):
    dist = []
    for x in data_bc.value:
        dist += [ma.sqrt(np.sum((np.array(x)-np.array(y))**2))]

    return(dist)

udf_dist1 = f.udf(pairwise_metric1, t.ArrayType(t.DoubleType()))

df = df.withColumn("D", udf_dist1("features"))

Step Two: SCAMOF algorithm:

n,p = data.shape
dim = 2
X = np.random.rand(n,dim)

# randomly initialize a solution for the pivot point.
dfrand = spark.createDataFrame(sc.parallelize(X.tolist()).zipWithIndex()).toDF("X", 
                     "id2").repartition("id2")
df = df.join(dfrand, df.id==dfrand.id2, "inner").drop("id1")

def pairwise_metric2(y):
    dist = []
    for x in X_bc.value:
        dist += [ma.sqrt(np.sum((np.array(x)-np.array(y))**2))]
    return(dist)

# create the matrix B
def B(id,x,y):

    y,x = np.array(y), np.array(x) 
    y[y==0.0] = np.inf
    z = -x/y

    z[id] = -(np.sum(z)-z[id])
    return(z.tolist())

# function for matrix multiplication using outer multiplication
def df_mult(df, col1, col2, n1, n2, matrix=True):

    udf_mult = f.udf(lambda x,y:np.outer(np.array(x), 
                  np.array(y)).flatten().tolist(),
                   t.ArrayType(t.DoubleType()))

    df = df.withColumn("mult", udf_mult(col1, col2))
    df = df.agg(f.array([f.sum(f.col("mult")[i]) 
             for i in range(n1*n2)])).toDF("mult")
    if not matrix:
        return(df)
    st = t.ArrayType(t.StructType(
                [t.StructField("id",t.LongType()),
                 t.StructField("row",t.ArrayType(
                 t.DoubleType()))]))
    udf_arange = (f.udf(lambda x:[(i,j.tolist()) 
                  for i,j in enumerate(np.array(x).
                       reshape(n1,n2)/n1)], st))

    df = (df.withColumn("mult", 
               udf_arange("mult")).select(
               f.explode("mult").alias("mult")))

    df = (df.select(f.col("mult.id").alias("id2"),
                      f.col("mult.row").
                      alias("X_min")).
                      repartition("id2"))
    return(df)



udf_B = f.udf(B, t.ArrayType(t.DoubleType()))
udf_sigma = (f.udf(lambda x,y: float(np.sum((
                 np.array(x)-np.array(y))**2)), 
                 t.DoubleType()))
sigma_old = np.inf
tol = 1e-4
max_iter = 1000


for i in range(max_iter):
    X_bc = sc.broadcast(df.sort("id").select("X").rdd.collect())
    def pairwise_metric2(y):
        dist = []
        for x in X_bc.value:
            dist += [ma.sqrt(np.sum((np.array(x)-np.array(y))**2))]
        return(dist)
    udf_dist2 = f.udf(pairwise_metric2, t.ArrayType(t.DoubleType()))
    df = df.withColumn("di", udf_dist2("X"))

    df = df.withColumn("sigma", udf_sigma("D","di"))
    sigma_new = df.agg({"sigma":"sum"}).collect()[0][0]
    print(sigma_old, sigma_new)
    sigma_old = sigma_new
    df = df.withColumn("B", udf_B("id","D","di")).drop("di")

    X_min = df_mult(df, "B", "X", n, dim)

    df = df.join(X_min, df.id==X_min.id2).select("id", "D", f.col("X_min").alias("X"))
    # cache action will prevent recreation of dataframe from base
    df.cache()

The plot of MDS embedding.

Though the clusters are not clearly distinct they can be identified easily.
Drawbacks of MDS:
MDS requires large computing power for calculating the dissimilarity matrix at every iteration. It is hard to embed the new data in MDS.

Conclusion: Like PCA, MDS is an old method. It has been well researched. It has few extensions like sammon mapping. With this post, we tried to develop an understanding of MDS and its working. We brushed up a few areas related to MDS and implemented a basic MDS in pyspark.
Next article in this series will be on LLE Sneak peek!