This tutorial is from a 7 part series on Dimension Reduction:
 Understanding Dimension Reduction with Principal Component Analysis (PCA)
 Diving Deeper into Dimension Reduction with Independent Components Analysis (ICA)
 MultiDimension Scaling (MDS)
 LLE (Coming Soon!)
 tSNE (Coming Soon!)
 IsoMap (Coming Soon!)
 Autoencoders (Coming Soon!)
(A Jupyter Notebook with math and code (python and pyspark) is available on github.)
MultiDimension Scaling is a distancepreserving manifold learning method. All manifold learning algorithms assume the dataset lies on a smooth, non linear manifold of low dimension and that a mapping f: R^{D} > R^{d} (D>>d) can be found by preserving one or more properties of the higher dimension space. Distance preserving methods assume that a manifold can be defined by the pairwise distances of its points. In distance preserving methods, a low dimensional embedding is obtained from the higher dimension in such a way that pairwise distances between the points remain same. Some distance preserving methods preserve spatial distances (MDS) while some preserve graph distances.
MDS is not a single method but a family of methods. MDS takes a dissimilarity matrix D where D_{ij} represents the dissimilarity between points i and j and produces a mapping on a lower dimension, preserving the dissimilarities as closely as possible. The dissimilarity matrix could be observed or calculated from the given dataset. MDS has been widely popular and developed in the field of human sciences like sociology, anthropology and especially in psychometrics.
Let's understand it better with an example from the MDS book. The table below represents correlations between rates of different types of crimes from the US in 1970 and the image shows it MDS embedding. As the number of variables increases, it becomes harder for a human mind to identify the relationships among the variables.
The relative position of points on the plot depends on the dissimilarity between them in the table of correlations, i.e. crime rates which share high correlation are mapped close to each other, while crime rates which do not share high correlation are mapped farther apart. From the figure we can see that along horizontal dimension crime distribution can be interpreted as "violent crime vs property crime" whereas on vertical dimension distribution of crime can be interpreted as the "street crime vs hidden crime".
MDS can be divided into two categories:

Metric MDS  Metric MDS is used for quantitative data and tries to preserve the original dissimilarity metrics. Given a dissimilarity matrix D , a monotone function f, and p(number of dimension in subspace) metric MDS tries to find an optimal configuration X âŠ‚ R^{p} s.t. f(D_{ij}) â‰ˆ d _{ij} = (x_{i} âˆ’ x_{j})^{2} . Another version of metric MDS is classical MDS(original MDS) formulation which provides closed form solution. Intead of trying to approximate the dissimilarity metrics in lower dimension, it uses eigenvalue decomposition for in the solution.

NonMetric MDS  Nonmetric MDS is used for ordinal data. It tries to keep the order of dissimialrity metrics intact. For example if P_{ij} is dissimilarity between i_{th} & j_{th} and P_{32} > P_{89} , then nonmetric mds creates a mapping s.t.d_{32} > d_{89}.
We will implement metric MDS using SMACOF( scaling by majorization of complicated function) algorithm. Before diving into the implementation of metric MDS, we need to learn a bit about MM( Majorization Minorization) algorithm for optimization.
MM for finding an optimum of a function
MM algorithm is an iterative algorithm for finding optimum of a complex function. Suppose we have a function f(x) for which we need to find a minimum. Instead of directly optimizing f(x) MM uses an approximating function g(x,x_{m}) to find an optimum. If problem is to find a minimum of f(x) then, g(x,x_{m}) is called majorizing function else minorizing function and x_{m} is called support point.
If g(x,x_{m}) is majorizing function of f(x) then, it has to satisfy following conditions:
 Optimizing g(x,x_{m}) should be easier than f(x) .
 For any x , f(x) â‰¤ g(x,x_{m})
 f(x_{m})=g(x_{m},x_{m})
Steps of MM algorithm:
 choose a random support point x_{m}
 find x_{min} = argmin_{x} g(x,x_{m})
 if f(x_{min})âˆ’f(x_{m})â‰ˆ e break (where e is a very small number) else go to step 4
 set x_{m}=x_{min} and go to step 2
Let's understand the MM algorithm with an example.
The plot in green is the function for which we need to find the minimum. Each image represents an iteration of the algorithm with first image as initial set up. Our initial pivot point is x_{m}=9.00 , the minimum of g(x,9) is at x_{min}=6.49. Now if we move to the next image, we see that x_{m} is now 6.49 and we have a new x_{min}=6.20 based on g(x,6.49). If we move to next iteration x_{m} becomes 6.20 and x_{min} value changes to 5.84. Continuing to subsequent iterations, we see the minima moves towards the minimum of the green plot(as shown in the image below). The most important part of MM algorithm is to find a good approximating function.
Now , let's move to SMACOF algorithm for metric MDS. As it was said earlier that metric MDS tries to appoximate the dissimilarity matrix and minimizes the stress function which is given by
Ïƒ(X)=Î£_{ij} w_{ij}(Î´_{ij}âˆ’d_{ij}(X))^{2} , where
w_{ij} is weightage assigned to disimilarity between i and j
Î´_{ij} is element of the given dissimilarity matrix
d_{ij}(X) is the dissimilarity from the X which we need to find
We aren't going to delve into derivation of the majorizing function for stress function. If you wish to follow please consult the excellent book (Topic  Majorizing Stress, page 187).
Steps of MDS
 Create a disimilarity matrix.
 choose a random point X_{m} as pivot.
 Find the minimum of stress majorizing functions.
 if Ïƒ(X_{m}) âˆ’ Ïƒ(X_{min}) < e break else set X_{m}=X_{min} and go to step 2
Step One  Dissimilarity matrix:
We need a distance metric to calculate the distance between two points in the dataset. Let's go through few popular distance metrics quickly.
Euclidean Metric: This is the most popular metric used in distance measurement. It is equal to the straight line distance between two points.
Manhattan Metric: The distance between two points is measured along right angles to the axes. It is also known as rectilinear distance, taxicab metric or city block distance.
Cosine Metric: The cosine distance measures the cosine of the angle between two vectors. Smaller the value of cosine closer will the points.
Mahalanobis Metric: Mahalanobis metric is used to measure a point p's distance to a distribution D. It is particularly useful in detecting outliers.
Hamming Metric: Hamming metric counts the number of places at which entries in two vectors differ. It is an important metric in information theory.
We will be using Euclidean metric as dissimilarity measure.
from sklearn import datasets
import math as ma
import numpy as np
from pyspark.sql import types as t
from pyspark.sql import functions as f
digits = datasets.load_digits(n_class=6)
data = digits.data
# repartitioning the dataframe by id column will speed up the join operation
df = spark.createDataFrame(sc.parallelize(data.tolist()).zipWithIndex()).toDF("features",
"id").repartition("id")
df.cache()
euclidean = lambda x,y:ma.sqrt(np.sum((np.array(x)np.array(y))**2))
data_bc = sc.broadcast(df.sort("id").select("features").rdd.collect())
# create the distance metric
def pairwise_metric1(y):
dist = []
for x in data_bc.value:
dist += [ma.sqrt(np.sum((np.array(x)np.array(y))**2))]
return(dist)
udf_dist1 = f.udf(pairwise_metric1, t.ArrayType(t.DoubleType()))
df = df.withColumn("D", udf_dist1("features"))
Step Two: SCAMOF algorithm:
n,p = data.shape
dim = 2
X = np.random.rand(n,dim)
# randomly initialize a solution for the pivot point.
dfrand = spark.createDataFrame(sc.parallelize(X.tolist()).zipWithIndex()).toDF("X",
"id2").repartition("id2")
df = df.join(dfrand, df.id==dfrand.id2, "inner").drop("id1")
def pairwise_metric2(y):
dist = []
for x in X_bc.value:
dist += [ma.sqrt(np.sum((np.array(x)np.array(y))**2))]
return(dist)
# create the matrix B
def B(id,x,y):
y,x = np.array(y), np.array(x)
y[y==0.0] = np.inf
z = x/y
z[id] = (np.sum(z)z[id])
return(z.tolist())
# function for matrix multiplication using outer multiplication
def df_mult(df, col1, col2, n1, n2, matrix=True):
udf_mult = f.udf(lambda x,y:np.outer(np.array(x),
np.array(y)).flatten().tolist(),
t.ArrayType(t.DoubleType()))
df = df.withColumn("mult", udf_mult(col1, col2))
df = df.agg(f.array([f.sum(f.col("mult")[i])
for i in range(n1*n2)])).toDF("mult")
if not matrix:
return(df)
st = t.ArrayType(t.StructType(
[t.StructField("id",t.LongType()),
t.StructField("row",t.ArrayType(
t.DoubleType()))]))
udf_arange = (f.udf(lambda x:[(i,j.tolist())
for i,j in enumerate(np.array(x).
reshape(n1,n2)/n1)], st))
df = (df.withColumn("mult",
udf_arange("mult")).select(
f.explode("mult").alias("mult")))
df = (df.select(f.col("mult.id").alias("id2"),
f.col("mult.row").
alias("X_min")).
repartition("id2"))
return(df)
udf_B = f.udf(B, t.ArrayType(t.DoubleType()))
udf_sigma = (f.udf(lambda x,y: float(np.sum((
np.array(x)np.array(y))**2)),
t.DoubleType()))
sigma_old = np.inf
tol = 1e4
max_iter = 1000
for i in range(max_iter):
X_bc = sc.broadcast(df.sort("id").select("X").rdd.collect())
def pairwise_metric2(y):
dist = []
for x in X_bc.value:
dist += [ma.sqrt(np.sum((np.array(x)np.array(y))**2))]
return(dist)
udf_dist2 = f.udf(pairwise_metric2, t.ArrayType(t.DoubleType()))
df = df.withColumn("di", udf_dist2("X"))
df = df.withColumn("sigma", udf_sigma("D","di"))
sigma_new = df.agg({"sigma":"sum"}).collect()[0][0]
print(sigma_old, sigma_new)
sigma_old = sigma_new
df = df.withColumn("B", udf_B("id","D","di")).drop("di")
X_min = df_mult(df, "B", "X", n, dim)
df = df.join(X_min, df.id==X_min.id2).select("id", "D", f.col("X_min").alias("X"))
# cache action will prevent recreation of dataframe from base
df.cache()
The plot of MDS embedding.
Though the clusters are not clearly distinct they can be identified easily.
Drawbacks of MDS:
MDS requires large computing power for calculating the dissimilarity matrix at every iteration. It is hard to embed the new data in MDS.
Conclusion: Like PCA, MDS is an old method. It has been well researched. It has few extensions like sammon mapping. With this post, we tried to develop an understanding of MDS and its working. We brushed up a few areas related to MDS and implemented a basic MDS in pyspark.
Next article in this series will be on LLE Sneak peek!