Dimension Reduction - t-SNE

Through a series of posts, learn how to implement dimension reduction algorithms using t-SNE.

8 years ago • 5 min read

By Ashwini Kumar Pal

This tutorial is from a 7 part series on Dimension Reduction:

(A more mathematical notebook with code is available the github repo)

t-SNE is a new award-winning technique for dimension reduction and data visualization. t-SNE not only captures the local structure of the higher dimension but also preserves the global structures of the data like clusters. It has stunning ability to produce well-defined segregated clusters. t-SNE is based on stochastic neighbor embedding(SNE). t-SNE was developed to address some of the problems in SNE. So let's have a basic understanding of SNE.
SNE: stochastic neighbor embedding uses a probabilistic approach to embed a high dimension dataset into lower dimension by preserving the neighborhood structure of the dataset. A Gaussian probability distribution centered on each point is defined over all the potential neighbors of this point. SNE aims to minimize the difference in probability distribution in the higher dimension and lower dimension.
For each object, i and it's neighbor j , we compute a P_i|j which reflects the probability that j is neighbor of i
P_i|j = exp(−d²_ij)Σ_k≠iexp(−d²_ij)) where d²_ij is the dissimilarity between element i and j given as input or calculated from the dataset provided.
The dissimilarity between x_i and x_j can be calculated using the following formula
d²_ij = ||x_i−x_j||² / (2σ²_i), where σ_i generally calculated through a binary search by equating the entropy of the distribution centered at x_i to perplexity which is chosen by hand. This method generates a probability matrix which is asymmetric.
Now, a random solution is chosen as the starting point for the low dimensional embedding. A probability distribution is defined on it in the same way as done above but with a constant σ=0.5 for all points.
SNE tries to minimize the difference between these two distributions. We can calculate the difference between two distributions using Kullback-Liebler divergence. For two discrete distirbution P and Q KL divergence is given by DKL(P||Q)=Σ_iP_i(P_i/Q_i).
SNE defines a cost function based of the difference between p_ij and q_ij which is given by
C=Σ_iΣ_j(P_ijlog(P_ij/q_ij))
While embedding the dataset in lower dimension, two kinds of error can occur, first neighbors are mapped as faraway points( p_ij is large and q_ij is small) and points which are far away mapped as neighbors( p_ij is small while q_ij is large).Look closely at the cost function, the cost of the first kind of error i.e. mapping large P_ij with small q_ij is smaller than the cost while mapping small p_ij as large q_ij . SNE heavily penalizes if the neighbors are mapped faraway from each other.
Some of the shortcomings of SNE approach are asymmetric probability matrix P, crowding problem. As pointed out earlier the probability matrix P is asymmetric. Suppose a point X_i is far away from other points, it's P_ij will be very small for all j. So, It will have little effect on the cost function and embedding it correctly in the lower dimension will be hard.
Any n-dimensional Euclidean space can have an object with n+1 or less equidistant vertices not more than that. Now, when the intrinsic dimension of a dataset is high say 20, and we are reducing its dimensions from 100 to 2 or 3 our solution will be affected by crowding problem. The amount of space available to map close points in 10 or 15 dimensions will always be greater than the space available in 2 or 3 dimensions. In order to map close points properly, moderately distant points will be pushed too far. This will eat the gaps in original clusters and it will look like a single giant cluster.
We need to brush up few more topics before we move to t-SNE.
Student-t distribution -- Student-t distribution is a continuous symmetric probability distribution function with heavy tails. It has only one parameter degree of freedom. As the degree of freedom increases, it approaches the normal distribution function. When degree of freedom =1, it takes the form of Cauchy distribution function and its probability density function is given by
f(t)=1/π(1+t2)
Entropy: Entrophy is measure of the average information contained in a data. For a variable x with pdf p(x), it is given by
H(x) = −Σ_i(p(x_i) × log₂(p(x_i)))
Perpexility: In information theory, perplexity measures how good a probability distribution predicts a sample. A low perplexity indicates that distribution function is good at predicting sample. It is given by
Perpx(x)=2^H(x), where H(x) is the entropy of the distribution.

t-SNE
t-SNE differs from SNE in two ways, first it uses a student-t distribution to measure the similarity between points Y_i and Y_j in the lower dimension,secondly for the higher dimension it uses symmetric probability distribution such that P_ji=P_ij.

Steps of t-SNE algorithm:

compute pairwise similarity P_ij for every i and j.
make P_ij symmetric.
choose a random solution Y₀
While not done:
          compute pairwise similairites for Y_i
          compute the gradient
          update the solution
          if i>max_iter    break
          else
     i = i+1

Computing probability distribution:
For computing pairwise similarities we need to know the variance σ_i for the Gaussian centered at x_i. One might think why not set a single value of σ_i for every x_i. The density of data is likely to vary, we need smaller σ_i for places with higher densities and bigger σ_i for places where points are far away. The entropy of the Gaussian distribution centered at x_i increases as σ_i increases. To get the σ_i we need to perform a binary search such that perplexity of the Gaussian distribution centered at x_i is equal to the perplexity specified by the user.
Now, If you are thinking how perplexity fits into all this. You can think of perplexity as a measure of the number of neighbors.

Impact of parameters on embedding:
For t-SNE to be meaningful we have to choose right value of perplexity. Perplexity balances the local and global aspects of the dataset. A Very high value will lead to the merging of clusters into a single big cluster and low will produce many close small clusters which will be meaningless. Images below show the effect of perplexity on t-SNE on iris dataset.

When K(number of neighbors) = 5 t-SNE produces many small clusters. This will create problems when number of classes is high. As the number of neighbors increases the clusters from same classes merge. At K=25 and K=50 we have well-defined clusters for few classes. Also, the clusters become denser as the K increases.
A plot of a subset of MNIST dataset after t-SNE embedding.

t-SNE produces a well-defined and separate cluster for each of the digits.
Drawbacks of t-SNE
Problems with t-SNE arise when intrinsic dimensions are higher i.e. more than 2-3 dimensions. t-SNE has the tendency to get stuck in local optima like other gradient descent based algorithms. The basic t-SNE algorithm is slow due to nearest neighbor search queries.

Conclusion: We talked about another dimension reduction and visualization method t-SNE through this post. In the beginning, we discussed important topics related to t-SNE. Afterwards, a basic t-SNE was implemented using pyspark. There are few extensions of basic t-SNE which improve the time complexity of the algorithm. People who wish delve deeper into this topic Barnes-Hut t-SNE would be a good starting point.

In next post, we will learn about Isomap

public

Dimension Reduction - IsoMap

public

Blog

Docs

Community

ML Showcase

Professional Services

Talk to an Expert

Dimension Reduction - IsoMap

Dimension Reduction - LLE

Solutions

Product

Resources

Company

This tutorial is from a 7 part series on Dimension Reduction:

Spread the word

Dimension Reduction - IsoMap

Dimension Reduction - LLE

Keep reading

Kolmogorov-Arnold Networks (KAN): Promising Alternative to Multi-Layer Perceptron?

Predictive Analysis for Sales: A Comprehensive Forecasting Approach 📈🕵🏼‍♂️👨🏼‍💻

Introduction to Uncertainty in Machine Learning Models: Concepts and Methods - part 1

Subscribe to our newsletter

Solutions

Product

Resources

Company