This tutorial is from a 7 part series on Dimension Reduction:
- Understanding Dimension Reduction with Principal Component Analysis (PCA)
- Diving Deeper into Dimension Reduction with Independent Components Analysis (ICA)
- Multi-Dimension Scaling (MDS)
- LLE (Coming Soon!)
- t-SNE (Coming Soon!)
- IsoMap (Coming Soon!)
- Autoencoders (Coming Soon!)
(An IPython notebook with math and code is available on github.)
Today, we will learn another dimensionality reduction method called ICA. ICA is a linear dimension reduction method, which transforms the dataset into columns of independent components. Blind Source Separation and the "cocktail party problem" are other names for it. ICA is an important tool in neuroimaging, fMRI, and EEG analysis that helps in separating normal signals from abnormal ones. So, what exactly is ICA?
ICA stands for Independent Components Analysis. It assumes that each sample of data is a mixture of independent components and it aims to find these independent components. At the heart of ICA is "Independence". We should try to understand this first.
What does independence mean in the context of ICA? When can we safely say that two variables are independent? How is it different from 'Correlation'? And lastly, How do you measure the degree of independence?
Suppose x, y are two random variables and their distribution functions are given by Px, Py respectively. If we receive some information about x and that doesn't change whatever knowledge we have about y, then we can safely say that x, y are independent variables. Now you will say "Hang on, this is the absence of correlation you are talking about". Yes, you are right but partly. Correlation is not the only means to measure the dependence between two variables. In fact, what correlation captures is the linear dependence. If two variables are independent then both linear and non-linear dependence will be zero. The absence of linear dependence does not imply independence since there might be a non-linear relationship. Let's take a small example to understand this.
Suppose x (-5,-4, -3, -2, -1, 0, 1, 2, 3, 4, 5) &
y = x2 which gives us y (25, 16, 9, 4,1,0, 1, 4, 9, 16, 25). Now, calculate the correlation between these two variables.
import numpy as np x = np.array([-5,-4,-2,-1,0,1,2,3,4,5]) y = np.array([25,16,9,4,1,0,1,4,9,16,25]) np.correlate(x,y) 0.0
As you see the correlation in the above case is 0 though they have a non-linear relationship. Thus, independence between two variables implies zero correlation but it is not true in reverse.
Let's come back to our topic for the day ICA. As earlier said ICA tries to find out the independent sources from which data is made of. We will start with a classic example to explain the ICA and its working principle.
In the image shown above Alice and Bob, both are speaking at the same time. Both mics receive inputs S1 & S2 from Alice and Bob respectively. ICA assumes that the mixing process is linear i.e. it can be represented as a matrix multiplication. Each mic mixes S1 & S2 according to its location and settings which is given by a matrix A. The matrix operation produces vector M as an output. Now, you wish to separate the S1 & S2 from M1 & M2. This is referred to as cocktail party problem or blind source separation.
The solution to this problem is trivial if matrix A is known. A simple matrix inversion of A followed by multiplication with M will give the answer. But in a real-world scenario matrix A is often unknown. The only information we have is the output of the mixing process.
ICA approach to this problem is based on three assumptions. These are:
1. Mixing process is linear.
2. All source signals are independent of each other.
3. All source signals have non-gaussian distribution.
We have already talked about first two assumptions. Let's talk a bit about the third assumption ICA makes: non-gaussianity of source signals.
The basis of this assumption comes from the central limit theorem. According to the Central Limit Theorem, the sum of independent random variables is more Gaussian than the independent variables. So to infer source variables we have to move away from the gaussianity. In case of Gaussian distribution, uncorrelated Gaussian variables are also independent, it is a unique property associated with Gaussian distribution.
Let's take a simple example to understand this concept. First, create four datasets - two from Gaussian distribution and two from the uniform distribution.
np.random.seed(100) U1 = np.random.uniform(-1, 1, 1000) U2 = np.random.uniform(-1, 1, 1000) G1 = np.random.randn(1000) G2 = np.random.randn(1000) %matplotlib inline # let's plot our signals from matplotlib import pyplot as plt fig = plt.figure() ax1 = fig.add_subplot(121, aspect = "equal") ax1.scatter(U1, U2, marker = ".") ax1.set_title("Uniform") ax2 = fig.add_subplot(122, aspect = "equal") ax2.scatter(G1, G2, marker = ".") ax2.set_title("Gaussian") plt.show()
Now, mix U1 & U2 and G1 & G2 to create outputs Umix and Gmix.
# now comes the mixing part. we can choose a random matrix for the mixing A = np.array([[1, 0], [1, 2]]) U_source = np.array([U1,U2]) U_mix = U_source.T.dot(A) G_source = np.array([G1, G2]) G_mix = G_source.T.dot(A) # plot of our dataset fig = plt.figure() ax1 = fig.add_subplot(121) ax1.set_title("Mixed Uniform ") ax1.scatter(U_mix[:, 0], U_mix[:,1], marker = ".") ax2 = fig.add_subplot(122) ax2.set_title("Mixed Gaussian ") ax2.scatter(G_mix[:, 0], G_mix[:, 1], marker = ".") plt.show()
Umix and Gmix are what we have in a real-world scenario. Remove the linear dependence from both the mixtures.
# PCA and whitening the dataset from sklearn.decomposition import PCA U_pca = PCA(whiten=True).fit_transform(U_mix) G_pca = PCA(whiten=True).fit_transform(G_mix) # let's plot the uncorrelated columns from the datasets fig = plt.figure() ax1 = fig.add_subplot(121) ax1.set_title("PCA Uniform ") ax1.scatter(U_pca[:, 0], U_pca[:,1], marker = ".") ax2 = fig.add_subplot(122) ax2.set_title("PCA Gaussian ") ax2.scatter(G_pca[:, 0], G_pca[:, 1], marker = ".")
Notice the differences between the uncorrelated(PCA uniform, PCA gaussian2) and source plots(Uniform, Gaussian). In case of Gaussian they look alike while uncorrelated Uniform needs a rotation to get there. By removing correlation in the gaussian case, we have achieved independence between variables. If the source variables are gaussian ICA is not required and PCA is sufficient.
How do we measure and remove the non-linear dependence between variables?
Non-linear dependence between the variables can be measured by mutual information among the variables. Higher the mutual information, higher will be the dependence.
Mutual information = sum of entropies of marginal distribution - entropy of the joint distribution
Entropy is a measure of uncertainty in a distribution. Entropy for a variable x is given by
H(x) = sum(log(P(x))*P(x)) for every possible of value of x.
The gaussian distribution has the highest entropy. A term closely related to entropy is negentropy which is formulated as
negentropy(x) = H(x_gaussian) - H(x). Here x_gaussian is gaussian random vector with same covariance as x. Thus, negentropy is always non-zero and equal to zero if x is a gaussian random variable. Also,
mutual information(y1,y2) = constant - sum(negentropy(yi))
Calculation of negentropy and mutual information requires knowledge of entropy. Entropy calculation requires probability distribution function which is not known. We can approximate negentropy with some suitable functions. Few popular examples are tanh(ay), -exp(-y2) and -y*exp(-y2).
G & g are the approximating function and its derivative respectively. X is the dataset. 1. Initialize W
2. X = PCA(X)
3. While W changes:
W = average(X*G(WX)) - average(g(WTX))W W = orthogonalize(W) 5. return S = WX
Orthogonalization is a process to make columns of a matrix orthogonal.
How many independent components should be selected? Which independent components should be selected?
ICA outputs a source matrix with columns as independent sources. It never tells us whether a component is significant or irrelevant. If the number of columns is less then checking every component is advisable. For large number of components, selection should be done at the PCA stage(2nd step). If you are unfamiliar with PCA check out the post-1 in this series.
Let's implement this algorithm in PySpark. We will create few signals and then mix them up to get a dataset suitable for ICA analysis.
%matplotlib inline import numpy as np import matplotlib.pyplot as plt from scipy import signal np.random.seed(0) num_rows = 3000 t = np.linaspace(0,10, n_samples) # create signals sources s1 = np.sin(3*t) # a sine wave s2 = np.sign(np.cos(6*time)) # a square wave s3 = signal.sawtooth(2 *t) # a sawtooth wave # combine single sources to create a numpy matrix S = np.c_[s1,s2,s3] # add a bit of random noise to each value S += 0.2 no.random.normal(size = S.shape) # create a mixing matrix A A = np.array([[1, 1.5, 0.5], [2.5, 1.0, 2.0], [1.0, 0.5, 4.0]]) X = S.dot(A.T) #plot the single sources and mixed signals plt.figure(figsize =(26,12) ) colors = ['red', 'blue', 'orange'] plt.subplot(2,1,1) plt.title('True Sources') for color, series in zip(colors, S.T): plt.plot(series, color) plt.subplot(2,1,2) plt.title('Observations(mixed signal)') for color, series in zip(colors, X.T): plt.plot(series, color)
Code for PCA and whitening the dataset.
from pyspark.mllib.linalg.distributed import IndexedRowMatrix, IndexedRow, BlockMatrix from pyspark.mllib.feature import StandardScaler from pyspark.mllib.linalg import Vectors, DenseMatrix, Matrix from sklearn import datasets # create the standardizer model for standardizing the dataset X_rdd = sc.parallelize(X).map(lambda x:Vectors.dense(x) ) scaler = StandardScaler(withMean = True, withStd = False).fit(iris_rdd) X_sc = scaler.transform(X_rdd) #create the IndexedRowMatrix from rdd X_rm = IndexedRowMatrix(X_sc.zipWithIndex().map(lambda x: (x, x))) # compute the svd factorization of the matrix. First the number of columns and second a boolean stating whether # to compute U or not. svd_o = X_rm.computeSVD(X_rm.numCols(), True) # svd_o.V is of shape n * k not k * n(as in sklearn) P_comps = svd_o.V.toArray().copy() num_rows = X_rm.numRows() # U is whitened and projected onto principal components subspace. S = svd_o.s.toArray() eig_vals = S**2 # change the ncomp to 3 for this tutorial #n_comp = np.argmax(np.cumsum(eig_vals)/eig_vals.sum() > 0.95)+1 n_comp = 3 U = svd_o.U.rows.map(lambda x:(x.index, (np.sqrt(num_rows-1)*x.vector).tolist()[0:n_comp])) # K is our transformation matrix to obtain projection on PC's subspace K = (U/S).T[:n_comp]
Now, the code for calculation of independent components.
import pyspark.sql.functions as f import pyspark.sql.types as t df = spark.createDataFrame(U).toDF("id", "features") # Approximating function g(y) = x*exp(-x**2/2) and its derivative def g(X): x = np.array(X) return(x * np.exp(-x**2/2.0)) def gprime(Y): y = np.array(Y) return((1-y**2)*np.exp(-y**2/2.0)) # function for calculating step 2 of the ICA algorithm def calc(df): function to calculate the appoximating function and its derivative def foo(x,y): y_arr = np.array(y) gy = g(y_arr) gp = gprime(y_arr) x_arr = np.array(x) res = np.outer(gy,x_arr) return([res.flatten().tolist(), gp.tolist()]) udf_foo = f.udf(foo, t.ArrayType(t.ArrayType(t.DoubleType()))) df2 = df.withColumn("vals", udf_foo("features","Y")) df2 = df2.select("id", f.col("vals").getItem(0).alias("gy"), f.col("vals").getItem(1).alias("gy_")) GY_ = np.array(df2.agg(f.array([f.sum(f.col("gy")[i]) for i in range(n_comp**2)])).collect()).reshape(n_comp,n_comp)/num_rows GY_AVG_V = np.array(df2.agg(f.array([f.avg(f.col("gy_")[i]) for i in range(n_comp)])).collect()).reshape(n_comp,1)*V return(GY_, GY_AVG_V) np.random.seed(101) # Initialization V = np.random.rand(n_comp, n_comp) # symmetric decorelation function def sym_decorrelation(V): U,D,VT = np.linalg.svd(V) Y = np.dot(np.dot(U,np.diag(1.0/D)),U.T) return np.dot(Y,V) numIters = 10 V = sym_decorrelation(v_init) tol =1e-3 V_bc = sc.broadcast(V) for i in range(numIters): # Y = V*X udf_mult = f.udf(lambda x: V_bc.value.dot(np.array(x)).tolist(), t.ArrayType(t.DoubleType())) df = df.withColumn("Y", udf_mult("features")) gy_x_mean, g_y_mean_V = calc(df) V_new = gy_x_mean - g_y_mean_V V_new = sym_decorrelation( V_new ) #condition for convergence lim = max(abs(abs(np.diag(V_new.dot(V.T)))-1)) V = V_new # V needs to be broadcasted after every change V_bc = sc.broadcast(V) print("i= ",i," lim = ",lim) if lim < tol: break elif i== numIters: print("Lower the tolerance or increase the number of iterations") #calculate the unmixing matrix for dataset W = V.dot(K) #now multiply U with V to get source signals S_ = df.withColumn("Y", udf_mult("features"))
Plot the result S_
plt.title('Recovered source Signals') for color, series in zip(colors, S_.T): plt.plot(series, color)
Drawbacks of ICA: ICA cannot uncover non-linear relationships of the dataset. ICA does not tell us anything about the order of independent components or how many of them are relevant.
Conclusion: In this post, we learned practical aspects of the Independent component analysis. We touched a few important topics related to the understanding of ICA like gaussianity and independence. Afterwards, ICA algorithm was implemented on pyspark and used on a toy dataset. If you wish to learn more about ICA and its applications try ICA paper on fMRI and EEG data.
Next article in this series will be on Multi-Dimension Scaling