Dimension Reduction Techniques Presented by Jie (Jerry) Yu
Outline � Problem Modeling � Review of PCA and MDS � Isomap � Local Linear Embedding (LLE) � Charting
Background � Advances in data collection and storage capacities lead to information overload in many fields. � Traditional statistical methods often break down because of the increase in the number of variables in each observations, that is , the dimension of the data. � One of the most challenging problem is to reduce the dimension of original data.
Problem Modeling Original high-dimensional data: � = T ( ,..., ) : p dimensional multivariate random X x x 1 p Underlying/Intrinsic low-dimensional data: � = T ( ,..., ) Y y y : k (< < p) dimensional multivariate 1 k random The mean and covariance: � ∑ = µ = µ µ = − µ − µ T T {( )( ) } ( ) ( ,..., ) E X X E X x 1 p Problems : � 1) Find the appropriate mapping that can best capture the most important features in low dimension and 2) Find the appropriate k that can best describe the data in low dimension.
State-of-the-art Techniques � Dimension reduction techniques can be categorized into two major classes: linear and non-linear. � Non-Linear Methods: Multidimensional Scaling (MDS), Principal Curves, Self-Organizing Map (SOM), Neural Network, Isomap, Local Linear Embedding (LLE) and Charting. � Linear Methods: Principal Component Analysis (PCA), Factor Analysis, Projection Pursuit and Independent Component Analysis (ICA)
Principal Component Analysis (PCA) � Denote a linear projection as = [ ,..., ] W w w 1 k i = T � Thus y w X i � In essence PCA tries to reduce the data dimension by finding a few orthogonal linear combinations (Principal Components, PCs) of the original variables with the largest variance. k k ∑ ∑ = = T arg max var{ } arg max var{ } W y w X i i = = 1 1 i i � It can also be further rewritten as : = T Σ arg max( ) W W W x
PCA � Σ can be decomposed by eigen decomposition as ∑ = U Λ T U x λ λ � Λ = ( ,..., ) is the diagonal matrix of diag 1 p ascending ordered eigenvalues. U is the orthogonal matrix containing the eigenvectors. � It is proven that the optimal projection matrix W are the first k eigenvectors in U.
PCA � Property 1: The subspace spanned by the first k eigenvectors has the smallest mean square deviation from X among all subspace of dimension K. � Property 2: The total variance is equal to the sum of the eigenvalues of the original covariance matrix.
Multidimensional Scaling (MDS) � Multidimensional Scaling (MDS) produces low- dimensional representation of the data such that the distance in the new space reflect the proximities of the data in the original space. � Denote symmetric proximity matrix as : ∆ = δ = { , , 1 ,..., } i j n ij � MDS tries to find the mapping such that the d = ( , ) distance in the lower space are as d y y ij i j close as possible to a function of the f δ corresponding proximity . ( ij )
MDS � Mapping Cost function: ∑ δ − 2 [ ( ) ] f d , i j ij ij _ scale factor ∑ i f δ 2 � The scale_factor are often based on ( ij ) , j ∑ 2 or . i d , j ij � Problem: Find optimal mapping that minimize the cost function L � If the proximity is the distance measure , 2 or , we call it metric-MDS. L 1 � If the proximity uses ordinal information of the data, it is called non-metric-MDS.
Isomap � Disadvantage of PCA and MDS: 1) Both methods often fail to discover complicated nonlinear structure and 2) both have difficulties in detecting the intrinsic dimension of the data. � Goal : combine the major algorithmic feature of PCA and MDS: computational efficiency, global optimality and asymptotic convergence guarantee and have the flexibility to learn nonlinear manifolds. � Idea : Introduce geodesic distance that can better describe the relation between data points.
Isomap Illustration: Points far apart on the underlying manifold , when measured by their geodesic distance may appear close in high-dimensional input space. The Swiss Roll data set
Isomap � In this approach the intrinsic geometry of the data is preserved by capturing the manifold distance between all data. � For neighboring points ( ε or k-nearest ), the Euclidean distance provides good approximation to the geodesic distance. � For faraway points, geodesic distance can be approximated by adding up a sequence of “ short hops ” between neighboring points. (Floyd Algorithm)
Isomap Algorithm � Step 1: determine which points are neighbors on the manifold based on the input distance matrix. � Step 2: Isomap estimates the geodesic distances between all pairs of ( , ) d G i j points on the manifold M by computing their shortest path distance . ( , ) d x i j � Step 3: Apply MDS or PCA to the matrix of G = the graph distance matrix . { ( , )} D d i j G
The Swiss Roll Problem
Detect Intrinsic Dimension The intrinsic dimensionality � of the data can be estimated from the decrease rate of Residual Variance as the dimensionality of Y increased. Residual Variance is defined � as : − 2 while R() 1 ( , ) R D D M y operation is the linear correlation coefficient and D M is the estimated distance in original space and the D y distance in projected space.
Theoretical Analysis � The main contribution of Isomap is substitute the Euclidean distance with geodesic distance, which may better capture the nonlinear structure of a manifold. � Given sufficient data, Isomap is guaranteed asymptotically to recover the true dimensionality and geometric structure of a non-linear manifolds.
Experiments
Experiments
Experiment 1: Facial Images
The hand-written 2 ’ s Experiment 2:
Locally Linear Embedding (LLE) � MDS and its variant Isomap try to preserve pair wise distance between data points. � Locally Linear Embedding (LLE) is unsupervised learning algorithm that recovers global nonlinear structure from locally linear fits. � Assumption: each data point and its neighbors lie on or close to a locally linear patch of the manifold.
Locally Linearity
LLE � Idea: The local geometry is characterized by linear coefficients that reconstruct each data point from its neighbors. � Reconstruction Cost is defined as : ∑ ∑ ε = − 2 ( ) | | W x w x i j ij j i � Two constraints: 1) each data point is only reconstruct by its neighbors instead of faraway points and 2) rows of weight matrix sum to one.
Linear reconstruction
LLE � The symmetric weight matrix for any data point is invariant to rotations, rescaling and translations. � Although the global manifold may be nonlinear, for each locally linear neighborhood there exists a linear mapping (consisting of a translation ,rotation and rescaling) that project the neighborhood to low dimension. � The same weight matrix that reconstruct ith data in D dimension should also reconstruct its embedded manifold in d dimsension.
LLE � W is solved by minimizing the reconstruct cost function in the original space. � To find the optimal global mapping to lower dimensional space, define an embedding cost function: ∑ ∑ φ = − 2 ( ) | | Y y w y i j ij j i � Because W is fixed, the problem turns to find a optimal projection (X-> Y) which minimize the embedding function.
Theoretical analysis: � 1) only one free parameter K and transformation is determinant. � 2)Guranteed to converge to global optimality with sufficient data point. � 3)LLE don ’ t have to be rerun to compute higher dimension embeddings. � 4)The intrinsic dimension d can be estimated by analyzing a reciprocal cost function of reconstruct Y to X.
Facial Images Experiment 1
Words in semantic space Experiment 2:
Arranging words in semantic Experiment 2: space
Charting � Charting is the problem of assigning a low- dimensional coordinate system to data points in a high-dimensional samples. � Assume that the data lies on or near a low- dimensional manifold in the sample space and there exists a 1-to-1 smooth nonlinear transform between the manifold and a low-dimensional vector space. � Goal: find a mapping that is expressed as a kernel- based mixture of linear projections that minimizes information loss about the density and relative locations of sample points.
Local Linear Scale and Intrinsic Dimensionality � Local Linear Scale (r) : at some scale r the mapping from a neighborhood on M d (original space) to lower dimension is r linear. � Consider a ball of radius r centered on a data point and containing n(r) data points. d r The count n(r) grows at only at the locally linear scale.
Local Linear Scale and Intrinsic Dimensionality There are two other factor that may affect the data distribution in different scale: isotropic noise (at a smaller scale) and embedding curvature ( at a larger scale). Define c(r) = log r /log � n(r) .At noise scale c(r)= 1/D< 1/d . At locally linear scale c(r)= 1/d . At curvature scale c(r) < 1/d .
Recommend
More recommend