1 Feature Selection: Linear Transformations y new = M x old 3 Constraint Optimization (insertion) Problem: Given an objective function f(x) to be optimized and let constraints be given by h k (x)=c k , moving constants to the left, ==> h k (x) - c k =g k (x). f(x) and g k (x) must have continuous first partial derivatives A Solution: Lagrangian Multipliers 0 = x f(x) + Σ x λ k g k (x) or starting with the Lagrangian : L ( x, λ ) = f(x) + Σ λ k g k (x). with x L (x, λ ) = 0. 1
4 The Covariance Matrix (insertion) Definition Let x = { x 1 , ..., x N } N be a real valued random variable (data vectors), with the expectation value of the mean E[ x ] = μ . We define the covariance matrix Σ x of a random variable x as Σ x := E[ ( x - μ ) ( x - μ ) T ] with matrix elements Σ ij = E[ ( x i - μ i ) ( x j - μ j ) T ] . Application: Estimating E[ x ] and E[ ( x - E[ x ] ) ( x - E[ x ] ) T ] from data . We assume m samples of the random variable x = { x 1 , ..., x N } N that is we have a set of m vectors { x 1 , ..., x m } N or when put into a data matrix X N x m Maximum Likelihood estimators 1 m for μ and Σ x are: x k M L m 1 k 1 m 1 T T ( x )( x ) XX k k ML ML ML m k 1 m 5 KLT/PCA Motivation • Find meaningful “directions” in correlated data • Linear dimensionality reduction • Visualization of higher dimensional data • Compression / Noise reduction • PDF-Estimate 2
7 Karhunen-Loève Transform: 1 st Derivation Problem Let x = { x 1 , ..., x N } N be a feature vector of zero mean, real valued random variables. We seek the direction a 1 of maximum variance: T x for which a 1 is such as E[ y 1 2 ] is maximum == > y 1 = a 1 with the constraint that a 1 T a 1 = 1 This is a constrained optimization → use of the Lagrangian: L( a 1 , λ 1 ) = E[ a 1 T x x T a 1 ] – λ 1 ( a 1 T a 1 – 1 ) T Σ x a 1 – λ 1 ( a 1 T a 1 – 1 ) Lagrange = a 1 multiplier 8 Karhunen-Loève Transform L( a 1 , λ 1 ) T Σ x a 1 – λ 1 ( a 1 T a 1 – 1 ) = a 1 ( , ) L a 2 ] to be maximum : for E[ y 1 1 1 0 a 1 => Σ x a 1 – λ 1 a 1 = 0 => a 1 must be eigenvector of Σ x with eigenvalue λ 1 . 2 ] = a 1 T Σ x a 1 = λ 1 E[ y 1 2 ] to be maximum, λ 1 must be the largest eigenvalue. => for E[ y 1 3
9 Karhunen-Loève Transform Now let’s search for a second direction, a 2 , such that: T x such as E[ y 2 2 ] is maximum, and y 2 = a 2 a 2 T a 1 = 0 and a 2 T a 2 = 1 Similar derivation: L( a 2 , λ 2 ) = a 2 T Σ x a 2 – λ 2 ( a 2 T a 2 – 1 ) with a 2 T a 1 = 0 => a 2 must be the eigenvector of Σ x associated with the second largest eigenvalue λ 2 . We can derive N orthonormal directions that maximize the variance: A = [ a 1 , a 2 ,…, a N ] and y = A T x The resulting matrix A is known as Principal Component Analysis (PCA) N or Kharunen-Loève transform (KLT) y = A T x x a y i i i 1 10 Karhunen-Loève Transform: 2 nd Derivation Problem Let x = { x 1 , ..., x N } N be a feature vector of zero mean, real valued random variables. We seek a transformation A of x that results in a new set of variables y = A T x (feature vectors) which are uncorrelated ( i.e. E [ y i, y j ] = 0 for i j ) . • Let y = A T x , then by definition of the correlation matrix: T T T T R E [ yy ] E A [ xx A ] A R A y x • R x is symmetric its eigenvectors are mutually orthogonal 4
11 Karhunen-Loève Transform i.e. if we choose A such that its columns a i are orthonormal eigenvectors of R x , we get: 0 0 1 T R A R A 0 0 y x 0 0 N • If we further assume R x to be positive definite, ---- > the eigenvalues i will be positive. The resulting matrix A is known as N Karhunen-Loève transform (KLT) y = A T x x y a i i i 1 12 Karhunen-Loève Transform The Karhunen-Loève transform (KLT) N A T y x x a y i i 1 i For mean-free vectors ( e.g. replace x by x – E [ x ] ) this process diagonalizes the covariance matrix Σ y 5
13 KLT Properties: MSE-Approximation We define a new vector in m -dimensional subspace ( m < N ), ˆ x m using only m basis vectors: ˆ x y a i i i 1 Projection of x into the subspace spanned by the m used (orthonormal) eigenvectors. Now, what is the expected mean square error between x and its projection : ˆ x 2 N E 2 ˆ T x x a ( a )( a ) E y E y y i i i i j j i m 1 i j 14 KLT Properties: MSE-Approximation N N 2 2 ˆ T E x x .... E ( y a )( y a ) E y i i j j i i i m 1 i m 1 i j The error is minimized if we choose as basis those eigenvectors corresponding to the m largest eigenvalues of the correlation matrix. • Amongst all other possible orthogonal transforms KLT is the one leading to minimum MSE This form of KLT ( as presented here ) is also referred to as Principal Component Analysis (PCA). The principal components are the eigenvectors ordered (desc.) by their respective eigenvalue magnitudes i 6
15 KLT Properties Total variance Let w.l.o.g. E[ x ]=0 and y = A T x the KLT (PCA) of x . • From the previous definitions we get: 2 2 E y y i i i • i.e. the eigenvalues of the input covariance matrix are equal to the variances of the transformed coordinates. Selecting those features corresponding to m largest eigenvalues retains the maximal possible total variance (sum of component variances) associated with the original random variables x i . 16 KLT Properties: Entropy For a random vector y the entropy is a [ln ( y )] H E p y y measure for the randomness of the underlying process. Example: for a zero-mean ( m =0) m -dim. Gaussian m 1 1 1 [ ] 2 2 H E ln( (2 ) exp( y y ) ) y y y 2 1 T m 1 1 H ln(2 ) ln E [ y y ] y 2 2 y 2 y T 1 T 1 [ y y ] [ y y ] E E trace y y m m 1 T E trace [ yy ] m 1 ln(2 ) ln y 2 2 i 2 E trace I [ ] m 1 i Selecting those features corresponding to m largest eigenvalues maximizes the entropy in the remaining features. No wonder: variance and randomness are directly related ! 7
17 Computing a PCA: Problem: Given mean free data X , a set on n feature vectors x i R m . Compute the orthonormal eigenvectors a i of the correlation matrix R x . There are many algorithms that can compute very efficiently eigenvectors of a matrix. However, most of these methods can be very unstable in certain special cases. Here we present SVD, a method that is in general not the most efficient one. However, the method can be made numerically stable very easily! 18 Computing a PCA: S ingular V alue D ecomposition: an Excursus to Linear Algebra ( without Proofs ) 8
19 Singular Value Decomposition : SVD (reduced Version): For matrices A R m n with m ≥ n , there exist matrices U R m n with orthonormal columns ( U T U = I ) , V R n n orthogonal ( V T V = I ) , R n n diagonal, with A=U V T n m = T U V A • The diagonal values of ( 1 , 2 , …., n ) are called the singular values. • It is accustomed to sort them: 1 2 …. n 20 SVD Applications: SVD is an all-rounder ! Once you have U , , V , you can use it to: Solve Linear Systems: A x = b - If A -1 exists Compute matrix inverse a) b) for fewer equations than unknowns c) for more equations than unknowns d) if there is no solution: compute x that | A x - b | = min e) compute rank (numerical rank) of a matrix - ……. - Compute PCA / KLT 9
21 SVD : Matrix inverse A -1 A x = b : A=U V T U, , V, exist for all A If A is square n x n and not singular, then A -1 exists . 1 1 T A U V Computing A -1 for a singular A !? 1 T 1 1 V U Since U, , V all exist, the only problem can originate if one σ i = 0 1 or numerically close to zero. 1 T V U --> singular values indicate if A 1 is singular or not!! n 22 SVD : Rank of a Matrix - The rank of A is the number of non-zero singular values. If there are very small singular values i , then A is close of - being singular. We can set a threshold t , and set i = 0 if i ≤ t then the numeric_rank ( A ) = # { i | i > t } n 1 2 n m = T A U V 10
Recommend
More recommend