the covariance matrix insertion
play

The Covariance Matrix (insertion) Definition Let x = { x 1 , ..., x - PDF document

1 Feature Selection: Linear Transformations y new = M x old 3 Constraint Optimization (insertion) Problem: Given an objective function f(x) to be optimized and let constraints be given by h k (x)=c k , moving constants to the left, ==> h k


  1. 1 Feature Selection: Linear Transformations y new = M x old 3 Constraint Optimization (insertion) Problem: Given an objective function f(x) to be optimized and let constraints be given by h k (x)=c k , moving constants to the left, ==> h k (x) - c k =g k (x). f(x) and g k (x) must have continuous first partial derivatives A Solution: Lagrangian Multipliers 0 =  x f(x) + Σ  x λ k g k (x) or starting with the Lagrangian : L ( x, λ ) = f(x) + Σ λ k g k (x). with  x L (x, λ ) = 0. 1

  2. 4 The Covariance Matrix (insertion) Definition Let x = { x 1 , ..., x N }   N be a real valued random variable (data vectors), with the expectation value of the mean E[ x ] = μ . We define the covariance matrix Σ x of a random variable x as Σ x := E[ ( x - μ ) ( x - μ ) T ] with matrix elements Σ ij = E[ ( x i - μ i ) ( x j - μ j ) T ] . Application: Estimating E[ x ] and E[ ( x - E[ x ] ) ( x - E[ x ] ) T ] from data . We assume m samples of the random variable x = { x 1 , ..., x N }   N that is we have a set of m vectors { x 1 , ..., x m }   N or when put into a data matrix X   N x m Maximum Likelihood estimators 1 m for μ and Σ x are:     x k M L  m 1 k 1 m 1          T T ( x )( x ) XX k k ML  ML ML m k 1 m 5 KLT/PCA Motivation • Find meaningful “directions” in correlated data • Linear dimensionality reduction • Visualization of higher dimensional data • Compression / Noise reduction • PDF-Estimate 2

  3. 7 Karhunen-Loève Transform: 1 st Derivation Problem Let x = { x 1 , ..., x N }   N be a feature vector of zero mean, real valued random variables. We seek the direction a 1 of maximum variance: T x for which a 1 is such as E[ y 1 2 ] is maximum == > y 1 = a 1 with the constraint that a 1 T a 1 = 1 This is a constrained optimization → use of the Lagrangian: L( a 1 , λ 1 ) = E[ a 1 T x x T a 1 ] – λ 1 ( a 1 T a 1 – 1 ) T Σ x a 1 – λ 1 ( a 1 T a 1 – 1 ) Lagrange = a 1 multiplier 8 Karhunen-Loève Transform L( a 1 , λ 1 ) T Σ x a 1 – λ 1 ( a 1 T a 1 – 1 ) = a 1   ( , ) L a 2 ] to be maximum : for E[ y 1  1 1 0  a 1 => Σ x a 1 – λ 1 a 1 = 0 => a 1 must be eigenvector of Σ x with eigenvalue λ 1 . 2 ] = a 1 T Σ x a 1 = λ 1 E[ y 1 2 ] to be maximum, λ 1 must be the largest eigenvalue. => for E[ y 1 3

  4. 9 Karhunen-Loève Transform Now let’s search for a second direction, a 2 , such that: T x such as E[ y 2 2 ] is maximum, and y 2 = a 2 a 2 T a 1 = 0 and a 2 T a 2 = 1 Similar derivation: L( a 2 , λ 2 ) = a 2 T Σ x a 2 – λ 2 ( a 2 T a 2 – 1 ) with a 2 T a 1 = 0 => a 2 must be the eigenvector of Σ x associated with the second largest eigenvalue λ 2 . We can derive N orthonormal directions that maximize the variance: A = [ a 1 , a 2 ,…, a N ] and y = A T x The resulting matrix A is known as Principal Component Analysis (PCA) N   or Kharunen-Loève transform (KLT) y = A T x x a y i i  i 1 10 Karhunen-Loève Transform: 2 nd Derivation Problem Let x = { x 1 , ..., x N }   N be a feature vector of zero mean, real valued random variables. We seek a transformation A of x that results in a new set of variables y = A T x (feature vectors) which are uncorrelated ( i.e. E [ y i, y j ] = 0 for i  j ) . • Let y = A T x , then by definition of the correlation matrix:  T  T T  T R E [ yy ] E A [ xx A ] A R A y x • R x is symmetric  its eigenvectors are mutually orthogonal 4

  5. 11 Karhunen-Loève Transform  i.e. if we choose A such that its columns a i are orthonormal eigenvectors of R x , we get:    0 0 1        T R A R A 0 0  y x      0 0 N • If we further assume R x to be positive definite, ---- > the eigenvalues  i will be positive. The resulting matrix A is known as N   Karhunen-Loève transform (KLT) y = A T x x y a i i  i 1 12 Karhunen-Loève Transform The Karhunen-Loève transform (KLT) N    A T y x x a y i i  1 i For mean-free vectors ( e.g. replace x by x – E [ x ] ) this process diagonalizes the covariance matrix Σ y 5

  6. 13 KLT Properties: MSE-Approximation We define a new vector in m -dimensional subspace ( m < N ), ˆ x m   using only m basis vectors: ˆ x y a i i  i 1  Projection of x into the subspace spanned by the m used (orthonormal) eigenvectors. Now, what is the expected mean square error between x and its projection : ˆ x   2   N E   2     ˆ      T x x a ( a )( a ) E y E  y y    i i i i j j         i m 1 i j 14 KLT Properties: MSE-Approximation   N N        2       2  ˆ T E x x .... E  ( y a )( y a )  E y     i i j j i i       i m 1 i m 1 i j The error is minimized if we choose as basis those eigenvectors corresponding to the m largest eigenvalues of the correlation matrix. • Amongst all other possible orthogonal transforms KLT is the one leading to minimum MSE This form of KLT ( as presented here ) is also referred to as Principal Component Analysis (PCA). The principal components are the eigenvectors ordered (desc.) by their respective eigenvalue magnitudes  i 6

  7. 15 KLT Properties Total variance Let w.l.o.g. E[ x ]=0 and y = A T x the KLT (PCA) of x . • From the previous definitions we get:     2   2 E y   y i i i • i.e. the eigenvalues of the input covariance matrix are equal to the variances of the transformed coordinates.  Selecting those features corresponding to m largest eigenvalues retains the maximal possible total variance (sum of component variances) associated with the original random variables x i . 16 KLT Properties: Entropy For a random vector y the entropy is a   [ln ( y )] H E p y y measure for the randomness of the underlying process. Example: for a zero-mean ( m =0) m -dim. Gaussian m 1  1          1 [ ] 2 2 H E ln( (2 ) exp( y y ) ) y y y 2        1 T m 1 1 H ln(2 ) ln E [ y y ] y 2 2 y 2 y        T 1 T 1 [ y y ] [ y y ] E E trace y y m   m    1 T  E trace [ yy ]      m 1 ln(2 ) ln y 2 2 i   2   E trace I [ ] m  1 i  Selecting those features corresponding to m largest eigenvalues maximizes the entropy in the remaining features. No wonder: variance and randomness are directly related !  7

  8. 17 Computing a PCA: Problem: Given mean free data X , a set on n feature vectors x i  R m . Compute the orthonormal eigenvectors a i of the correlation matrix R x .  There are many algorithms that can compute very efficiently eigenvectors of a matrix. However, most of these methods can be very unstable in certain special cases.  Here we present SVD, a method that is in general not the most efficient one. However, the method can be made numerically stable very easily! 18 Computing a PCA: S ingular V alue D ecomposition: an Excursus to Linear Algebra ( without Proofs ) 8

  9. 19 Singular Value Decomposition : SVD (reduced Version): For matrices A  R m  n with m ≥ n , there exist matrices U  R m  n with orthonormal columns ( U T U = I ) , V  R n  n orthogonal ( V T V = I ) ,   R n  n diagonal, with A=U  V T n m =  T U V A • The diagonal values of  (  1 ,  2 , ….,  n ) are called the singular values. • It is accustomed to sort them:  1   2  ….   n 20 SVD Applications: SVD is an all-rounder ! Once you have U ,  , V , you can use it to: Solve Linear Systems: A x = b - If A -1 exists  Compute matrix inverse a) b) for fewer equations than unknowns c) for more equations than unknowns d) if there is no solution: compute x that | A x - b | = min e) compute rank (numerical rank) of a matrix - ……. - Compute PCA / KLT 9

  10. 21 SVD : Matrix inverse A -1 A x = b : A=U  V T U,  , V, exist for all A If A is square n x n and not singular, then A -1 exists .    1    1 T A U V Computing A -1 for a singular A !?    1     T 1 1 V U Since U,  , V all exist, the only problem can originate if one σ i = 0   1  or numerically close to zero.  1   T V   U --> singular values indicate if A   1 is singular or not!!    n 22 SVD : Rank of a Matrix - The rank of A is the number of non-zero singular values. If there are very small singular values  i , then A is close of - being singular. We can set a threshold t , and set  i = 0 if  i ≤ t then the numeric_rank ( A ) = # {  i |  i > t } n  1  2  n m =  T A U V 10

Recommend


More recommend