Discriminative Feature Extraction and Dimension Reduction - PCA & LDA Berlin Chen, 2004
Introduction • Goal: discover significant patterns or features from the input data – Salient feature selection or dimensionality reduction Network y x W Feature space Input space – Compute an input-output mapping based on some desirable properties 2 2004 TCFST - Berlin Chen
Introduction (cont.) • Principal Component Analysis (PCA) • Linear Discriminant Analysis (LDA) • Heteroscedastic Discriminant Analysis (HDA) 3 2004 TCFST - Berlin Chen
Introduction (cont.) • Formulation for discriminative feature extraction – Model-free (nonparametric) • Without prior information: e.g., PCA • With prior information: e.g., LDA – Model-dependent (parametric) • E.g., EM (Expectation-Maximization), MCE (Minimum Classification Error) Training 4 2004 TCFST - Berlin Chen
Principle Component Analysis (PCA) Pearson, 1901 • Known as Karhunen-Lo ẻ ve Transform (1947, 1963) – Or Hotelling Transform (1933) • A standard technique commonly used for data reduction in statistical pattern recognition and signal processing • A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information content – A small set of features to be found to represent the data samples accurately • Also called “Subspace Decomposition” 5 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) The patterns show a significant difference from each other in one of the transformed axes 6 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • Suppose x is an n -dimensional zero mean { } random vector, = E x 0 x – If x is not zero mean, we can subtract the mean before processing the following analysis – x can be represented without error by the summation of n linearly independent vectors n [ ] = = ∑ x y φ Φ y T = where y y . y . y i i 1 i n = [ ] i i = Φ φ . φ . φ 1 i n The i- th component in the feature (mapped) space The basis vectors 7 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) – Further assume the column (basis) vectors of the matrix form an orthonormal set Φ = ⎧ 1 if i j T = φ φ ⎨ j ≠ 0 if i j i ⎩ φ y x • Such that is equal to the projection of on i i ∀ = ϕ = ϕ T T y x x x i i i i ϕ ϕ 1 2 y • also has the following properties 2 y y 1 i ϕ T x = θ = = ϕ T y x cos x x 1 – Its mean is zero, too 1 1 x φ 1 1 { } = , where φ 1 { } { } 1 = ϕ = ϕ = ϕ = T T T E y E x E x 0 0 i i i i – Its variance is { } 1 { } { } { } = T = T R E xx x x ∑ σ 2 = 2 = ϕ T T ϕ = ϕ T T ϕ E y E xx E xx i N i i i i i i i i [ ] = ϕ T ϕ R R is the (auto-)cor relation matrix of x i i 8 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) – Further assume the column (basis) vectors of the matrix form an orthonormal set Φ y • also has the following properties i – Its mean is zero, too { } { } { } = ϕ = ϕ = ϕ = T T T E y E x E x 0 0 i i i i – Its variance is { } 1 { } { } { } = = T T R E xx x x ∑ σ 2 = 2 = ϕ T T ϕ = ϕ T T ϕ E y E xx E xx i N i i i i i i i i [ ] = ϕ ϕ T R R is the (auto-)cor relation matrix of x i i y y • The correlation between two projections and j i { } is ( )( ) { } { } T = T T = T T E y y E φ x φ x E φ xx φ i j i j i j { } = T T = T φ E xx φ φ R φ i j i j 9 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • Minimum Mean-Squared Error Criterion φ ' s – We want to choose only m of that we still can i x approximate well in mean-squared error criterion n m n = = + x y φ y φ y φ ∑ ∑ ∑ i i i i j j = = = + i 1 i 1 j m 1 ( ) m = ˆ x m y φ ∑ i i = i 1 { } ⎧ ⎫ ⎛ ⎞ ( ) ( ) ⎛ ⎞ n n 2 ε = − = ˆ T ⎜ ⎟ m E x m x E ⎜ y φ ⎟ y φ ⎨ ∑ ∑ ⎬ j j ⎝ k k ⎠ ⎝ ⎠ ⎩ ⎭ = + = + j m 1 k m 1 ⎧ ⎫ n n = T E y y φ φ ⎨ ⎬ ∑ ∑ j k j k ⎩ ⎭ We should = + = + { } j m 1 k m 1 = E y 0 { } discard the = ⎧ 1 if j k j { } n ( { } ) T = = 2 φ φ ⎨ E y Q ∑ bases where the 2 σ 2 = 2 − k E y E y ≠ 0 if j k j j ⎩ j j j = + { } j m 1 projections have = 2 E y n n lower variances = σ 2 = T j φ R φ ∑ ∑ j j j = + = + j m 1 j m 1 10 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • Minimum Mean-Squared Error Criterion φ ' s – If the orthonormal (basis) set is selected to be the i R eigenvectors of the correlation matrix , associated with λ ' s eigenvalues i • They will have the property that: R is real and symmetric, = λ therefore its eigenvectors R φ φ form a orthonormal set j j j – Such that the mean-squared error mentioned above will be ( ) n ε = σ 2 m ∑ j = + j m 1 n n n = = λ = λ T T φ R φ φ φ ∑ ∑ ∑ j j j j j j = + = + = + j m 1 j m 1 j m 1 11 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • Minimum Mean-Squared Error Criterion – If the eigenvectors are retained associated with the m largest eigenvalues, the mean-squared error will be ( ) ( ) n ε = λ λ ≥ ≥ λ ≥ ≥ λ eigen m where ... ... ∑ j 1 m n = + j m 1 y y – Any two projections and will be mutually uncorrelated i j { } ( )( ) { } { } T = T T = T T E y y E φ x φ x E φ xx φ i j i j i j { } = = = λ = T T T T φ E xx φ φ R φ φ φ 0 i j i j j i j • Good news for most statistical modeling – Gaussians and diagonal matrices 12 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • An two-dimensional example of Principle Component Analysis 13 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • Minimum Mean-Squared Error Criterion ( ) ε m – It can be proved that is the optimal solution under the eigen mean-squared error criterion ∂ ϕ T ϕ R = ϕ 2 R ∂ ϕ To be minimized constraints ∂ x y T ( ) = y n n n = − − δ ∑ ∑ ∑ Define : J φ R φ µ φ φ ∂ T T x j j jk j k jk = + = + = + j m 1 j m 1 k m 1 [ ] Take derivation ( ) ∂ J n 1 1 ⇒ ∀ = − = = ∑ 2 R φ µ φ 0 where µ µ ,..., µ T + ≤ ≤ + m 1 j n j jk k 2 j m 1 2 jn ∂ φ j = + k m 1 j [ ] ( ) ⇒ ∀ = = R φ Φ µ where Φ φ ..... φ + ≤ ≤ − − + m 1 j n j n m j n m m 1 n [ ] [ ] ⇒ = R φ ..... φ Φ µ ..... µ + − + m 1 n n m m 1 n [ ] ( ) ⇒ = = R Φ Φ U where U µ ..... µ − − − − + n m n m n m n m m 1 n U Have a particular solution if is a diagonal matrix and its diagonal elements n − m φ ..... φ λ λ ... R is the eigenvalues of and is their corresponding eigenvectors + m 1 n + m 1 n 14 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • Given an input vector x with dimensional m – Try to construct a linear transform Φ ’ ( Φ ’ is an n x m matrix m<n ) such that the truncation result, Φ ’ T x, is optimal in mean- squared error criterion ⎡ ⎤ y 1 ⎢ ⎥ ⎡ ⎤ x ′ Encoder = T y Φ x y ⎢ ⎥ 1 2 ⎢ ⎥ x ′ ⎢ ⎥ = Φ T y . ⎢ ⎥ 2 ⎢ ⎥ ⎢ ⎥ = x . [ ] . ′ = ⎢ ⎥ ⎢ ⎥ where Φ e e .. e . 1 2 m ⎢ ⎥ ⎢ ⎥ y ⎣ ⎦ ⎢ ⎥ m x ⎣ ⎦ n ⎡ ˆ ⎤ x ⎡ ⎤ y ′ = ˆ x Φ y 1 ⎢ ⎥ 1 ⎢ ⎥ ˆ x Decoder y ⎢ ⎥ 2 ⎢ ⎥ 2 ⎢ ⎥ ˆ = x . ⎢ ⎥ = y . Φ ′ ⎢ ⎥ ⎢ ⎥ . ⎢ ⎥ . ⎢ ⎥ ⎢ ⎥ ˆ x ⎣ ⎦ ⎢ ⎥ n y ⎣ ⎦ ( ) m ( ) ( ) T ˆ ˆ minimize E x -x x -x x 15 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • Data compression in communication – PCA is an optimal transform for signal representation and dimensional reduction, but not necessary for classification tasks, such as speech recognition – PCA needs no prior information (e.g. class distributions) of the sample patterns 16 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • Example 1: principal components of some data points 17 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • Example 2: feature transformation and selection Correlation matrix for old feature dimensions New feature dimensions threshold for information content reserved 18 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • Example 3: Image Coding 8 8 256 256 19 2004 TCFST - Berlin Chen
Principle Component Analysis (cont.) • Example 3: Image Coding (cont.) 20 2004 TCFST - Berlin Chen
Recommend
More recommend