Discriminative Feature Extraction and Dimension Reduction - PCA - PowerPoint PPT Presentation

Discriminative Feature Extraction and Dimension Reduction - PCA & LDA Berlin Chen, 2004

Introduction • Goal: discover significant patterns or features from the input data – Salient feature selection or dimensionality reduction Network y x W Feature space Input space – Compute an input-output mapping based on some desirable properties 2 2004 TCFST - Berlin Chen

Introduction (cont.) • Principal Component Analysis (PCA) • Linear Discriminant Analysis (LDA) • Heteroscedastic Discriminant Analysis (HDA) 3 2004 TCFST - Berlin Chen

Introduction (cont.) • Formulation for discriminative feature extraction – Model-free (nonparametric) • Without prior information: e.g., PCA • With prior information: e.g., LDA – Model-dependent (parametric) • E.g., EM (Expectation-Maximization), MCE (Minimum Classification Error) Training 4 2004 TCFST - Berlin Chen

Principle Component Analysis (PCA) Pearson, 1901 • Known as Karhunen-Lo ẻ ve Transform (1947, 1963) – Or Hotelling Transform (1933) • A standard technique commonly used for data reduction in statistical pattern recognition and signal processing • A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information content – A small set of features to be found to represent the data samples accurately • Also called “Subspace Decomposition” 5 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) The patterns show a significant difference from each other in one of the transformed axes 6 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • Suppose x is an n -dimensional zero mean { } random vector, = E x 0 x – If x is not zero mean, we can subtract the mean before processing the following analysis – x can be represented without error by the summation of n linearly independent vectors n [ ] = = ∑ x y φ Φ y T = where y y . y . y i i 1 i n = [ ] i i = Φ φ . φ . φ 1 i n The i- th component in the feature (mapped) space The basis vectors 7 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) – Further assume the column (basis) vectors of the matrix form an orthonormal set Φ = ⎧ 1 if i j T = φ φ ⎨ j ≠ 0 if i j i ⎩ φ y x • Such that is equal to the projection of on i i ∀ = ϕ = ϕ T T y x x x i i i i ϕ ϕ 1 2 y • also has the following properties 2 y y 1 i ϕ T x = θ = = ϕ T y x cos x x 1 – Its mean is zero, too 1 1 x φ 1 1 { } = , where φ 1 { } { } 1 = ϕ = ϕ = ϕ = T T T E y E x E x 0 0 i i i i – Its variance is { } 1 { } { } { } = T = T R E xx x x ∑ σ 2 = 2 = ϕ T T ϕ = ϕ T T ϕ E y E xx E xx i N i i i i i i i i [ ] = ϕ T ϕ R R is the (auto-)cor relation matrix of x i i 8 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) – Further assume the column (basis) vectors of the matrix form an orthonormal set Φ y • also has the following properties i – Its mean is zero, too { } { } { } = ϕ = ϕ = ϕ = T T T E y E x E x 0 0 i i i i – Its variance is { } 1 { } { } { } = = T T R E xx x x ∑ σ 2 = 2 = ϕ T T ϕ = ϕ T T ϕ E y E xx E xx i N i i i i i i i i [ ] = ϕ ϕ T R R is the (auto-)cor relation matrix of x i i y y • The correlation between two projections and j i { } is ( )( ) { } { } T = T T = T T E y y E φ x φ x E φ xx φ i j i j i j { } = T T = T φ E xx φ φ R φ i j i j 9 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • Minimum Mean-Squared Error Criterion φ ' s – We want to choose only m of that we still can i x approximate well in mean-squared error criterion n m n = = + x y φ y φ y φ ∑ ∑ ∑ i i i i j j = = = + i 1 i 1 j m 1 ( ) m = ˆ x m y φ ∑ i i = i 1 { } ⎧ ⎫ ⎛ ⎞ ( ) ( ) ⎛ ⎞ n n 2 ε = − = ˆ T ⎜ ⎟ m E x m x E ⎜ y φ ⎟ y φ ⎨ ∑ ∑ ⎬ j j ⎝ k k ⎠ ⎝ ⎠ ⎩ ⎭ = + = + j m 1 k m 1 ⎧ ⎫ n n = T E y y φ φ ⎨ ⎬ ∑ ∑ j k j k ⎩ ⎭ We should = + = + { } j m 1 k m 1 = E y 0 { } discard the = ⎧ 1 if j k j { } n ( { } ) T = = 2 φ φ ⎨ E y Q ∑ bases where the 2 σ 2 = 2 − k E y E y ≠ 0 if j k j j ⎩ j j j = + { } j m 1 projections have = 2 E y n n lower variances = σ 2 = T j φ R φ ∑ ∑ j j j = + = + j m 1 j m 1 10 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • Minimum Mean-Squared Error Criterion φ ' s – If the orthonormal (basis) set is selected to be the i R eigenvectors of the correlation matrix , associated with λ ' s eigenvalues i • They will have the property that: R is real and symmetric, = λ therefore its eigenvectors R φ φ form a orthonormal set j j j – Such that the mean-squared error mentioned above will be ( ) n ε = σ 2 m ∑ j = + j m 1 n n n = = λ = λ T T φ R φ φ φ ∑ ∑ ∑ j j j j j j = + = + = + j m 1 j m 1 j m 1 11 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • Minimum Mean-Squared Error Criterion – If the eigenvectors are retained associated with the m largest eigenvalues, the mean-squared error will be ( ) ( ) n ε = λ λ ≥ ≥ λ ≥ ≥ λ eigen m where ... ... ∑ j 1 m n = + j m 1 y y – Any two projections and will be mutually uncorrelated i j { } ( )( ) { } { } T = T T = T T E y y E φ x φ x E φ xx φ i j i j i j { } = = = λ = T T T T φ E xx φ φ R φ φ φ 0 i j i j j i j • Good news for most statistical modeling – Gaussians and diagonal matrices 12 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • An two-dimensional example of Principle Component Analysis 13 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • Minimum Mean-Squared Error Criterion ( ) ε m – It can be proved that is the optimal solution under the eigen mean-squared error criterion ∂ ϕ T ϕ R = ϕ 2 R ∂ ϕ To be minimized constraints ∂ x y T ( ) = y n n n = − − δ ∑ ∑ ∑ Define : J φ R φ µ φ φ ∂ T T x j j jk j k jk = + = + = + j m 1 j m 1 k m 1 [ ] Take derivation ( ) ∂ J n 1 1 ⇒ ∀ = − = = ∑ 2 R φ µ φ 0 where µ µ ,..., µ T + ≤ ≤ + m 1 j n j jk k 2 j m 1 2 jn ∂ φ j = + k m 1 j [ ] ( ) ⇒ ∀ = = R φ Φ µ where Φ φ ..... φ + ≤ ≤ − − + m 1 j n j n m j n m m 1 n [ ] [ ] ⇒ = R φ ..... φ Φ µ ..... µ + − + m 1 n n m m 1 n [ ] ( ) ⇒ = = R Φ Φ U where U µ ..... µ − − − − + n m n m n m n m m 1 n U Have a particular solution if is a diagonal matrix and its diagonal elements n − m φ ..... φ λ λ ... R is the eigenvalues of and is their corresponding eigenvectors + m 1 n + m 1 n 14 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • Given an input vector x with dimensional m – Try to construct a linear transform Φ ’ ( Φ ’ is an n x m matrix m<n ) such that the truncation result, Φ ’ T x, is optimal in mean- squared error criterion ⎡ ⎤ y 1 ⎢ ⎥ ⎡ ⎤ x ′ Encoder = T y Φ x y ⎢ ⎥ 1 2 ⎢ ⎥ x ′ ⎢ ⎥ = Φ T y . ⎢ ⎥ 2 ⎢ ⎥ ⎢ ⎥ = x . [ ] . ′ = ⎢ ⎥ ⎢ ⎥ where Φ e e .. e . 1 2 m ⎢ ⎥ ⎢ ⎥ y ⎣ ⎦ ⎢ ⎥ m x ⎣ ⎦ n ⎡ ˆ ⎤ x ⎡ ⎤ y ′ = ˆ x Φ y 1 ⎢ ⎥ 1 ⎢ ⎥ ˆ x Decoder y ⎢ ⎥ 2 ⎢ ⎥ 2 ⎢ ⎥ ˆ = x . ⎢ ⎥ = y . Φ ′ ⎢ ⎥ ⎢ ⎥ . ⎢ ⎥ . ⎢ ⎥ ⎢ ⎥ ˆ x ⎣ ⎦ ⎢ ⎥ n y ⎣ ⎦ ( ) m ( ) ( ) T ˆ ˆ minimize E x -x x -x x 15 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • Data compression in communication – PCA is an optimal transform for signal representation and dimensional reduction, but not necessary for classification tasks, such as speech recognition – PCA needs no prior information (e.g. class distributions) of the sample patterns 16 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • Example 1: principal components of some data points 17 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • Example 2: feature transformation and selection Correlation matrix for old feature dimensions New feature dimensions threshold for information content reserved 18 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • Example 3: Image Coding 8 8 256 256 19 2004 TCFST - Berlin Chen

Principle Component Analysis (cont.) • Example 3: Image Coding (cont.) 20 2004 TCFST - Berlin Chen

Discriminative Feature Extraction and Dimension Reduction - PCA - PowerPoint PPT Presentation

Discriminative Feature Extraction and Dimension Reduction - PCA & LDA Berlin Chen, 2004 Introduction Goal: discover significant patterns or features from the input data Salient feature selection or dimensionality reduction

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Feature Extraction 7-1 Ronald Peikert SciVis 2007 - Feature Extraction What are features?

Feature Extraction 7-1 Ronald Peikert SciVis 2008 - Feature Extraction What are features?

Feature Extraction Combining Feature Extraction Combining Spectral Noise Reduction and Spectral

Automated Feature Extraction Automated Feature Extraction for Object Recognition for Object

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Dimension Reduction and Nearest Neighbor Search Advanced Algorithms Nanjing University, Fall

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

Object based feature extraction of Google based feature extraction of Google Object Earth

AB Feature Extraction Experiments Discussion Noise Robust LVCSR Feature Extraction Based on

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani

Constrained discriminative speaker verification specific to normalized i-vectors P.M. Bousquet,

Class discrimination for microarray studies Vlad Popovici Swiss Institute of Bioinformatics

Feature Reduction and Selection Selim Aksoy Bilkent University Department of Computer

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Discriminant

Local Classification Methods for Heterogeneous Classes Julia Schiffner and Claus Weihs

Introduction to Machine Learning Classification: Tasks Sonar Learning goals 0.20 Understand

Lecture 16: Summary and outlook Felix Held, Mathematical Sciences MSA220/MVE440 Statistical

Practical Considerations for ANOVA Applied Statistics and Experimental Design Chapter 5 Peter

Discriminative Feature Extraction and Dimension Reduction - PCA - PowerPoint PPT Presentation

Discriminative Feature Extraction and Dimension Reduction - PCA & LDA Berlin Chen, 2004 Introduction Goal: discover significant patterns or features from the input data Salient feature selection or dimensionality reduction

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Feature Extraction 7-1 Ronald Peikert SciVis 2007 - Feature Extraction What are features?

Feature Extraction 7-1 Ronald Peikert SciVis 2008 - Feature Extraction What are features?

Feature Extraction Combining Feature Extraction Combining Spectral Noise Reduction and Spectral

Automated Feature Extraction Automated Feature Extraction for Object Recognition for Object

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Dimension Reduction and Nearest Neighbor Search Advanced Algorithms Nanjing University, Fall

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

Object based feature extraction of Google based feature extraction of Google Object Earth

AB Feature Extraction Experiments Discussion Noise Robust LVCSR Feature Extraction Based on

PCA &amp; ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani

Constrained discriminative speaker verification specific to normalized i-vectors P.M. Bousquet,

Class discrimination for microarray studies Vlad Popovici Swiss Institute of Bioinformatics

Feature Reduction and Selection Selim Aksoy Bilkent University Department of Computer

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Discriminant

Local Classification Methods for Heterogeneous Classes Julia Schiffner and Claus Weihs

Introduction to Machine Learning Classification: Tasks Sonar Learning goals 0.20 Understand

Lecture 16: Summary and outlook Felix Held, Mathematical Sciences MSA220/MVE440 Statistical

Practical Considerations for ANOVA Applied Statistics and Experimental Design Chapter 5 Peter

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani