Lecture 25: − Autoencoders − Kernel PCA Aykut Erdem January 2017 Hacettepe University
Today • Motivation • PCA algorithms • Applications • PCA shortcomings • Autoencoders • Kernel PCA 2
Autoencoders 3
Relation to Neural Networks • PCA is closely related to a particular form of neural network • An autoencoder is a neural network whose outputs are its own inputs slide by Sanja Fidler • The goal is to minimize reconstruction error 4
Auto encoders • Define ˆ z = f ( W x ); x = g ( V z ) slide by Sanja Fidler 5
Auto encoders • Define ˆ z = f ( W x ); x = g ( V z ) • Goal: N 1 X || x ( n ) − ˆ x ( n ) || 2 min 2 N W , V n =1 slide by Sanja Fidler 6
Auto encoders • Define ˆ z = f ( W x ); x = g ( V z ) • Goal: N 1 X || x ( n ) − ˆ x ( n ) || 2 min 2 N W , V n =1 • If g and f are linear N 1 X || x ( n ) − VW x ( n ) || 2 min 2 N W , V n =1 slide by Sanja Fidler 7
Auto encoders • Define ˆ z = f ( W x ); x = g ( V z ) • Goal: N 1 X || x ( n ) − ˆ x ( n ) || 2 min 2 N W , V n =1 • If g and f are linear N 1 X || x ( n ) − VW x ( n ) || 2 min 2 N W , V n =1 slide by Sanja Fidler • In other words, the optimal solution is PCA 8
Auto encoders: Nonlinear PCA • What if g ( ) is not linear? • Then we are basically doing nonlinear PCA • Some subtleties but in general this is an accurate description slide by Sanja Fidler 9
Comparing Reconstructions Real data 30-d deep autoencoder 30-d logistic PCA 30-d PCA slide by Sanja Fidler 10
Kernel PCA 11
Dimensionality Reduction • Data representation - Inputs are real-valued vectors in a high dimensional space. in a • Linear structure PCA - Does the date live in a low dimensional subspace? • Nonlinear structure - Does the data live on a low dimensional submanifold? slide by Rita Osadchy 12
The “magic” of high dimensions • Given some problem, how do we know what classes of functions are capable of solving that problem? • VC (Vapnik-Chervonenkis) theory tells us that often mappings which take us into a higher dimensional space than the dimension of the input space provide us with greater classification power. slide by Rita Osadchy 13
Example in R 2 These classes are Th linearly inseparable in lin the input space. slide by Rita Osadchy 14
Example: High-Dimensional Mapping W We can make the probl problem linearly separabl separable by a ma simple mapping Φ → 2 3 : R R + 2 2 a ( x , x ) ( x , x , x x ) 1 2 1 2 1 2 slide by Rita Osadchy 15
Kernel Trick • High-dimensional mapping can seriously increase computation time. • Can we get around this problem and still get the benefit of high-D? • Yes! Kernel Trick l Trick ( ) = φ φ T K x , x ( x ) ( x ) i j i j • Given any algorithm that can be expressed solely in terms of dot products, this trick allows us to construct di ff erent nonlinear versions of slide by Rita Osadchy it. 16
Popular Kernels slide by Rita Osadchy 17
Kernel Principle Component Analysis • Extends conventional principal component analysis (PCA) to a high dimensional feature space using the “kernel trick”. • Can extract up to n (number of samples) nonlinear principal components without expensive computations. slide by Rita Osadchy 18
Making PCA Non-Linear • Suppose that instead of using the points x i we would first map them to some nonlinear feature space φ ( x i ) - E.g. using polar coordinates instead of cartesian coordinates would help us deal with the circle. • Extract principal component in that space (PCA) • The result will be non-linear in the original data space! slide by Rita Osadchy 19
Derivation • Suppose that the mean of the data in the feature space is = ∑ n 1 µ φ = ( x ) 0 i n = i 1 • Covariance: n n 1 1 ∑ ∑ = = φ φ φ φ T T C C ( ( x x ) ) ( ( x x ) ) i i n = i 1 • Eigenvectors = λ C v v slide by Rita Osadchy 20
Derivation • Eigenvectors can be expressed as linear combination of features: features: n ∑ = α φ v ( x ) i i = i 1 • Proof: = ∑ n 1 φ φ = λ T C v ( x ) ( x ) v v i i n = i 1 us thus n n 1 1 ∑ ∑ = φ φ = φ ⋅ φ T T v ( x ) ( x ) v ( ( x ) v ) ( x ) λ λ i i i i n n slide by Rita Osadchy = = i 1 i 1 21
t t t t Showing that = ( ⋅ T T xx v x v ) x slide by Rita Osadchy slide by Rita Osadchy 22
t t t t Showing that = ( ⋅ T T xx v x v ) x slide by Rita Osadchy 23
Derivation • So, from before we had, So, from before we had, n n 1 1 ∑ ∑ = φ φ = φ ⋅ φ T T v ( x ) ( x ) v ( ( x ) v ) ( x ) λ λ i i i i n n = = i 1 i 1 just a scalar • this means that all solutions v with λ = 0 lie in the span of φ ( x 1 ) ,.., φ ( x n ) , i.e., n ∑ = α φ v ( x ) i i = i 1 • Finding the eigenvectors is equivalent to slide by Rita Osadchy finding the coe ffi cients α i 24
Derivation • By substituting this back into the equation we get: n n n 1 ∑ ∑ ∑ φ φ α φ = α φ T λ ( x ) ( x ) ( x ) ( x ) i i jl l j jl l n = = = i 1 l 1 l 1 • We can rewrite it as n n n 1 ∑ ∑ ∑ φ α = α φ λ ( x ) K ( x , x ) ( x ) i jl i l j jl l n = = = i 1 l 1 l 1 • Multiple this by φ ( x k ) from the left: n n n 1 ∑ ∑ ∑ φ φ α = α φ φ T λ T ( x ) ( x ) K ( x , x ) ( x ) ( x ) slide by Rita Osadchy k i jl i l j jl k l n = = = i 1 l 1 l 1 25
Derivation • By plugging in the kernel and rearranging we get: α = λ α 2 K n K j j j We can remove a factor of K from both sides of the matrix (this will only a ff ects the eigenvectors with zero eigenvalue, which will not be a principle component anyway): α = λ α K n j j j • We have a normalization condition for α j n n ( ) ( ) ∑∑ slide by Rita Osadchy = ⇒ α α φ φ = ⇒ α α = T T T v v 1 x x 1 K 1 vectors: j j jl jk l k j j = = k 1 l 1 26
Derivation • By multiplying K α j = n λ j α j by α j and using the normalization condition we get: λ α α = ∀ T n 1 , j j j j • For a new point x , its projection onto the principal components is: mponents is: n n ∑ ∑ φ = α φ φ = α T T ( x ) v ( x ) ( x ) K ( x , x ) j ji i ji i = = i 1 i 1 slide by Rita Osadchy 27
Normalizing the feature space • In general, φ ( x i ) may not be zero mean. • Centered features: features: n ~ 1 ∑ φ = φ − φ ( x ) ( x ) ( x ) k i k n = k 1 • The corresponding kernel is: ~ ~ ~ = φ φ T K ( x , x ) ( x ) ( x ) i j i j T n n 1 1 ∑ ∑ = φ − φ φ − φ ( x ) ( x ) ( x ) ( x ) i k j k n n = = k 1 k 1 n n n slide by Rita Osadchy 1 1 1 ∑ ∑ ∑ = − − + K ( x , x ) K ( x , x ) K ( x , x ) K ( x , x ) i j i k j k l k 2 n n n = = = k 1 k 1 l , k 1 28
Normalizing the feature space n n n ~ 1 1 1 ∑ ∑ ∑ = − − + K ( x , x ) K ( x , x ) K ( x , x ) K ( x , x ) K ( x , x ) i j i j i k j k l k 2 n n n = = = k 1 k 1 l , k 1 • In a matrix form ~ = + K K - 2 1 K 1 K 1 1/n 1/n 1/n is a matrix with all elements 1 where is a matrix with all elements 1/n. re is 1 1/n slide by Rita Osadchy 29
Summary of Kernel PCA • Pick a kernel • Construct the normalized kernel matrix of the data (dimension m x m ): ~ = = + + K K K K - - 2 2 1 1 K K 1 1 K K 1 1 1/n 1/n 1/n 1/n 1/n 1/n genvalue problem: • Solve an eigenvalue problem: ~ α = λ α K i i i point (new or old) • For any data point (new or old), we can represent it as n ∑ = α = slide by Rita Osadchy y K ( x , x ), j 1 ,.., d j ji i = i 1 30
Input points before kernel PCA slide by Rita Osadchy http://en.wikipedia.org/wiki/Kernel_principal_component_analysis 31
Output after kernel PCA The three groups are distinguishable using the first component only slide by Rita Osadchy 66 32
Example: De-noising images slide by Rita Osadchy 33
Properties of KPCA • Kernel PCA can give a good re-encoding of the data when it lies along a non-linear manifold. • The kernel matrix is n x n , so kernel PCA will have di ffi culties if we have lots of data points. slide by Rita Osadchy 34
Recommend
More recommend