lecture 7
play

Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from : Percy Liang (Stanford) Dimensionality Reduction Goal: Map high dimensional


  1. Unsupervised Machine Learning 
 and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent

  2. DIMENSIONALITY REDUCTION Borrowing from : 
 Percy Liang 
 (Stanford)

  3. Dimensionality Reduction Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities Original Data (4 dims) Projection with PCA (2 dims) Objective: projection should “preserve” relative distances

  4. Linear Dimensionality Reduction Idea : Project high-dimensional vector 
 onto a lower dimensional space ∈ x ∈ R 361 z = U > x z ∈ R 10

  5. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Transpose of X 
 used in regression!

  6. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k

  7. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x

  8. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x Project x down to z = ( z 1 , . . . , z k ) > = U > x How to choose U ?

  9. Principal Component Analysis Top 2 components Bottom 2 components Data : three varieties of wheat: Kama, Rosa, Canadian 
 Attributes : Area, Perimeter, Compactness, Length of Kernel, 
 Width of Kernel, Asymmetry Coefficient, Length of Groove

  10. Principal Component Analysis ∈ x ∈ R 361 z = U > x z ∈ R 10 Optimize two equivalent objectives 1. Minimize the reconstruction error 2. Maximizes the projected variance

  11. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x P

  12. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x • Decode: ˜ x = Uz = P k j =1 z j u j

  13. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x • Decode: ˜ x = Uz = P k j =1 z j u j Want reconstruction error k x � ˜ x k to be small

  14. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x • Decode: ˜ x = Uz = P k j =1 z j u j Want reconstruction error k x � ˜ x k to be small Objective: minimize total squared reconstruction error n X k x i � UU > x i k 2 min U 2 R d ⇥ k i =1

  15. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c n

  16. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s

  17. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s

  18. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s Objective: maximize variance of projected data ˆ E [ k U > x k 2 ] max U 2 R d ⇥ k , U > U = I

  19. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s ˆ E [ U > x ] ?) Objective: maximize variance of projected data ˆ E [ k U > x k 2 ] max U 2 R d ⇥ k , U > U = I

  20. Equivalence of two objectives Key intuition: variance of data = captured variance + reconstruction error | {z } | {z } | {z } fixed want small want large

  21. Equivalence of two objectives Key intuition: variance of data = captured variance + reconstruction error | {z } | {z } | {z } fixed want small want large Pythagorean decomposition: x = UU > x + ( I � UU > ) x k x k k ( I � UU > ) x k k UU > x k Take expectations; note rotation U doesn’t a ff ect length: E [ k x k 2 ] = ˆ ˆ E [ k U > x k 2 ] + ˆ E [ k x � UU > x k 2 ]

  22. Equivalence of two objectives Key intuition: variance of data = captured variance + reconstruction error | {z } | {z } | {z } fixed want small want large Pythagorean decomposition: x = UU > x + ( I � UU > ) x k x k k ( I � UU > ) x k k UU > x k Take expectations; note rotation U doesn’t a ff ect length: E [ k x k 2 ] = ˆ ˆ E [ k U > x k 2 ] + ˆ E [ k x � UU > x k 2 ] Minimize reconstruction error $ Maximize captured variance

  23. Changes of Basis Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n d d

  24. Changes of Basis Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n d d Inverse Change of basis Change of basis d to z = ( z 1 , . . . , z k ) > > j x x = Uz = ˜ d ” z j = u > j x z = U > x

  25. Principal Component Analysis Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n d d d Eigenvectors of Covariance Eigen-decomposition   λ 1 λ 2   Λ = ...   λ d Claim : Eigenvectors of a symmetric matrix are orthogonal

  26. Principal Component Analysis n (from stack exchange)

  27. Principal Component Analysis Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n d d d Eigenvectors of Covariance Eigen-decomposition   λ 1 λ 2   Λ = ...   λ d Idea : Take top- k eigenvectors to maximize variance

  28. Principal Component Analysis Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Eigenvectors of Covariance Truncated decomposition   λ 1 λ 2 Λ ( k ) =   ...   λ k

  29. PCA: Complexity Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Using eigen-value decomposition • Computation of covariance C : O ( n d 2 ) • Eigen-value decomposition: O ( d 3 ) • Total complexity: O ( n d 2 + d 3 )

  30. PCA: Complexity Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Using singular-value decomposition • Full decomposition: O(min{ nd 2 , n 2 d }) • Rank-k decomposition: O( k d n log(n)) 
 (with power method) 


  31. Singular Value Decomposition Idea : Decompose a 
 d x d matrix M into 1. Change of basis V 
 (unitary matrix) 2. A scaling Σ 
 (diagonal matrix) 3. Change of basis U 
 (unitary matrix)

  32. Singular Value Decomposition Idea : Decompose the 
 d x n matrix X into 1. A n x n basis V 
 (unitary matrix) 2. A d x n matrix Σ 
 (diagonal projection) 3. A d x d basis U 
 (unitary matrix) d X = U d ⇥ d Σ d ⇥ n V > n ⇥ n

  33. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i

  34. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . .

  35. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification

  36. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification Much faster: O ( dk + nk ) time instead of O ( dn ) when n, d � k

Recommend


More recommend