data mining techniques
play

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from : Percy Liang (Stanford) Linear Dimensionality Reduction Idea :


  1. Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang)

  2. DIMENSIONALITY REDUCTION Borrowing from : 
 Percy Liang 
 (Stanford)

  3. Linear Dimensionality Reduction Idea : Project high-dimensional vector 
 onto a lower dimensional space ∈ x ∈ R 361 z = U > x z ∈ R 10

  4. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n

  5. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k

  6. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x

  7. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x Project x down to z = ( z 1 , . . . , z k ) > = U > x How to choose U ?

  8. Principal Component Analysis ∈ x ∈ R 361 z = U > x z ∈ R 10 How do we choose U ? Two Objectives 1. Minimize the reconstruction error 2. Maximize the projected variance

  9. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x P

  10. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x • Decode: ˜ x = Uz = P k j =1 z j u j

  11. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x • Decode: ˜ x = Uz = P k j =1 z j u j Want reconstruction error k x � ˜ x k to be small

  12. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x • Decode: ˜ x = Uz = P k j =1 z j u j Want reconstruction error k x � ˜ x k to be small Objective: minimize total squared reconstruction error n X k x i � UU > x i k 2 min U 2 R d ⇥ k i =1

  13. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c n

  14. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s

  15. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P n P E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s ˆ E [ U > x ] ?)

  16. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s ˆ E [ U > x ] ?) Objective: maximize variance of projected data ˆ E [ k U > x k 2 ] max U 2 R d ⇥ k , U > U = I

  17. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s ˆ E [ U > x ] ?) Objective: maximize variance of projected data ˆ E [ k U > x k 2 ] max U 2 R d ⇥ k , U > U = I

  18. Equivalence of two objectives Key intuition: variance of data = captured variance + reconstruction error | {z } | {z } | {z } fixed want small want large

  19. Equivalence of two objectives Key intuition: variance of data = captured variance + reconstruction error | {z } | {z } | {z } fixed want small want large Pythagorean decomposition: x = UU > x + ( I � UU > ) x k x k k ( I � UU > ) x k k UU > x k Take expectations; note rotation U doesn’t a ff ect length: E [ k x k 2 ] = ˆ ˆ E [ k U > x k 2 ] + ˆ E [ k x � UU > x k 2 ]

  20. Equivalence of two objectives Key intuition: variance of data = captured variance + reconstruction error | {z } | {z } | {z } fixed want small want large Pythagorean decomposition: x = UU > x + ( I � UU > ) x k x k k ( I � UU > ) x k k UU > x k Take expectations; note rotation U doesn’t a ff ect length: E [ k x k 2 ] = ˆ ˆ E [ k U > x k 2 ] + ˆ E [ k x � UU > x k 2 ] Minimize reconstruction error $ Maximize captured variance

  21. Finding one principal component Objective: maximize variance of projected data Input data: X = ( x 1 . . . x n ) rincipal component analysis (PCA) / Ba

  22. Finding one principal component Objective: maximize variance of projected data ˆ E [( u > x ) 2 ] = max k u k =1 Input data: X = ( x 1 . . . x n ) rincipal component analysis (PCA) / Ba

  23. Finding one principal component Objective: maximize variance of projected data ˆ E [( u > x ) 2 ] = max k u k =1 n 1 X ( u > x i ) 2 = max n k u k =1 i =1 1 Input data: X = ( x 1 . . . x n ) rincipal component analysis (PCA) / Ba

  24. Finding one principal component Objective: maximize variance of projected data ˆ E [( u > x ) 2 ] = max k u k =1 n 1 X ( u > x i ) 2 = max n k u k =1 i =1 1 Input data: n k u > X k 2 = max k u k =1 X = ( x 1 . . . x n ) ✓ 1 ◆ rincipal component analysis (PCA) / Ba

  25. Finding one principal component Objective: maximize variance of projected data ˆ E [( u > x ) 2 ] = max k u k =1 n 1 X ( u > x i ) 2 = max n k u k =1 i =1 1 Input data: n k u > X k 2 = max k u k =1 X = ( x 1 . . . x n ) ✓ 1 ◆ n XX > k u k =1 u > = max u 1 rincipal component analysis (PCA) / Ba

  26. Finding one principal component Objective: maximize variance of projected data ˆ E [( u > x ) 2 ] = max k u k =1 n 1 X ( u > x i ) 2 = max n k u k =1 i =1 1 Input data: n k u > X k 2 = max k u k =1 X = ( x 1 . . . x n ) ✓ 1 ◆ n XX > k u k =1 u > = max u = 1 def n XX > = largest eigenvalue of C rincipal component analysis (PCA) / Ba

  27. Finding one principal component Objective: maximize variance of projected data ˆ E [( u > x ) 2 ] = max k u k =1 n 1 X ( u > x i ) 2 = max n k u k =1 i =1 1 Input data: n k u > X k 2 = max k u k =1 X = ( x 1 . . . x n ) ✓ 1 ◆ n XX > k u k =1 u > = max u = 1 def n XX > = largest eigenvalue of C ( C is covariance matrix of data) rincipal component analysis (PCA) / Ba ic principles

  28. How many components? • Similar to question of “How many clusters?” • Magnitude of eigenvalues indicate fraction of variance captured.

  29. How many components? • Similar to question of “How many clusters?” • Magnitude of eigenvalues indicate fraction of variance captured. • Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 λ i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i

  30. How many components? • Similar to question of “How many clusters?” • Magnitude of eigenvalues indicate fraction of variance captured. • Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 λ i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i • Eigenvalues typically drop o ff sharply, so don’t need that many. • Of course variance isn’t everything...

  31. Computing PCA Method 1: eigendecomposition n XX > U are eigenvectors of covariance matrix C = 1 2 ) (

  32. Computing PCA Method 1: eigendecomposition n XX > U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive)

  33. Computing PCA Method 1: eigendecomposition n XX > U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d ⇥ d Σ d ⇥ n V > n ⇥ n where U > U = I d ⇥ d , V > V = I n ⇥ n , Σ is diagonal ( )

  34. Computing PCA Method 1: eigendecomposition n XX > U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d ⇥ d Σ d ⇥ n V > n ⇥ n where U > U = I d ⇥ d , V > V = I n ⇥ n , Σ is diagonal Computing top k singular vectors takes only O ( ndk )

Recommend


More recommend