information theoretically optimal sparse pca
play

Information-theoretically Optimal Sparse PCA Yash Deshpande and - PowerPoint PPT Presentation

Information-theoretically Optimal Sparse PCA Yash Deshpande and Andrea Montanari Stanford University July 3 rd , 2014 Problem Definition n xx T Y = + Z . 2 Problem Definition n xx T Y = + Z . n n


  1. Information-theoretically Optimal Sparse PCA Yash Deshpande and Andrea Montanari Stanford University July 3 rd , 2014

  2. Problem Definition � λ n xx T Y λ = + Z . 2

  3. Problem Definition � λ n xx T Y λ = + Z . � λ � λ n n � λ � λ 0 n n Z ij = Z ji 0 0 x i ∼ Bernoulli( ε ) , Z ij ∼ Normal(0 , 1) independent. Estimate X = xx T from Y λ 2

  4. An example: gene expression data [Baechler et al, 2003 PNAS] • Genes × patients matrix • Blue - lupus patients, Aqua - healthy controls • Black - a subset of immune system specific genes 3

  5. An example: gene expression data [Baechler et al, 2003 PNAS] • Genes × patients matrix • Blue - lupus patients, Aqua - healthy controls • Black - a subset of immune system specific genes A simple probabilistic model 3

  6. Related work Detection and estimation: Y = X + noise . • X ∈ S ⊂ { 0 , 1 } n , a known set • Goal: hypothesis testing, support recovery • [Donoho, Jin 2004], [Addario-Berry et al. 2010], [Arias-Castro et al. 2011] . . . 4

  7. Related work Machine learning: maximize � v , Y λ v � subject to: � v � 2 ≤ 1 , v is sparse. • Goal: maximize “variance”, support recovery • [d’Aspremont et al. 2004], [Moghaddam et al. 2005], [Zou et al. 2006], [Amini, Wainwright 2009] , [Papailiopoulos et al. 2013]. . . 4

  8. Related work Information theory: minimize � Y λ − vv T � 2 F + f ( v ) . • Probabilistic model for x , Y λ • Propose approximate message passing algorithm • [Rangan, Fletcher 2012], [Kabashima et al. 2014] 4

  9. A first try: simple PCA � λ n xx T + Z . Y λ = 5

  10. A first try: simple PCA � λ n xx T + Z . Y λ = Estimate x using scaled principal eigenvector x 1 ( Y λ ). 5

  11. Limitations of PCA 6

  12. Limitations of PCA If λε 2 > 1 Limiting Spectral Density − 2 2 � x 1 ( Y λ ) , x � lim n →∞ > 0 a. s. √ n ε 6

  13. Limitations of PCA If λε 2 > 1 If λε 2 < 1 Limiting Spectral Density Limiting Spectral Density − 2 2 − 2 2 � x 1 ( Y λ ) , x � � x 1 ( Y λ ) , x � lim n →∞ > 0 a. s. lim n →∞ = 0 a. s. √ n ε √ n ε 6

  14. Limitations of PCA If λε 2 > 1 If λε 2 < 1 Limiting Spectral Density Limiting Spectral Density − 2 2 − 2 2 � x 1 ( Y λ ) , x � � x 1 ( Y λ ) , x � lim n →∞ > 0 a. s. lim n →∞ = 0 a. s. √ n ε √ n ε [Knowles, Yin, 2011] 6

  15. Our contributions • Poly-time algorithm that exploits sparsity 7

  16. Our contributions • Poly-time algorithm that exploits sparsity • Provably optimal in terms of MSE when ε > ε c 7

  17. Our contributions • Poly-time algorithm that exploits sparsity • Provably optimal in terms of MSE when ε > ε c • “Single-letter” characterization of MMSE 7

  18. Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = 8

  19. Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F 8

  20. Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F Scalar problem √ Y λ = λ X 0 + Z , 8

  21. Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F Scalar problem √ Y λ = λ X 0 + Z , � ( X 0 − E { X 0 | Y λ } ) 2 � S-mmse( λ ) ≡ E . 8

  22. Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F Scalar problem √ Y λ = λ X 0 + Z , � ( X 0 − E { X 0 | Y λ } ) 2 � S-mmse( λ ) ≡ E . Here X 0 ∼ Bernoulli( ε ) , Z ∼ Normal(0 , 1). 8

  23. Main result Theorem (Deshpande, Montanari 2014) There exists an ε c < 1 such that the following happens. For every ε > ε c n →∞ M - mmse( λ, n ) = ε 2 − τ 2 lim ∗ where τ ∗ = ε − S - mmse( λτ ∗ ) . Further there exists a polynomial time algorithm that achieves this MSE. 9

  24. Main result Theorem (Deshpande, Montanari 2014) There exists an ε c < 1 such that the following happens. For every ε > ε c n →∞ M - mmse( λ, n ) = ε 2 − τ 2 lim ∗ where τ ∗ = ε − S - mmse( λτ ∗ ) . Further there exists a polynomial time algorithm that achieves this MSE. ε c ≈ 0 . 05 (solution to scalar non-linear equation) 9

  25. Making use of sparsity 10

  26. Making use of sparsity The power iteration with A = Y λ / √ n : x t +1 = A x t . 10

  27. Making use of sparsity The power iteration with A = Y λ / √ n : x t +1 = A x t . Improvement: x t +1 = A F t ( x t ) , where F t ( x t ) = ( f t ( x t 1 ) , . . . f t ( x t n )) T . Choose f t to exploit sparsity. 10

  28. A heuristic analysis Expanding the i th entry of x t +1 : � √ � � λ � x , F t ( x t ) � x i + 1 x t +1 Z ij f t ( x t = √ n j ) i n � �� � j � �� � ≈ µ t ≈ Normal(0 ,τ t ) 11

  29. A heuristic analysis Expanding the i th entry of x t +1 : � √ � � λ � x , F t ( x t ) � x i + 1 x t +1 Z ij f t ( x t = √ n j ) i n � �� � j � �� � ≈ µ t ≈ Normal(0 ,τ t ) Thus: ≈ µ t x + √ τ t z , where z ∼ Normal(0 , I n ) x t +1 d 11

  30. Approximate Message Passing (AMP) This analysis is obviously wrong, but. . . 12

  31. Approximate Message Passing (AMP) This analysis is obviously wrong, but. . . is asymptotically exact for the modified iteration: x t +1 = A � x t − b t � x t − 1 , x t = F t ( x t ) . � [Donoho, Maleki, Montanari 2009], [Bayati, Montanari 2011], [Rangan, Fletcher 2012]. 12

  32. Asymptotic behavior t = 2 hist( x t hist( x t i − µ t x i ) i − µ t x i ) 150 150 100 100 50 50 0 0 − 2 − 1 0 1 2 − 2 − 1 0 1 2 3 Power method AMP 13

  33. Asymptotic behavior t = 4 hist( x t i − µ t x i ) hist( x t i − µ t x i ) 150 150 100 100 50 50 0 0 − 2 0 2 4 − 1 − 0 . 5 0 0 . 5 1 1 . 5 Power method AMP 13

  34. Asymptotic behavior t = 8 hist( x t i − µ t x i ) hist( x t i − µ t x i ) 200 150 150 100 100 50 50 0 0 − 10 − 5 0 5 10 15 20 − 0 . 6 − 0 . 4 − 0 . 2 0 0 . 2 0 . 4 0 . 6 Power method AMP 13

  35. Asymptotic behavior t = 12 hist( x t i − µ t x i ) hist( x t i − µ t x i ) 200 150 150 100 100 50 50 0 0 − 50 0 50 100 − 0 . 3 − 0 . 2 − 0 . 1 0 0 . 1 0 . 2 0 . 3 Power method AMP 13

  36. Asymptotic behavior t = 16 hist( x t hist( x t i − µ t x i ) i − µ t x i ) 200 150 150 100 100 50 50 0 0 − 0 . 1 − 5 · 10 − 2 0 0 . 1 0 . 15 − 200 0 200 400 600 5 · 10 − 2 Power method AMP 13

  37. Asymptotic behavior: a lemma Lemma Let f t be a sequence of Lipschitz functions. For every fixed t and uniformly random i: ( X 0 , µ t X 0 + √ τ t Z ) almost surely. d ( x i , x t i ) → 14

  38. State evolution Deterministic recursions: √ λ f t ( µ t X 0 + √ τ t Z ) } µ t +1 = E { τ t +1 = E { f t ( µ t X 0 + √ τ t Z ) 2 } . 15

  39. State evolution Deterministic recursions: √ λ f t ( µ t X 0 + √ τ t Z ) } µ t +1 = E { τ t +1 = E { f t ( µ t X 0 + √ τ t Z ) 2 } . With optimal f t : √ µ t +1 = λτ t +1 τ t +1 = ε − S-mmse( λτ t ) . 15

  40. State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ t 16

  41. State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ t 16

  42. State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ 1 τ t 16

  43. State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ t 16

  44. State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ 2 τ t 16

  45. State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ t 16

  46. State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ 3 τ t 16

  47. State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ t 16

  48. State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) M - mmse ( λ ) = ε 2 − τ 2 ∗ τ ∗ τ t 16

  49. Proof sketch: MSE expression X t = � Using estimator � x t ( � x t ) T : X t , λ ) = 1 x t ) T − xx T � 2 mse( � n 2 E {� � x ( � F } � � � � x t , x � 2 = 1 n 2 E {� x � 4 } + 1 x � 4 } − 2 E n 2 E {� � n 2 → ε 2 − τ 2 t +1 . 17

  50. Proof sketch: MSE expression X t = � Using estimator � x t ( � x t ) T : X t , λ ) = 1 x t ) T − xx T � 2 mse( � n 2 E {� � x ( � F } � � � � x t , x � 2 = 1 n 2 E {� x � 4 } + 1 x � 4 } − 2 E n 2 E {� � n 2 → ε 2 − τ 2 t +1 . Thus X t , λ ) = ε 2 − τ 2 n →∞ mse( � mse AMP ( λ ) = lim t →∞ lim ∗ . 17

  51. Proof sketch: I-MMSE identity M-mmse( λ ) ≤ mse AMP ( λ ) 18

  52. Proof sketch: I-MMSE identity � ∞ � ∞ 1 1 M-mmse( λ ) d λ ≤ mse AMP ( λ ) d λ 4 4 0 0 18

  53. Proof sketch: I-MMSE identity � ∞ � ∞ 1 1 M-mmse( λ ) d λ ≤ mse AMP ( λ ) d λ 4 4 0 0 I ( X ; Y ∞ ) − I ( X ; Y 0 ) 18

Recommend


More recommend