Information-theoretically Optimal Sparse PCA Yash Deshpande and Andrea Montanari Stanford University July 3 rd , 2014
Problem Definition � λ n xx T Y λ = + Z . 2
Problem Definition � λ n xx T Y λ = + Z . � λ � λ n n � λ � λ 0 n n Z ij = Z ji 0 0 x i ∼ Bernoulli( ε ) , Z ij ∼ Normal(0 , 1) independent. Estimate X = xx T from Y λ 2
An example: gene expression data [Baechler et al, 2003 PNAS] • Genes × patients matrix • Blue - lupus patients, Aqua - healthy controls • Black - a subset of immune system specific genes 3
An example: gene expression data [Baechler et al, 2003 PNAS] • Genes × patients matrix • Blue - lupus patients, Aqua - healthy controls • Black - a subset of immune system specific genes A simple probabilistic model 3
Related work Detection and estimation: Y = X + noise . • X ∈ S ⊂ { 0 , 1 } n , a known set • Goal: hypothesis testing, support recovery • [Donoho, Jin 2004], [Addario-Berry et al. 2010], [Arias-Castro et al. 2011] . . . 4
Related work Machine learning: maximize � v , Y λ v � subject to: � v � 2 ≤ 1 , v is sparse. • Goal: maximize “variance”, support recovery • [d’Aspremont et al. 2004], [Moghaddam et al. 2005], [Zou et al. 2006], [Amini, Wainwright 2009] , [Papailiopoulos et al. 2013]. . . 4
Related work Information theory: minimize � Y λ − vv T � 2 F + f ( v ) . • Probabilistic model for x , Y λ • Propose approximate message passing algorithm • [Rangan, Fletcher 2012], [Kabashima et al. 2014] 4
A first try: simple PCA � λ n xx T + Z . Y λ = 5
A first try: simple PCA � λ n xx T + Z . Y λ = Estimate x using scaled principal eigenvector x 1 ( Y λ ). 5
Limitations of PCA 6
Limitations of PCA If λε 2 > 1 Limiting Spectral Density − 2 2 � x 1 ( Y λ ) , x � lim n →∞ > 0 a. s. √ n ε 6
Limitations of PCA If λε 2 > 1 If λε 2 < 1 Limiting Spectral Density Limiting Spectral Density − 2 2 − 2 2 � x 1 ( Y λ ) , x � � x 1 ( Y λ ) , x � lim n →∞ > 0 a. s. lim n →∞ = 0 a. s. √ n ε √ n ε 6
Limitations of PCA If λε 2 > 1 If λε 2 < 1 Limiting Spectral Density Limiting Spectral Density − 2 2 − 2 2 � x 1 ( Y λ ) , x � � x 1 ( Y λ ) , x � lim n →∞ > 0 a. s. lim n →∞ = 0 a. s. √ n ε √ n ε [Knowles, Yin, 2011] 6
Our contributions • Poly-time algorithm that exploits sparsity 7
Our contributions • Poly-time algorithm that exploits sparsity • Provably optimal in terms of MSE when ε > ε c 7
Our contributions • Poly-time algorithm that exploits sparsity • Provably optimal in terms of MSE when ε > ε c • “Single-letter” characterization of MMSE 7
Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = 8
Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F 8
Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F Scalar problem √ Y λ = λ X 0 + Z , 8
Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F Scalar problem √ Y λ = λ X 0 + Z , � ( X 0 − E { X 0 | Y λ } ) 2 � S-mmse( λ ) ≡ E . 8
Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F Scalar problem √ Y λ = λ X 0 + Z , � ( X 0 − E { X 0 | Y λ } ) 2 � S-mmse( λ ) ≡ E . Here X 0 ∼ Bernoulli( ε ) , Z ∼ Normal(0 , 1). 8
Main result Theorem (Deshpande, Montanari 2014) There exists an ε c < 1 such that the following happens. For every ε > ε c n →∞ M - mmse( λ, n ) = ε 2 − τ 2 lim ∗ where τ ∗ = ε − S - mmse( λτ ∗ ) . Further there exists a polynomial time algorithm that achieves this MSE. 9
Main result Theorem (Deshpande, Montanari 2014) There exists an ε c < 1 such that the following happens. For every ε > ε c n →∞ M - mmse( λ, n ) = ε 2 − τ 2 lim ∗ where τ ∗ = ε − S - mmse( λτ ∗ ) . Further there exists a polynomial time algorithm that achieves this MSE. ε c ≈ 0 . 05 (solution to scalar non-linear equation) 9
Making use of sparsity 10
Making use of sparsity The power iteration with A = Y λ / √ n : x t +1 = A x t . 10
Making use of sparsity The power iteration with A = Y λ / √ n : x t +1 = A x t . Improvement: x t +1 = A F t ( x t ) , where F t ( x t ) = ( f t ( x t 1 ) , . . . f t ( x t n )) T . Choose f t to exploit sparsity. 10
A heuristic analysis Expanding the i th entry of x t +1 : � √ � � λ � x , F t ( x t ) � x i + 1 x t +1 Z ij f t ( x t = √ n j ) i n � �� � j � �� � ≈ µ t ≈ Normal(0 ,τ t ) 11
A heuristic analysis Expanding the i th entry of x t +1 : � √ � � λ � x , F t ( x t ) � x i + 1 x t +1 Z ij f t ( x t = √ n j ) i n � �� � j � �� � ≈ µ t ≈ Normal(0 ,τ t ) Thus: ≈ µ t x + √ τ t z , where z ∼ Normal(0 , I n ) x t +1 d 11
Approximate Message Passing (AMP) This analysis is obviously wrong, but. . . 12
Approximate Message Passing (AMP) This analysis is obviously wrong, but. . . is asymptotically exact for the modified iteration: x t +1 = A � x t − b t � x t − 1 , x t = F t ( x t ) . � [Donoho, Maleki, Montanari 2009], [Bayati, Montanari 2011], [Rangan, Fletcher 2012]. 12
Asymptotic behavior t = 2 hist( x t hist( x t i − µ t x i ) i − µ t x i ) 150 150 100 100 50 50 0 0 − 2 − 1 0 1 2 − 2 − 1 0 1 2 3 Power method AMP 13
Asymptotic behavior t = 4 hist( x t i − µ t x i ) hist( x t i − µ t x i ) 150 150 100 100 50 50 0 0 − 2 0 2 4 − 1 − 0 . 5 0 0 . 5 1 1 . 5 Power method AMP 13
Asymptotic behavior t = 8 hist( x t i − µ t x i ) hist( x t i − µ t x i ) 200 150 150 100 100 50 50 0 0 − 10 − 5 0 5 10 15 20 − 0 . 6 − 0 . 4 − 0 . 2 0 0 . 2 0 . 4 0 . 6 Power method AMP 13
Asymptotic behavior t = 12 hist( x t i − µ t x i ) hist( x t i − µ t x i ) 200 150 150 100 100 50 50 0 0 − 50 0 50 100 − 0 . 3 − 0 . 2 − 0 . 1 0 0 . 1 0 . 2 0 . 3 Power method AMP 13
Asymptotic behavior t = 16 hist( x t hist( x t i − µ t x i ) i − µ t x i ) 200 150 150 100 100 50 50 0 0 − 0 . 1 − 5 · 10 − 2 0 0 . 1 0 . 15 − 200 0 200 400 600 5 · 10 − 2 Power method AMP 13
Asymptotic behavior: a lemma Lemma Let f t be a sequence of Lipschitz functions. For every fixed t and uniformly random i: ( X 0 , µ t X 0 + √ τ t Z ) almost surely. d ( x i , x t i ) → 14
State evolution Deterministic recursions: √ λ f t ( µ t X 0 + √ τ t Z ) } µ t +1 = E { τ t +1 = E { f t ( µ t X 0 + √ τ t Z ) 2 } . 15
State evolution Deterministic recursions: √ λ f t ( µ t X 0 + √ τ t Z ) } µ t +1 = E { τ t +1 = E { f t ( µ t X 0 + √ τ t Z ) 2 } . With optimal f t : √ µ t +1 = λτ t +1 τ t +1 = ε − S-mmse( λτ t ) . 15
State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ t 16
State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ t 16
State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ 1 τ t 16
State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ t 16
State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ 2 τ t 16
State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ t 16
State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ 3 τ t 16
State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ t 16
State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) M - mmse ( λ ) = ε 2 − τ 2 ∗ τ ∗ τ t 16
Proof sketch: MSE expression X t = � Using estimator � x t ( � x t ) T : X t , λ ) = 1 x t ) T − xx T � 2 mse( � n 2 E {� � x ( � F } � � � � x t , x � 2 = 1 n 2 E {� x � 4 } + 1 x � 4 } − 2 E n 2 E {� � n 2 → ε 2 − τ 2 t +1 . 17
Proof sketch: MSE expression X t = � Using estimator � x t ( � x t ) T : X t , λ ) = 1 x t ) T − xx T � 2 mse( � n 2 E {� � x ( � F } � � � � x t , x � 2 = 1 n 2 E {� x � 4 } + 1 x � 4 } − 2 E n 2 E {� � n 2 → ε 2 − τ 2 t +1 . Thus X t , λ ) = ε 2 − τ 2 n →∞ mse( � mse AMP ( λ ) = lim t →∞ lim ∗ . 17
Proof sketch: I-MMSE identity M-mmse( λ ) ≤ mse AMP ( λ ) 18
Proof sketch: I-MMSE identity � ∞ � ∞ 1 1 M-mmse( λ ) d λ ≤ mse AMP ( λ ) d λ 4 4 0 0 18
Proof sketch: I-MMSE identity � ∞ � ∞ 1 1 M-mmse( λ ) d λ ≤ mse AMP ( λ ) d λ 4 4 0 0 I ( X ; Y ∞ ) − I ( X ; Y 0 ) 18
Recommend
More recommend