Low-rank Matrix Estimation via Approximate Message Passing Andrea Montanari Ramji Venkataramanan Stanford University University of Cambridge WoLA 2018 1 / 25
The Spiked Model k � λ i v i v T ∈ R n × n A = i + W i =1 • λ 1 ≥ λ 2 ≥ . . . ≥ λ k are deterministic scalars • v 1 , . . . , v k ∈ R n are orthonormal vectors • W ∼ GOE( n ) ⇒ W symmetric with ( W ii ) i ≤ n ∼ i . i . d . N(0 , 2 n ) and ( W ij ) i < j ≤ n ∼ i . i . d . N(0 , 1 n ) GOAL: To estimate the vectors v 1 , . . . , v k from A 2 / 25
Spectrum of spiked matrix k � λ i v i v T A = i + W i =1 Random matrix theory and the ‘BBAP’ phase transition : • Bulk of eigenvalues of A in [ − 2 , 2] distributed according to Wigner’s semicircle • Outlier eigenvalues corresponding to | λ i | ’s greater than 1: z i → λ i + 1 > 2 λ i • Eigenvectors ϕ i corresponding to outliers z i satisfy � 1 − λ − 2 |� ϕ i , v i �| → i [Baik, Ben Arous, P´ ech´ e ’05], [Baik, Silverstein ’06], [Capitaine, Donati-Martin, F´ eral ’09], [Benaych-Georges and Nadakuditi ’11], . . . 3 / 25
Structural information k � λ i v i v T A = i + W i =1 When v i ’s are unstructured, e.g., drawn uniformly at random from the unit sphere, • Best estimator of v i is the i th eigenvector ϕ i � 1 − 1 • If | λ i | ≥ 1, then |� v i , ϕ i �| → λ 2 i 4 / 25
Structural information k � λ i v i v T A = i + W i =1 When v i ’s are unstructured, e.g., drawn uniformly at random from the unit sphere, • Best estimator of v i is the i th eigenvector ϕ i � 1 − 1 • If | λ i | ≥ 1, then |� v i , ϕ i �| → λ 2 i But we often have structural information about v i ’s • For example, v i ’s may be sparse, bounded, non-negative etc. • Relevant for many applications: sparse PCA, non-negative PCA, community detection under stochastic block model, . . . • Can improve on spectral methods 4 / 25
Prior on eigenvectors k � i + W ≡ V Λ V T + W λ i v i v T A = i =1 R n × k V = [ v 1 v 2 . . . v k ] If each row of V is ∼ i . i . d P V , then Bayes-optimal estimator (for squared error) is � V Bayes = E [ V | A ] • Generally not computable • Closed-form expressions for asymptotic Bayes error [Deshpande, Montanari ’14], [Barbier et al. ’16], [Lesieur et al. ’17], [Miolane, Lelarge ’16] . . . 5 / 25
Computable estimators � k i + W ≡ V Λ V T + W λ i v i v T A = i =1 • Convex relaxations generally do not achieve Bayes optimal error [Javanmard, Montanari, Ricci-Tersinghi ’16] • MCMC can approximate Bayes estimator, but can have very large mixing time and hard to analyze 6 / 25
Computable estimators � k i + W ≡ V Λ V T + W λ i v i v T A = i =1 • Convex relaxations generally do not achieve Bayes optimal error [Javanmard, Montanari, Ricci-Tersinghi ’16] • MCMC can approximate Bayes estimator, but can have very large mixing time and hard to analyze In this talk Approximate Message Passing (AMP) algorithm to estimate V 6 / 25
Rank one spiked model A = λ n vv T + W , E V 2 = 1 v ∼ i . i . d . P V , Power iteration for principal eigenvector: x t +1 = Ax t , with x 0 chosen at random 7 / 25
Rank one spiked model A = λ n vv T + W , E V 2 = 1 v ∼ i . i . d . P V , Power iteration for principal eigenvector: x t +1 = Ax t , with x 0 chosen at random AMP : n � b t = 1 x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) , f ′ t ( x t i ) n i =1 • Non-linear function f t chosen based on structural info on v • Memory term ensures a nice distributional property for the iterates in high dimensions • Can be derived via approximation of belief propagation equations 7 / 25
State evolution n � with b t = 1 x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) , f ′ t ( x t i ) n i =1 If we initialize with x 0 independent of A , then as n → ∞ : x t − → µ t v + σ t g • g ∼ i . i . d . N(0 , 1), independent of v ∼ i . i . d . P V [Bayati,Montanari ’11], [Rangan, Fletcher ’12], [Deshpande, Montanari ’14] 8 / 25
State evolution n � with b t = 1 x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) , f ′ t ( x t i ) n i =1 If we initialize with x 0 independent of A , then as n → ∞ : x t − → µ t v + σ t g • g ∼ i . i . d . N(0 , 1), independent of v ∼ i . i . d . P V • Scalars µ t , σ 2 t recursively determined as σ 2 t +1 = E [ f t ( µ t V + σ t G ) 2 ] µ t +1 = λ E [ V f t ( µ t V + σ t G )] , • Initialize with µ 0 = 1 n | E � x 0 , v �| [Bayati,Montanari ’11], [Rangan, Fletcher ’12], [Deshpande, Montanari ’14] 8 / 25
Bayes-optimal AMP Assuming x t = µ t v + σ t g , choose f t ( y ) = E [ V | µ t V + σ t G = y ] State evolution becomes γ t +1 = λ 2 � � 1 − mmse( γ t ) with µ t = σ 2 t = γ t √ P V ∼ uniform { 1 , − 1 } , λ = 2 Initial value γ 0 ∝ 1 n | E � x 0 , v �| , what is lim t →∞ γ t ? 9 / 25
Fixed points of state evolution • If E � x 0 , v � = 0, then γ t = 0 is an (unstable) fixed point. • This is the case in problems where v has zero mean, as x 0 is independent of v 10 / 25
Spectral Initialization A = λ n vv T + W , λ > 1 • Compute ϕ 1 , the principal eigenvector of A • Run AMP with initialization x 0 = √ n ϕ 1 √ • γ 0 > 0 as 1 n | E � x 0 , v �| → 1 − λ − 2 11 / 25
AMP with spectral initialization A = λ n vv T + W x 0 = √ n ϕ 1 x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) , Existing AMP analysis does not apply for initialization x 0 correlated with v 12 / 25
AMP analysis with spectral initialization A = λ n vv T + W Let ( ϕ 1 , z 1 ) are the principal eigenvector and eigenvalue of A Instead of A , we will analyze AMP on � λ � n vv T + ˜ ˜ A = z 1 ϕ 1 ϕ T 1 + P ⊥ P ⊥ W • P ⊥ = I − ϕ 1 ϕ T 1 ˜ • W ∼ GOE( n ) is independent of W 13 / 25
True vs conditional model A = λ n vv T + W � λ � n vv T + ˜ ˜ A = z 1 ϕ 1 ϕ T 1 + P ⊥ P ⊥ W Lemma � � 1 v ) 2 ≥ 1 − λ − 2 − ε | z 1 − ( λ + λ − 1 ) | ≤ ε, ( ϕ T For ( z 1 , ϕ 1 ) ∈ , we have � �� � � � � � ˜ 1 � � � z 1 , ϕ 1 � z 1 , ϕ 1 c ( ε ) e − nc ( ε ) sup � P A ∈ · − P A ∈ · � TV ≤ ( z ˆ S , Φ ˆ S ) ∈E ε 14 / 25
AMP on conditional model � λ � n vv T + ˜ ˜ A = z 1 ϕ 1 ϕ T 1 + P ⊥ P ⊥ W AMP with ˜ A instead of A : x 0 = √ n ϕ 1 x t +1 = ˜ x t ; t ) − b t f (˜ x t − 1 ; t − 1) , ˜ A f (˜ ˜ Analyze using existing AMP analysis + results from random matrix theory 15 / 25
Model assumptions A = λ n vv T + W Let v = v ( n ) ∈ R n be a sequence such that the empirical distribution of entries of v ( n ) converges weakly to P V , 16 / 25
Model assumptions A = λ n vv T + W Let v = v ( n ) ∈ R n be a sequence such that the empirical distribution of entries of v ( n ) converges weakly to P V , Performance of any estimator ˆ v measured via loss function ψ : R × R → R : n � v ) = 1 ψ ( v , ˆ ψ ( v i , ˆ v i ) . n i =1 ψ assumed to be pseudo-Lipschitz : ∀ x , y ∈ R 2 | ψ ( x ) − ψ ( y ) | ≤ C � x − y � 2 (1 + � x � 2 + � y � 2 ) , 16 / 25
Result for rank one case A = λ n vv T + W Theorem: Let λ > 1. Consider the AMP x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) • Assume f t : R → R is Lipschitz continuous • Initialize with x 0 = √ n ϕ 1 Then for any pseudo-Lipschitz loss function ψ and t ≥ 0, � n 1 ψ ( v i , x t lim i ) = E { ψ ( V , µ t V + σ t G ) } a.s. n →∞ n i =1 17 / 25
Result for rank one case A = λ n vv T + W Theorem: Let λ > 1. Consider the AMP x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) • Assume f t : R → R is Lipschitz continuous • Initialize with x 0 = √ n ϕ 1 Then for any pseudo-Lipschitz loss function ψ and t ≥ 0, � n 1 ψ ( v i , x t lim i ) = E { ψ ( V , µ t V + σ t G ) } a.s. n →∞ n i =1 The state evolution parameters are recursively defined as σ 2 t +1 = E [ f t ( µ t V + σ t G ) 2 ] , µ t +1 = λ E [ V f t ( µ t V + σ t G )] , √ 17 / 25 1 − λ − 2 and σ = 1 /λ . with µ =
Bayes-optimal AMP A = λ n vv T + W x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) • Bayes-optimal choice f t ( y ) = λ E ( V | γ t V + √ γ t G = y ) • State evolution: γ t +1 = λ 2 � � γ 0 = λ 2 − 1 1 − mmse( γ t ) , �� � 2 � V − E ( V | √ γ V + G ) where mmse( γ ) = E • µ t = σ 2 t = γ t 18 / 25
Bayes-optimal AMP A = λ n vv T + W Let γ AMP ( λ ) be the smallest strictly positive solution of γ = λ 2 [1 − mmse( γ )] . (1) x t = f t ( x t ) achieves Then the AMP estimate ˆ 1 2 = 1 − γ AMP ( λ ) x t − s v � 2 t →∞ lim lim min n � ˆ λ 2 n →∞ s ∈{ +1 , − 1 } 19 / 25
Bayes-optimal AMP A = λ n vv T + W Let γ AMP ( λ ) be the smallest strictly positive solution of γ = λ 2 [1 − mmse( γ )] . (1) x t = f t ( x t ) achieves Then the AMP estimate ˆ � x t , v �| |� ˆ γ AMP ( λ ) Overlap : t →∞ lim lim = x t � 2 � v � 2 � ˆ λ n →∞ 19 / 25
Recommend
More recommend