on convergence of approximate message passing
play

On convergence of Approximate Message Passing Francesco Caltagirone - PowerPoint PPT Presentation

On convergence of Approximate Message Passing Francesco Caltagirone (1) , Florent Krzakala (2) and Lenka Zdeborova (1) (1) Institut de Physique Thorique, CEA Saclay (2) LPS, Ecole Normale Suprieure, Paris Compressed Sensing the signal is an N


  1. On convergence of Approximate Message Passing Francesco Caltagirone (1) , Florent Krzakala (2) and Lenka Zdeborova (1) (1) Institut de Physique Théorique, CEA Saclay (2) LPS, Ecole Normale Supérieure, Paris

  2. Compressed Sensing the signal is an N components vector y = F x + ξ only K<N components are non-zero the measurement is an M<N components vector ρ = K α = M MxN random matrix with i.i.d. elements N N h ξ 2 i = ∆ white noise with variance = + ξ y F x GIVEN RECONSTRUCT

  3. Standard techniques Minimization of the norm under linear constraint l 0 x || x || 0 with F x = y min || x || 0 = number of non-zero elements non-convex norm, exponentially hard to find Candès, Tao, Donoho Minimization of the norm l 1 N X || x || 1 = | x i | convex norm, easy to minimize i =1 The norm well approximates the norm l 1 l 0

  4. The Donoho-Tanner line α = 1 Square matrix, we can invert it = + ξ y F x α < 1 Rectangular matrix, under-determined K = ρ N M = α N Information-theoretical limit 1 α = ρ 0.8 Let us consider the noiseless case with a measurement matrix with i.i.d. elements 0.6 distributed according to a gaussian with zero α mean and a variance of order 1/N. 0.4 α ( ρ 0 ) ` 1 α EM-BP ( ρ 0 ) s-BP, N=10 4 0.2 s-BP, N=10 3 Donoho-Tanner, L1 minimization α = ρ 0 AMP Bayesian 0 0 0.2 0.4 0.6 0.8 1 ρ 0 Information-Theoretical limit

  5. Setting and motivation Bayesian setting Reconstruct the signal, given the measurement vector, GOAL the measurement matrix and a prior knowledge of the (sparse) distribution of signal elements Powerful algorithm. Approximate Message Passing Convergence issues. Donoho, Maleki, Montanari (2009)

  6. Setting and motivation 1 F µi = γ N (0 , 1) y = F x + ξ N + P ( x ) = (1 − ρ ) δ ( x ) + ρ N (0 , 1) √ N Simplest case in which Approximate Massage Passing (AMP) has convergence problems. If the mean is sufficiently large then AMP displays violent divergencies. This kind of divergencies are observed in many other cases and are the main obstacle to a wide use of AMP . In this simple case there are workarounds that ensure convergence, like a “mean-removal” procedure. BUT it is interesting because want to understand the origin of the non- convergence that, we argue, is of the same nature in more complicated settings.

  7. Bayesian Inference with Belief Propagation P ( x | F , y ) = P ( x | F ) P ( y | F , x ) Bayes formula P ( y | F ) M Conditional probability of 1 2 ∆ µ ( y µ − P N i =1 F µi x i ) 2 1 Y P ( y | F , x ) = e − the measurement vector p 2 π ∆ µ µ =1 N M 1 1 2 ∆ µ ( y µ − P N i =1 F µi x i ) 2 1 Y Y P ( x | F , y ) = [(1 − ρ ) δ ( x i ) + ρφ ( x i )] e − Z ( y , F ) p 2 π ∆ µ i =1 µ =1 MMSE estimator Z x ? i = d x i x i ν i ( x i ) N X ( x i − s i ) 2 /N E = i =1 Z ν i ( x i ) ≡ P ( x | F , y ) Takes an exponential time, unfeasible { x j } j 6 = i

  8. Bayes optimal setting If we know exactly the prior distribution on the signal elements and on the noise we are in the so-called BAYES OPTIMAL setting In the following we will consider that this is the case. When it is not the case, the prior can be efficiently learned adding a step to the algorithm that I will present. (I will not talk about this)

  9. Belief Propagation (Cavity method) x Two kinds of nodes: factors (matrix P ( x ) lines) and variables (signal elements) F m i → µ ( x i ) m µ → i We can introduce a third kind of nodes: the prior distribution on the signal elements, local field. Belief propagation works for: locally tree-like graphs or densely and weakly connected graphs. Messages represent an approximation to the marginal distribution of a variable. Messages are updated according to a sequential or parallel schedule until convergence (fixed point).

  10. Belief Propagation, r-BP and AMP x BP P ( x ) F m i → µ O ( N 2 ) ( x i ) m µ → i continuous messages projection r-BP O ( N 2 ) numbers dense matrix, TAP For the last step one assumes parallel update AMP In this case, fast matrix multiplication algorithms O ( N 2 ) can be applied, reducing the complexity to operations N log( N ) Donoho, Maleki, Montanari (2009) Krzakala et al. (2012)

  11. AMP Algorithm X V t +1 F 2 µi v t = (1) i , The performance of the algorithm µ i can be evaluated through i − ( y µ − ω t µ ) X X ω t +1 F 2 F µi a t µi v t = i , (2) µ ∆ + V t µ i i # − 1 "X F 2 ) 2 = µi ( Σ t +1 (3) , N E t = 1 i ∆ + V t +1 X ( s i − a t µ i ) 2 µ N ( y µ − ω t +1 ) P µ µ F µi i =1 ∆ + V t +1 R t +1 = a t µ i + (4) , i F 2 P µi N µ ∆ + V t +1 V t = 1 µ X v i a t +1 ( Σ t +1 ) 2 , R t +1 � � = f 1 (5) , i i i N v t +1 ( Σ t +1 ) 2 , R t +1 � � i =1 = f 2 . (6) i i i f k ( Σ 2 , R ) k-th connected cumulants w.r.t. the measure Z ( Σ 2 , R ) P ( x ) e − ( x − R )2 1 2 Σ 2 Q ( x ) = √ 2 π Σ 2 and are the AMP estimators for the mean and variance of the i-th signal component. a i v i

  12. AMP Algorithm X V t +1 F 2 µi v t = (1) i , The performance of the algorithm µ i can be evaluated through i − ( y µ − ω t µ ) X X ω t +1 F 2 F µi a t µi v t = i , (2) µ ∆ + V t µ i i # − 1 "X F 2 ) 2 = µi ( Σ t +1 (3) , N E t = 1 i ∆ + V t +1 X ( s i − a t µ i ) 2 µ N ( y µ − ω t +1 ) P µ µ F µi i =1 ∆ + V t +1 R t +1 = a t µ i + (4) , i F 2 P µi N µ ∆ + V t +1 V t = 1 µ X v i a t +1 ( Σ t +1 ) 2 , R t +1 � � = f 1 (5) , i i i N v t +1 ( Σ t +1 ) 2 , R t +1 � � i =1 = f 2 . (6) i i i 1 F µi = γ N + N (0 , 1) √ N The AMP algorithm does NOT depend explicitly on the value of the mean of the matrix.

  13. Convergence Bayes optimal case. Given a certain (sufficiently high) measurement ratio. Very small or zero noise.

  14. Bayati, Montanari (rigorous in the zero-mean case) ‘11 State Evolution (infinite N) Krzakala et al. (replicas in the zero-mean case) ‘12 Caltagirone, Krzakala, Zdeborova (replicas in the non-zero-mean case) ‘14 State evolution is the asymptotic analysis of the average performance of the inference algorithm when the size of the signal goes to infinity. It gives a good indication of what happens in a practical situation if the size of the signal is sufficiently large. It can be obtained rigorously in simple cases and non rigorously with the replica method in more involved cases.

  15. Bayati, Montanari (rigorous in the zero-mean case) ‘11 State Evolution (infinite N) Krzakala et al. (replicas in the zero-mean case) ‘12 Caltagirone, Krzakala, Zdeborova (replicas in the non-zero-mean case) ‘14 N N V t = 1 D t = 1 E t = 1 X X X ( s j − a t ( s i − a t i ) 2 j ) v i N N N i =1 j i =1 ✓ ∆ + V t ◆ Z Z V t +1 = , s + z A ( E t , D t ) + γ 2 D t d s P ( s ) D z × f 2 α ◆� 2  ✓ ∆ + V t Z Z E t +1 = , s + z A ( E t , D t ) + γ 2 D t d s P ( s ) D z × s − f 1 α ✓ ∆ + V t  ◆� Z Z D t +1 = , s + z A ( E t , D t ) + γ 2 D t d s P ( s ) D z × s − f 1 α E t + ∆ + γ 2 ( D t ) 2 r with A ( E t , D t ) = α If the mean is zero the density evolution that does not depend on D

  16. The Nishimori Condition D t = 0 Bayes optimal setting E t = V t D t +1 = 0 E t +1 = V t +1 Therefore, analytically, if the evolution starts (exactly) on the Nishimori Line it stays on it until convergence. BUT What is the effect of small perturbations with respect to the NL? • Very small fluctuations due to numerical precision in the DE • Fluctuations due to finite size in the AMP algorithm

  17. Zero-mean case Gaussian Signal, Gaussian inference, ρ =0.2, no spinodal 0.07 0.06 Convergence on the NL 0.05 (Bayati, Montanari) 0.04 E 0.03 0.02 0.01 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 V D The non-zero mean adds a third dimension to the phase space!

  18. Stability Analysis (I) K = V − E D δ K E δ D K t +1 = f K ( V t , K t , D t ) D t +1 = f D ( V t , K t , D t ) V The NL is a “fixed line”: D ( K ∗ = 0 , D ∗ = 0)

  19. Stability Analysis (II) We linearize the equations with ✓ δ K t +1 ✓ δ K t ◆ ◆ δ K t = K t − K ∗ = M · δ D t +1 δ D t δ D t = D t − D ∗ ✓ ∂ K f K ( V t , 0 , 0) ∂ D f K ( V t , 0 , 0) ◆ M = ∂ K f D ( V t , 0 , 0) ∂ D f D ( V t , 0 , 0)

  20. Stability Analysis (II) We linearize the equations with ✓ δ K t +1 ✓ δ K t ◆ ◆ δ K t = K t − K ∗ = M · δ D t +1 δ D t δ D t = D t − D ∗ ✓ ∂ K f K ( V t , 0 , 0) ◆ 0 M = ∂ D f D ( V t , 0 , 0) 0 When the signal is Gauss-Bernoulli with zero mean, the off-diagonal terms vanish.

  21. Stability Analysis (II) αγ 2 = − αγ 2 V t +1 Z Z λ D ∂ D f D ( V t ) = − A 2 , s + zA � � ds P ( s ) D z f 2 ∆ + V t , ∆ + V t ∂ K f K ( V t ) = − 1 1 Z Z A 2 , s + zA A 2 , s + zA ) 2 � � � � � ds P ( s ) + 2( f 2 D z f 4 2 ∆ + V t A 2 , s + zA A 2 , s + zA ⇥ � � ⇤ � � λ K +2 f 1 − s f 3 , ∆ = 10 − 10 ρ = 0 . 1 α = 0 . 3 0 • the eigenvalue is always less γ < γ (1) -0.5 c than 1 in modulus. • the eigenvalue -1 γ (1) < γ < γ (2) c c � D becomes larger than 1 in a limited -1.5 region. -2 • the eigenvalue is larger than 1 γ > γ (2) � =1.9 � =2.5 c -2.5 in modulus down to the fixed point. � =2.9 � =3.6 -3 -9 -8 -7 -6 -5 -4 -3 -2 -1 log 10 V

Recommend


More recommend