efficient bayesian computation by proximal markov chain
play

Efficient Bayesian computation by proximal Markov chain Monte Carlo: - PowerPoint PPT Presentation

Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau Dr. Marcelo Pereyra http://www.macs.hw.ac.uk/ mp71/ Maxwell Institute for Mathematical Sciences, Heriot-Watt University June 2017, Heriot-Watt,


  1. Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau Dr. Marcelo Pereyra http://www.macs.hw.ac.uk/ ∼ mp71/ Maxwell Institute for Mathematical Sciences, Heriot-Watt University June 2017, Heriot-Watt, Edinburgh. Joint work with Alain Durmus and Eric Moulines M. Pereyra (MI — HWU) LMS 17 0 / 25

  2. Outline Bayesian inference in imaging inverse problems 1 Proximal Markov chain Monte Carlo 2 3 Experiments Conclusion 4 M. Pereyra (MI — HWU) LMS 17 1 / 25

  3. Imaging inverse problems We are interested in an unknown x ∈ R d . We measure y , related to x by a statistical model p ( y ∣ x ) . The recovery of x from y is ill-posed or ill-conditioned, resulting in significant uncertainty about x . For example, in many imaging problems y = Ax + w , for some operator A that is rank-deficient, and additive noise w . M. Pereyra (MI — HWU) LMS 17 2 / 25

  4. The Bayesian framework We use priors to reduce uncertainty and deliver accurate results. Given the prior p ( x ) , the posterior distribution of x given y p ( x ∣ y ) = p ( y ∣ x ) p ( x )/ p ( y ) models our knowledge about x after observing y . In this talk we consider that p ( x ∣ y ) is log-concave; i.e., p ( x ∣ y ) = exp {− φ ( x )}/ Z , where φ ( x ) is a convex function and Z = ∫ exp {− φ ( x )} d x . M. Pereyra (MI — HWU) LMS 17 3 / 25

  5. Inverse problems in mathematical imaging More precisely, we consider models of the form p ( x ∣ y ) ∝ exp {− f ( x ) − g ( x )} (1) where f ( x ) and g ( x ) are lower semicontinuous convex functions from R d → (−∞ , +∞] and f is L f -Lipschitz differentiable. For example, f ( x ) = 2 σ 2 ∥ y − Ax ∥ 2 1 2 for some observation y ∈ R p and linear operator A ∈ R p × n , and g ( x ) = α ∥ Bx ∥ † + 1 S ( x ) for some norm ∥ ⋅ ∥ † , dictionary B ∈ R n × n , and convex set S . Often, g ∉ C 1 . M. Pereyra (MI — HWU) LMS 17 4 / 25

  6. Maximum-a-posteriori (MAP) estimation The predominant Bayesian approach in imaging is MAP estimation x MAP = argmax p ( x ∣ y ) , ˆ x ∈ R d (2) = argmin f ( x ) + g ( x ) , x ∈ R d that can be computed efficiently by “proximal” convex optimisation. For example, the proximal gradient algorithm x m + 1 = prox L − 1 g { x m + L − 1 ∇ f ( x m )} , 2 λ ∣∣ u − x ∣∣ 2 converges at rate O ( 1 / m ) . g ( x ) = argmax u ∈ R N g ( u ) − 1 with prox λ x MAP provides very little about p ( x ∣ y ) . However, ˆ M. Pereyra (MI — HWU) LMS 17 5 / 25

  7. Illustrative example: image resolution enhancement Recover x ∈ R d from low resolution and noisy measurements y = Hx + w , where H is a circulant blurring matrix. We use the Bayesian model p ( x ∣ y ) ∝ exp (−∥ y − Hx ∥ 2 / 2 σ 2 − β ∥ x ∥ 1 ) . (3) y ˆ x MAP Uncertainty estimates? Figure : Resolution enhancement of the Molecules image of size 256 × 256 pixels. M. Pereyra (MI — HWU) LMS 17 6 / 25

  8. Illustrative example: tomographic image reconstruction Recover x ∈ R d from partially observed and noisy Fourier measurements y = Φ F x + w , where Φ is a mask and F is the 2D Fourier operator. We use the model p ( x ∣ y ) ∝ exp (−∥ y − Φ F x ∥ 2 / 2 σ 2 − β ∥∇ d x ∥ 1 − 2 ) , (4) where ∇ d is the 2d discrete gradient operator and ∥ ⋅ ∥ 1 − 2 the ℓ 1 − ℓ 2 norm. y ˆ x MAP Possible solution? Figure : Tomographic reconstruction of the Shepp-Logan phantom image. M. Pereyra (MI — HWU) LMS 17 7 / 25

  9. Modern Bayesian computation Recent surveys on Bayesian computation... 25th anniversary special issue on Bayesian computation P. Green, K. Latuszynski, M. Pereyra, C. P. Robert, ”Bayesian computation: a perspective on the current state, and sampling backwards and forwards”, Statistics and Computing, vol. 25, no. 4, pp 835-862, Jul. 2015. Special issue on “Stochastic simulation and optimisation in signal processing” M. Pereyra, P. Schniter, E. Chouzenoux, J.-C. Pesquet, J.-Y. Tourneret, A. Hero, and S. McLaughlin, “A Survey of Stochastic Simulation and Optimization Methods in Signal Pro- cessing” IEEE Sel. Topics in Signal Processing, in press. M. Pereyra (MI — HWU) LMS 17 8 / 25

  10. Outline Bayesian inference in imaging inverse problems 1 Proximal Markov chain Monte Carlo 2 3 Experiments Conclusion 4 M. Pereyra (MI — HWU) LMS 17 9 / 25

  11. Inference by Markov chain Monte Carlo integration Monte Carlo integration Given a set of samples X 1 ,..., X M distributed according to p ( x ∣ y ) , we approximate posterior expectations and probabilities M 1 h ( X m ) → E { h ( x )∣ y } , as M → ∞ ∑ M m = 1 m = 1 h ( X m ) ∼ N [ E { h ( x )∣ y } , Σ ] . 1 M ∑ M Guarantees from CLTs, e.g., √ Markov chain Monte Carlo: Construct a Markov kernel X m + 1 ∣ X m ∼ K ( ⋅ ∣ X m ) such that the Markov chain X 1 ,..., X M has p ( x ∣ y ) as stationary distribution. MCMC simulation in high-dimensional spaces is very challenging. M. Pereyra (MI — HWU) LMS 17 10 / 25

  12. Unadjusted Langevin algorithm Suppose for now that p ( x ∣ y ) ∈ C 1 . Then, we can generate samples by mimicking a Langevin diffusion process that converges to p ( x ∣ y ) as t → ∞ , d X t = 1 2 ∇ log p ( X t ∣ y ) d t + d W t , 0 ≤ t ≤ T , X ( 0 ) = x 0 . X ∶ where W is the n -dimensional Brownian motion. Because solving X t exactly is generally not possible, we use an Euler Maruyama approximation and obtain the “unadjusted Langevin algorithm” √ ULA ∶ X m + 1 = X m + δ ∇ log p ( X m ∣ y ) + 2 δ Z m + 1 , Z m + 1 ∼ N( 0 , I n ) ULA is remarkably efficient when p ( x ∣ y ) is sufficiently regular. M. Pereyra (MI — HWU) LMS 17 11 / 25

  13. Unadjusted Langevin algorithm However, our interest is in high-dimensional models of the form p ( x ∣ y ) ∝ exp {− f ( x ) − g ( x )} with f , g l.s.c. convex, ∇ f L f -Lipschitz continuous, and g ∉ C 1 . Unfortunately, such models are beyond the scope of ULA, which may perform poorly if p ( x ∣ y ) is not Lipchitz differentiable. Idea: Regularise p ( x ∣ y ) to enable efficiently Langevin sampling. M. Pereyra (MI — HWU) LMS 17 12 / 25

  14. Approximation of p ( x ∣ y ) Moreau-Yoshida approximation of p ( x ∣ y ) (Pereyra, 2015): Let λ > 0. We propose to approximate p ( x ∣ y ) with the density exp [ − f ( x ) − g λ ( x )] p λ ( x ∣ y ) = ∫ R d exp [ − f ( x ) − g λ ( x )] d x , where g λ is the Moreau-Yoshida envelope of g given by g λ ( x ) = inf u ∈ R d { g ( u ) − ( 2 λ ) − 1 ∥ u − x ∥ 2 2 } , and where λ controls the approximation error involved. M. Pereyra (MI — HWU) LMS 17 13 / 25

  15. Moreau-Yoshida approximations Key properties (Pereyra, 2015; Durmus et al., 2017): 1 ∀ λ > 0, p λ defines a proper density of a probability measure on R d . 2 Convexity and differentiability : p λ is log-concave on R d . p λ ∈ C 1 even if p not differentiable, with ∇ log p λ ( x ∣ y ) = −∇ f ( x ) + { prox λ g ( x ) − x }/ λ, g ( x ) = argmax u ∈ R N g ( u ) − 1 2 λ ∣∣ u − x ∣∣ 2 . and prox λ ∇ log p λ is Lipchitz continuous with constant L ≤ L f + λ − 1 . 3 Approximation error between p λ ( x ∣ y ) and p ( x ∣ y ) : lim λ → 0 ∥ p λ − p ∥ TV = 0. If g is L g -Lipchitz, then ∥ p λ − p ∥ TV ≤ λ L 2 g . M. Pereyra (MI — HWU) LMS 17 14 / 25

  16. Illustration Examples of Moreau-Yoshida approximations: p ( x ) ∝ exp (− x 4 ) p ( x ) ∝ 1 [− 0 . 5 , 0 . 5 ] ( x ) p ( x ) ∝ exp (−∣ x ∣) Figure : True densities (solid blue) and approximations (dashed red). M. Pereyra (MI — HWU) LMS 17 15 / 25

  17. Proximal ULA We approximate X with the “regularised” auxiliary Langevin diffusion t = 1 X λ ∶ d X λ 2 ∇ log p λ ( X λ t ∣ y ) d t + d W t , 0 ≤ t ≤ T , X λ ( 0 ) = x 0 , which targets p λ ( x ∣ y ) . Remark: we can make X λ arbitrarily close to X . Finally, an Euler Maruyama discretisation of X λ leads to the (Moreau-Yoshida regularised) proximal ULA √ X m + 1 = ( 1 − δ λ ) X m − δ ∇ f { X m } + δ λ prox λ g { X m } + MYULA ∶ 2 δ Z m + 1 , where we used that ∇ g λ ( x ) = { x − prox λ g ( x )}/ λ . M. Pereyra (MI — HWU) LMS 17 16 / 25

  18. Convergence results Non-asymptotic estimation error bound Theorem 2.1 (Durmus et al. (2017)) = ( L 1 + 1 / λ ) − 1 . Assume that g is Lipchitz continuous. Then, Let δ max λ there exist δ ǫ ∈ ( 0 ,δ max ] and M ǫ ∈ N such that ∀ δ < δ ǫ and ∀ M ≥ M ǫ λ ∥ δ x 0 Q M δ − p ∥ TV < ǫ + λ L 2 g , where Q M is the kernel assoc. with M iterations of MYULA with step δ . δ Note: δ ǫ and M ǫ are explicit and tractable. If f + g is strongly convex outside some ball, then M ǫ scales with order O( d log ( d )) (otherwise at worse O( d 5 ) ). See Durmus et al. (2017) for other convergence results. M. Pereyra (MI — HWU) LMS 17 17 / 25

  19. Outline Bayesian inference in imaging inverse problems 1 Proximal Markov chain Monte Carlo 2 3 Experiments Conclusion 4 M. Pereyra (MI — HWU) LMS 17 18 / 25

  20. Sparse image deblurring α = { x ∶ p ( x ∣ y ) ≥ γ α } with Bayesian credible region C ∗ p ( x ∣ y ) ∝ exp ( − ∥ y − Hx ∥ 2 / 2 σ 2 − β ∥ x ∥ 1 ) P [ x ∈ C α ∣ y ] = 1 − α, and y x MAP ˆ uncertainty estimates Figure : Live-cell microscopy data (Zhu et al., 2012). Uncertainty analysis ( ± 78 nm × ± 125 nm ) in close agreement with the experimental precision ± 80 nm . Computing time 4 minutes. M = 10 5 iterations. Estimation error 0 . 2%.. M. Pereyra (MI — HWU) LMS 17 19 / 25

Recommend


More recommend