Efficient Bayesian computation by proximal Markov chain Monte Carlo: - PowerPoint PPT Presentation

Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau Dr. Marcelo Pereyra http://www.macs.hw.ac.uk/ ∼ mp71/ Maxwell Institute for Mathematical Sciences, Heriot-Watt University June 2017, Heriot-Watt, Edinburgh. Joint work with Alain Durmus and Eric Moulines M. Pereyra (MI — HWU) LMS 17 0 / 25

Outline Bayesian inference in imaging inverse problems 1 Proximal Markov chain Monte Carlo 2 3 Experiments Conclusion 4 M. Pereyra (MI — HWU) LMS 17 1 / 25

Imaging inverse problems We are interested in an unknown x ∈ R d . We measure y , related to x by a statistical model p ( y ∣ x ) . The recovery of x from y is ill-posed or ill-conditioned, resulting in significant uncertainty about x . For example, in many imaging problems y = Ax + w , for some operator A that is rank-deficient, and additive noise w . M. Pereyra (MI — HWU) LMS 17 2 / 25

The Bayesian framework We use priors to reduce uncertainty and deliver accurate results. Given the prior p ( x ) , the posterior distribution of x given y p ( x ∣ y ) = p ( y ∣ x ) p ( x )/ p ( y ) models our knowledge about x after observing y . In this talk we consider that p ( x ∣ y ) is log-concave; i.e., p ( x ∣ y ) = exp {− φ ( x )}/ Z , where φ ( x ) is a convex function and Z = ∫ exp {− φ ( x )} d x . M. Pereyra (MI — HWU) LMS 17 3 / 25

Inverse problems in mathematical imaging More precisely, we consider models of the form p ( x ∣ y ) ∝ exp {− f ( x ) − g ( x )} (1) where f ( x ) and g ( x ) are lower semicontinuous convex functions from R d → (−∞ , +∞] and f is L f -Lipschitz differentiable. For example, f ( x ) = 2 σ 2 ∥ y − Ax ∥ 2 1 2 for some observation y ∈ R p and linear operator A ∈ R p × n , and g ( x ) = α ∥ Bx ∥ † + 1 S ( x ) for some norm ∥ ⋅ ∥ † , dictionary B ∈ R n × n , and convex set S . Often, g ∉ C 1 . M. Pereyra (MI — HWU) LMS 17 4 / 25

Maximum-a-posteriori (MAP) estimation The predominant Bayesian approach in imaging is MAP estimation x MAP = argmax p ( x ∣ y ) , ˆ x ∈ R d (2) = argmin f ( x ) + g ( x ) , x ∈ R d that can be computed efficiently by “proximal” convex optimisation. For example, the proximal gradient algorithm x m + 1 = prox L − 1 g { x m + L − 1 ∇ f ( x m )} , 2 λ ∣∣ u − x ∣∣ 2 converges at rate O ( 1 / m ) . g ( x ) = argmax u ∈ R N g ( u ) − 1 with prox λ x MAP provides very little about p ( x ∣ y ) . However, ˆ M. Pereyra (MI — HWU) LMS 17 5 / 25

Illustrative example: image resolution enhancement Recover x ∈ R d from low resolution and noisy measurements y = Hx + w , where H is a circulant blurring matrix. We use the Bayesian model p ( x ∣ y ) ∝ exp (−∥ y − Hx ∥ 2 / 2 σ 2 − β ∥ x ∥ 1 ) . (3) y ˆ x MAP Uncertainty estimates? Figure : Resolution enhancement of the Molecules image of size 256 × 256 pixels. M. Pereyra (MI — HWU) LMS 17 6 / 25

Illustrative example: tomographic image reconstruction Recover x ∈ R d from partially observed and noisy Fourier measurements y = Φ F x + w , where Φ is a mask and F is the 2D Fourier operator. We use the model p ( x ∣ y ) ∝ exp (−∥ y − Φ F x ∥ 2 / 2 σ 2 − β ∥∇ d x ∥ 1 − 2 ) , (4) where ∇ d is the 2d discrete gradient operator and ∥ ⋅ ∥ 1 − 2 the ℓ 1 − ℓ 2 norm. y ˆ x MAP Possible solution? Figure : Tomographic reconstruction of the Shepp-Logan phantom image. M. Pereyra (MI — HWU) LMS 17 7 / 25

Modern Bayesian computation Recent surveys on Bayesian computation... 25th anniversary special issue on Bayesian computation P. Green, K. Latuszynski, M. Pereyra, C. P. Robert, ”Bayesian computation: a perspective on the current state, and sampling backwards and forwards”, Statistics and Computing, vol. 25, no. 4, pp 835-862, Jul. 2015. Special issue on “Stochastic simulation and optimisation in signal processing” M. Pereyra, P. Schniter, E. Chouzenoux, J.-C. Pesquet, J.-Y. Tourneret, A. Hero, and S. McLaughlin, “A Survey of Stochastic Simulation and Optimization Methods in Signal Pro- cessing” IEEE Sel. Topics in Signal Processing, in press. M. Pereyra (MI — HWU) LMS 17 8 / 25

Inference by Markov chain Monte Carlo integration Monte Carlo integration Given a set of samples X 1 ,..., X M distributed according to p ( x ∣ y ) , we approximate posterior expectations and probabilities M 1 h ( X m ) → E { h ( x )∣ y } , as M → ∞ ∑ M m = 1 m = 1 h ( X m ) ∼ N [ E { h ( x )∣ y } , Σ ] . 1 M ∑ M Guarantees from CLTs, e.g., √ Markov chain Monte Carlo: Construct a Markov kernel X m + 1 ∣ X m ∼ K ( ⋅ ∣ X m ) such that the Markov chain X 1 ,..., X M has p ( x ∣ y ) as stationary distribution. MCMC simulation in high-dimensional spaces is very challenging. M. Pereyra (MI — HWU) LMS 17 10 / 25

Unadjusted Langevin algorithm Suppose for now that p ( x ∣ y ) ∈ C 1 . Then, we can generate samples by mimicking a Langevin diffusion process that converges to p ( x ∣ y ) as t → ∞ , d X t = 1 2 ∇ log p ( X t ∣ y ) d t + d W t , 0 ≤ t ≤ T , X ( 0 ) = x 0 . X ∶ where W is the n -dimensional Brownian motion. Because solving X t exactly is generally not possible, we use an Euler Maruyama approximation and obtain the “unadjusted Langevin algorithm” √ ULA ∶ X m + 1 = X m + δ ∇ log p ( X m ∣ y ) + 2 δ Z m + 1 , Z m + 1 ∼ N( 0 , I n ) ULA is remarkably efficient when p ( x ∣ y ) is sufficiently regular. M. Pereyra (MI — HWU) LMS 17 11 / 25

Unadjusted Langevin algorithm However, our interest is in high-dimensional models of the form p ( x ∣ y ) ∝ exp {− f ( x ) − g ( x )} with f , g l.s.c. convex, ∇ f L f -Lipschitz continuous, and g ∉ C 1 . Unfortunately, such models are beyond the scope of ULA, which may perform poorly if p ( x ∣ y ) is not Lipchitz differentiable. Idea: Regularise p ( x ∣ y ) to enable efficiently Langevin sampling. M. Pereyra (MI — HWU) LMS 17 12 / 25

Approximation of p ( x ∣ y ) Moreau-Yoshida approximation of p ( x ∣ y ) (Pereyra, 2015): Let λ > 0. We propose to approximate p ( x ∣ y ) with the density exp [ − f ( x ) − g λ ( x )] p λ ( x ∣ y ) = ∫ R d exp [ − f ( x ) − g λ ( x )] d x , where g λ is the Moreau-Yoshida envelope of g given by g λ ( x ) = inf u ∈ R d { g ( u ) − ( 2 λ ) − 1 ∥ u − x ∥ 2 2 } , and where λ controls the approximation error involved. M. Pereyra (MI — HWU) LMS 17 13 / 25

Moreau-Yoshida approximations Key properties (Pereyra, 2015; Durmus et al., 2017): 1 ∀ λ > 0, p λ defines a proper density of a probability measure on R d . 2 Convexity and differentiability : p λ is log-concave on R d . p λ ∈ C 1 even if p not differentiable, with ∇ log p λ ( x ∣ y ) = −∇ f ( x ) + { prox λ g ( x ) − x }/ λ, g ( x ) = argmax u ∈ R N g ( u ) − 1 2 λ ∣∣ u − x ∣∣ 2 . and prox λ ∇ log p λ is Lipchitz continuous with constant L ≤ L f + λ − 1 . 3 Approximation error between p λ ( x ∣ y ) and p ( x ∣ y ) : lim λ → 0 ∥ p λ − p ∥ TV = 0. If g is L g -Lipchitz, then ∥ p λ − p ∥ TV ≤ λ L 2 g . M. Pereyra (MI — HWU) LMS 17 14 / 25

Illustration Examples of Moreau-Yoshida approximations: p ( x ) ∝ exp (− x 4 ) p ( x ) ∝ 1 [− 0 . 5 , 0 . 5 ] ( x ) p ( x ) ∝ exp (−∣ x ∣) Figure : True densities (solid blue) and approximations (dashed red). M. Pereyra (MI — HWU) LMS 17 15 / 25

Proximal ULA We approximate X with the “regularised” auxiliary Langevin diffusion t = 1 X λ ∶ d X λ 2 ∇ log p λ ( X λ t ∣ y ) d t + d W t , 0 ≤ t ≤ T , X λ ( 0 ) = x 0 , which targets p λ ( x ∣ y ) . Remark: we can make X λ arbitrarily close to X . Finally, an Euler Maruyama discretisation of X λ leads to the (Moreau-Yoshida regularised) proximal ULA √ X m + 1 = ( 1 − δ λ ) X m − δ ∇ f { X m } + δ λ prox λ g { X m } + MYULA ∶ 2 δ Z m + 1 , where we used that ∇ g λ ( x ) = { x − prox λ g ( x )}/ λ . M. Pereyra (MI — HWU) LMS 17 16 / 25

Convergence results Non-asymptotic estimation error bound Theorem 2.1 (Durmus et al. (2017)) = ( L 1 + 1 / λ ) − 1 . Assume that g is Lipchitz continuous. Then, Let δ max λ there exist δ ǫ ∈ ( 0 ,δ max ] and M ǫ ∈ N such that ∀ δ < δ ǫ and ∀ M ≥ M ǫ λ ∥ δ x 0 Q M δ − p ∥ TV < ǫ + λ L 2 g , where Q M is the kernel assoc. with M iterations of MYULA with step δ . δ Note: δ ǫ and M ǫ are explicit and tractable. If f + g is strongly convex outside some ball, then M ǫ scales with order O( d log ( d )) (otherwise at worse O( d 5 ) ). See Durmus et al. (2017) for other convergence results. M. Pereyra (MI — HWU) LMS 17 17 / 25

Sparse image deblurring α = { x ∶ p ( x ∣ y ) ≥ γ α } with Bayesian credible region C ∗ p ( x ∣ y ) ∝ exp ( − ∥ y − Hx ∥ 2 / 2 σ 2 − β ∥ x ∥ 1 ) P [ x ∈ C α ∣ y ] = 1 − α, and y x MAP ˆ uncertainty estimates Figure : Live-cell microscopy data (Zhu et al., 2012). Uncertainty analysis ( ± 78 nm × ± 125 nm ) in close agreement with the experimental precision ± 80 nm . Computing time 4 minutes. M = 10 5 iterations. Estimation error 0 . 2%.. M. Pereyra (MI — HWU) LMS 17 19 / 25

Efficient Bayesian computation by proximal Markov chain Monte Carlo: - PowerPoint PPT Presentation

Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau Dr. Marcelo Pereyra http://www.macs.hw.ac.uk/ mp71/ Maxwell Institute for Mathematical Sciences, Heriot-Watt University June 2017, Heriot-Watt,

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Markov Chain Monte Carlo Methods Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Markov chain Monte Carlo Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2018 Jarad

Part 3 Markov Chain Modeling Markov Chain Model Stochastic model Amounts to sequence of

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Discrete time Markov chains Today: Short recap of probability theory Markov chain

Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques

Asymmetric Proximal Point Algorithms with Moving Proximal Centers Deren Han

Approximate Bayesian Computation Chris Drovandi, Charisse Farr October 24, 2012 Chris Drovandi,

MARKOV CHAIN MONTE CARLO METHODS MARKOV CHAIN MONTE CARLO METHODS MARKO LAINE, FMI MARKO LAINE,

Today. Continue markov chain mixing analysis. Today. Continue markov chain mixing analysis.

Some Definition and Example of Markov Chain Bowen Dai The Ohio State University April 5 th 2016

Continuous Time Markov Chain Birth and Death Process IE 502: Probabilistic Models Jayendran

iPiano: Inertial Proximal Algorithm for Non-Convex Optimization David Stutz June 2, 2016 David

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey

Meta-Learning of Structured Representation by Proximal Mapping Mao Li, Yingyi Ma, Xinhua

Inertial Block Proximal Methods for Non-Convex Non-Smooth Optimization L. T. K. Hien 1 N. Gillis 1

Proximal Identification and Applications J er ome MALICK CNRS, Lab. J. Kuntzmann, Grenoble

Complexity of a quadratic penalty accelerated inexact proximal point method W. Kong 1 J.G. Melo 2

Using the DMM to unde r stand and r e spond to De ve lopme ntal T r auma in Child Pr ote c

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong