new langevin based algorithms for mcmc in high dimensions
play

New Langevin based algorithms for MCMC in high dimensions Alain - PowerPoint PPT Presentation

New Langevin based algorithms for MCMC in high dimensions Alain Durmus Joint work with Gareth O. Roberts, Gilles Vilmart and Konstantinos Zygalakis. Dpartement TSI, Telecom ParisTech Siximes rencontres des jeunes statisticiens Main themes


  1. New Langevin based algorithms for MCMC in high dimensions Alain Durmus Joint work with Gareth O. Roberts, Gilles Vilmart and Konstantinos Zygalakis. Département TSI, Telecom ParisTech Sixièmes rencontres des jeunes statisticiens

  2. Main themes of this talk - Scaling limits of Metropolis-Hastings algorithms - A new MH algorithm with a new scaling page 2 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  3. Outlines page 3 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  4. Brief review of scaling results Outlines page 4 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  5. Brief review of scaling results ◮ Introduction Outlines page 5 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  6. Brief review of scaling results ◮ Introduction Motivation Let F : R d → R and π a probability measure on R d (with density π ). Generic problem : estimation of an expectation E F def = E π [ F ], where - π is known up to a multiplicative factor ; - we do not know how to sample from π (no basic Monte Carlo estimator) ; - π is high dimensional density (usual importance sampling and accept/reject inefficient). A solution is to approximate E F by n � n − 1 F ( X i ) , i =1 where ( X i ) i ≥ 0 is a Markov chain with invariant measure π . page 6 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  7. Brief review of scaling results ◮ Introduction Markov chain theory Definition Let P : R d × B ( R d ) → R + . P is a Markov kernel if - for all x ∈ R d , A �→ P ( x , A ) is a probability measure on R d , - for all A ∈ B ( R d ), x �→ P ( x , A ) is measurable from R d to R . A transition density function q : R d × R d → R is a measurable function such that for all x ∈ R d , � q ( x , y ) d y = 1 . R d A q ( x , y ) d y is a Markov kernel on R d with density q . Then P ( x , A ) = � A Markov chain associated with P is a stochastic process ( X k ) k ≥ 0 such for all k ≥ 0, X k +1 ∼ P ( X k , · ). page 7 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  8. Brief review of scaling results ◮ Introduction Markov chain theory Some simple properties : - If P 1 and P 2 are two Markov kernels , we can define a new Markov kernel, denoted P 1 P 2 , by for x ∈ R d , A ∈ B ( R d ) � P 1 P 2 ( x , A ) = P 1 ( x , d z ) P 2 ( z , A ) d z . R d - If P is a Markov kernel and ν a probability measure on R d , we can define a measure on R d , denoted ν P , by for A ∈ B ( R d ) � ν P ( A ) = ν ( d z ) P ( z , A ) . R d - Let P be a Markov kernel on R d . For f : R d → R + measurable, we can define a measurable function Pf : R d → ¯ R + by � Pf ( x ) = P ( x , d z ) f ( z ) . R d page 8 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  9. Brief review of scaling results ◮ Introduction Markov chain theory Invariant probability measure : π is said to be an invariant probability measure for the Markov kernel P if π P = π . Theorem (Meyn and Tweedie, 2003, Ergodic theorem) With some conditions on P, we have for any F ∈ L 1 ( π ) , n � 1 � F ( X i ) − → F ( x ) π ( x ) d x . n π -a.s. i =1 A simple condition for π to be an invariant measure for P is the reversibility : π ( d y ) P ( y , d x ) = π ( d x ) P ( x , d y ) . page 9 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  10. Brief review of scaling results ◮ Introduction MCMC : rationale To approximate E F : find P with invariant measure π , from which we can efficiently sample. Question : How to find P ? ⇒ the Metropolis Hastings algorithm provides a way to build such kernel. page 10 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  11. Brief review of scaling results ◮ The Metropolis-Hastings algorithm Outlines page 11 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  12. Brief review of scaling results ◮ The Metropolis-Hastings algorithm The Metropolis-Hastings algorithm (I) Initial Data : the target density π , a transition density q , X 0 ∼ µ 0 . For k ≥ 0 given X k , 1. Generate Y k +1 ∼ q ( X k , · ). 2. Set � Y k +1 with probability α ( X k , Y k +1 ) , X k +1 = with probability 1 − α ( X k , Y k +1 ) . X k where α ( x , y ) = 1 ∧ π ( y ) q ( y , x ) q ( x , y ) . π ( x ) The algorithm produces a Markov chain with a kernel P MH reversible w.r.t. π . Note X k +1 = X k + ✶ { U ≤ α ( X k , Y k +1 ) } ( Y k +1 − X k ), where U ∼ U [0 , 1] page 12 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  13. Brief review of scaling results ◮ The Metropolis-Hastings algorithm The RWM and MALA Two well known Metropolis Hastings algorithms : 1) The Random Walk Metropolis :  Y k +1 = X k + σ d Z k +1 ( Z k ) k ≥ 0 i . i . d . sequence of law N d (0 , Id d )  where φ d is the Gaussian density on R d q ( x , y ) = φ d (( y − x ) /σ d )  X k +1 = X k + ✶ { U ≤ α ( X k , Y k +1 ) } σ d Z k . 2) The Metropolis Adjusted Langevin Algorithm : Assume that log π is at least C 1 with gradient denoted by b .  = X k + σ 2 Y k +1 d b ( X k ) / 2 + σ d Z k +1  = φ d (( y − x − σ 2 q ( x , y ) d b ( x )) /σ d ) = X k + ✶ { U ≤ α ( X k , Y k +1 ) } ( σ 2  d b ( X k ) / 2 + σ d Z k ) . X k +1 page 13 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  14. Brief review of scaling results ◮ The Metropolis-Hastings algorithm Scaling problems and diffusion limits Scaling problems : - How should σ d depend on the dimension d ? - What does this tell us about the efficiency of the algorithm ? - Can we optimize σ d in a sensible way ? - Can we characterize the optimal choice of σ d by some intrinsic criteria independent of π ? For the case of the RWM and MALA, we have a diffusion limits which answer to these questions. page 14 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  15. Brief review of scaling results ◮ The Metropolis-Hastings algorithm Efficiency of HM algorithms Let ( X k ) k ≥ 0 be a Markov chain with invariant measure π . With some conditions we have a LLN and a CLT : for some F , n � 1 a . s . � − → F ( x ) π ( x ) d x F ( X i ) n n → + ∞ i =1 � n � √ � 1 � ∗ n → + ∞ N (0 , σ 2 ( F , P )) , n F ( X i ) − F ( x ) π ( x ) d x = ⇒ n i =1 where � � n 1 � σ 2 ( F , P ) = n → + ∞ n Var π lim F ( X i ) n i =1 � = Var π [ F ( X 0 )] + Cov π [ F ( X i ) , F ( X 0 )] . i ≥ 1 page 15 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  16. Brief review of scaling results ◮ The Metropolis-Hastings algorithm Efficiency of MH algorithms • Given F , the CLT allows us to compare two Markov kernel P 1 , P 2 : σ 2 ( F , P 1 ) ≤ σ 2 ( F , P 2 ) = ⇒ P 1 is more efficient than P 2 . • For all i ≥ 1 Cov π [ F ( X i ) , F ( X 0 )] ≥ 0 : therefore we cannot do better than i . i . d . samples. • However no practical conditions to ensure for all F , σ 2 ( F , P 1 ) ≤ σ 2 ( F , P 2 ) , which is the case for Langevin diffusion as we will see. page 16 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  17. Brief review of scaling results ◮ The Metropolis-Hastings algorithm Expected Square Jump Distance Common efficiency criteria : the ESJD defined for Markov chain in one dimension by : ESJD = E π [( X 1 − X 0 ) 2 ] . One justification : Maximize the ESJD ⇔ Minimize Cov π [ F ( X 1 ) , F ( X 0 )] , for F linear function. page 17 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  18. Brief review of scaling results ◮ Speed of Langevin diffusions Outlines page 18 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  19. Brief review of scaling results ◮ Speed of Langevin diffusions Langevin diffusion Let π a probability measure on R d with log-density C 1 , b ( x ) = ∇ log( π ( x )). Consider the overdamped Langevin equation : d Y t = ( b ( Y t ) / 2) d t + d B t . Note that the proposal of the MALA is just a Euler-Maruyama discretization of this SDE Under some conditions, ( Y t ) t ≥ 0 is ergodic with respect to π , and we have a LLN and a CLT again : � t � 1 a . s . F ( X s ) d s − → F ( x ) π ( x ) d x t t → + ∞ 0 � t √ � � � 1 ∗ t → + ∞ N (0 , σ 2 ( F , Y )) , t F ( X s ) d s − F ( x ) π ( x ) d x = ⇒ t 0 where � t � � 1 σ 2 ( F , Y ) = lim t → + ∞ t Var π F ( Y s ) d s . t 0 page 19 A. Durmus New Langevin based algorithms for MCMC in high dimensions

  20. Brief review of scaling results ◮ Speed of Langevin diffusions scaled Langevin equation Consider the following scaled Langevin equation : √ d Y c t = ( cb ( Y t ) / 2) d t + c d B t . (1) Then a solution of (1) is given by ( Y 1 ct ) t ≥ 0 : � ct Y 1 ct = Y 1 ( b ( Y 1 0 + s ) / 2) d s + B ct 0 � t √ s = cu = Y 1 ( cb ( Y 1 c ˜ 0 + cu ) / 2) d s + B t , 0 with the Brownian motion ˜ B t = c − 1 / 2 B ct . page 20 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Recommend


More recommend