Limit theorems for adaptive MCMC algorithms Limit theorems for adaptive MCMC algorithms Gersende FORT LTCI CNRS - TELECOM ParisTech In collaboration with Yves ATCHADE (Univ. Michigan, US) , Eric MOULINES (TELECOM ParisTech) and Pierre PRIOURET (Univ. Paris 6) .
Limit theorems for adaptive MCMC algorithms Markov chain Monte Carlo algorithms (MCMC) : algorithms to sample from a target density π ◮ in some applications : known up to a (normalizing) constant. ◮ complex, so that exact sampling from π is not possible.
Limit theorems for adaptive MCMC algorithms Markov chain Monte Carlo algorithms (MCMC) : algorithms to sample from a target density π ◮ in some applications : known up to a (normalizing) constant. ◮ complex, so that exact sampling from π is not possible. Define a Markov chain { X n ,n ≥ 0 } with transition kernel : P � E [ f ( X n +1 ) |F n ] = f ( y ) P ( X n ,dy ) so that ◮ for any bounded function f : lim n E x [ f ( X n )] = π ( f ) . ◮ for any function f increasing like · · · : n − 1 P n → a.s. π ( f ) . k =1 f ( X k ) − ◮ · · ·
Limit theorems for adaptive MCMC algorithms I. Adaptive MCMC : ◮ why? ◮ does the process { X n ,n ≥ 0 } approximate π ?
Limit theorems for adaptive MCMC algorithms Motivation Symmetric Random Walk Hastings-Metropolis algorithm 1.1. Symmetric Random Walk Hastings-Metropolis algorithm An example of transition kernel P is described by the algorithm: ◮ Choose : a proposal density q ◮ Iterate: starting from X n ◮ draw (an increment) Y n +1 ∼ q ( · ) ◮ compute the acceptation ratio α ( X n ,X n + Y n +1 ) := 1 ∧ π ( X n + Y n +1 ) π ( X n ) ◮ set Y n +1 + X n with probability α ( X n ,X n + Y n +1 ) X n +1 = X n with probability 1 − α ( X n ,X n + Y n +1 )
Limit theorems for adaptive MCMC algorithms Motivation Symmetric Random Walk Hastings-Metropolis algorithm 1.1. Symmetric Random Walk Hastings-Metropolis algorithm An example of transition kernel P is described by the algorithm: ◮ Choose : a proposal density q ◮ Iterate: starting from X n ◮ draw (an increment) Y n +1 ∼ q ( · ) ◮ compute the acceptation ratio α ( X n ,X n + Y n +1 ) := 1 ∧ π ( X n + Y n +1 ) π ( X n ) ◮ set Y n +1 + X n with probability α ( X n ,X n + Y n +1 ) X n +1 = X n with probability 1 − α ( X n ,X n + Y n +1 ) The efficiency of the algorithm depends upon the proposal q
Limit theorems for adaptive MCMC algorithms Motivation On the choice of the variance of the proposal distribution 1.2. On the choice of the variance of the proposal distribution For ex., when q is Gaussian, how to choose its variance matrix Σ q ?
Limit theorems for adaptive MCMC algorithms Motivation On the choice of the variance of the proposal distribution ◮ When π ∼ N d ( µ π , Σ π ) , the optimal choice for the variance of q is Σ q = (2 . 38) 2 Σ π . d Results obtained by the ’scaling’ technique (see also ’fluid limit’ ). Generalizations exist (other MCMC; relaxing conditions on π ) edard (2007); Fort-Moulines-Priouret (2008) . Roberts-Rosenthal (2001); B´ ◮ This suggests an adaptive procedure : learn Σ π “on the fly” and modify the variance Σ q continuously during the run of the algorithm.
Limit theorems for adaptive MCMC algorithms Motivation On the choice of the variance of the proposal distribution ◮ When π ∼ N d ( µ π , Σ π ) , the optimal choice for the variance of q is Σ q = (2 . 38) 2 Σ π . d Results obtained by the ’scaling’ technique (see also ’fluid limit’ ). Generalizations exist (other MCMC; relaxing conditions on π ) edard (2007); Fort-Moulines-Priouret (2008) . Roberts-Rosenthal (2001); B´ ◮ This suggests an adaptive procedure : learn Σ π “on the fly” and modify the variance Σ q continuously during the run of the algorithm. Example : at each iteration, choose q equal to � 0 , (2 . 38) 2 d − 1 ˆ � 0 , (0 . 1) 2 d − 1 I d � � 0 . 95 N Σ n + 0 . 05 N where Σ n − 1 + 1 { X n − µ n }{ X n − µ n } T − ˆ � � Σ n = ˆ ˆ Σ n − 1 n µ n = µ n − 1 + 1 n ( X n − µ n − 1 ) Haario et al. (2001); Roberts-Rosenthal (2006)
Limit theorems for adaptive MCMC algorithms Motivation On the choice of the variance of the proposal distribution
Limit theorems for adaptive MCMC algorithms Motivation Does adaptation preserve convergence? 1.3. Be careful with adaptation ! The previous example illustrates the general framework : ◮ Let { P θ ,θ ∈ Θ } be a family of Markov kernels s.t. πP θ = π for any θ ∈ Θ . ◮ Define a process { ( θ n ,X n ) ,n ≥ 0 } : ◮ X n +1 ∼ P θ n ( X n , · ) ◮ update θ n +1 based on ( θ n ,X n ,X n +1 ) “internal” adaptation Is it true that the marginal { X n ,n ≥ 0 } approximates π ?
Limit theorems for adaptive MCMC algorithms Motivation Does adaptation preserve convergence? 1.3. Be careful with adaptation ! The previous example illustrates the general framework : ◮ Let { P θ ,θ ∈ Θ } be a family of Markov kernels s.t. πP θ = π for any θ ∈ Θ . ◮ Define a process { ( θ n ,X n ) ,n ≥ 0 } : ◮ X n +1 ∼ P θ n ( X n , · ) ◮ update θ n +1 based on ( θ n ,X n ,X n +1 ) “internal” adaptation Is it true that the marginal { X n ,n ≥ 0 } approximates π ? Not always, unfortunately for θ ∈ ]0 , 1[ � (1 − θ ) � 1 / 2 � � θ P θ = π = θ (1 − θ ) 1 / 2 Let t 1 ,t 2 ∈ ]0 , 1[ , and set θ k = t i iff X k = i . Then { X n ,n ≥ 0 } is Markov with invariant probability π ∝ [ t 2 t 1 ] T ˜ � = π
Limit theorems for adaptive MCMC algorithms II. Sufficient conditions for convergence of adaptive schemes { ( θ n ,X n ) ,n ≥ 0 } ◮ convergence of the marginals { X n ,n ≥ 0 } ◮ law of large numbers w.r.t. { X n ,n ≥ 0 }
Limit theorems for adaptive MCMC algorithms Convergence of the marginals (ergodicity) Sufficient conditions 2.1. Convergence of the marginals : Suff Cond Let ◮ a family of Markov kernels { P θ ,θ ∈ Θ } s.t. P θ has an unique invariant probability measure Π θ ◮ a filtration F n and a process { ( X n ,θ n ) ,n ≥ 0 } s.t. for any f ≥ 0 , � E [ f ( X n +1 ) |F n ] = f ( y ) P θ n ( X n ,dy ) P − a.s. Given a target density π ⋆ , which set of conditions will imply lim sup | E [ f ( X n )] − π ⋆ ( f ) | = 0 ? n f, | f | ∞ ≤ 1
Limit theorems for adaptive MCMC algorithms Convergence of the marginals (ergodicity) Sufficient conditions Idea : E [ f ( X n )] − π ⋆ ( f ) = E [ E [ f ( X n ) |F n − N ]] − π ⋆ ( f ) � � � � E [ f ( X n ) |F n − N ] − P N P N = E θ n − N f ( X n − N ) + E θ n − N f ( X n − N ) − π θ n − N ( f ) � � + E π θ n − N ( f ) − π ⋆ ( f )
Limit theorems for adaptive MCMC algorithms Convergence of the marginals (ergodicity) Sufficient conditions Idea : E [ f ( X n )] − π ⋆ ( f ) = E [ E [ f ( X n ) |F n − N ]] − π ⋆ ( f ) � � � � E [ f ( X n ) |F n − N ] − P N P N = E θ n − N f ( X n − N ) + E θ n − N f ( X n − N ) − π θ n − N ( f ) � � + E π θ n − N ( f ) − π ⋆ ( f ) i.e. conditions on ◮ (Diminishing Adaptation) the difference � P θ n ( x, · ) − P θ n − 1 ( x, · ) � TV ◮ (ergodicity of P θ / Containment) the convergence of � P N θ ( x, · ) − π θ � TV as N → + ∞ . ◮ (convergence of the stationary measures) convergence of π θ n ( f ) − π ⋆ ( f ) as n → + ∞ .
Limit theorems for adaptive MCMC algorithms Convergence of the marginals (ergodicity) Sufficient conditions Set M ǫ ( x,θ ) := inf { n ≥ 1 , � P n θ ( x, · ) − π θ � TV ≤ ǫ } . Theorem Assume ( i ) D.A. cond sup x � P θ n ( x, · ) − P θ n − 1 ( x, · ) � TV − → P 0 ( ii ) C. cond ∀ ǫ > 0 , lim M sup n P ( M ǫ ( X n ,θ n ) ≥ M ) = 0 ( iii ) π θ = π ⋆ Then sup f, | f | ∞ ≤ 1 | E [ f ( X n )] − π ⋆ ( f ) | = 0 . i.e. conditions on ◮ (Diminishing Adaptation) the difference � P θ n ( x, · ) − P θ n − 1 ( x, · ) � TV ◮ (ergodicity of P θ / Containment) the convergence of � P N θ ( x, · ) − π θ � TV as N → + ∞ . ◮ (convergence of the stationary measures) convergence of π θ n ( f ) − π ⋆ ( f ) as n → + ∞ .
Limit theorems for adaptive MCMC algorithms Convergence of the marginals (ergodicity) Sufficient conditions Set M ǫ ( x,θ ) := inf { n ≥ 1 , � P n θ ( x, · ) − π θ � TV ≤ ǫ } . Theorem Assume ( i ) D.A. cond sup x � P θ n ( x, · ) − P θ n − 1 ( x, · ) � TV − → P 0 ( ii ) C. cond ∀ ǫ > 0 , lim M sup n P ( M ǫ ( X n ,θ n ) ≥ M ) = 0 ( iii ) ∀ ǫ > 0 , sup f ∈ F P ( | π θ n ( f ) − π ⋆ ( f ) | > ǫ ) → 0 sup f ∈ F | E [ f ( X n )] − π ⋆ ( f ) | = 0 . Then i.e. conditions on ◮ (Diminishing Adaptation) the difference � P θ n ( x, · ) − P θ n − 1 ( x, · ) � TV ◮ (ergodicity of P θ / Containment) the convergence of � P N θ ( x, · ) − π θ � TV as N → + ∞ . ◮ (convergence of the stationary measures) convergence of π θ n ( f ) − π ⋆ ( f ) as n → + ∞ .
Recommend
More recommend