convergence and efficiency of adaptive importance
play

Convergence and Efficiency of Adaptive Importance Sampling - PowerPoint PPT Presentation

Convergence and Efficiency of Adaptive Importance Sampling techniques with partial biasing Gersende Fort Institut de Math ematiques de Toulouse CNRS France Joint work with B. Jourdain, T. Leli` evre and G. Stoltz Talk based on the paper


  1. Convergence and Efficiency of Adaptive Importance Sampling techniques with partial biasing Gersende Fort Institut de Math´ ematiques de Toulouse CNRS France Joint work with B. Jourdain, T. Leli` evre and G. Stoltz Talk based on the paper G.F., B. Jourdain, T. Leli` evre, G. Stoltz Convergence and Efficiency of Adaptive Importance Sampling techniques with partial biasing , J. Stat. Phys (2018) 1 / 17

  2. The goal Assumption Let π · d µ be a probability distribution on X ⊆ R p assumed to be highly metastable and (possibly) known up to a normalizing constant. Question 1: How to design a MC sampler for an approximation of � f π d µ X Question 2: How to compute the free energy � − ln π d µ X i ⊂ X X i In this talk, an approach by Free Energy-based Adaptive Importance Sampling technique which is a generalization of Wang Landau, Self Healing Umbrella Sampling, Well tempered metadynamics. 2 / 17

  3. The intuition (1/3) - a family of auxiliary distributions π ( x ) = 1 Z exp( − V ( x )) ◮ The auxiliary distribution Choose a partition X 1 , · · · , X d of X 8 7 2.5 6 2 5 1.5 4 1 3 2 0.5 1 0 0 −0.5 3 2 −1 1 0 −1.5 −1 −2 −2.5 −2 −0.5 −1 −1.5 −2 0.5 0 1.5 1 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 2.5 2 � def θ ∗ ,i = π d µ X i 3 / 17

  4. The intuition (1/3) - a family of auxiliary distributions π ( x ) = 1 Z exp( − V ( x )) ◮ The auxiliary distribution Choose a partition X 1 , · · · , X d of X and for positive weights ∗ θ = ( θ 1 , · · · , θ d ) set � d π θ ( x ) ∝ 1 I X i ( x ) exp ( − V ( x ) − ln θ i ) i =1 ◮ Property 1: � � π θ d µ ∝ θ ∗ ,i def ∀ i ∈ { 1 , · · · , d } , θ ∗ ,i = π d µ θ i X i X i ◮ Property 2: � π θ ∗ d µ = 1 ∀ i ∈ { 1 , · · · , d } , d. X i ∗ θ i ∈ (0 , 1) , � d i =1 θ i = 1 3 / 17

  5. Intuition (2/3) - How to choose θ ? � d � def π θ ( x ) ∝ 1 I X i ( x ) exp ( − V ( x ) − ln θ i ) θ ∗ ,i = π d µ X i i =1 ◮ If θ = θ ∗ Efficient exploration under π θ ∗ : each subset X i has the same weight under π θ ∗ Poor ESS: The IS approximation gets into � d � � N � � f π d µ ≈ d 1 I X i ( X n ) θ ∗ ,i f ( X n ) N X n =1 i =1 ◮ Choose ρ ∈ (0 , 1) and set θ ρ ∗ ∝ ( θ ρ ∗ , 1 , · · · , θ ρ ∗ ,d ) : � d � � d � � N � � � 1 θ 1 − ρ I X i ( X n ) θ ρ f π d µ ≈ 1 f ( X n ) ∗ ,i ∗ ,i N X n =1 i =1 i =1 ◮ But θ ∗ is unknown 4 / 17

  6. Intuition (3/3) -Estimation of the free energy � C n,i def def θ ∗ ,i = π d µ ≈ θ n,i = ” Normalized count of the visits to X i ” � d j =1 C n,j X i ◮ Exact sampling If X n +1 ∼ π d µ : C n +1 ,i = C n,i + 1 I X i ( X n +1 ) This yields for all i = 1 , · · · , d n +1 d � � def C n +1 ,i = 1 I X i ( X k ) S n +1 = C n +1 ,i = ( n + 1) = O ( n ) k =1 i =1 and n +1 � 1 1 θ n +1 ,i = 1 I X i ( X n +1 ) = θ n,i + n + 1 (1 I X i ( X n +1 ) − θ n,i ) n + 1 k =1 i.e. Stochastic Apprimation scheme with learning rate 1 /S n +1 , and limiting point θ ∗ ,i 5 / 17

  7. Intuition (3/3) -Estimation of the free energy � C n,i def def θ ∗ ,i = π d µ ≈ θ n,i = ” Normalized count of the visits to X i ” � d j =1 C n,j X i ◮ Exact sampling If X n +1 ∼ π d µ : C n +1 ,i = C n,i + 1 I X i ( X n +1 ) C n +1 ,i = C n,i + γ � ◮ IS sampling If X n +1 ∼ π � θ d µ : θ i 1 I X i ( X n +1 ) This yields for all i = 1 , · · · , d n +1 d � � def C n +1 ,i = γ � θ i 1 I X i ( X n +1 ) S n +1 = C n +1 ,i = O wp 1 ( n ) k =1 i =1 and S n +1 H i ( θ n , X n +1 ) + O ( 1 γ θ n +1 ,i = θ n,i + n 2 ) i.e. Stochastic Apprimation scheme with learning rate 1 /S n +1 , and limiting point θ ∗ ,i 5 / 17

  8. Intuition (3/3) -Estimation of the free energy � C n,i def def θ ∗ ,i = π d µ ≈ θ n,i = ” Normalized count of the visits to X i ” � d j =1 C n,j X i ◮ Exact sampling If X n +1 ∼ π d µ : C n +1 ,i = C n,i + 1 I X i ( X n +1 ) C n +1 ,i = C n,i + γ � ◮ IS sampling If X n +1 ∼ π � θ d µ : θ i 1 I X i ( X n +1 ) ◮ IS sampling with a leverage effect If X n +1 ∼ π � θ d µ : S n � C n +1 ,i = C n,i + γ θ i 1 I X i ( X n +1 ) + ∞ g = + ∞ , lim inf lim s/g ( s ) > 0 g ( S n ) s This yields S n +1 − S n γ � S n +1 ↑ + ∞ = θ i 1 I X i ( X n +1 ) S n g ( S n ) and � � γ 2 γ θ n +1 ,i = θ n,i + g ( S n ) H i ( θ n , X n +1 ) + O g 2 ( S n ) i.e. S.A. scheme with learning rate γ/g ( S n ) , and limiting point θ ∗ ,i . 5 / 17

  9. Intuition (3/3) -Estimation of the free energy � C n,i def def θ ∗ ,i = π d µ ≈ θ n,i = ” Normalized count of the visits to X i ” � d j =1 C n,j X i ◮ Exact sampling If X n +1 ∼ π d µ : C n +1 ,i = C n,i + 1 I X i ( X n +1 ) C n +1 ,i = C n,i + γ � ◮ IS sampling If X n +1 ∼ π � θ d µ : θ i 1 I X i ( X n +1 ) ◮ IS sampling with a leverage effect If X n +1 ∼ π � θ d µ : S n � C n +1 ,i = C n,i + γ θ i 1 I X i ( X n +1 ) + ∞ g = + ∞ , lim inf lim s/g ( s ) > 0 g ( S n ) s If g ( s ) = ln(1 + s ) α/ (1+ α ) , the learning rate is O ( t − α ) ◮ Key property: if X n +1 ∈ X i , then for any j � = i π θ n +1 ( X j ) > π θ n ( X j ) the probability of stratum # j increases 5 / 17

  10. The algorithm: Adaptive IS with partial biasing def = (ln(1 + s )) α/ (1 − α ) . ◮ Fix: ρ ∈ (0 , 1) and α ∈ (1 / 2 , 1) . Set g ( s ) ◮ Initialisation: X 0 ∈ X, a positive weight vector θ 0 , ◮ Repeat, for n = 0 , · · · , N − 1 sample X n +1 ∼ P θ ρ n ( X n , · ) , a Markov kernel invariant wrt π θ ρ n d µ compute γ g ( S n ) S n θ ρ C n +1 ,i = C n,i + n,i 1 I X i ( X n +1 ) d � θ n +1 ,i = C n +1 ,i S n +1 = C n +1 ,i S n +1 i =1 ◮ Return ( θ n ) n sequ. of estimates of θ ∗ ; and the IS estimator � d � � d � � N � � � 1 θ 1 − ρ I X i ( X n ) θ ρ f π d µ ≈ 1 f ( X n ) n − 1 ,i n − 1 ,i N n =1 i =1 i =1 6 / 17

  11. Convergence results 1 The limiting behavior of the estimates ( θ n ) n 2 The limiting distribution of X n 3 The limiting behavior of the IS estimator 7 / 17

  12. Assumptions 1 On the target density and the strata X i : sup π < ∞ , 1 ≤ i ≤ d θ ∗ ( i ) > 0 min X 2 On the kernels P θ : Hastings-Metropolis kernel, with symmetric proposal q ( x, y ) d µ ( y ) such that inf X 2 q > 0 . for any compact subset K , there exists C and λ ∈ (0 , 1) s.t. � P n θ ( x, · ) − π θ � TV ≤ Cλ n sup θ ∈ K 3 ρ ∈ (0 , 1) 4 g ( s ) = (ln(1 + s )) α/ (1 − α ) with α ∈ (1 / 2 , 1) . 8 / 17

  13. Convergence results: on the sequence θ n ◮ Recall def θ n +1 = θ n + γ n +1 H ( X n +1 , θ n ) + γ 2 n +1 Λ n +1 γ n +1 = γ/g ( S n ) where γ n is a positive random learning rate sup n � Λ n +1 � is bounded a.s. � H ( · , θ ) π θ ρ d µ = 0 iff θ = θ ∗ . ◮ Result 1   d � n γ n n α = (1 − α ) α γ 1 − α θ 1 − ρ   lim a.s. ∗ ,j j =1 ◮ Result 2: lim n θ n = θ ∗ a.s. 9 / 17

  14. Convergence results - on the samples X n ◮ Recall X n +1 ∼ P θ ρ n ( X n , · ) π θ P θ = π θ ◮ Result 1 For any bounded function f � lim n E [ f ( X n )] = f π θ ρ ∗ d µ ◮ Result 2 For any bounded function f � N � 1 lim f ( X n ) = f π θ ρ ∗ d µ a.s. N N n =1 10 / 17

  15. Convergence results - on the IS estimator ◮ Result 1 For any bounded function f       � � N � d � d  1 θ ρ θ 1 − ρ      = lim f ( X n ) n − 1 ,j 1 I X j ( X n ) f π d µ N E n − 1 ,j N n =1 j =1 j =1 ◮ Result 1 For any bounded function f , a.s.:     � N d d � � � 1  θ ρ   θ 1 − ρ  = lim f ( X n ) n − 1 ,j 1 I X j ( X n ) f π d µ n − 1 ,j N N n =1 j =1 j =1 11 / 17

  16. Is it new ? ◮ Theoretical contribution Self Healing Umbrella Sampling ρ = 1 (no biasing intensity) g ( s ) = s (also covered by the theory; not detailed here) Well-tempered metadynamics ρ ∈ (0 , 1) (biasing intensity) g ( s ) = s 1 − ρ (also covered by the theory; not detailed here) ◮ Methodological contribution: the introduction of a function g ( s ) in the updating scheme of the estimator θ n , allowing a random learning rate γ n ∼ O wp 1 ( n − α ) for α ∈ (1 / 2 , 1) . 12 / 17

  17. Is there a gain in such a self-tuned and partially biasing algorithm ? 8 7 2.5 6 2 5 1.5 4 1 3 2 0.5 1 0 0 −0.5 3 2 −1 1 0 −1.5 −1 −2.5 −1.5 −2 −2 −0.5 −1 −2 0.5 0 1.5 1 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 2.5 2 beta=1 beta=5 8 x 10 60 5 50 4 40 3 30 2 20 1 10 0 0 4 4 2 3 2 3 2 2 0 0 1 1 −2 0 −2 0 −1 −1 −4 −2 −4 −2 Make the metastability larger by increasing β . 13 / 17

Recommend


More recommend