Convergence and Efficiency of Adaptive Importance Sampling techniques with partial biasing Gersende Fort Institut de Math´ ematiques de Toulouse CNRS France Joint work with B. Jourdain, T. Leli` evre and G. Stoltz Talk based on the paper G.F., B. Jourdain, T. Leli` evre, G. Stoltz Convergence and Efficiency of Adaptive Importance Sampling techniques with partial biasing , J. Stat. Phys (2018) 1 / 17
The goal Assumption Let π · d µ be a probability distribution on X ⊆ R p assumed to be highly metastable and (possibly) known up to a normalizing constant. Question 1: How to design a MC sampler for an approximation of � f π d µ X Question 2: How to compute the free energy � − ln π d µ X i ⊂ X X i In this talk, an approach by Free Energy-based Adaptive Importance Sampling technique which is a generalization of Wang Landau, Self Healing Umbrella Sampling, Well tempered metadynamics. 2 / 17
The intuition (1/3) - a family of auxiliary distributions π ( x ) = 1 Z exp( − V ( x )) ◮ The auxiliary distribution Choose a partition X 1 , · · · , X d of X 8 7 2.5 6 2 5 1.5 4 1 3 2 0.5 1 0 0 −0.5 3 2 −1 1 0 −1.5 −1 −2 −2.5 −2 −0.5 −1 −1.5 −2 0.5 0 1.5 1 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 2.5 2 � def θ ∗ ,i = π d µ X i 3 / 17
The intuition (1/3) - a family of auxiliary distributions π ( x ) = 1 Z exp( − V ( x )) ◮ The auxiliary distribution Choose a partition X 1 , · · · , X d of X and for positive weights ∗ θ = ( θ 1 , · · · , θ d ) set � d π θ ( x ) ∝ 1 I X i ( x ) exp ( − V ( x ) − ln θ i ) i =1 ◮ Property 1: � � π θ d µ ∝ θ ∗ ,i def ∀ i ∈ { 1 , · · · , d } , θ ∗ ,i = π d µ θ i X i X i ◮ Property 2: � π θ ∗ d µ = 1 ∀ i ∈ { 1 , · · · , d } , d. X i ∗ θ i ∈ (0 , 1) , � d i =1 θ i = 1 3 / 17
Intuition (2/3) - How to choose θ ? � d � def π θ ( x ) ∝ 1 I X i ( x ) exp ( − V ( x ) − ln θ i ) θ ∗ ,i = π d µ X i i =1 ◮ If θ = θ ∗ Efficient exploration under π θ ∗ : each subset X i has the same weight under π θ ∗ Poor ESS: The IS approximation gets into � d � � N � � f π d µ ≈ d 1 I X i ( X n ) θ ∗ ,i f ( X n ) N X n =1 i =1 ◮ Choose ρ ∈ (0 , 1) and set θ ρ ∗ ∝ ( θ ρ ∗ , 1 , · · · , θ ρ ∗ ,d ) : � d � � d � � N � � � 1 θ 1 − ρ I X i ( X n ) θ ρ f π d µ ≈ 1 f ( X n ) ∗ ,i ∗ ,i N X n =1 i =1 i =1 ◮ But θ ∗ is unknown 4 / 17
Intuition (3/3) -Estimation of the free energy � C n,i def def θ ∗ ,i = π d µ ≈ θ n,i = ” Normalized count of the visits to X i ” � d j =1 C n,j X i ◮ Exact sampling If X n +1 ∼ π d µ : C n +1 ,i = C n,i + 1 I X i ( X n +1 ) This yields for all i = 1 , · · · , d n +1 d � � def C n +1 ,i = 1 I X i ( X k ) S n +1 = C n +1 ,i = ( n + 1) = O ( n ) k =1 i =1 and n +1 � 1 1 θ n +1 ,i = 1 I X i ( X n +1 ) = θ n,i + n + 1 (1 I X i ( X n +1 ) − θ n,i ) n + 1 k =1 i.e. Stochastic Apprimation scheme with learning rate 1 /S n +1 , and limiting point θ ∗ ,i 5 / 17
Intuition (3/3) -Estimation of the free energy � C n,i def def θ ∗ ,i = π d µ ≈ θ n,i = ” Normalized count of the visits to X i ” � d j =1 C n,j X i ◮ Exact sampling If X n +1 ∼ π d µ : C n +1 ,i = C n,i + 1 I X i ( X n +1 ) C n +1 ,i = C n,i + γ � ◮ IS sampling If X n +1 ∼ π � θ d µ : θ i 1 I X i ( X n +1 ) This yields for all i = 1 , · · · , d n +1 d � � def C n +1 ,i = γ � θ i 1 I X i ( X n +1 ) S n +1 = C n +1 ,i = O wp 1 ( n ) k =1 i =1 and S n +1 H i ( θ n , X n +1 ) + O ( 1 γ θ n +1 ,i = θ n,i + n 2 ) i.e. Stochastic Apprimation scheme with learning rate 1 /S n +1 , and limiting point θ ∗ ,i 5 / 17
Intuition (3/3) -Estimation of the free energy � C n,i def def θ ∗ ,i = π d µ ≈ θ n,i = ” Normalized count of the visits to X i ” � d j =1 C n,j X i ◮ Exact sampling If X n +1 ∼ π d µ : C n +1 ,i = C n,i + 1 I X i ( X n +1 ) C n +1 ,i = C n,i + γ � ◮ IS sampling If X n +1 ∼ π � θ d µ : θ i 1 I X i ( X n +1 ) ◮ IS sampling with a leverage effect If X n +1 ∼ π � θ d µ : S n � C n +1 ,i = C n,i + γ θ i 1 I X i ( X n +1 ) + ∞ g = + ∞ , lim inf lim s/g ( s ) > 0 g ( S n ) s This yields S n +1 − S n γ � S n +1 ↑ + ∞ = θ i 1 I X i ( X n +1 ) S n g ( S n ) and � � γ 2 γ θ n +1 ,i = θ n,i + g ( S n ) H i ( θ n , X n +1 ) + O g 2 ( S n ) i.e. S.A. scheme with learning rate γ/g ( S n ) , and limiting point θ ∗ ,i . 5 / 17
Intuition (3/3) -Estimation of the free energy � C n,i def def θ ∗ ,i = π d µ ≈ θ n,i = ” Normalized count of the visits to X i ” � d j =1 C n,j X i ◮ Exact sampling If X n +1 ∼ π d µ : C n +1 ,i = C n,i + 1 I X i ( X n +1 ) C n +1 ,i = C n,i + γ � ◮ IS sampling If X n +1 ∼ π � θ d µ : θ i 1 I X i ( X n +1 ) ◮ IS sampling with a leverage effect If X n +1 ∼ π � θ d µ : S n � C n +1 ,i = C n,i + γ θ i 1 I X i ( X n +1 ) + ∞ g = + ∞ , lim inf lim s/g ( s ) > 0 g ( S n ) s If g ( s ) = ln(1 + s ) α/ (1+ α ) , the learning rate is O ( t − α ) ◮ Key property: if X n +1 ∈ X i , then for any j � = i π θ n +1 ( X j ) > π θ n ( X j ) the probability of stratum # j increases 5 / 17
The algorithm: Adaptive IS with partial biasing def = (ln(1 + s )) α/ (1 − α ) . ◮ Fix: ρ ∈ (0 , 1) and α ∈ (1 / 2 , 1) . Set g ( s ) ◮ Initialisation: X 0 ∈ X, a positive weight vector θ 0 , ◮ Repeat, for n = 0 , · · · , N − 1 sample X n +1 ∼ P θ ρ n ( X n , · ) , a Markov kernel invariant wrt π θ ρ n d µ compute γ g ( S n ) S n θ ρ C n +1 ,i = C n,i + n,i 1 I X i ( X n +1 ) d � θ n +1 ,i = C n +1 ,i S n +1 = C n +1 ,i S n +1 i =1 ◮ Return ( θ n ) n sequ. of estimates of θ ∗ ; and the IS estimator � d � � d � � N � � � 1 θ 1 − ρ I X i ( X n ) θ ρ f π d µ ≈ 1 f ( X n ) n − 1 ,i n − 1 ,i N n =1 i =1 i =1 6 / 17
Convergence results 1 The limiting behavior of the estimates ( θ n ) n 2 The limiting distribution of X n 3 The limiting behavior of the IS estimator 7 / 17
Assumptions 1 On the target density and the strata X i : sup π < ∞ , 1 ≤ i ≤ d θ ∗ ( i ) > 0 min X 2 On the kernels P θ : Hastings-Metropolis kernel, with symmetric proposal q ( x, y ) d µ ( y ) such that inf X 2 q > 0 . for any compact subset K , there exists C and λ ∈ (0 , 1) s.t. � P n θ ( x, · ) − π θ � TV ≤ Cλ n sup θ ∈ K 3 ρ ∈ (0 , 1) 4 g ( s ) = (ln(1 + s )) α/ (1 − α ) with α ∈ (1 / 2 , 1) . 8 / 17
Convergence results: on the sequence θ n ◮ Recall def θ n +1 = θ n + γ n +1 H ( X n +1 , θ n ) + γ 2 n +1 Λ n +1 γ n +1 = γ/g ( S n ) where γ n is a positive random learning rate sup n � Λ n +1 � is bounded a.s. � H ( · , θ ) π θ ρ d µ = 0 iff θ = θ ∗ . ◮ Result 1 d � n γ n n α = (1 − α ) α γ 1 − α θ 1 − ρ lim a.s. ∗ ,j j =1 ◮ Result 2: lim n θ n = θ ∗ a.s. 9 / 17
Convergence results - on the samples X n ◮ Recall X n +1 ∼ P θ ρ n ( X n , · ) π θ P θ = π θ ◮ Result 1 For any bounded function f � lim n E [ f ( X n )] = f π θ ρ ∗ d µ ◮ Result 2 For any bounded function f � N � 1 lim f ( X n ) = f π θ ρ ∗ d µ a.s. N N n =1 10 / 17
Convergence results - on the IS estimator ◮ Result 1 For any bounded function f � � N � d � d 1 θ ρ θ 1 − ρ = lim f ( X n ) n − 1 ,j 1 I X j ( X n ) f π d µ N E n − 1 ,j N n =1 j =1 j =1 ◮ Result 1 For any bounded function f , a.s.: � N d d � � � 1 θ ρ θ 1 − ρ = lim f ( X n ) n − 1 ,j 1 I X j ( X n ) f π d µ n − 1 ,j N N n =1 j =1 j =1 11 / 17
Is it new ? ◮ Theoretical contribution Self Healing Umbrella Sampling ρ = 1 (no biasing intensity) g ( s ) = s (also covered by the theory; not detailed here) Well-tempered metadynamics ρ ∈ (0 , 1) (biasing intensity) g ( s ) = s 1 − ρ (also covered by the theory; not detailed here) ◮ Methodological contribution: the introduction of a function g ( s ) in the updating scheme of the estimator θ n , allowing a random learning rate γ n ∼ O wp 1 ( n − α ) for α ∈ (1 / 2 , 1) . 12 / 17
Is there a gain in such a self-tuned and partially biasing algorithm ? 8 7 2.5 6 2 5 1.5 4 1 3 2 0.5 1 0 0 −0.5 3 2 −1 1 0 −1.5 −1 −2.5 −1.5 −2 −2 −0.5 −1 −2 0.5 0 1.5 1 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 2.5 2 beta=1 beta=5 8 x 10 60 5 50 4 40 3 30 2 20 1 10 0 0 4 4 2 3 2 3 2 2 0 0 1 1 −2 0 −2 0 −1 −1 −4 −2 −4 −2 Make the metastability larger by increasing β . 13 / 17
Recommend
More recommend