Non-convex Learning via Replica Exchange Stochastic Gradient MCMC A scalable parallel tempering algorithm for DNNs Qi Feng *2 Liyao Gao * 1 July 27, 2020 1 Purdue University 2 University of Southern California * Equal contribution Wei Deng 1 Faming Liang 1 Guang Lin 1
Intro
Markov chain Monte Carlo The increasing concern for AI safety problems draws our attention to Markov chain Monte Carlo (MCMC) , which is known for • Multi-modal sampling [Teh et al., 2016] • Non-convex optimization [Zhang et al., 2017] 1
Acceleration strategies for MCMC Popular strategies to accelerate MCMC: • Simulated annealing [Kirkpatrick et al., 1983] • Simulated tempering [Marinari and Parisi, 1992] • Replica exchange MCMC [Swendsen and Wang, 1986] 2
Replica exchange stochastic gradient MCMC
Replica exchange Langevin difgusion t In other words, a jump process is included in a Markov process t t 1 Moreover, the positions of the two particles swap with a probability 3 t t Consider two Langevin difgusion processes with τ 1 > τ 2 � d β ( 1 ) = −∇ U ( β ( 1 ) 2 τ 1 d W ( 1 ) t ) dt + � d β ( 2 ) = −∇ U ( β ( 2 ) 2 τ 2 d W ( 2 ) t ) dt + t , � �� � U ( β ( 1 ) ) − U ( β ( 2 ) S ( β ( 1 ) t , β ( 2 ) ) τ 1 − 1 t ) := e τ 2 P ( β t + dt = ( β ( 2 ) t , β ( 1 ) t ) | β t = ( β ( 1 ) t , β ( 2 ) t )) = rS ( β ( 1 ) t , β ( 2 ) t ) dt P ( β t + dt = ( β ( 1 ) t , β ( 2 ) t ) | β t = ( β ( 1 ) t , β ( 2 ) t )) = 1 − rS ( β ( 1 ) t , β ( 2 ) t ) dt
A demo Figure 1: Trajectory plot for replica exchange Langevin difgusion. 4
Why the naïve numerical algorithm fails k (1) 1 Consider the scalable stochastic gradient Langevin dynamics k 5 k algorithm [Welling and Teh, 2011] � β ( 1 ) β ( 1 ) β ( 1 ) 2 η k τ 1 ξ ( 1 ) � k + 1 = � − η k ∇ � L ( � k ) + � β ( 2 ) � k + 1 = � β ( 2 ) − η k ∇ � L ( � β ( 2 ) 2 η k τ 2 ξ ( 2 ) k ) + k . β ( 1 ) β ( 2 ) Swap the chains with a naïve swapping rate r S ( � k + 1 , � k + 1 ) η k § : � �� � β ( 1 ) β ( 2 ) � L ( � k + 1 ) − � L ( � τ 1 − 1 k + 1 ) S ( � β ( 1 ) k + 1 , � β ( 2 ) k + 1 ) = e . τ 2 β ( · ) Exponentiating the unbiased estimators � L ( � k + 1 ) leads to a large bias . § In the implementations, we fix r η k = 1 by default.
Why the naïve numerical algorithm fails k (1) 1 Consider the scalable stochastic gradient Langevin dynamics k 5 k algorithm [Welling and Teh, 2011] � β ( 1 ) β ( 1 ) β ( 1 ) 2 η k τ 1 ξ ( 1 ) � k + 1 = � − η k ∇ � L ( � k ) + � β ( 2 ) � k + 1 = � β ( 2 ) − η k ∇ � L ( � β ( 2 ) 2 η k τ 2 ξ ( 2 ) k ) + k . β ( 1 ) β ( 2 ) Swap the chains with a naïve swapping rate r S ( � k + 1 , � k + 1 ) η k § : � �� � β ( 1 ) β ( 2 ) � L ( � k + 1 ) − � L ( � τ 1 − 1 k + 1 ) S ( � β ( 1 ) k + 1 , � β ( 2 ) k + 1 ) = e . τ 2 β ( · ) Exponentiating the unbiased estimators � L ( � k + 1 ) leads to a large bias . § In the implementations, we fix r η k = 1 by default.
Why the naïve numerical algorithm fails k (1) 1 Consider the scalable stochastic gradient Langevin dynamics k 5 k algorithm [Welling and Teh, 2011] � β ( 1 ) β ( 1 ) β ( 1 ) 2 η k τ 1 ξ ( 1 ) � k + 1 = � − η k ∇ � L ( � k ) + � β ( 2 ) � k + 1 = � β ( 2 ) − η k ∇ � L ( � β ( 2 ) 2 η k τ 2 ξ ( 2 ) k ) + k . β ( 1 ) β ( 2 ) Swap the chains with a naïve swapping rate r S ( � k + 1 , � k + 1 ) η k § : � �� � β ( 1 ) β ( 2 ) � L ( � k + 1 ) − � L ( � τ 1 − 1 k + 1 ) S ( � β ( 1 ) k + 1 , � β ( 2 ) k + 1 ) = e . τ 2 β ( · ) Exponentiating the unbiased estimators � L ( � k + 1 ) leads to a large bias . § In the implementations, we fix r η k = 1 by default.
A corrected algorithm 1 1 1 2 dW t S t t dW 2 S t 2 S t (2) 1 1 6 1 Assume � L ( θ ) ∼ N ( L ( θ ) , σ 2 ) and consider the geometric Brownian motion of { � S t } t ∈ [ 0 , 1 ] in each swap as a Martingale � �� � � � � L ( � β ( 1 ) ) − � L ( � β ( 2 ) ) − τ 1 − 1 τ 1 − 1 σ 2 t � S t = e τ 2 τ 2 � �� � � � √ L ( � β ( 1 ) ) − L ( � β ( 2 ) ) − τ 1 − 1 τ 1 − 1 σ 2 t + 2 σ W t = e . τ 2 τ 2 Taking the derivative of � S t with respect to t and W t , Itô’s lemma gives, � � � 1 � d � d 2 � dt + d � √ d � σ � S t = dt + 1 dW t = − 1 S t dW t . τ 1 τ 2 By fixing t = 1 in (2), we have the suggested unbiased swapping rate � �� � � σ 2 � � L ( � β ( 1 ) ) − � L ( � β ( 2 ) ) − τ 1 − 1 � τ 1 − 1 S 1 = e . τ 2 τ 2
A corrected algorithm 1 1 1 2 dW t S t t dW 2 S t 2 S t (2) 1 1 6 1 Assume � L ( θ ) ∼ N ( L ( θ ) , σ 2 ) and consider the geometric Brownian motion of { � S t } t ∈ [ 0 , 1 ] in each swap as a Martingale � �� � � � � L ( � β ( 1 ) ) − � L ( � β ( 2 ) ) − τ 1 − 1 τ 1 − 1 σ 2 t � S t = e τ 2 τ 2 � �� � � � √ L ( � β ( 1 ) ) − L ( � β ( 2 ) ) − τ 1 − 1 τ 1 − 1 σ 2 t + 2 σ W t = e . τ 2 τ 2 Taking the derivative of � S t with respect to t and W t , Itô’s lemma gives, � � � 1 � d � d 2 � dt + d � √ d � σ � S t = dt + 1 dW t = − 1 S t dW t . τ 1 τ 2 By fixing t = 1 in (2), we have the suggested unbiased swapping rate � �� � � σ 2 � � L ( � β ( 1 ) ) − � L ( � β ( 2 ) ) − τ 1 − 1 � τ 1 − 1 S 1 = e . τ 2 τ 2
Unknown corrections in practice Figure 2: Unknown corrections on CIFAR 10 and CIFAR 100 datasets. 7
An adaptive algorithm for unknown corrections k F 1 1 Swapping step Sampling step Stochastic approximation step k 8 � β ( 1 ) � k + 1 = � β ( 1 ) k − η ( 1 ) k ∇ � L ( � β ( 1 ) 2 η ( 1 ) k τ 1 ξ ( 1 ) k ) + � β ( 2 ) � k + 1 = � β ( 2 ) − η ( 2 ) k ∇ � L ( � β ( 2 ) 2 η ( 2 ) k τ 2 ξ ( 2 ) k ) + k , Obtain an unbiased estimate ˜ σ 2 m + 1 for σ 2 . ˆ m + 1 = ( 1 − γ m )ˆ m + γ m ˜ σ 2 σ 2 σ 2 m + 1 , Generate a uniform random number u ∈ [ 0 , 1 ] . �� � � �� � � τ 1 − 1 σ 2 ˆ ˆ � L ( � β ( 1 ) k + 1 ) − � L ( � β ( 2 ) m + 1 τ 1 − 1 k + 1 ) − S 1 = exp τ 2 τ 2 If u < ˆ S 1 : Swap � β ( 1 ) k + 1 and � β ( 2 ) k + 1 .
Convergence Analysis
Discretization Error Replica exchange SGLD tracks the replica exchange Langevin is the noise in the stochastic swapping rate. 9 Lemma (Discretization Error) Given the smoothness and dissipativity assumptions in the appendix, difgusion in some sense. and a small (fixed) learning rate η , we have that √ E [sup 0 ≤ t ≤ T ∥ β t − � t || 2 ] ≤ ˜ β η O ( η +max i E [ ∥ φ i ∥ 2 ]+max i E [ | ψ i | 2 ]) , where � β η t is the continuous-time interpolation for reSGLD, φ := ∇ � U − ∇ U is the noise in the stochastic gradient, and ψ := � S − S
10 Lyapunov condition Hessian Lower bound (3) Poincaré inequality acceleration 2 (ii) Comparison method: acceleration with a larger Dirichlet form (i) Log-Sobolev inequality for Langevin difgusion [Cattiaux et al., 2010] Accelerated exponential decay of W 2 Smooth gradient condition → ∇ 2 G ≽ − CI 2 d for some constant C > 0. � d ν t [Chen et al., 2019] → χ 2 ( ν || π ) ≤ c p E ( d π ) � � ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 a / 4 · V ( x 1 , x 2 ) ≤ κ − γ ( ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 ) → L V ( x 1 , x 2 ) V ( x 1 , x 2 ) := e τ 1 τ 2 � E S ( f ) = E ( f ) + 1 S ( x 1 , x 2 ) · ( f ( x 2 , x 1 ) − f ( x 1 , x 2 )) 2 d π ( x 1 , x 2 ) , , � �� �
10 Lyapunov condition Hessian Lower bound (3) Poincaré inequality acceleration 2 (ii) Comparison method: acceleration with a larger Dirichlet form (i) Log-Sobolev inequality for Langevin difgusion [Cattiaux et al., 2010] Accelerated exponential decay of W 2 Smooth gradient condition → ∇ 2 G ≽ − CI 2 d for some constant C > 0. � d ν t [Chen et al., 2019] → χ 2 ( ν || π ) ≤ c p E ( d π ) � � ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 a / 4 · V ( x 1 , x 2 ) ≤ κ − γ ( ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 ) → L V ( x 1 , x 2 ) V ( x 1 , x 2 ) := e τ 1 τ 2 � E S ( f ) = E ( f ) + 1 S ( x 1 , x 2 ) · ( f ( x 2 , x 1 ) − f ( x 1 , x 2 )) 2 d π ( x 1 , x 2 ) , , � �� �
10 Lyapunov condition Hessian Lower bound (3) Poincaré inequality acceleration 2 (i) Log-Sobolev inequality for Langevin difgusion [Cattiaux et al., 2010] Accelerated exponential decay of W 2 Smooth gradient condition → ∇ 2 G ≽ − CI 2 d for some constant C > 0. � d ν t [Chen et al., 2019] → χ 2 ( ν || π ) ≤ c p E ( d π ) � � ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 a / 4 · V ( x 1 , x 2 ) ≤ κ − γ ( ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 ) → L V ( x 1 , x 2 ) V ( x 1 , x 2 ) := e τ 1 τ 2 (ii) Comparison method: acceleration with a larger Dirichlet form � E S ( f ) = E ( f ) + 1 S ( x 1 , x 2 ) · ( f ( x 2 , x 1 ) − f ( x 1 , x 2 )) 2 d π ( x 1 , x 2 ) , , � �� �
Recommend
More recommend