Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard LTCI, T´ el´ ecom Paris, Institut Polytechnique de Paris, France Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Introduction Non-convex optimization problem : min f ( x ) Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Introduction Non-convex optimization problem : min f ( x ) Fractional Langevin Algorithm (FLA) (Simsekli, 2017) : W k +1 = W k − η c α ∇ f ( W k ) + � 1 /α ∆ L α � η/β k +1 − { ∆ L α k } k ∈ N + : α -stable random variables − α ∈ (1 , 2]: the characteristic index, c α : a known constant Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Introduction Non-convex optimization problem : min f ( x ) Fractional Langevin Algorithm (FLA) (Simsekli, 2017) : W k +1 = W k − η c α ∇ f ( W k ) + � 1 /α ∆ L α � η/β k +1 − { ∆ L α k } k ∈ N + : α -stable random variables − α ∈ (1 , 2]: the characteristic index, c α : a known constant α -stable Distribution α -stable L´ evy Motion : =1.2 100 =1.2 10 -1 =1.6 =1.6 =2.0 =2.0 50 10 -2 0 10 -3 -50 -15 -10 -5 0 5 10 15 0 500 1000 1500 2000 2500 3000 Generalizes Stochastic Gradient Langevin Dynamics ( α = 2) (Welling and Teh, 2011) Strong links with SGD for Deep Neural Networks (Simsekli et al. 2019) Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Introduction Non-convex optimization problem : min f ( x ) Fractional Langevin Algorithm (FLA) (Simsekli, 2017) : W k +1 = W k − η c α ∇ f ( W k ) + � 1 /α ∆ L α � η/β k +1 − { ∆ L α k } k ∈ N + : α -stable random variables − α ∈ (1 , 2]: the characteristic index, c α : a known constant α -stable Distribution α -stable L´ evy Motion : =1.2 100 =1.2 10 -1 =1.6 =1.6 =2.0 =2.0 50 10 -2 0 10 -3 -50 -15 -10 -5 0 5 10 15 0 500 1000 1500 2000 2500 3000 Generalizes Stochastic Gradient Langevin Dynamics ( α = 2) (Welling and Teh, 2011) Strong links with SGD for Deep Neural Networks (Simsekli et al. 2019) Our Goal: Analyze E [ f ( W k ) − f ⋆ ], where f ⋆ � min f ( x ) Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Method of Analysis Define three stochastic processes: d X 1 ( t ) = − c α ∇ f ( X 1 ( t − )) d t + β − 1 /α d L α ( t ) , ∞ � ∇ f ( X 2 ( j η )) I [ j η, ( j +1) η [ ( t ) d t + β − 1 /α d L α ( t ) , d X 2 ( t ) = − c α k =0 φ ( X 3 ( t − )) ∂ f ( X 3 ( t − )) � �� d X 3 ( t ) = −D α − 2 /φ ( X 3 ( t − )) d t + β − 1 /α d L α ( t ) . x i ∂ x i Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Method of Analysis Define three stochastic processes: d X 1 ( t ) = − c α ∇ f ( X 1 ( t − )) d t + β − 1 /α d L α ( t ) , ∞ � ∇ f ( X 2 ( j η )) I [ j η, ( j +1) η [ ( t ) d t + β − 1 /α d L α ( t ) , d X 2 ( t ) = − c α k =0 φ ( X 3 ( t − )) ∂ f ( X 3 ( t − )) � �� d X 3 ( t ) = −D α − 2 /φ ( X 3 ( t − )) d t + β − 1 /α d L α ( t ) . x i ∂ x i − D : Riesz fractional (directional) derivative − X 1 is the continuous-time limit of the FLA algorithm − X 2 is a linearly interpolated version of W k : X 2 ( k η ) = W k , ∀ k ∈ N + − X 3 admits π ∝ exp( − β f ( x )) d x as its unique invariant distribution Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Method of Analysis Define three stochastic processes: d X 1 ( t ) = − c α ∇ f ( X 1 ( t − )) d t + β − 1 /α d L α ( t ) , ∞ � ∇ f ( X 2 ( j η )) I [ j η, ( j +1) η [ ( t ) d t + β − 1 /α d L α ( t ) , d X 2 ( t ) = − c α k =0 φ ( X 3 ( t − )) ∂ f ( X 3 ( t − )) � �� d X 3 ( t ) = −D α − 2 /φ ( X 3 ( t − )) d t + β − 1 /α d L α ( t ) . x i ∂ x i − D : Riesz fractional (directional) derivative − X 1 is the continuous-time limit of the FLA algorithm − X 2 is a linearly interpolated version of W k : X 2 ( k η ) = W k , ∀ k ∈ N + − X 3 admits π ∝ exp( − β f ( x )) d x as its unique invariant distribution Decompose the error E f ( W k ) − f ∗ as: [ E f ( X 2 ( k η )) − E f ( X 1 ( k η ))] + [ E f ( X 1 ( k η )) − E f ( X 3 ( k η ))] + [ E f ( X 3 ( k η )) − E f ( ˆ W )] + [ E f ( ˆ W ) − f ∗ ] − ˆ W ∼ π ∝ exp( − β f ( x )) d x − Relate these terms to Wasserstein distance between processes Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Main Result Main assumptions: older continuous gradients: c α �∇ f ( x ) − ∇ f ( y ) � ≤ M � x − y � γ 1 ) H¨ 2 ) Dissipativity: c α � x , ∇ f ( x ) � ≥ m � x � 1+ γ − b Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Main Result Main assumptions: older continuous gradients: c α �∇ f ( x ) − ∇ f ( y ) � ≤ M � x − y � γ 1 ) H¨ 2 ) Dissipativity: c α � x , ∇ f ( x ) � ≥ m � x � 1+ γ − b Theorem For 0 < η < m / M 2 , there exists C > 0 such that: � q + k 1+max { 1 q ,γ + γ q + γ 1 q } η α q d E [ f ( W k )] − f ∗ ≤ C q ,γ + γ k 1+max { 1 1 q } η ( q − 1) γ β α q � Mc − 1 + β b + d exp( − λ ∗ k η α ) + β γ +1 (1 + γ ) m β d (2 e ( b + d 2 Γ( d 2 + 1) β d β )) + 1 β log . d ( dm ) 2 Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Main Result Main assumptions: older continuous gradients: c α �∇ f ( x ) − ∇ f ( y ) � ≤ M � x − y � γ 1 ) H¨ 2 ) Dissipativity: c α � x , ∇ f ( x ) � ≥ m � x � 1+ γ − b Theorem For 0 < η < m / M 2 , there exists C > 0 such that: � q + k 1+max { 1 q ,γ + γ q + γ 1 q } η α q d E [ f ( W k )] − f ∗ ≤ C q ,γ + γ k 1+max { 1 1 q } η ( q − 1) γ β α q � Mc − 1 + β b + d exp( − λ ∗ k η α ) + β γ +1 (1 + γ ) m β d (2 e ( b + d 2 Γ( d 2 + 1) β d β )) + 1 β log . d ( dm ) 2 − Worse dependency on η and k than the case α = 2 − Requires smaller η Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Additional Results Posterior Sampling: sampling from π ∝ exp( − β f ( x )) d x Stochastic Gradients: � n f ( x ) � 1 i =1 f ( i ) ( x ) n � � � ∇ f ≈ ∇ f k ( x ) � i ∈ Ω k ∇ f ( i ) ( x ) / n s Non-Asymptotic Analysis of FLMC for Non-Convex Optimization Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard
Recommend
More recommend