Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Gersende Fort Institut de Math´ ematiques de Toulouse CNRS Toulouse, France
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Based on joint works with Yves Atchad´ e (Univ. Michigan, USA) Eric Moulines (Ecole Polytechnique, France) Edouard Ollier (ENS Lyon, France) Laurent Risser (IMT, France). Adeline Samson (Univ. Grenoble Alpes, France). and published in the papers (or works in progress) - Convergence of the Monte-Carlo EM for curved exponential families (Ann. Stat., 2003) - On Perturbed Proximal-Gradient algorithms (JMLR, 2017) - Stochastic Proximal Gradient Algorithms for Penalized Mixed Models (Statistics and Computing, 2018) - Stochastic FISTA algorithms : so fast ? (IEEE workshop SSP, 2018)
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish The topic This talk : answer a computationnel issue ◮ Find θ ∗ ∈ argmin θ ∈ Θ ( f ( θ ) + g ( θ )) (1) where Θ ⊆ R d (extension to any Hilbert possible; not done) g is not smooth , but is convex and proper, lower semi-continuous ( ”prox” operator ) f is is not explicit / is untractable , ∇ f exists but is not explicit / is untractable When proving results : f is convex and ∇ f is Lipschitz ◮ In this talk : numerical tools to solve (1) based on first order methods; convergence analysis.
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Outline The topic Applications in Statistical Learning A numerical solution: proximal-gradient based methods Case of Monte Carlo approximation Perturbed Proximal-Gradient algorithms and EM-based algorithms
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Example 1 : large scale learning Minimization of a composite function g = 0 or g is a penalty / regularization / constraint condition on the parameter θ f is an (empirical) loss function associated to N examples N � f ( θ ) = 1 f i ( θ ) N i =1 when N is large For any i , f i and ∇ f i can be evaluated at any point θ but the computation of the sum over N terms is too expensive. Rmk that ∇ f ( θ ) = E [ ∇ f I ( θ )] where I r.v. uniform on { 1 , · · · , N } .
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Example 2 : binary graphical model Minimization of a composite function Observation y ∈ {− 1 , 1 } p (a binary vector of length p , collecting the binary values of p nodes) , with statistical model � p � p p � � � π θ ( y ) ∝ exp θ i y i + θ ij y i y j i =1 i =1 j = i +1 with an untractable normalizing constant exp( Z θ ) . θ collects the ”weights”. f is the negative log-likelihood of N indep. observations � � � � p p p N N � � � � � Y ( n ) N − 1 N − 1 f ( θ ) = − log Z θ + θ i + θ ij 1 I Y ( n ) = Y ( n ) i i j i =1 n =1 i =1 j = i +1 n =1 In this model ∇ f ( θ ) = E θ [ H ( X, θ )] where X ∼ π θ g = 0 or g is a penalty / regularization / constraint condition on the parameter θ (the number of observations N << p 2 / 2 )
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Example 3 : Parametric inference in Latent variable models Minimization of a composite function g is a penalty function (e.g. for sparsity condition on θ ) f is the negative log-likelihood of the N observations � f ( θ ) = − log h ( x, Y 1: N ; θ ) ν ( d x ) X and the gradient is of the form � h ( x, Y 1: N ; θ ) ∇ f ( θ ) = ∂ θ log h ( x, Y 1: N ; θ ) � X h ( u, Y 1: N ; θ ) ν ( d u ) ν ( d x ) X i.e. an expectation w.r.t. the a posteriori distribution (known up to a normalizing constant in these models)
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Outline The topic Applications in Statistical Learning A numerical solution: proximal-gradient based methods Case of Monte Carlo approximation Perturbed Proximal-Gradient algorithms and EM-based algorithms
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Numerical solution : the ingredient argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) � �� � � �� � smooth non smooth The Proximal Gradient algorithm Given a stepsize sequence { γ n , n ≥ 0 } , iterative algorithm: θ n +1 = Prox γ n +1 ,g ( θ n − γ n +1 ∇ f ( θ n )) where � � g ( θ ) + 1 def 2 γ � θ − τ � 2 Prox γ,g ( τ ) = argmin θ ∈ Θ Proximal map: Moreau(1962) Proximal Gradient algorithm: Beck-Teboulle(2010); Combettes-Pesquet(2011); Parikh-Boyd(2013) A generalization of the gradient algorithm to a composite objective fct. A Majorize-Minimize algorithm from a quadratic majorization of f (since Lipschitz gradient) which produces a sequence { θ n , n ≥ 0 } such that F ( θ n +1 ) ≤ F ( θ n ) . In our frameworks, ∇ f ( θ ) is not available.
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Numerical solution : a perturbed proximal-gradient algorithm The Perturbed Proximal Gradient algorithm Given a stepsize sequence { γ n , n ≥ 0 } , iterative algorithm: θ n +1 = Prox γ n +1 ,g ( θ n − γ n +1 H n + 1 ) where H n +1 is an approximation of ∇ f ( θ n ) . Useful for the proof: observe θ n +1 = Prox γ n +1 ,g θ n − γ n +1 ∇ f ( θ n ) − γ n +1 ( H n +1 − ∇ f ( θ n )) � �� � perturbation
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Convergence result : the assumptions (1/2) argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) where the function g : R d → [0 , ∞ ] is convex, non smooth, not identically equal to + ∞ , and lower semi-continuous the function f : R d → R is a smooth convex function i.e. f is continuously differentiable and there exists L > 0 such that ∀ θ, θ ′ ∈ R d �∇ f ( θ ) − ∇ f ( θ ′ ) � ≤ L � θ − θ ′ � Θ ⊆ R d is the domain of g : Θ = { θ ∈ R d : g ( θ ) < ∞} . The set argmin Θ F is a non-empty subset of Θ .
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Convergence results (2/2) θ n +1 = Prox γ n +1 ,g ( θ n − γ n +1 H n +1 ) with H n +1 ≈ ∇ f ( θ n ) Set: L = argmin Θ ( f + g ) η n +1 = H n +1 − ∇ f ( θ n ) Theorem (Atchad´ e, F., Moulines (2017)) Assume g convex, lower semi-continuous; f convex, C 1 and its gradient is Lipschitz with constant L ; L is non empty. � n γ n = + ∞ and γ n ∈ (0 , 1 /L ] . Convergence of the series � � � γ 2 n +1 � η n +1 � 2 , γ n +1 η n +1 , γ n +1 � T n , η n +1 � n n n where T n = Prox γ n +1 ,g ( θ n − γ n +1 ∇ f ( θ n )) . Then there exists θ ⋆ ∈ L such that lim n θ n = θ ⋆ .
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Sketch of proof Its proof relies on a deterministic Lyapunov inequality 1 � θ n +1 − θ ⋆ � 2 ≤ � θ n − θ ⋆ � 2 − 2 γ n +1 � � � � + 2 γ 2 n +1 � η n +1 � 2 F ( θ n +1 ) − min F − 2 γ n +1 T n − θ ⋆ , η n +1 � �� � � �� � non-negative signed noise (an extension of) the Robbins-Siegmund lemma 2 Let { v n , n ≥ 0 } and { χ n , n ≥ 0 } be non-negative sequences and { ξ n , n ≥ 0 } be such that � n ξ n exists. If for any n ≥ 0 , v n +1 ≤ v n − χ n +1 + ξ n +1 then � n χ n < ∞ and lim n v n exists. Rmk: deterministic lemma, signed noise.
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods What about Nesterov-based acceleration ? (FISTA) Let { t n , n ≥ 0 } be a positive sequence s.t. γ n +1 t n ( t n − 1) ≤ γ n t 2 n − 1 Nesterov acceleration of the Proximal Gradient algorithm θ n +1 = Prox γ n +1 ,g ( τ n − γ n +1 ∇ f ( τ n )) τ n +1 = θ n +1 + t n − 1 t n +1 ( θ n +1 − θ n ) Nesterov(2004), Tseng(2008), Beck-Teboulle(2009) Zhu-Orecchia (2015); Attouch-Peypouquet(2015); Bubeck-Lee-Singh(2015); Su-Boyd-Candes(2015) � 1 � (deterministic) Proximal-gradient F ( θ n ) − min F = O n � 1 � (deterministic) Accelerated Proximal-gradient F ( θ n ) − min F = O n 2
Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Convergence results for perturbed FISTA When ∇ f ( τ n ) is replaced with H n +1 Perturbed FISTA H n +1 ≈ ∇ f ( τ n ) θ n +1 = Prox γ n +1 ,g ( τ n − γ n +1 H n +1 ) τ n +1 = θ n +1 + t n − 1 t n +1 ( θ n +1 − θ n ) def Under conditions on γ n , t n and on the perturbation ˜ η n +1 = H n +1 − ∇ f ( τ n ) � γ n +1 t n � z n − θ ∗ , ˜ η n +1 � < ∞ n we have (F., Risser, Atchad´ e, Moulines; 2018) lim n γ n +1 t 2 n F ( θ n ) exists Explicit control of this quantity.
Recommend
More recommend