Stochastic Forward-Backward Splitting Silvia Villa joint work with - PowerPoint PPT Presentation

Stochastic Forward-Backward Splitting Silvia Villa joint work with Lorenzo Rosasco and Bang Cong V˜ u Laboratory for Computational and Statistical Learning, IIT and MIT http://lcsl.mit.edu/data/silviavilla 2015 – Dagstuhl Seminar “Mathematical and Computational Foundations of Learning Theory” S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 1 / 36

Introduction Problem setting Given a separable Hilbert space H , we consider the problem w ∈ H T ( w ) , min T ( w ) = F ( w ) + R ( w ) , with F : H → R convex and continuously differentiable, with Lipschitz continuous gradient, i.e., �∇ F ( w ) − ∇ F ( w ′ ) � ≤ β � w − w ′ � R : H → R ∪ { + ∞} proper, convex, and lower semicontinuous S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 2 / 36

Introduction Statistical learning with regularization Given: Hilbert space H , measure space (Ω , A , P ), loss function ℓ : H × Ω → R + Goal Approximate the infimum of � F ( w ) = ℓ ( w , ξ ) dP ( ξ ) , H × Ω given a training set { ξ 1 , . . . , ξ m } of points sampled from P . If, for every ξ ∈ Ω: ℓ ( · , ξ ) is convex ∇ ℓ ( · , ξ ) is Lipschitz continuous (uniformly w.r.t. ξ ) then F is convex and ∇ F is Lipschitz continuous. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 3 / 36

Introduction Statistical learning with regularization Given: Hilbert space H , measure space (Ω , A , P ), loss function ℓ : H × Ω → R + Goal Approximate the infimum of � ℓ ( w , ξ ) dP ( ξ ) + R ( w ) , H × Ω given a training set { ξ 1 , . . . , ξ m } of points sampled from P . If, for every ξ ∈ Ω: ℓ ( · , ξ ) is convex ∇ ℓ ( · , ξ ) is Lipschitz continuous (uniformly w.r.t. ξ ) then F is convex and ∇ F is Lipschitz continuous. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 3 / 36

Introduction Statistical learning with regularization cont’d A common strategy is to minimize m T ( w ) = 1 � ℓ ( w , ξ i ) + R ( w ) m i =1 Example Given a Hilbert space X (input space), Y ⊂ R (output space), and L : R × Y → R + set Ω = X × Y , ξ = ( x , y ) , ℓ ( w , ξ ) = L ( � w , x � , y ) . S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 4 / 36

Introduction Forward-backward splitting algorithm (proximal gradient algorithm) Given w 0 ∈ H and γ n ∈ [ ǫ, 2 /β − ǫ ], define FB w n +1 = prox γ n R ( w n − γ n ∇ F ( w n )) with � � R ( v ) + 1 2 γ � v − w � 2 prox γ R ( w ) = argmin v ∈ H See [Bausckhe-Combettes, Convex analysis and monotone operator theory in Hilbert spaces, 2011]. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 5 / 36

Algorithm and convergence results Stochastic forward-backward splitting algorithm Given w 0 such that E [ � w 0 � 2 ] < + ∞ , γ n > 0, define SFB w n +1 = prox γ n R ( w n − γ n ∇ F ( w n )) S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 6 / 36

Algorithm and convergence results Stochastic forward-backward splitting algorithm Given w 0 such that E [ � w 0 � 2 ] < + ∞ , γ n > 0, define SFB w n +1 = prox γ n R ( w n − γ n G n ) where G n is a stochastic estimate of the gradient. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 6 / 36

Algorithm and convergence results Stochastic forward-backward splitting algorithm Given w 0 such that E [ � w 0 � 2 ] < + ∞ , γ n > 0, λ n ∈ [0 , 1], define SFB y n = prox γ n R ( w n − γ n G n ) w n +1 = (1 − λ n ) w n + λ n y n . where G n is a stochastic estimate of the gradient. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 6 / 36

Algorithm and convergence results Online learning point of view Duchi Singer 2009 Langford Li Zhang 2009 Shalev-Shwartz Shamir Srebro Sridaran 2009 Kakade Tewari 2008 Bottou Bousquet 2008 Hazan Kalai Kale Agarwal 2006 ... ... Convergence analysis is often based on: Regret estimation + Online to batch conversions Cesa-Bianchi-Conconi-Gentile 2004 which imply averaging of the iterates. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 7 / 36

Algorithm and convergence results Outline Convergence results for stochastic forward-backward for minimization problems Extension to monotone inclusions Primal-dual stochastic proximal methods S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 8 / 36

Algorithm and convergence results Assumptions Let (Ω , F , P ) be a probability space. Define the filtration F n = σ ( { w 0 , . . . , w n } ) and assume E [ � G n � 2 ] < + ∞ E [ G n |F n ] = ∇ F ( w n ) E [ � G n − ∇ F ( w n ) � 2 |F n ] ≤ σ 2 (1 + �∇ F ( w n ) � 2 ) , S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 9 / 36

Algorithm and convergence results The case of statistical learning Given a Hilbert space, the objective is to minimize � T ( w ) = ℓ ( w , ξ ) dP ( ξ ) + R ( w ) H × Y given a sequence of i.i.d. samples ( ξ i ) i ∈ N . Then, F n = σ ( ξ 1 , . . . , ξ n ) and G n = ∇ ℓ ( · , ξ n )( w n ) = ⇒ E [ G n |F n ] = ∇ F ( w n ) E [ � G n − ∇ F ( w n ) � 2 |F n ] ≤ σ 2 (1 + �∇ F ( w n ) � 2 ) is a condition on the variance of the random variable ξ ∈ Ω �→ ∇ ℓ ( · , ξ )( w ) S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 10 / 36

Algorithm and convergence results Statistical learning - Incremental FB algorithm The goal is to minimize m T ( w ) = 1 � ℓ ( w , ξ i ) + R ( w ) m i =1 Let i n : Ω → { 1 , . . . , m } be a sequence of independent r.v. such that, for every n and i , P [ i n = i ] = 1 / m . Then G n = ∇ ℓ ( · , ξ i n ))( w n ) � m is such that E [ G n |F n ] = E [ G n ] = 1 i =1 ∇ ℓ ( · , ξ i )( w n ). m S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 11 / 36

Algorithm and convergence results Comparison between FB and SFB algorithm The stochastic incremental FB algorithm becomes SFB w n +1 = prox γ n R ( w n − γ n ∇ w ℓ ( w n , ξ i n )) The FB algorithm is FB m � 1 � � w n +1 = prox γ n R w n − γ n ∇ w ℓ ( w n , ξ i ) m i =1 S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 12 / 36

Algorithm and convergence results Convergence Results Assume that a solution w exists. Given w 0 such that E [ � w 0 � 2 ] < + ∞ , γ n , and G n . SFBA w n +1 = prox γ n R ( w n − γ n G n ) Our contributions Convergence rates for ( E [ � w n − w � 2 ] 1 Almost sure convergence of w n 2 S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 13 / 36

Algorithm and convergence results Convergence Results Assume that a solution w exists. Given w 0 such that E [ � w 0 � 2 ] < + ∞ , γ n , and G n . SFBA w n +1 = prox γ n R ( w n − γ n G n ) Our contributions convergence rates for E [ � w n − w � 2 ] 1 Almost sure convergence of w n 2 S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 13 / 36

Algorithm and convergence results Convergence rates for E [ � w n − w � 2 ] Main assumption F is µ strongly convex and R is ν strongly convex with µ + ν > 0. Theorem Let α > 0 and θ ∈ ]0 , 1] . Assume that γ n = α/ n θ and suppose that there (2 − ǫ ) exists ǫ > 0 s.t. γ n < (1 + 2 σ 2 ) β ( β is the Lipschitz constant of ∇ F ). Then, setting c = 2 α ( ν + µǫ ) / (1 + ν ) 2 , � O (1 / n θ ) if θ ∈ ]0 , 1[ E [ � w n − w � 2 ] ≤ O (1 / n c ) + O (1 / n ) if θ = 1 S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 14 / 36

Algorithm and convergence results Remarks and related work c can be made greater than 1 by properly choosing α (knowledge of µ and ν is required) S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 15 / 36

Algorithm and convergence results Remarks and related work c can be made greater than 1 by properly choosing α (knowledge of µ and ν is required) the obtained rate of convergence is the same that can be obtained using “accelerated” methods (see e.g. [Kwok-Hu-Pan, NIPS 2009, Ghadimi-Lan 2012, Li-Chen-Pe˜ na 2014]) S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 15 / 36

Algorithm and convergence results Remarks and related work c can be made greater than 1 by properly choosing α (knowledge of µ and ν is required) the obtained rate of convergence is the same that can be obtained using “accelerated” methods (see e.g. [Kwok-Hu-Pan, NIPS 2009, Ghadimi-Lan 2012, Li-Chen-Pe˜ na 2014]) the result is not asymptotic. An explicit estimate of the constants in the O terms is available (Chung Lemma) S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 15 / 36

Stochastic Forward-Backward Splitting Silvia Villa joint work with - PowerPoint PPT Presentation

Stochastic Forward-Backward Splitting Silvia Villa joint work with Lorenzo Rosasco and Bang Cong V u Laboratory for Computational and Statistical Learning, IIT and MIT http://lcsl.mit.edu/data/silviavilla 2015 Dagstuhl Seminar

FORWARD www.revekoll.com 2 BACKWARD FORWARD www.revekoll.com 3 BACKWARD FORWARD 5 10 16

Timeless elegance... FORWARD www.revekoll.com 2 BACKWARD FORWARD www.revekoll.com 3

Introduction 1 Splitting unpack 2 Splitting pack 3 Reduction 4 Advanced technicalities 5

Die Hard 1.1024.0: Die Hard 1.1024.0: Backward compatibility of a Backward compatibility of a

Splitting methods in geometric numerical integration of differential equations Fernando Casas

With Splitting Steepest Descent Splitting yields adaptive net structure optimization Questions

Splitting and Propositional Variables in Resolution Theorem Provers Splitting and Propositional

A General Theory of Backward Stochastic Difference Equations Robert J. Elliott and Samuel N.

A Parallel Forward-backward Splitting Method for Multiterm Composite Convex Optimization Maicon

Splitting Envelopes Accelerated Second-order Proximal Methods Panos Patrinos (joint work with

Plug-and-Play ADMM and Forward-Backward Splitting Ernest K. Ryu 1 Jialin Liu 1 Sicheng Wang 2

Stochastic Grid Bundling Method for Backward Stochastic Di ff erential Equations Ki Wai Chau

Optimal control of stochastic delay equations and time-advanced backward stochastic differential

Using first order logic (Ch. 9) Backward chaining Backward chaining is almost the opposite of

A Backward dual representation for the quantile hedging of Bermudan options G eraldine

Lecture 8 Backward Induction 14.12 Game Theory Muhamet Yildiz 1 Road Map 1. Backward Induction

rt t rts

Planning Your Next Career Move: Developing the Skills to Make it Happen Elizabeth Atcheson Blue

Brendan Gregg Sr. Performance Architect, Netflix Take Aways Identify bottlenecks: 1. In the

Scheduling complex streaming applications on the Cell processor Mathias Jacquelin, joint work

Improving Self-adaptive Systems Conceptual Modeling Joo Pablo S. da Silva, Miguel Ecar, Marcelo

Verifying Deadlock Freedom Duy-Khanh LE , Wei-Ngan CHIN, Yong-Meng TEO {leduykha,chinwn,teoym}

Operating Systems Operating Systems Hot Topics Hot Topics http://d3s.mff.cuni.cz Martin Dck

thermal transport from first principles Stefano Baroni Scuola Internazionale Superiore di Studi