Stochastic Forward-Backward Splitting Silvia Villa joint work with Lorenzo Rosasco and Bang Cong V˜ u Laboratory for Computational and Statistical Learning, IIT and MIT http://lcsl.mit.edu/data/silviavilla 2015 – Dagstuhl Seminar “Mathematical and Computational Foundations of Learning Theory” S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 1 / 36
Introduction Problem setting Given a separable Hilbert space H , we consider the problem w ∈ H T ( w ) , min T ( w ) = F ( w ) + R ( w ) , with F : H → R convex and continuously differentiable, with Lipschitz continuous gradient, i.e., �∇ F ( w ) − ∇ F ( w ′ ) � ≤ β � w − w ′ � R : H → R ∪ { + ∞} proper, convex, and lower semicontinuous S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 2 / 36
Introduction Statistical learning with regularization Given: Hilbert space H , measure space (Ω , A , P ), loss function ℓ : H × Ω → R + Goal Approximate the infimum of � F ( w ) = ℓ ( w , ξ ) dP ( ξ ) , H × Ω given a training set { ξ 1 , . . . , ξ m } of points sampled from P . If, for every ξ ∈ Ω: ℓ ( · , ξ ) is convex ∇ ℓ ( · , ξ ) is Lipschitz continuous (uniformly w.r.t. ξ ) then F is convex and ∇ F is Lipschitz continuous. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 3 / 36
Introduction Statistical learning with regularization Given: Hilbert space H , measure space (Ω , A , P ), loss function ℓ : H × Ω → R + Goal Approximate the infimum of � ℓ ( w , ξ ) dP ( ξ ) + R ( w ) , H × Ω given a training set { ξ 1 , . . . , ξ m } of points sampled from P . If, for every ξ ∈ Ω: ℓ ( · , ξ ) is convex ∇ ℓ ( · , ξ ) is Lipschitz continuous (uniformly w.r.t. ξ ) then F is convex and ∇ F is Lipschitz continuous. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 3 / 36
Introduction Statistical learning with regularization Given: Hilbert space H , measure space (Ω , A , P ), loss function ℓ : H × Ω → R + Goal Approximate the infimum of � ℓ ( w , ξ ) dP ( ξ ) + R ( w ) , H × Ω given a training set { ξ 1 , . . . , ξ m } of points sampled from P . If, for every ξ ∈ Ω: ℓ ( · , ξ ) is convex ∇ ℓ ( · , ξ ) is Lipschitz continuous (uniformly w.r.t. ξ ) then F is convex and ∇ F is Lipschitz continuous. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 3 / 36
Introduction Statistical learning with regularization cont’d A common strategy is to minimize m T ( w ) = 1 � ℓ ( w , ξ i ) + R ( w ) m i =1 Example Given a Hilbert space X (input space), Y ⊂ R (output space), and L : R × Y → R + set Ω = X × Y , ξ = ( x , y ) , ℓ ( w , ξ ) = L ( � w , x � , y ) . S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 4 / 36
Introduction Forward-backward splitting algorithm (proximal gradient algorithm) Given w 0 ∈ H and γ n ∈ [ ǫ, 2 /β − ǫ ], define FB w n +1 = prox γ n R ( w n − γ n ∇ F ( w n )) with � � R ( v ) + 1 2 γ � v − w � 2 prox γ R ( w ) = argmin v ∈ H See [Bausckhe-Combettes, Convex analysis and monotone operator theory in Hilbert spaces, 2011]. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 5 / 36
Algorithm and convergence results Stochastic forward-backward splitting algorithm Given w 0 such that E [ � w 0 � 2 ] < + ∞ , γ n > 0, define SFB w n +1 = prox γ n R ( w n − γ n ∇ F ( w n )) S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 6 / 36
Algorithm and convergence results Stochastic forward-backward splitting algorithm Given w 0 such that E [ � w 0 � 2 ] < + ∞ , γ n > 0, define SFB w n +1 = prox γ n R ( w n − γ n G n ) where G n is a stochastic estimate of the gradient. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 6 / 36
Algorithm and convergence results Stochastic forward-backward splitting algorithm Given w 0 such that E [ � w 0 � 2 ] < + ∞ , γ n > 0, λ n ∈ [0 , 1], define SFB y n = prox γ n R ( w n − γ n G n ) w n +1 = (1 − λ n ) w n + λ n y n . where G n is a stochastic estimate of the gradient. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 6 / 36
Algorithm and convergence results Online learning point of view Duchi Singer 2009 Langford Li Zhang 2009 Shalev-Shwartz Shamir Srebro Sridaran 2009 Kakade Tewari 2008 Bottou Bousquet 2008 Hazan Kalai Kale Agarwal 2006 ... ... Convergence analysis is often based on: Regret estimation + Online to batch conversions Cesa-Bianchi-Conconi-Gentile 2004 which imply averaging of the iterates. S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 7 / 36
Algorithm and convergence results Outline Convergence results for stochastic forward-backward for minimization problems Extension to monotone inclusions Primal-dual stochastic proximal methods S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 8 / 36
Algorithm and convergence results Assumptions Let (Ω , F , P ) be a probability space. Define the filtration F n = σ ( { w 0 , . . . , w n } ) and assume E [ � G n � 2 ] < + ∞ E [ G n |F n ] = ∇ F ( w n ) E [ � G n − ∇ F ( w n ) � 2 |F n ] ≤ σ 2 (1 + �∇ F ( w n ) � 2 ) , S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 9 / 36
Algorithm and convergence results The case of statistical learning Given a Hilbert space, the objective is to minimize � T ( w ) = ℓ ( w , ξ ) dP ( ξ ) + R ( w ) H × Y given a sequence of i.i.d. samples ( ξ i ) i ∈ N . Then, F n = σ ( ξ 1 , . . . , ξ n ) and G n = ∇ ℓ ( · , ξ n )( w n ) = ⇒ E [ G n |F n ] = ∇ F ( w n ) E [ � G n − ∇ F ( w n ) � 2 |F n ] ≤ σ 2 (1 + �∇ F ( w n ) � 2 ) is a condition on the variance of the random variable ξ ∈ Ω �→ ∇ ℓ ( · , ξ )( w ) S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 10 / 36
Algorithm and convergence results Statistical learning - Incremental FB algorithm The goal is to minimize m T ( w ) = 1 � ℓ ( w , ξ i ) + R ( w ) m i =1 Let i n : Ω → { 1 , . . . , m } be a sequence of independent r.v. such that, for every n and i , P [ i n = i ] = 1 / m . Then G n = ∇ ℓ ( · , ξ i n ))( w n ) � m is such that E [ G n |F n ] = E [ G n ] = 1 i =1 ∇ ℓ ( · , ξ i )( w n ). m S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 11 / 36
Algorithm and convergence results Comparison between FB and SFB algorithm The stochastic incremental FB algorithm becomes SFB w n +1 = prox γ n R ( w n − γ n ∇ w ℓ ( w n , ξ i n )) The FB algorithm is FB m � 1 � � w n +1 = prox γ n R w n − γ n ∇ w ℓ ( w n , ξ i ) m i =1 S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 12 / 36
Algorithm and convergence results Convergence Results Assume that a solution w exists. Given w 0 such that E [ � w 0 � 2 ] < + ∞ , γ n , and G n . SFBA w n +1 = prox γ n R ( w n − γ n G n ) Our contributions Convergence rates for ( E [ � w n − w � 2 ] 1 Almost sure convergence of w n 2 S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 13 / 36
Algorithm and convergence results Convergence Results Assume that a solution w exists. Given w 0 such that E [ � w 0 � 2 ] < + ∞ , γ n , and G n . SFBA w n +1 = prox γ n R ( w n − γ n G n ) Our contributions convergence rates for E [ � w n − w � 2 ] 1 Almost sure convergence of w n 2 S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 13 / 36
Algorithm and convergence results Convergence rates for E [ � w n − w � 2 ] Main assumption F is µ strongly convex and R is ν strongly convex with µ + ν > 0. Theorem Let α > 0 and θ ∈ ]0 , 1] . Assume that γ n = α/ n θ and suppose that there (2 − ǫ ) exists ǫ > 0 s.t. γ n < (1 + 2 σ 2 ) β ( β is the Lipschitz constant of ∇ F ). Then, setting c = 2 α ( ν + µǫ ) / (1 + ν ) 2 , � O (1 / n θ ) if θ ∈ ]0 , 1[ E [ � w n − w � 2 ] ≤ O (1 / n c ) + O (1 / n ) if θ = 1 S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 14 / 36
Algorithm and convergence results Remarks and related work c can be made greater than 1 by properly choosing α (knowledge of µ and ν is required) S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 15 / 36
Algorithm and convergence results Remarks and related work c can be made greater than 1 by properly choosing α (knowledge of µ and ν is required) the obtained rate of convergence is the same that can be obtained using “accelerated” methods (see e.g. [Kwok-Hu-Pan, NIPS 2009, Ghadimi-Lan 2012, Li-Chen-Pe˜ na 2014]) S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 15 / 36
Algorithm and convergence results Remarks and related work c can be made greater than 1 by properly choosing α (knowledge of µ and ν is required) the obtained rate of convergence is the same that can be obtained using “accelerated” methods (see e.g. [Kwok-Hu-Pan, NIPS 2009, Ghadimi-Lan 2012, Li-Chen-Pe˜ na 2014]) the result is not asymptotic. An explicit estimate of the constants in the O terms is available (Chung Lemma) S. Villa (LCSL - IIT and MIT) Stochastic forward-backward 15 / 36
Recommend
More recommend