a stochastic gradient method with an exponential
play

A Stochastic Gradient Method with an Exponential Convergence Rate - PowerPoint PPT Presentation

A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets Nicolas Le Roux 1 , 2 , Mark Schmidt 1 and Francis Bach 1 1 Sierra project-team, INRIA - Ecole Normale Sup erieure, Paris 2 Now at Criteo 4/12/12


  1. A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets Nicolas Le Roux 1 , 2 , Mark Schmidt 1 and Francis Bach 1 1 Sierra project-team, INRIA - Ecole Normale Sup´ erieure, Paris 2 Now at Criteo 4/12/12 Nicolas Le Roux, Mark Schmidt, Francis Bach A Stochastic Gradient Method with an Exponential Convergence Rate for Finite

  2. Context : Machine Learning for “Big Data” Large-scale machine learning : large n , large p n : number of observations (inputs) p : number of parameters in the model Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  3. Context : Machine Learning for “Big Data” Large-scale machine learning : large n , large p n : number of observations (inputs) p : number of parameters in the model Examples : vision, bioinformatics, speech, language, etc. Pascal large-scale datasets : n = 5 · 10 5 , p = 10 3 ImageNet : n = 10 7 Industrial datasets : n > 10 8 , p > 10 7 Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  4. Context : Machine Learning for “Big Data” Large-scale machine learning : large n , large p n : number of observations (inputs) p : number of parameters in the model Examples : vision, bioinformatics, speech, language, etc. Pascal large-scale datasets : n = 5 · 10 5 , p = 10 3 ImageNet : n = 10 7 Industrial datasets : n > 10 8 , p > 10 7 Main computational challenge : Design algorithms for very large n and p . Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  5. A standard machine learning optimization problem We want to minimize the sum of a finite set of smooth functions : n θ ∈ R p g ( θ ) = 1 � min f i ( θ ) n i = 1 Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  6. A standard machine learning optimization problem We want to minimize the sum of a finite set of smooth functions : n θ ∈ R p g ( θ ) = 1 � min f i ( θ ) n i = 1 For instance, we may have + λ − y i x ⊤ 2 � θ � 2 � � �� f i ( θ ) = log 1 + exp i θ Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  7. A standard machine learning optimization problem We want to minimize the sum of a finite set of smooth functions : n θ ∈ R p g ( θ ) = 1 � min f i ( θ ) n i = 1 For instance, we may have + λ − y i x ⊤ 2 � θ � 2 � � �� f i ( θ ) = log 1 + exp i θ We will focus on strongly-convex functions g . Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  8. Deterministic methods n θ ∈ R p g ( θ ) = 1 � min f i ( θ ) n i = 1 Gradient descent updates θ k + 1 = θ k − α k g ′ ( θ k ) n = θ k − α k � f ′ i ( θ k ) n i = 1 Iteration cost in O ( n ) � C k � Linear convergence rate O Fancier methods exist but still in O ( n ) Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  9. Stochastic methods n θ ∈ R p g ( θ ) = 1 � min f i ( θ ) n i = 1 Stochastic gradient descent updates i ( k ) ∼ U � 1 , n � θ k + 1 = θ k − α k f ′ i ( k ) ( θ k ) Iteration cost in O ( 1 ) Sublinear convergence rate O ( 1 / k ) Bound on the test error valid for one pass Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  10. Hybrid methods log(excess cost) stochastic deterministic time Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  11. Hybrid methods Goal = linear rate and O ( 1 ) iteration cost. log(excess cost) stochastic deterministic hybrid time Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  12. Related work - Sublinear convergence rate Stochastic version of full gradient methods Schraudolph (1999), Sunehag et al. (2009), Ghadimi and Lan (2010), Martens (2010), Xiao (2010) Momentum, gradient/iterate averaging Polyak and Judistky (1992), Tseng (1998), Nesterov (2009), Xiao (2010), Kushner and Yin (2003), Hazan and Kale (2011), Rakhlin et al. (2012) None of these methods improve on the O ( 1 / k ) rate Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  13. Related work - Linear convergence rate Constant step-size SG, accelerated SG Kesten (1958), Delyon and Juditsky (1993), Nedic and Bertsekas (2000) Linear convergence but only up to a fixed tolerance Hybrid methods, incremental average gradient Bertsekas (1997), Blatt et al. (2007), Friedlander and Schmidt (2012) Linear rate but iterations make full passes through the data Stochastic methods in the dual Shalev-Shwartz and Zhang (2012) Linear rate but limited choice for the f i ’s Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  14. Stochastic Average Gradient Method Full gradient update : n θ k + 1 = θ k − α k � f ′ i ( θ k ) n i = 1 Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  15. Stochastic Average Gradient Method Full gradient update : n θ k + 1 = θ k − α k � f ′ i ( θ k ) n i = 1 Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  16. Stochastic Average Gradient Method Stochastic average gradient update : n θ k + 1 = θ k − α k � y k i n i = 1 i ( k ′ ) ( θ k ′ ) from the last k ′ where i was selected. Memory : y k i = f ′ Random selection of i ( k ) from { 1 , 2 , . . . , n } . Only evaluates f ′ i ( k ) ( θ k ) on each iteration. Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  17. Stochastic Average Gradient Method Stochastic average gradient update : n θ k + 1 = θ k − α k � y k i n i = 1 i ( k ′ ) ( θ k ′ ) from the last k ′ where i was selected. Memory : y k i = f ′ Random selection of i ( k ) from { 1 , 2 , . . . , n } . Only evaluates f ′ i ( k ) ( θ k ) on each iteration. Stochastic variant of incremental average gradient [Blatt et al., 2007] Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  18. SAG convergence analysis Assume each f ′ i is L -continuous, average g is µ -strongly convex. 1 With step size α k � 2 nL , SAG has linear convergence rate. Linear convergence with iteration cost independent of n . Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  19. SAG convergence analysis Assume each f ′ i is L -continuous, average g is µ -strongly convex. 1 With step size α k � 2 nL , SAG has linear convergence rate. Linear convergence with iteration cost independent of n . 2 n µ , if n � 8 L 1 With step size α k = µ then � k � 1 − 1 E [ g ( θ k ) − g ( θ ∗ )] � C . 8 n Rate is “independent” of the condition number. Constant error reduction after each pass, � n � 1 − 1 � − 1 � ≤ exp = 0 . 8825 . 8 n 8 Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  20. Comparison with full gradient methods Assume L = 100, µ = 0 . 01 and n = 80000 : � 2 = 0 . 9998 � Full gradient has rate 1 − µ L � µ � � Accelerated gradient has rate 1 − = 0 . 9900 L � n = 0 . 8825 � 1 SAG ( n iterations) multiplies the error by 1 − 8 n � √ L −√ µ � 2 Fastest possible first-order method has rate = 0 . 9608 √ L + √ µ Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  21. Comparison with full gradient methods Assume L = 100, µ = 0 . 01 and n = 80000 : � 2 = 0 . 9998 � Full gradient has rate 1 − µ L � µ � � Accelerated gradient has rate 1 − = 0 . 9900 L � n = 0 . 8825 � 1 SAG ( n iterations) multiplies the error by 1 − 8 n � √ L −√ µ � 2 Fastest possible first-order method has rate = 0 . 9608 √ L + √ µ We beat two lower bounds (with additional assumptions) Stochastic gradient bound Full gradient bound Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  22. Experiments - Training cost Quantum dataset ( n = 50000 , p = 78) ℓ 2 -regularized logistic regression 0 10 AFG L−BFGS pegasos Objective minus Optimum SAG−C −2 SAG−LS 10 −4 10 −6 10 −8 10 0 5 10 15 20 25 Effective Passes Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  23. Experiments - Training cost RCV1 dataset ( n = 20242 , p = 47236) ℓ 2 -regularized logistic regression 0 10 AFG L−BFGS pegasos −2 10 Objective minus Optimum SAG−C SAG−LS −4 10 −6 10 −8 10 −10 10 0 5 10 15 20 25 Effective Passes Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  24. Experiments - Testing cost Quantum dataset ( n = 50000 , p = 78) ℓ 2 -regularized logistic regression 4 x 10 AFG 1.7 L−BFGS pegasos SAG−C 1.65 SAG−LS Test Logistic Loss 1.6 1.55 1.5 1.45 1.4 0 5 10 15 20 25 Effective Passes Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

  25. Experiments - Testing cost RCV1 dataset ( n = 20242 , p = 47236) ℓ 2 -regularized logistic regression 7000 AFG 6500 L−BFGS pegasos SAG−C 6000 SAG−LS 5500 Test Logistic Loss 5000 4500 4000 3500 3000 2500 2000 0 5 10 15 20 25 Effective Passes Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Recommend


More recommend