optimization
play

Optimization Aymeric DIEULEVEUT EPFL, Lausanne January 26, 2018 - PowerPoint PPT Presentation

Optimization Aymeric DIEULEVEUT EPFL, Lausanne January 26, 2018 Journ ees YSP 1 Outline 1. General context and examples. 2. What makes optimization hard ? 2 Outline 1. General context and examples. 2. What makes optimization hard ?


  1. What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . 16

  2. What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. 16

  3. What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. 16

  4. What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) 16

  5. What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) ◮ Even when Θ = R d , d might be very large (typically millions) 16

  6. What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) ◮ Even when Θ = R d , d might be very large (typically millions) � use only first order methods 16

  7. What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) ◮ Even when Θ = R d , d might be very large (typically millions) � use only first order methods b. Structure of f . If f = ˆ R ( θ ) = 1 � n i =1 ℓ ( y i , � θ, Φ( x i ) � ), n computing a gradient has a cost proportional to n . 16

  8. Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients 17

  9. Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? 17

  10. Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) 17

  11. Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) ◮ show how rates depend onsmoothness and strong convexity 17

  12. Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) ◮ show how rates depend onsmoothness and strong convexity ◮ show how we can use the structure 17

  13. Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) ◮ show how rates depend onsmoothness and strong convexity ◮ show how we can use the structure ◮ not forgetting the initial problem...! 17

  14. Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 Two fundamental questions: (a) computing (b) analyzing ˆ θ . 18

  15. Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 Two fundamental questions: (a) computing (b) analyzing ˆ θ . “Large scale” framework: number of examples n and the number of explanatory variables d are both large. ⇒ First order algorithms 1. High dimension d = Gradient Descent (GD) : θ k = θ k − 1 − γ k ˆ R ′ ( θ k − 1 ) 18

  16. Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 Two fundamental questions: (a) computing (b) analyzing ˆ θ . “Large scale” framework: number of examples n and the number of explanatory variables d are both large. ⇒ First order algorithms 1. High dimension d = Gradient Descent (GD) : θ k = θ k − 1 − γ k ˆ R ′ ( θ k − 1 ) Problem: computing the gradient costs O ( dn ) per iteration. ⇒ Stochastic algorithms 2. Large n = Stochastic Gradient Descent (SGD) 18

  17. Stochastic Gradient descent ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). 19

  18. Stochastic Gradient descent θ 0 ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) ◮ E [ f ′ k ( θ k − 1 ) |F k − 1 ] = f ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 19

  19. Stochastic Gradient descent θ 0 θ 1 ◮ Goal: θ ∈ R d f ( θ ) min θ n given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) ◮ E [ f ′ k ( θ k − 1 ) |F k − 1 ] = f ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 19

  20. SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = 20

  21. SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) 20

  22. SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) n I k ( θ k − 1 ) |F k − 1 ] = 1 � E [ f ′ ℓ ′ ( y k , � θ, Φ( x k ) � ) n k =1 20

  23. SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) n I k ( θ k − 1 ) |F k − 1 ] = 1 � ℓ ′ ( y k , � θ, Φ( x k ) � ) = ˆ E [ f ′ R ′ ( θ k − 1 ) . n k =1 with F k = σ (( x i , y i ) 1 ≤ i ≤ n , ( I i ) 1 ≤ i ≤ k ). 20

  24. Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate ( γ k ) k ≥ 0 . For smooth and strongly convex problem, θ k → θ ∗ a.s. if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 21

  25. Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate ( γ k ) k ≥ 0 . For smooth and strongly convex problem, θ k → θ ∗ a.s. if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 √ d k ( θ k − θ ∗ ) → N (0 , V ), for And asymptotic normality γ k = γ 0 k , γ 0 ≥ 1 µ . 21

  26. Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate ( γ k ) k ≥ 0 . For smooth and strongly convex problem, θ k → θ ∗ a.s. if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 √ d k ( θ k − θ ∗ ) → N (0 , V ), for And asymptotic normality γ k = γ 0 k , γ 0 ≥ 1 µ . ◮ Limit variance scales as 1 /µ 2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown... 21

  27. Polyak Ruppert averaging θ 1 θ 0 θ 1 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n k 1 θ ∗ ¯ � θ k = θ i . k + 1 i =0 ◮ off line averaging reduces the noise effect. 22

  28. Polyak Ruppert averaging θ 1 θ 0 θ 1 θ 2 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n k 1 θ ∗ θ n ¯ � θ k = θ i . k + 1 i =0 ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 k +1 ¯ k θ k +1 = k +1 θ k +1 + θ k . 22

  29. Convex stochastic approximation: convergence Known global minimax rates for non-smooth problems ◮ Strongly convex: O (( µ k ) − 1 ) Attained by averaged stochastic gradient descent with γ k ∝ ( µ k ) − 1 ◮ Non-strongly convex: O ( k − 1 / 2 ) Attained by averaged stochastic gradient descent with γ k ∝ k − 1 / 2 23

  30. Convex stochastic approximation: convergence Known global minimax rates for non-smooth problems ◮ Strongly convex: O (( µ k ) − 1 ) Attained by averaged stochastic gradient descent with γ k ∝ ( µ k ) − 1 ◮ Non-strongly convex: O ( k − 1 / 2 ) Attained by averaged stochastic gradient descent with γ k ∝ k − 1 / 2 For smooth problems ◮ Strongly convex: O ( µ k ) − 1 for γ k ∝ k − 1 / 2 : adapts to strong convexity. 23

  31. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k 24

  32. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. Can we get best of both worlds ? 24

  33. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. 24

  34. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. Can we get best of both worlds ? 24

  35. Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n 25

  36. Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) 25

  37. Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n 25

  38. Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n ◮ sample i k ∼ U [1; n ], use � n � 1 � f ′ i ( θ k i ) − f ′ i k ( θ k ik ) + f ′ i k ( θ k ) , n i =0 25

  39. Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n ◮ sample i k ∼ U [1; n ], use � n � 1 � f ′ i ( θ k i ) − f ′ i k ( θ k ik ) + f ′ i k ( θ k ) , n i =0 � ⊕ update costs the same as SGD � ⊖ needs to store all gradients f ′ i ( θ k i ) at “points in the past” Some references: ◮ SAG Schmidt et al. (2013), SAGA Defazio et al. (2014a) ◮ SVRG Johnson and Zhang (2013) (reduces memory cost but 2 epochs...) ◮ FINITO Defazio et al. (2014b) ◮ S2GD Koneˇ cn` y and Richt´ arik (2013)... And many others... See for example Niao He’s lecture notes for a nice overview. 25

  40. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k 26

  41. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 O ( e − µ k ) 1 − ( µ ∧ 1 1 Stgly-Cvx O O n ) O µ k µ k GD, SGD, SAG (Fig. from Schmidt et al. (2013)) 26

  42. Take home Stochastic algorithms for Empirical Risk Minimization. ◮ Rates depend on the regularity of the function. ◮ Several algorithms to optimize empirical risk, most efficient ones are stochastic and rely on finite sum structure

  43. Take home Stochastic algorithms for Empirical Risk Minimization. ◮ Rates depend on the regularity of the function. ◮ Several algorithms to optimize empirical risk, most efficient ones are stochastic and rely on finite sum structure ◮ Stochastic algorithms to optimize a deterministic function. 27

  44. What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). 28

  45. What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM: ◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O (1 / √ n ), no need to be precise 28

  46. What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM: ◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O (1 / √ n ), no need to be precise 2 important insights: 1. No need to optimize below statistical error, 2. Generalization risk is more important than empirical risk. 28

  47. What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM: ◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O (1 / √ n ), no need to be precise 2 important insights: 1. No need to optimize below statistical error, 2. Generalization risk is more important than empirical risk. SGD can be used to minimize the generalization risk. 28

  48. SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). 29

  49. SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) 29

  50. SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). E [ f ′ E ρ [ ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) |F k − 1 ] k ( θ k − 1 ) |F k − 1 ] = 29

  51. SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). E [ f ′ E ρ [ ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) |F k − 1 ] k ( θ k − 1 ) |F k − 1 ] = ℓ ′ ( Y , � θ k − 1 , Φ( X ) � ) = R ′ ( θ k − 1 ) � � = E ρ 29

  52. SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). E [ f ′ E ρ [ ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) |F k − 1 ] k ( θ k − 1 ) |F k − 1 ] = ℓ ′ ( Y , � θ k − 1 , Φ( X ) � ) = R ′ ( θ k − 1 ) � � = E ρ ◮ Single pass through the data, Running-time = O ( nd ), ◮ “Automatic” regularization. 29

  53. SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). E [ f ′ E ρ [ ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) |F k − 1 ] k ( θ k − 1 ) |F k − 1 ] = ℓ ′ ( Y , � θ k − 1 , Φ( X ) � ) = R ′ ( θ k − 1 ) � � = E ρ ◮ Single pass through the data, Running-time = O ( nd ), ◮ “Automatic” regularization. 29

  54. SGD for the generalization risk: f = R ERM minimization Gen. risk minimization several passes : 0 ≤ k One pass 0 ≤ k ≤ n F t -measurable for any t F t -measurable for t ≥ i . x i , y i is 30

  55. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k 31

  56. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k 0 ≤ k 0 ≤ k ≤ n Lower Bounds α β γ δ δ :Information theoretic LB - Statistical theory (Tsybakov, 2003). Gradient does not even exist 31

  57. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ n k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ n 0 ≤ k 0 ≤ k ≤ n 31

  58. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ n k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ n 0 ≤ k 0 ≤ k ≤ n Gradient is unknown 31

  59. Least Mean Squares: rate independent of µ Least-squares: R ( θ ) = 1 ( Y − � Φ( X ) , θ � ) 2 � � 2 E Analysis for averaging and constant step-size γ = 1 / (4 R 2 ) (Bach and Moulines, 2013) ◮ Assume � Φ( x n ) � � r and | y n − � Φ( x n ) , θ ∗ �| � σ ◮ No assumption regarding lowest eigenvalues of the Hessian θ n ) − R ( θ ∗ ) � 4 σ 2 d + � θ 0 − θ ∗ � 2 E R (¯ γ n n 32

  60. Least Mean Squares: rate independent of µ Least-squares: R ( θ ) = 1 ( Y − � Φ( X ) , θ � ) 2 � � 2 E Analysis for averaging and constant step-size γ = 1 / (4 R 2 ) (Bach and Moulines, 2013) ◮ Assume � Φ( x n ) � � r and | y n − � Φ( x n ) , θ ∗ �| � σ ◮ No assumption regarding lowest eigenvalues of the Hessian θ n ) − R ( θ ∗ ) � 4 σ 2 d + � θ 0 − θ ∗ � 2 E R (¯ γ n n ◮ Matches statistical lower bound (Tsybakov, 2003). ◮ Optimal rate with “large” step sizes 32

Recommend


More recommend