What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . 16
What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. 16
What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. 16
What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) 16
What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) ◮ Even when Θ = R d , d might be very large (typically millions) 16
What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) ◮ Even when Θ = R d , d might be very large (typically millions) � use only first order methods 16
What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) ◮ Even when Θ = R d , d might be very large (typically millions) � use only first order methods b. Structure of f . If f = ˆ R ( θ ) = 1 � n i =1 ℓ ( y i , � θ, Φ( x i ) � ), n computing a gradient has a cost proportional to n . 16
Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients 17
Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? 17
Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) 17
Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) ◮ show how rates depend onsmoothness and strong convexity 17
Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) ◮ show how rates depend onsmoothness and strong convexity ◮ show how we can use the structure 17
Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) ◮ show how rates depend onsmoothness and strong convexity ◮ show how we can use the structure ◮ not forgetting the initial problem...! 17
Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 Two fundamental questions: (a) computing (b) analyzing ˆ θ . 18
Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 Two fundamental questions: (a) computing (b) analyzing ˆ θ . “Large scale” framework: number of examples n and the number of explanatory variables d are both large. ⇒ First order algorithms 1. High dimension d = Gradient Descent (GD) : θ k = θ k − 1 − γ k ˆ R ′ ( θ k − 1 ) 18
Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 Two fundamental questions: (a) computing (b) analyzing ˆ θ . “Large scale” framework: number of examples n and the number of explanatory variables d are both large. ⇒ First order algorithms 1. High dimension d = Gradient Descent (GD) : θ k = θ k − 1 − γ k ˆ R ′ ( θ k − 1 ) Problem: computing the gradient costs O ( dn ) per iteration. ⇒ Stochastic algorithms 2. Large n = Stochastic Gradient Descent (SGD) 18
Stochastic Gradient descent ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). 19
Stochastic Gradient descent θ 0 ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) ◮ E [ f ′ k ( θ k − 1 ) |F k − 1 ] = f ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 19
Stochastic Gradient descent θ 0 θ 1 ◮ Goal: θ ∈ R d f ( θ ) min θ n given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) ◮ E [ f ′ k ( θ k − 1 ) |F k − 1 ] = f ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 19
SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = 20
SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) 20
SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) n I k ( θ k − 1 ) |F k − 1 ] = 1 � E [ f ′ ℓ ′ ( y k , � θ, Φ( x k ) � ) n k =1 20
SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) n I k ( θ k − 1 ) |F k − 1 ] = 1 � ℓ ′ ( y k , � θ, Φ( x k ) � ) = ˆ E [ f ′ R ′ ( θ k − 1 ) . n k =1 with F k = σ (( x i , y i ) 1 ≤ i ≤ n , ( I i ) 1 ≤ i ≤ k ). 20
Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate ( γ k ) k ≥ 0 . For smooth and strongly convex problem, θ k → θ ∗ a.s. if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 21
Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate ( γ k ) k ≥ 0 . For smooth and strongly convex problem, θ k → θ ∗ a.s. if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 √ d k ( θ k − θ ∗ ) → N (0 , V ), for And asymptotic normality γ k = γ 0 k , γ 0 ≥ 1 µ . 21
Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate ( γ k ) k ≥ 0 . For smooth and strongly convex problem, θ k → θ ∗ a.s. if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 √ d k ( θ k − θ ∗ ) → N (0 , V ), for And asymptotic normality γ k = γ 0 k , γ 0 ≥ 1 µ . ◮ Limit variance scales as 1 /µ 2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown... 21
Polyak Ruppert averaging θ 1 θ 0 θ 1 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n k 1 θ ∗ ¯ � θ k = θ i . k + 1 i =0 ◮ off line averaging reduces the noise effect. 22
Polyak Ruppert averaging θ 1 θ 0 θ 1 θ 2 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n k 1 θ ∗ θ n ¯ � θ k = θ i . k + 1 i =0 ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 k +1 ¯ k θ k +1 = k +1 θ k +1 + θ k . 22
Convex stochastic approximation: convergence Known global minimax rates for non-smooth problems ◮ Strongly convex: O (( µ k ) − 1 ) Attained by averaged stochastic gradient descent with γ k ∝ ( µ k ) − 1 ◮ Non-strongly convex: O ( k − 1 / 2 ) Attained by averaged stochastic gradient descent with γ k ∝ k − 1 / 2 23
Convex stochastic approximation: convergence Known global minimax rates for non-smooth problems ◮ Strongly convex: O (( µ k ) − 1 ) Attained by averaged stochastic gradient descent with γ k ∝ ( µ k ) − 1 ◮ Non-strongly convex: O ( k − 1 / 2 ) Attained by averaged stochastic gradient descent with γ k ∝ k − 1 / 2 For smooth problems ◮ Strongly convex: O ( µ k ) − 1 for γ k ∝ k − 1 / 2 : adapts to strong convexity. 23
Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k 24
Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. Can we get best of both worlds ? 24
Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. 24
Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. Can we get best of both worlds ? 24
Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n 25
Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) 25
Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n 25
Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n ◮ sample i k ∼ U [1; n ], use � n � 1 � f ′ i ( θ k i ) − f ′ i k ( θ k ik ) + f ′ i k ( θ k ) , n i =0 25
Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n ◮ sample i k ∼ U [1; n ], use � n � 1 � f ′ i ( θ k i ) − f ′ i k ( θ k ik ) + f ′ i k ( θ k ) , n i =0 � ⊕ update costs the same as SGD � ⊖ needs to store all gradients f ′ i ( θ k i ) at “points in the past” Some references: ◮ SAG Schmidt et al. (2013), SAGA Defazio et al. (2014a) ◮ SVRG Johnson and Zhang (2013) (reduces memory cost but 2 epochs...) ◮ FINITO Defazio et al. (2014b) ◮ S2GD Koneˇ cn` y and Richt´ arik (2013)... And many others... See for example Niao He’s lecture notes for a nice overview. 25
Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k 26
Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 O ( e − µ k ) 1 − ( µ ∧ 1 1 Stgly-Cvx O O n ) O µ k µ k GD, SGD, SAG (Fig. from Schmidt et al. (2013)) 26
Take home Stochastic algorithms for Empirical Risk Minimization. ◮ Rates depend on the regularity of the function. ◮ Several algorithms to optimize empirical risk, most efficient ones are stochastic and rely on finite sum structure
Take home Stochastic algorithms for Empirical Risk Minimization. ◮ Rates depend on the regularity of the function. ◮ Several algorithms to optimize empirical risk, most efficient ones are stochastic and rely on finite sum structure ◮ Stochastic algorithms to optimize a deterministic function. 27
What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). 28
What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM: ◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O (1 / √ n ), no need to be precise 28
What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM: ◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O (1 / √ n ), no need to be precise 2 important insights: 1. No need to optimize below statistical error, 2. Generalization risk is more important than empirical risk. 28
What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM: ◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O (1 / √ n ), no need to be precise 2 important insights: 1. No need to optimize below statistical error, 2. Generalization risk is more important than empirical risk. SGD can be used to minimize the generalization risk. 28
SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). 29
SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) 29
SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). E [ f ′ E ρ [ ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) |F k − 1 ] k ( θ k − 1 ) |F k − 1 ] = 29
SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). E [ f ′ E ρ [ ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) |F k − 1 ] k ( θ k − 1 ) |F k − 1 ] = ℓ ′ ( Y , � θ k − 1 , Φ( X ) � ) = R ′ ( θ k − 1 ) � � = E ρ 29
SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). E [ f ′ E ρ [ ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) |F k − 1 ] k ( θ k − 1 ) |F k − 1 ] = ℓ ′ ( Y , � θ k − 1 , Φ( X ) � ) = R ′ ( θ k − 1 ) � � = E ρ ◮ Single pass through the data, Running-time = O ( nd ), ◮ “Automatic” regularization. 29
SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). E [ f ′ E ρ [ ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) |F k − 1 ] k ( θ k − 1 ) |F k − 1 ] = ℓ ′ ( Y , � θ k − 1 , Φ( X ) � ) = R ′ ( θ k − 1 ) � � = E ρ ◮ Single pass through the data, Running-time = O ( nd ), ◮ “Automatic” regularization. 29
SGD for the generalization risk: f = R ERM minimization Gen. risk minimization several passes : 0 ≤ k One pass 0 ≤ k ≤ n F t -measurable for any t F t -measurable for t ≥ i . x i , y i is 30
Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k 31
Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k 0 ≤ k 0 ≤ k ≤ n Lower Bounds α β γ δ δ :Information theoretic LB - Statistical theory (Tsybakov, 2003). Gradient does not even exist 31
Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ n k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ n 0 ≤ k 0 ≤ k ≤ n 31
Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ n k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ n 0 ≤ k 0 ≤ k ≤ n Gradient is unknown 31
Least Mean Squares: rate independent of µ Least-squares: R ( θ ) = 1 ( Y − � Φ( X ) , θ � ) 2 � � 2 E Analysis for averaging and constant step-size γ = 1 / (4 R 2 ) (Bach and Moulines, 2013) ◮ Assume � Φ( x n ) � � r and | y n − � Φ( x n ) , θ ∗ �| � σ ◮ No assumption regarding lowest eigenvalues of the Hessian θ n ) − R ( θ ∗ ) � 4 σ 2 d + � θ 0 − θ ∗ � 2 E R (¯ γ n n 32
Least Mean Squares: rate independent of µ Least-squares: R ( θ ) = 1 ( Y − � Φ( X ) , θ � ) 2 � � 2 E Analysis for averaging and constant step-size γ = 1 / (4 R 2 ) (Bach and Moulines, 2013) ◮ Assume � Φ( x n ) � � r and | y n − � Φ( x n ) , θ ∗ �| � σ ◮ No assumption regarding lowest eigenvalues of the Hessian θ n ) − R ( θ ∗ ) � 4 σ 2 d + � θ 0 − θ ∗ � 2 E R (¯ γ n n ◮ Matches statistical lower bound (Tsybakov, 2003). ◮ Optimal rate with “large” step sizes 32
Recommend
More recommend