adagrad stepsizes sharp convergence over nonconvex
play

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes - PowerPoint PPT Presentation

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes Xiaoxia(Shirley) WU PhD Candidate, The University of Texas at Austin June 11th, 2019 joint work with Rachel Ward and L eon Bottou, at Facebook AI Research. Outline


  1. AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes Xiaoxia(Shirley) WU ⋆ PhD Candidate, The University of Texas at Austin June 11th, 2019 ⋆ joint work with Rachel Ward and L´ eon Bottou, at Facebook AI Research.

  2. Outline Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

  3. Outline Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

  4. Motivation Problem Setup Given a differentiable non-convex function, F : R d → R , ◮ �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d

  5. Motivation Problem Setup Given a differentiable non-convex function, F : R d → R , ◮ �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d Our desired goal ⇒ x ∈ R d F ( x ) min �∇ F ( x ) � 2 ≤ ε We can achieve ⇒

  6. Motivation Problem Setup Given a differentiable non-convex function, F : R d → R , ◮ �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d Our desired goal ⇒ x ∈ R d F ( x ) min �∇ F ( x ) � 2 ≤ ε We can achieve ⇒ Algorithm Stochastic Gradient Descent (SGD) at the j th iteration x j +1 ← x j − η j G ( x j ) , (1) where E [ G ( x j )] = ∇ F ( x j ) and η j > 0 is the stepsize .

  7. Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Q : How to set the sequence { η j } j ≥ 0 ? 1 E [ � G ( x ) − ∇ F ( x ) � 2 ] ≤ σ 2

  8. Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Q : How to set the sequence { η j } j ≥ 0 ? Difficulty in Choosing Stepsizes The classical Robbins/Monro theory (Robbins and Monro, 1951) if ∞ ∞ � � η 2 η j = ∞ and j < ∞ ; (2) j =1 j =1 and the variance of the gradient is bounded 1 , then lim j →∞ E [ �∇ F ( x j ) � 2 ] = 0 . 1 E [ � G ( x ) − ∇ F ( x ) � 2 ] ≤ σ 2

  9. Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Q : How to set the sequence { η j } j ≥ 0 ? Difficulty in Choosing Stepsizes The classical Robbins/Monro theory (Robbins and Monro, 1951) if ∞ ∞ � � η 2 η j = ∞ and j < ∞ ; (3) j =1 j =1 and the variance of the gradient is bounded, then lim j →∞ E [ �∇ F ( x j ) � 2 ] = 0 . However, the rule is too general for practical applications.

  10. Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Possible Choice: Manual Tuning   η j ≤ T 1     α 1 η T 1 ≤ j ≤ T 2 η j =  α 2 η T 2 ≤ j ≤ T 3     · · · 2 �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d

  11. Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Possible Choice: Manual Tuning   η j ≤ T 1     α 1 η T 1 ≤ j ≤ T 2 η j =  α 2 η T 2 ≤ j ≤ T 3     · · · However, tuning η , α 1 , α 2 , T 1 , T 2 , . . . are computationally costly. In particular, it requires η ≤ 2 / L . 2 2 �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d

  12. Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ [ b j +1 ] ℓ Possible Choice: Adaptive Gradient Methods Among many variants, one is AdaGrad ([ b j +1 ] ℓ ) 2 = ([ b j ] ℓ ) 2 + ([ G ( x j )] ℓ ) 2

  13. Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ [ b j +1 ] ℓ Possible Choice: Adaptive Gradient Methods Among many variants, one is AdaGrad ([ b j +1 ] ℓ ) 2 = ([ b j ] ℓ ) 2 + ([ G ( x j )] ℓ ) 2 ◮ It helps with “increasing the stepsize for more sparse parameters and decreasing the stepsize for less sparse ones.” (Duchi et al. 2011)

  14. Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ [ b j +1 ] ℓ Possible Choice: Adaptive Gradient Methods Among many variants, one is AdaGrad ([ b j +1 ] ℓ ) 2 = ([ b j ] ℓ ) 2 + ([ G ( x j )] ℓ ) 2 ◮ It helps with “increasing the stepsize for more sparse parameters and decreasing the stepsize for less sparse ones.” (Duchi et al. 2011) ◮ However, “co-ordinate” AdaGrad changes the optimization problem by introducing the “bias” in the solutions, leading to worse generalization (Wilson et al. 2017)

  15. Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ b j +1 Possible Variant: Norm Version of AdaGrad b 2 j +1 = b 2 j + � G ( x j ) � 2 (AdaGrad-Norm)

  16. Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ b j +1 Possible Variant: Norm Version of AdaGrad b 2 j +1 = b 2 j + � G ( x j ) � 2 (AdaGrad-Norm) ◮ Auto-tuning property (Wu, Ward, and Bottou, 2018): robustness to the choices of hyper-parameters ( b 0 and η ); connection to Weight/Layer/Batch Normalization; ◮ Does not affect generalization.

  17. Outline Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

  18. Theory Algorithm: SGD with Adaptive Stepsize η b 2 j +1 = b 2 j + � G ( x j ) � 2 x j +1 ← x j − G ( x j ) with b j +1 What is the convergence rate of AdaGrad-Norm? ◮ Intuition: if E [ � G ( x j ) � 2 ] ≤ γ 2 , then the effective stepsize η b j � η � η ≥ � E b j j γ 2 + b 2 0

  19. Theory Algorithm: SGD with Adaptive Stepsize η b 2 j +1 = b 2 j + � G ( x j ) � 2 x j +1 ← x j − G ( x j ) with b j +1 What is the convergence rate of AdaGrad-Norm? ◮ Intuition: if E [ � G ( x j ) � 2 ] ≤ γ 2 , then the effective stepsize η b j � η � η ≥ � E b j j γ 2 + b 2 0 � � 1 ◮ Convex Landscapes O (Levy, 2018) √ T

  20. Theory Algorithm: SGD with Adaptive Stepsize η b 2 j +1 = b 2 j + � G ( x j ) � 2 x j +1 ← x j − G ( x j ) with b j +1 What is the convergence rate of AdaGrad-Norm? ◮ Intuition: if E [ � G ( x j ) � 2 ] ≤ γ 2 , then the effective stepsize η b j � η � η ≥ � E b j j γ 2 + b 2 0 � � 1 ◮ Convex Landscapes O (Levy, 2018) √ T � � log( T ) ◮ Nonconvex Landscapes O (Ours, Theorem 2.1) √ T

  21. Theory Algorithm: SGD with Adaptive Stepsize (1) At j th iteration, generate ξ j and G ( x j ) = G ( x j , ξ j ) η b 2 j +1 = b 2 j + � G ( x j ) � 2 (2) x j +1 ← x j − b j +1 G ( x j ) with Theorem Under the assumption: 1. The random vectors ξ j , j = 0 , 1 , 2 , . . . , are mutually independent and also independent of x j ; 2. Bounded variance 3 : E ξ j [ � G ( x j , ξ j ) − ∇ F ( x j ) � 2 ] ≤ σ 2 ; 3. Bounded gradient norm: �∇ F ( x j ) � ≤ γ uniformly; 3 It means the expectation with respect to ξ j conditional on x j .

  22. Theory Algorithm: SGD with Adaptive Stepsize (1) At j th iteration, generate ξ j and G ( x j ) = G ( x j , ξ j ) η b 2 j +1 = b 2 j + � G ( x j ) � 2 (2) x j +1 ← x j − b j +1 G ( x j ) with Theorem Under the assumption: 1. The random vectors ξ j , j = 0 , 1 , 2 , . . . , are mutually independent and also independent of x j ; 2. Bounded variance 3 : E ξ j [ � G ( x j , ξ j ) − ∇ F ( x j ) � 2 ] ≤ σ 2 ; 3. Bounded gradient norm: �∇ F ( x j ) � ≤ γ uniformly; AdaGrad-Norm converges to a stationary point w.h.p. at the rate ℓ =0 , 1 ,..., T − 1 �∇ F ( x ℓ ) � 2 ≤ C 2 T + σ C min √ T where C = � O (log ( T / b 0 + 1)) and � O hides η, L and F ( x 0 ) − F ∗ . 3 It means the expectation with respect to ξ j conditional on x j .

  23. Theory Algorithm: SGD with Adaptive Stepsize (1) At j th iteration, generate ξ j and G ( x j ) = G ( x j , ξ j ) η b 2 j +1 = b 2 j + � G ( x j ) � 2 (2) x j +1 ← x j − b j +1 G ( x j ) with Challenges in the proof: b j +1 is a random variable correlated with ∇ F ( x j ) and G ( x j ) ◮ L -Lipschitz continuous gradient 4 + �∇ F j , ∇ F j − G j � ≤ − �∇ F j � 2 + η L � G j � 2 F j +1 − F j j +1 . 2 b 2 η b j +1 b j +1 � �� � KeyTerm 4 We write F ( x j ) = F j , ∇ F ( x j ) = ∇ F j and G ( x j ) = G j .

Recommend


More recommend