benefiting from negative curvature
play

Benefiting from Negative Curvature Daniel P. Robinson Johns Hopkins - PowerPoint PPT Presentation

Benefiting from Negative Curvature Daniel P. Robinson Johns Hopkins University Department of Applied Mathematics and Statistics Collaborator: Frank E. Curtis (Lehigh University) US and Mexico Workshop on Optimization and Its Applications


  1. Benefiting from Negative Curvature Daniel P. Robinson Johns Hopkins University Department of Applied Mathematics and Statistics Collaborator: Frank E. Curtis (Lehigh University) US and Mexico Workshop on Optimization and Its Applications Huatulco, Mexico January 8, 2018 Negative Curvature US-Mexico-2018 1 / 31

  2. Outline Motivation 1 Deterministic Setting 2 The Method Convergence Results Numerical Results Comments Stochastic Setting 3 Negative Curvature US-Mexico-2018 2 / 31

  3. Motivation Outline Motivation 1 Deterministic Setting 2 The Method Convergence Results Numerical Results Comments Stochastic Setting 3 Negative Curvature US-Mexico-2018 3 / 31

  4. Motivation Problem of interest: deterministic setting minimize f ( x ) x ∈ R n f : R n → R assumed to be twice-continuously differentiable. L will denote the Lipschitz constant for ∇ f σ will denote the Lipschitz constant for ∇ 2 f f may be nonconvex Notation: g ( x ) := ∇ f ( x ) H ( x ) := ∇ 2 f ( x ) Negative Curvature US-Mexico-2018 4 / 31

  5. Motivation Much work has been done on convergence two second-order points: D. Goldfarb (1979) [6] - prove convergence result to second-order optimal points (unconstrained) - curvilinear search using descent direction and negative curvature direction D. Goldfarb, C. Mu, J. Wright, and C. Zhou (2017) [7] - consider equality constrained problems - prove convergence result to second-order optimal points - extend curvilinear search for unconstrained F. Facchinei and S. Lucidi (1998) [3] - consider inequality constrained problems - exact penalty function, directions of negative curvature, and line search P. Gill, V. Kungurtsev, and D. Robinson (2017) [4, 5] - consider inequality constrained problems - convergence to second-order optimal points under weak assumptions J. Moré and D. Sorensen (1979), A. Forsgren, P. Gill, and W. Murray (1995), and many more . . . None consistently perform better by using directions of negative curvature! Negative Curvature US-Mexico-2018 5 / 31

  6. Motivation Others hope to avoid saddle-points: J. Lee, M. Simchowich, M. Jordan, and B. Recht (2016) [8] - Gradient descent converges to local minimizer almost surely. - Uses random initialization. Y. Dauphin et al. (2016) [2] - Present a saddle-free Newton method (it is a modified-Newton method) - Goal is to escape saddle points (move away when close) These (and others) try to avoid the ill-effects of negative curvature. Negative Curvature US-Mexico-2018 6 / 31

  7. Motivation Purpose of this research: Design a method that consistently performs better by using directions of negative curvature. Do not try to avoid negative curvature. Use it! Negative Curvature US-Mexico-2018 7 / 31

  8. Deterministic Setting Outline Motivation 1 Deterministic Setting 2 The Method Convergence Results Numerical Results Comments Stochastic Setting 3 Negative Curvature US-Mexico-2018 8 / 31

  9. Deterministic Setting The Method Outline Motivation 1 Deterministic Setting 2 The Method Convergence Results Numerical Results Comments Stochastic Setting 3 Negative Curvature US-Mexico-2018 9 / 31

  10. Deterministic Setting The Method Overview: Compute descent direction ( s k ) and negative curvature direction ( d k ). Predict which step will make more progress in reducing the objective f . If predicted decrease is not realized, adjust parameters. Iterate until an approximate second-order solution is obtained. Negative Curvature US-Mexico-2018 10 / 31

  11. Deterministic Setting The Method Requirements on the descent direction s k Compute s k to satisfy − g ( x k ) T s k ≥ δ � s k � 2 � g ( x k ) � 2 � � some δ ∈ ( 0 , 1 ] Examples: s k = − g ( x k ) B k s k = − g k with B k appropriately chosen Requirements on the negative curvature direction d k Compute d k to satisfy d T k H ( x k ) d k ≤ γλ k � d k � 2 � � 2 < 0 some γ ∈ ( 0 , 1 ] g ( x k ) T d k ≤ 0 Examples: d k = ± v k with ( λ k , v k ) being the left-most eigenpair of H ( x k ) d k a sufficiently accurate estimate of ± v k Negative Curvature US-Mexico-2018 11 / 31

  12. Deterministic Setting The Method How to use s k and d k ? Use both in a curvilinear linesearch? - Often taints good descent directions by "poorly scaled" directions of negative curvature. - No consistent performance gains! Start using d k only once � g ( x k ) � is “small"? - No consistent performance gains! - Misses areas of the space in which great decrease in f is possible. Use s k when � g ( x k ) � is big relative to | ( λ k ) − | . Otherwise, use d k ? - Better, but still inconsistent performance gains! We propose to use upper-bounding models. It works! Negative Curvature US-Mexico-2018 12 / 31

  13. Deterministic Setting The Method Predicted decrease along descent direction s k If L k ≥ L , then � � f ( x k + α s k ) ≤ f ( x k ) − m s , k ( α ) for all α with m s , k ( α ) := − α g ( x k ) T s k − 1 2 L k α 2 � s k � 2 2 and define the quantity α k := − g ( x k ) T s k = argmax m s , k ( α ) L k � s k � 2 α ≥ 0 2 Comments m s , k ( α k ) is the best predicted decrease along s k If s k = − g ( x k ) , then α k = 1 / L k Negative Curvature US-Mexico-2018 13 / 31

  14. Deterministic Setting The Method Predicted decrease along the negative curvature direction d k If σ k ≥ σ , then � � f ( x k + β d k ) ≤ f ( x k ) − m d , k ( β ) for all β with k H ( x k ) d k − σ k m d , k ( β ) := − β g ( x k ) T d k − 1 2 β 2 d T 6 β 3 � d k � 3 2 and define, with c k := d T k H ( x k ) d k , the quantity � � � c 2 k − 2 σ k � d k � 3 − c k + 2 g ( x k ) T d k β k := = argmax m d , k ( β ) σ k � d k � 3 β ≥ 0 2 Comments m d , k ( β k ) is the best predicted decrease along d k Negative Curvature US-Mexico-2018 14 / 31

  15. Deterministic Setting The Method Choose the step that predicts the largest decrease in f . If m s , k ( α k ) ≥ m d , k ( β k ) , then Try the step s k If m d , k ( β k ) > m s , k ( α k ) , then Try the step d k Question: Why “Try" instead of “Use"? Answer: We do not know if L k ≥ L and σ k ≥ σ - If L k < L , then it could be the case that f ( x k + α k s k ) > f ( x k ) − m s , k ( α k ) - If σ k < σ , then it could be the case that f ( x k + β k d k ) > f ( x k ) − m d , k ( β k ) Negative Curvature US-Mexico-2018 15 / 31

  16. Deterministic Setting The Method Dynamic Step-Size Algorithm 1: for k ∈ N do compute s k and d k satisfying the required step conditions 2: loop 3: compute α k = argmax m s , k ( α ) and β k = argmax m d , k ( β ) 4: α ≥ 0 β ≥ 0 if m s , k ( α k ) ≥ m d , k ( β k ) then 5: if f ( x k + α k s k ) ≤ f ( x k ) − m s , k ( α k ) then 6: set x k + 1 ← x k + α k s k and then exit loop 7: else 8: set L k ← ρ L k [ ρ ∈ ( 1 , ∞ ) ] 9: else 10: if f ( x k + β k d k ) ≤ f ( x k ) − m d , k ( β k ) then 11: set x k + 1 ← x k + β k d k and then exit loop 12: else 13: set σ k ← ρσ k 14: set ( L k + 1 , σ k + 1 ) ∈ ( L min , L k ] × ( σ min , σ k ] 15: Negative Curvature US-Mexico-2018 16 / 31

  17. Deterministic Setting Convergence Results Outline Motivation 1 Deterministic Setting 2 The Method Convergence Results Numerical Results Comments Stochastic Setting 3 Negative Curvature US-Mexico-2018 17 / 31

  18. Deterministic Setting Convergence Results Key decrease inequality: For all k ∈ N it holds that � δ 2 2 , 2 γ 3 � � g ( x k ) � 2 | ( λ k ) − | 3 f ( x k ) − f ( x k + 1 ) ≥ max . 3 σ 2 2 L k k Comments: First term in the max holds when x k + 1 = x k + α k s k . Second term in the max holds when x k + 1 = x k + β k d k . The above max holds because we choose whether to try s k or d k based on m s , k ( α k ) ≥ m d , k ( β k ) Can prove that { L k } and { σ k } remain uniformly bounded. Negative Curvature US-Mexico-2018 18 / 31

  19. Deterministic Setting Convergence Results Theorem (Limit points satisfy second-order necessary conditions) The computed iterates satisfy k →∞ � g ( x k ) � 2 = 0 and lim inf k →∞ λ k ≥ 0 lim Theorem (Complexity result) The number of iterations, function, and derivative (i.e., gradient and Hessian) evaluations required until some iteration k ∈ N is reached with � g ( x k ) � 2 ≤ ǫ g and | ( λ k ) − | ≤ ǫ H is at most O ( max { ǫ − 2 g , ǫ − 3 H } ) Negative Curvature US-Mexico-2018 19 / 31

  20. Deterministic Setting Numerical Results Outline Motivation 1 Deterministic Setting 2 The Method Convergence Results Numerical Results Comments Stochastic Setting 3 Negative Curvature US-Mexico-2018 20 / 31

  21. Deterministic Setting Numerical Results Refined parameter increase strategy � � f ( x k + α k s k ) − f ( x k ) + m s , k ( α k ) L k ← L k + 2 ˆ α 2 k � s k � 2 � � σ k ← σ k + 6 f ( x k + β k d k ) − f ( x k ) + m d , k ( β k ) ˆ β 3 k � d k � 3 then, with ρ ← 2, use the update L k ← max { ρ L k , min { 10 3 L k , ˆ L k }} σ k ← max { ρσ k , min { 10 3 σ k , ˆ σ k }} Refined parameter decrease strategy L k + 1 ← max { 10 − 3 , 10 − 3 L k , ˆ L k } and σ k + 1 ← σ k when x k + 1 ← x k + α k s k σ k + 1 ← max { 10 − 3 , 10 − 3 σ k , ˆ σ k } and L k + 1 ← L k when x k + 1 ← x k + β k d k Negative Curvature US-Mexico-2018 21 / 31

Recommend


More recommend