first order methods
play

First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts - PowerPoint PPT Presentation

First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts Institute of Technology OPTML++, Fall 2015 Outline Lect 1: Recap on convexity Lect 1: Recap on duality, optimality First-order optimization algorithms Proximal


  1. First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts Institute of Technology OPTML++, Fall 2015

  2. Outline – Lect 1: Recap on convexity – Lect 1: Recap on duality, optimality – First-order optimization algorithms – Proximal methods, operator splitting Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 2 / 23

  3. Descent methods min x f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 3 / 23

  4. Descent methods min x f ( x ) x k x k +1 . . . x ∗ ∇ f ( x ∗ ) = 0 Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 3 / 23

  5. Descent methods x Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

  6. Descent methods ∇ f ( x ) x −∇ f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

  7. Descent methods ∇ f ( x ) x − α ∇ f ( x ) x x − δ ∇ f ( x ) −∇ f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

  8. Descent methods ∇ f ( x ) x + α 2 d x − α ∇ f ( x ) x d x − δ ∇ f ( x ) −∇ f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

  9. Algorithm 1 Start with some guess x 0 ; 2 For each k = 0 , 1 , . . . x k + 1 ← x k + α k d k Check when to stop (e.g., if ∇ f ( x k + 1 ) = 0) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 5 / 23

  10. Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

  11. Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Descent direction d k satisfies �∇ f ( x k ) , d k � < 0 Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

  12. Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Descent direction d k satisfies �∇ f ( x k ) , d k � < 0 Numerous ways to select α k and d k Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

  13. Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Descent direction d k satisfies �∇ f ( x k ) , d k � < 0 Numerous ways to select α k and d k Usually methods seek monotonic descent f ( x k + 1 ) < f ( x k ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

  14. Gradient methods – direction x k + 1 = x k + α k d k , k = 0 , 1 , . . . ◮ Different choices of direction d k ◦ Scaled gradient: d k = − D k ∇ f ( x k ) , D k ≻ 0 ◦ Newton’s method: ( D k = [ ∇ 2 f ( x k )] − 1 ) ◦ Quasi-Newton: D k ≈ [ ∇ 2 f ( x k )] − 1 ◦ Steepest descent: D k = I � − 1 ◦ Diagonally scaled: D k diagonal with D k � ∂ 2 f ( x k ) ii ≈ ( ∂ x i ) 2 ◦ Discretized Newton: D k = [ H ( x k )] − 1 , H via finite-diff. Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 7 / 23

  15. Gradient methods – direction x k + 1 = x k + α k d k , k = 0 , 1 , . . . ◮ Different choices of direction d k ◦ Scaled gradient: d k = − D k ∇ f ( x k ) , D k ≻ 0 ◦ Newton’s method: ( D k = [ ∇ 2 f ( x k )] − 1 ) ◦ Quasi-Newton: D k ≈ [ ∇ 2 f ( x k )] − 1 ◦ Steepest descent: D k = I � − 1 ◦ Diagonally scaled: D k diagonal with D k � ∂ 2 f ( x k ) ii ≈ ( ∂ x i ) 2 ◦ Discretized Newton: D k = [ H ( x k )] − 1 , H via finite-diff. ◦ . . . Exercise: Verify that �∇ f ( x k ) , d k � < 0 for above choices Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 7 / 23

  16. Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

  17. Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

  18. Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s ◮ Armijo-rule . Given fixed scalars, s , β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set α k = β m k s , where we try β m s for m = 0 , 1 , . . . until sufficient descent f ( x k ) − f ( x + β m sd k ) ≥ − σβ m s �∇ f ( x k ) , d k � Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

  19. Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s ◮ Armijo-rule . Given fixed scalars, s , β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set α k = β m k s , where we try β m s for m = 0 , 1 , . . . until sufficient descent f ( x k ) − f ( x + β m sd k ) ≥ − σβ m s �∇ f ( x k ) , d k � If �∇ f ( x k ) , d k � < 0, stepsize guaranteed to exist Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

  20. Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s ◮ Armijo-rule . Given fixed scalars, s , β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set α k = β m k s , where we try β m s for m = 0 , 1 , . . . until sufficient descent f ( x k ) − f ( x + β m sd k ) ≥ − σβ m s �∇ f ( x k ) , d k � If �∇ f ( x k ) , d k � < 0, stepsize guaranteed to exist Usually, σ small ∈ [ 10 − 5 , 0 . 1 ] , while β from 1 / 2 to 1 / 10 depending on how confident we are about initial stepsize s . ◮ Constant: α k = 1 / L (for suitable value of L ) ◮ Diminishing: α k → 0 but � k α k = ∞ . Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

  21. Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

  22. Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

  23. Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent) Barzilai & Borwein stepsizes x k + 1 = x k − α k ∇ f ( x k ) , k = 0 , 1 , . . . Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

  24. Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent) Barzilai & Borwein stepsizes x k + 1 = x k − α k ∇ f ( x k ) , k = 0 , 1 , . . . α k = � u k , v k � � u k � 2 � v k � 2 , α k = � u k , v k � u k = x k − x k − 1 , v k = ∇ f ( x k ) − ∇ f ( x k − 1 ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

  25. Least-squares Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 10 / 23

  26. Nonnegative least squares 2 � Ax − b � 2 + � x ≥ 0 � 1 min intensities, concentrations, frequencies, . . . Applications Machine learning Physics Statistics Bioinformatics Image Processing Remote Sensing Computer Vision Engineering Medical Imaging Inverse problems Astronomy Finance Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 11 / 23

  27. NNLS: � Ax − b � 2 s.t. x ≥ 0 Unconstrained solution x uc = ( A T A ) − 1 A T b Solve ∇ f ( x ) = 0 = ⇒ x = ( x uc ) + Cannot just truncate x ∗ ( x uc ) + x uc x ≥ 0 makes problem trickier as problem size ր Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 12 / 23

  28. Solving NNLS scalably x ∗ x ← ( x − α ∇ f ( x )) + x uc Good choice of α crucial ◮ Backtracking line-search ◮ Armijo ◮ and many others Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 13 / 23

  29. Solving NNLS scalably x ∗ x ← ( x − α ∇ f ( x )) + x uc Good choice of α crucial ◮ Backtracking line-search ◮ Armijo ◮ and many others Too slow! Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 13 / 23

  30. NNLS: long studied problem Method Remarks Scalability Accuracy NNLS (1976) M ATLAB default poor high FNNLS (1989) fast NNLS poor high LBFGS-B (1997) famous solver fair medium TRON (1999) TR newton poor high SPG (2000) spectral proj fair+ medium ASA (2006) prev state-of-art fair+ medium SBB (2011) subspace BB steps very good medium Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 14 / 23

  31. Spectacular failure of projection 2 10 Naive BB+Projxn 1 10 Objective function value 0 10 −1 10 −2 10 −3 10 −4 10 5 10 15 20 25 30 35 40 Running time (seconds) x ′ = ( x − α ∇ f ( x )) + Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 15 / 23

Recommend


More recommend