25 nlp algorithms
play

25. NLP algorithms Overview Local methods Constrained optimization - PowerPoint PPT Presentation

CS/ECE/ISyE 524 Introduction to Optimization Spring 201718 25. NLP algorithms Overview Local methods Constrained optimization Global methods Black-box methods Course wrap-up Laurent Lessard (www.laurentlessard.com)


  1. CS/ECE/ISyE 524 Introduction to Optimization Spring 2017–18 25. NLP algorithms ❼ Overview ❼ Local methods ❼ Constrained optimization ❼ Global methods ❼ Black-box methods ❼ Course wrap-up Laurent Lessard (www.laurentlessard.com)

  2. Review of algorithms Studying Linear Programs , we talked about: ❼ Simplex method: traverse the surface of the feasible polyhedron looking for the vertex with minimum cost. Only applicable for linear programs. Used by solvers such as Clp and CPLEX . Hybrid versions used by Gurobi and Mosek . ❼ Interior point methods: traverse the inside of the feasible polyhedron and move towards the boundary point with minimum cost. Applicable to many different types of optimization problems. Used by SCS , ECOS , Ipopt . 25-2

  3. Review of algorithms Studying Mixed Integer Programs , we talked about: ❼ Cutting plane methods: solve a sequence of LP relaxations and keep adding cuts (special extra linear constraints) until solution is integral, and therefore optimal. Also applicable for more general convex problems. ❼ Branch and bound methods: solve a sequence of LP relaxations (upper bounding), and branch on fractional variables (lower bounding). Store problems in a tree, prune branches that aren’t fruitful. Most optimization problems can be solved this way. You just need a way to branch (split the feasible set) and a way to bound (efficiently relax). ❼ Variants of methods above are used by all MIP solvers. 25-3

  4. Overview of NLP algorithms To solve Nonlinear Programs with continuous variables , there is a wide variety of available algorithms. We’ll assume the problem has the standard form: minimize f 0 ( x ) x subject to: f i ( x ) ≤ 0 for i = 1 , . . . , m ❼ What works best depends on the kind of problem you’re solving. We need to talk about problem categories. 25-4

  5. Overview of NLP algorithms 1. Are the functions differentiable? Can we efficiently compute gradients or second derivatives of the f i ? 2. What problem size are we dealing with? a few variables and constraints? hundreds? thousands? millions? 3. Do we want to find local optima, or do we need the global optimum (more difficult!) 4. Does the objective function have a large number of local minima? or a relatively small number? Note: items 3 and 4 don’t matter if the problem is convex. In that case any local minimum is also a global minimum! 25-5

  6. Survey of NLP algorithms ❼ Local methods using derivative information. It’s what most NLP solvers use (and what most JuMP solvers use). ◮ unconstrained case ◮ constrained case ❼ Global methods ❼ Derivative-free methods 25-6

  7. Local methods using derivatives Let’s start with the unconstrained case: minimize f ( x ) x Stochastic gradient descent slow Many methods Gradient descent available! Accelerated methods Conjugate gradient Quasi-Newton methods fast Newton’s method cheap expensive 25-7

  8. Iterative methods Local methods iteratively step through the space looking for a point where ∇ f ( x ) = 0. 1. pick a starting point x 0 . 2. choose a direction to move in ∆ k . This is the part where different algorithms do different things. 3. update your location x k +1 = x k + ∆ k 4. repeat until you’re happy with the function value or the algorithm has ceased to make progress. 25-8

  9. Vector calculus Suppose f : R n → R is a twice-differentiable function. f : R n → R n defined by: ❼ The gradient of f is a function ∇ i = ∂ f � � ∇ f ∂ x i ∇ f ( x ) points in the direction of greatest increase of f at x . ❼ The Hessian of f is a function ∇ 2 f : R n → R n × n where: ∂ 2 f � � ∇ 2 f ij = ∂ x i ∂ x j ∇ 2 f ( x ) is a matrix that encodes the curvature of f at x . 25-9

  10. Vector calculus Example: suppose f ( x , y ) = x 2 + 3 xy + 5 y 2 − 7 x + 2 � ∂ f � � 2 x + 3 y − 7 � ∂ x ❼ ∇ f = = 3 x + 10 y ∂ f ∂ y � ∂ 2 f � ∂ 2 f � 2 � 3 ∂ x 2 ∂ x ∂ y ❼ ∇ 2 f = = ∂ 2 f ∂ 2 f 3 10 ∂ y 2 ∂ x ∂ y Taylor’s theorem in n dimensions best linear approximation � f ( x 0 ) T ( x − x 0 ) +1 �� � 2( x − x 0 ) T ∇ 2 f ( x 0 )( x − x 0 ) f ( x ) ≈ f ( x 0 ) + ∇ + · · · � �� � best quadratic approximation 25-10

  11. Gradient descent ❼ The simplest of all iterative methods. It’s a first-order method, which means it only uses gradient information: x k +1 = x k − t k ∇ f ( x k ) ❼ −∇ f ( x k ) points in the direction of local steepest decrease of the function. We will move in this direction. ❼ t k is the stepsize. Many ways to choose it: ◮ Pick a constant t k = t √ ◮ Pick a slowly decreasing stepsize, such as t k = 1 / k ◮ Exact line search: t k = arg min t f ( x k − t ∇ f ( x k )). ◮ A heuristic method (most common in practice). Example: backtracking line search. 25-11

  12. Gradient descent We can gain insight into the effectiveness of a method by seeing how it performs on a quadratic: f ( x ) = 1 2 x T Qx . The condition number κ := λ max ( Q ) λ min ( Q ) determines convergence. Optimal step 10 0 2 Shorter step 10 -2 distance to optimal point Even shorter 1 10 -4 10 -6 0 10 -8 κ = 10 1 10 -10 Optimal step Shorter step 2 10 -12 Even shorter 10 -14 5 0 5 10 0 10 1 10 2 10 3 number of iterations 10 0 Optimal step Optimal step 2 Shorter step Shorter step distance to optimal point 10 -2 Even shorter Even shorter 1 10 -4 10 -6 0 10 -8 κ = 1 . 2 1 10 -10 2 10 -12 10 -14 5 0 5 10 0 10 1 10 2 10 3 number of iterations 25-12

  13. Gradient descent Advantages ❼ Simple to implement and cheap to execute. ❼ Can be easily adjusted. ❼ Robust in the presence of noise and uncertainty. Disadvantages ❼ Convergence is slow. ❼ Sensitive to conditioning. Even rescaling a variable can have a substantial effect on performance! ❼ Not always easy to tune the stepsize. Note: The idea of preconditioning (rescaling) before solving adds another layer of possible customizations and tradeoffs. 25-13

  14. Other first-order methods Accelerated methods (momentum methods) ❼ Still a first-order method, but makes use of past iterates to accelerate convergence. Example: the Heavy-ball method: x k +1 = x k − α k ∇ f ( x k ) + β k ( x k − x k − 1 ) Other examples: Nesterov, Beck & Teboulle, others. ❼ Can achieve substantial improvement over gradient descent with only a moderate increase in computational cost ❼ Not as robust to noise as gradient descent, and can be more difficult to tune because there are more parameters. 25-14

  15. Other first-order methods Mini-batch stochastic gradient descent (SGD) ❼ Useful if f ( x ) = � N i =1 f i ( x ). Use direction � i ∈ S ∇ f i ( x k ) where S ⊆ { 1 , . . . , N } . Size of S determines “batch size”. | S | = 1 is SGD and | S | = N is ordinary gradient descent. ❼ Same pros and cons as gradient descent, but allows further tradeoff of speed vs computation. ❼ Industry standard for big-data problems like deep learning. Nonlinear conjugate gradient ❼ Variant of the standard conjugate gradient algorithm for solving Ax = b , but adapted for use in general optimization. ❼ Requires more computation than accelerated methods. ❼ Converges exactly in a finite number of steps when applied to quadratic functions. 25-15

  16. Newton’s method Basic idea: approximate the function as a quadratic, move directly to the minimum of that quadratic, and repeat. ❼ If we’re at x k , then by Taylor’s theorem: f ( x k ) T ( x − x 0 )+ 1 2( x − x k ) T ∇ 2 f ( x k )( x − x k ) f ( x ) ≈ f ( x k )+ ∇ ❼ If ∇ 2 f ( x k ) ≻ 0, the minimum of the quadratic occurs at: x k +1 := x opt = x k − ∇ 2 f ( x k ) − 1 ∇ f ( x k ) ❼ Newton’s method is a second-order method; it requires computing the Hessian (second derivatives). 25-16

  17. Newton’s method in 1D 4 . 2 Example: f ( x ) = log( e x +3 + e − 2 x +2 ) starting at: x 0 = 0 . 5 4 3 . 8 ( x 1 , f 1 ) ( x 0 , f 0 ) 3 . 6 ( x 2 , f 2 ) 3 . 4 3 . 2 3 − 1 − 0 . 8 − 0 . 6 − 0 . 4 − 0 . 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 x example by: L. El Ghaoui, UC Berkeley, EE127a 25-17

  18. Newton’s method in 1D Example: f ( x ) = log( e x +3 + e − 2 x +2 ) 60 ( x 1 , f 1 ) starting at: x 0 = 1 . 5 divergent! x 2 = 2 . 3 × 10 6 ... 40 20 ( x 0 , f 0 ) 0 − 30 − 20 − 10 0 10 20 30 x example by: L. El Ghaoui, UC Berkeley, EE127a 25-18

  19. Newton’s method Advantages ❼ It’s usually very fast. Converges to the exact optimum in one iteration if the objective is quadratic. ❼ It’s scale-invariant. Convergence rate is not affected by any linear scaling or transformation of the variables. Disadvantages ❼ If n is large, storing the Hessian (an n × n matrix) and computing ∇ 2 f ( x k ) − 1 ∇ f ( x k ) can be prohibitively expensive. ❼ If ∇ 2 f ( x k ) ⊁ 0, Newton’s method may converge to a local maximum or a saddle point. ❼ May fail to converge at all if we start too far from the optimal point. 25-19

  20. Quasi-Newton methods ❼ An approximate Newton’s method that doesn’t require computing the Hessian. ❼ Uses an approximation H k ≈ ∇ 2 f ( x k ) − 1 that can be updated directly and is faster to compute than the full Hessian. x k +1 = x k − H k ∇ f ( x k ) H k +1 = g ( H k , ∇ f ( x k ) , x k ) ❼ Several popular update schemes for H k : ◮ DFP (Davidon–Fletcher–Powell) ◮ BFGS (Broyden–Fletcher–Goldfarb–Shanno) 25-20

Recommend


More recommend