math529 fundamentals of optimization trust region
play

MATH529 Fundamentals of Optimization Trust Region Algorithms Marco - PowerPoint PPT Presentation

MATH529 Fundamentals of Optimization Trust Region Algorithms Marco A. Montes de Oca Mathematical Sciences, University of Delaware, USA 1 / 23 Line Search vs. Trust Region Line Search Select a search (descent) direction p k . Select step


  1. MATH529 – Fundamentals of Optimization Trust Region Algorithms Marco A. Montes de Oca Mathematical Sciences, University of Delaware, USA 1 / 23

  2. Line Search vs. Trust Region Line Search Select a search (descent) direction p k . Select step size α k to ensure sufficient descent along f ( x k + α k p k ). Move to new point x k +1 = x k + α k p k . Trust Region Build model m k of f at x k . (Similar to Newton’s method.) k p + 1 p ∈ R n m k ( p ) = f k + g T 2 p T B k p Solve p k = min s.t. || p || ≤ ∆ k If predicted decrease is good enough, then x k +1 = x k + p k . Otherwise, x k +1 = x k and improve the model. 2 / 23

  3. Acceptance criterion To measure how well the predicted decrease matches the actual decrease, we use: ρ k = f ( x k ) − f ( x k + p k ) . m k (0) − m k ( p k ) Given that m k (0) − m k ( p k ) > 0, if ρ k < 0 then the predicted reduction is not obtained, the step is rejected and ∆ k is decreased. If ρ k ≈ 1, then accept p k and increase ∆ k . If ρ k > 0 but not ≈ 1, then accept p k and do not change ∆ k . If ρ k > 0 but ≈ 0, the step may be accepted or not, and ∆ k is decreased. 3 / 23

  4. Algorithm Inttialization: k = 0, ∆ 0 > 0, and x 0 by educated guess. Set η g ∈ (0 , 1) (typically, η g = 0 . 9), η a ∈ (0 , η g ) (typically, η a = 0 . 1), γ e ≥ 1 (typically, γ e = 2), and γ s ∈ (0 , 1) (typically, γ s = 0 . 5). Until convergence do: Build model m k ( p ). Solve trust region subproblem (result in p k ) Test acceptance criterion (result in ρ k ). If ρ k ≥ η g , then x k +1 = x k + p k and ∆ k +1 = γ e ∆ k Else If ρ k ≥ η a , then x k +1 = x k + p k Else If ρ k < η a , then ∆ k +1 = γ s ∆ k Increase k by one 4 / 23

  5. Solving the trust region subproblem approximately We want to solve the subproblem as efficiently as possible. We want a solution that at least decreases the model as much as the steepest descent would subject to the size of the trust region. 5 / 23

  6. Solving the trust region subproblem approximately From Ruszczy´ nski A. “Nonlinear Optimization” pp. 268. Princeton University Press. 2006. 6 / 23

  7. Cauchy Point The Cauchy point can be found by minimizing the model along a line segment. k = − ∆ k g k Thus, let p s || g k || . (Point at the border of the trust region in the direction of steepest descent.) k = − τ k ∆ k g k The Cauchy point is p C k = τ k p s || g k || . To find τ k , consider k ) + 1 g ( τ ) = m k ( τ p s k ) = f k + g T k ( τ p s 2 ( τ p s k ) T B k ( τ p s k ) k + τ 2 m k ( τ p s k ) = f k + τ g T k p s 2 ( p s k ) T B k p s k Differentiating wrt τ : 0 = g ′ ( τ ) = g T k p s k + τ ( p s k ) T B k p s k , which means that 7 / 23

  8. Cauchy Point g T k p s τ k = − k k . (1) k ) T B k p s ( p s k = − ∆ k g k Substituting p s || g k || in (1): g k T ( − ∆ k g k || g k || ) || g k || || g k || 3 1 1 τ k = − || g k || ) = k B k g k ) = k B k g k . ( − ∆ k g k || g k || ) T B k ( − ∆ k g k || g k || 2 ( g T 1 g T ∆ k ∆ k However, there may be two problems: a) τ k > ∆ k , or b) g T k B k g k ≤ 0, that is, B k is not positive definite. So, we define the Cauchy point as follows: Definition (Cauchy Point) k = − τ k ∆ k g k p C k = τ k p s || g k || , where || g k || 3 1 τ k = 1 if g T k B k g k ≤ 0, or τ k = min { 1 , k B k g k } otherwise. ∆ k g T 8 / 23

  9. Cauchy step is a baseline of performance A reduction at least as good as the one obtained with the Cauchy step guarantees that the trust-region method is convergent. The Cauchy step is just a steepest descent step with fixed length (∆ k ). (Thus, it is inefficient.) The direction of the Cauchy step does not depend directly on B k , which means that curvature information is not exploited in its calculation. 9 / 23

  10. Improvements over Cauchy step The main idea is to incorporate information provided by the “full k = − B − 1 step” (Newton step for the local model m k ): p B k g k whenever || p B k || ≤ ∆ k . Dogleg Method k be the solution to the subproblem. If ∆ k ≥ || p B Let p ⋆ k || , then k = − ∆ k g k k = p B k . If, however, ∆ k << || p B k ≈ p s p ⋆ k || , then p ⋆ || g k || . The idea of the dogleg method is to combine these two directions and search the minimum of the model along the resulting path p ( τ ): � � τ p U 0 ≤ τ ≤ 1, k � p ( τ ) = p U k + ( τ − 1)( p B k − p U k ) 1 < τ ≤ 2, k = − g T k g k where 0 ≤ τ ≤ 2, and p U k B k g k g k , i.e., the steepest descent g T step with exact length (see that if || p C k || < ∆ k , p U k = p C k ). 10 / 23

  11. Dogleg Method Adapted from Nocedal J. and Wright S. “Numerical Optimization” 2nd. Ed. pp. 74. Springer. 2006. 11 / 23

  12. Dogleg Method If B k is positive definite, m ( � p ( τ )) is a decreasing function of τ (Lemma 4.2, page 75). Therefore: p ( τ ) is attained at τ = 2 if || p B The minimum along � k || ≤ ∆ k . If || p B k || > ∆ k , we need to find τ such that || � p ( τ ) || = ∆ k . 12 / 23

  13. Dogleg Method Example: f ( x , y ) = x 2 + 10 y 2 13 / 23

  14. 2D Subspace Minimization The dogleg is completely contained in the plane spanned by p U k and p B k . Therefore, one may extend the search to the whole subspace spanned by p U k and p B k , span [ p U k , p B k ]. 14 / 23

  15. 2D Subspace Minimization Given span [ p U k , p B k ] = { v | a p U k + b p B k } , a , b ∈ R . The subproblem is thus: � � k ) T ∇ f k + 1 f k + ( a p U k + b p B 2( a p U k + b p B k ) T B k ( a p U k + b p B min k ) a , b ∈ R s.t. || a p U k + b p B k || ≤ ∆ k , which can be solved using tools from constrained optimization. (To be discussed after break.) 15 / 23

  16. Issues 16 / 23

  17. Indefinite Hessians Problem: Newton’s step may not be decreasing. Example: Newton’s step solves the system Hf k p = −∇ f k . Now,   10 0 0  p = − (1 , − 3 , 2) T = ( − 1 , 3 , − 2) T . Thus,  0 3 0 0 0 − 1 p = ( − 1 / 10 , 1 , 2). However, p T ∇ f k > 0, thus p is not a descent direction. Solution approaches: Replace negative eigenvalues by some small positive number. Replace negative eigenvalues by their negative. 17 / 23

  18. Replace negative eigenvalues by some small positive number   10 0 0   , so p T ∇ f k < 0, but p = ? Now Hf k = 0 3 0 10 − 6 0 0 18 / 23

  19. Replace negative eigenvalues by some small positive number   10 0 0   , so p T ∇ f k < 0, but p = ? Now Hf k = 0 3 0 10 − 6 0 0 19 / 23

  20. Replace negative eigenvalues by their negative   10 0 0   , so p T ∇ f k < 0, but p = ? Now Hf k = 0 3 0 0 0 1 20 / 23

  21. In practice Perturb B k with β I such that: ( B k + β I ) p = − g , β (∆ k − || p || ) = 0, and B k + β I is positive semidefinite. with β ∈ ( − λ 1 , − 2 λ 1 ], where λ 1 is the most negative eigenvalue of B . 21 / 23

  22. Further improvements Iterative solution of the subproblem: To avoid direct Hessian manipulation. Scaling: || D p || ≤ ∆ k . This created elliptical trust regions, which reduce the problem of different scaling of some variables. 22 / 23

  23. Other methods Conjugate Gradient Methods: A set of nonzero vectors { p 0 , p 1 , . . . , ... p n } are conjugate wrt to a symmetric positive definite matrix A if p T i A p j = 0, for all i � = j . Quasi-Newton Methods: Use changes in gradient information to estimate a model of the function in order to achive superlinear convergence. Example: B k +1 α k p k = ∇ f k +1 − ∇ f k (BFGS Method). Derivative-free methods. Heuristic methods. 23 / 23

Recommend


More recommend