minimization using descent information
play

Minimization Using Descent Information we will consider the - PDF document

Minimization Using Descent Information we will consider the minimization of unconstrained functions of several variables where we now assume we have some derivative information such as the gradient vector or the Hessian matrix. Recall


  1. Minimization Using Descent Information • we will consider the minimization of unconstrained functions of several variables where we now assume we have some derivative information such as the gradient vector or the Hessian matrix. • Recall that Powell’s method used the powerful concept of conjugate directions and performed a series of line searches. • We will see how these conjugate directions are related to the gradient directions and we will introduce a very powerful method called the conjugate-gradient technique. • Recall Taylor’s expansion uses such information: ∆ x T f x 1 - ∆ x T H x ∆ x ≅ ∇ - ( ) ∆ x ( + ) ( ) + ( ) + f x f x 2 • Methods using only first derivatives are called first-order methods • Methods using second order derivatives are called second-order methods .

  2. The Gradient Vector Re-examined R n ∈ • Recall that the gradient vector of a function f x ( ), x : ∂ f ∂ x 1 ∂ f ∇ g x ( ) = f x ( ) = ∂ x 2 … ∂ f ∂ x n • Consider a differential length dx 1 dx 2 = = d x u s d … dx n where u = and u holds the directional information of the differential. 1 • Now a change in the function f x ( ) along d x is given by n ∂ f ∑ ( ∇ , ) ( ∇ , ) ( ∇ , ) d f = d x i = f x ( ) d x = f x ( ) u s d = d s f x ( ) u ∂ x i i = 1 the rate of change of f x ( ) along the arbitrary direction u is given by

  3. d f ( ∇ , ) = f x ( ) u d s • On the other hand, along a direction u , the function f x ( ) is described as α u f x ( + ) and thus the rate of change along this direction is also written as d f d f x ( ∇ , ) α u = f x ( ) u = ( + ) (1) α d s d • We can examine (1) to see which direction gives the maximum rate of increase. • A well known inequality in functional analysis is called the Cauchy-Schwarz inequality which says ( , ) ≤ (2) a b a b Applying this to (1) we find ( ∇ , ) ≤ ∇ ∇ ( ) ( ) = ( ) (3) f x u f x u f x ∇ ( ) f x and letting u = - - - - - - - - - - - - - - - - - we have ∇ f x ( ) 2 ∇ ∇ f x ( ) f x ( ) ⎛ ⎞ ∇ , ∇ - - - - - - - - - - - - - - - - - f x ( ) = - - - - - - - - - - - - - - - - - - - - = f x ( ) (4) ⎝ ⎠ ∇ f x ( ) ∇ f x ( ) which shows that for this choice of direction, the directional derivative ( ∇ , ) f x ( ) u reaches its maximum value. ∇ • Therefore we say that f x ( ) is the direction of maximum increase and ∇ is the direction of maximum decrease • – ( ) f x (also: steepest ascent and steepest descent ).

  4. Cauchy’s Method (Steepest Descent) • A logical minimization strategy is to use the direction of steepest descent and perform a line search in that direction. • Assume we are at a point x k and that we have calculated the gradient at this ∇ point, g k = g x k ( ) = f x k ( ) . • Then we can minimize along g k starting from x k : α k α g k = arg min α f x k ( + ) (5) and arrive at the new point α k g k x k = x k + . (6) + 1 Algorithm: Cauchy’s Method of Steepest Descent 1. input: f x ( ), g x ( ), x 0 , g tol , k max 2. set: x = , g = ( ) x 0 g x < > 3. while k and g k max g tol set: α α g 4. = arg min α f x ( + ) α g 5. set: x = x + 6. set: g = ( ) , k = + g x k 1 7. end • Note that we don’t bother to take the negative of g in the line search since this is automatically accomplished by allowing negative values of α . • The iterations end when either a maximum number of iterations have been reached or the norm of the gradient at the current point is less than a user defined value g tol which is a value close to zero.

  5. • The successive directions of minimization in the method of steepest descent are orthogonal to each other. • This can be shown as follows: assume that we are at a point x k and we need to find α k α g k = arg min α f x k ( + ) (7) in order to arrive at the new point α k g k x k = x k + . (8) + 1 • We can find α k by setting the derivative of F α α g k ( ) = f x k ( + ) with respect to α equal to zero, which from (1) we can write: d F α ( ∇ , ) ( , ) ( ) = f x k ( ) g k = g k g k = 0 . (9) + 1 + 1 α d this immediately shows the orthogonality between the successive directions + and g k . g k 1

  6. • Now, although it may seem like a good idea to minimize along the direction of steepest descent, it turns out that this method is not very efficient. • For relatively complicated functions, this method will tend to zig-zag towards the minimum. x 2 g 1 g 2 x 1 Zig-zaging effect of Cauchy’s method. • Note that since successive directions of descent are always orthogonal, in two dimensions the algorithm searches in only two directions. • This is what makes the algorithm slow to converge to the minimum and is generally not recommended.

  7. The Conjugate-Gradient Method • It turns out that if we have access to the gradient of the objective function, then we can determine conjugate directions relatively efficiently. • Recall that Powell’s conjugate direction method requires n single variable minimizations per iteration in order to determine one new conjugate direction at the end of the iteration. This results in approximately n 2 line minimizations to find the minimum of a quadratic function. • In the conjugate-gradient algorithm, access to the gradient of f x ( ), that is ∇ g x ( ) = f x ( ) , allows us to set up a new conjugate direction after every line minimization. • Consider the quadratic function 1 b T x - x T C x - ( ) = + + (10) Q x a 2 R n ∈ where x . • We want to perform successive line minimizations along conjugate directions, say s x k ( ), where x k is the current search point and we minimize to the next point as λ k s x k = + ( ) . (11) x k x k + 1 • Now, how do we find these conjugate directions = ( ) s k s x k given previous information?

  8. • Expand the search directions in terms of the gradient at the current point ∇ g k = f x k ( ) and a linear combination of the previous search directions: – k 1 ∑ γ i s i s k = – g k + (12) = i 0 where we start with the initial search direction as the steepest descent direction = – s 0 g 0 . • The coefficients of the expansion, γ i , i 1 … k = – 1 , are to be chosen so that the s k are C -conjugate . • Therefore, we have γ 0 s 0 γ 0 g 0 s 1 = – g 1 + = – g 1 – (13) and we require that ( , ) ( γ 0 g 0 , ) s 1 C s 0 = 0 = – g 1 – C s 0 (14) λ 0 s 0 but x 1 = x 0 + which can be solved for s 0 as ∆ x 0 x 1 – x 0 - - - - - - - - - - - - - - - - - = s 0 = - - - - - - - - - (15) λ 0 λ 0 where ∆ x 0 = x 1 – x 0 is the forward difference operator. Using this in (14) we have ∆ x 0 ⎛ ⎞ γ 0 g 0 , ⎜ - - - - - - - - - ⎟ – g 1 – C = 0 (16) λ 0 ⎝ ⎠ but g x ( ) = C x + b which means that we can set ∆ g 0 ( ) C ∆ x 0 = – = – = (17) g 1 g 0 C x 1 x 0

  9. which substituting into (16), we have ∆ g 0 ⎛ ⎞ γ 0 g 0 , ⎜ - - - - - - - - - ⎟ – g 1 – = 0 λ 0 ⎝ ⎠ ( , ∆ g 0 ) γ 0 g 0 ∆ g 0 ( , ) – g 1 = and finally ( ∆ g 0 g 1 , ) γ 0 = – - - - - - - - - - - - - - - - - - - - - - - - . (18) ( ∆ g 0 g 0 , ) • We can further reduce the numerator as follows: 2 ( ∆ g 0 g 1 , ) ( , ) ( , ) ( , ) = – = – = (19) g 1 g 0 g 1 g 1 g 1 g 0 g 1 g 1 ( , ) since g 0 g 1 = 0 (successive directions of minimization are orthogonal) • The denominator can also be rewritten as 2 ( ∆ g 0 g 0 , ) ( , ) ( , ) ( , ) = g 1 – g 0 g 0 = g 1 g 0 – g 0 g 0 = – g 0 (20) and therefore the coefficient γ 0 can be calculated from 2 g 1 γ 0 = - - - - - - - - - - - - - . (21) 2 g 0 • The conjugate direction, s 1 , can be calculated as 2 g 1 - - - - - - - - - - - - s 0 - = – + . (22) s 1 g 1 2 g 0

  10. • Now although we derived this for s 1 , we could have derived it for any s k to get 2 g k - - - - - - - - - - - - - - - - - - - - s k s k = – g k + (23) – 1 2 g k – 1 • This represents the Fletcher-Reeves scheme (1964) for determining conjugate directions for search from the gradient at the current point, g k , the previous search direction, s k – , and the magnitude of the gradient at the previous point, 1 . g k – 1 • Two alternative update equations are the Hestenes-Stiefel method (1952): ( ∆ g k , ) g k – 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - s k s k = – g k + (24) – 1 ( ∆ g k , ) s k – 1 and the Polak-Ribière method (1969): ( ∆ g k , ) g k – 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - s k s k = – g k + (25) – 1 2 g k – 1 • All three schemes are identical for quadratic functions but will be different for non-quadratic functions. • For quadratic functions, these methods will find the exact minimum in n iterations. • For non-quadratic functions, the search direction is reset to the steepest descent direction every n iterations.

Recommend


More recommend