conjugate direction minimization
play

Conjugate Direction minimization Lectures for PHD course on - PowerPoint PPT Presentation

Conjugate Direction minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS Universit a di Trento November 21 December 14, 2011 Conjugate Direction minimization 1 / 106 Outline Introduction 1


  1. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. (3 / 5) . STEP 3: problem reduction. By using Lagrange multiplier maxima and minima are the stationary points of: �� n � k =1 α 2 g ( α 1 , . . . , α n , µ ) = h ( α 1 , . . . , α n ) + µ k − 1 setting A = � n k λ k and B = � n k λ − 1 k =1 α 2 k =1 α 2 we have k � ∂g ( α 1 , . . . , α n , µ ) λ k B + λ − 1 = 2 α k k A + µ ) = 0 ∂α k so that 1 Or α k = 0 ; 2 Or λ k is a root of the quadratic polynomial λ 2 B + λµ + A . in any case there are at most 2 coefficients α ’s not zero. a a the argument should be improved in the case of multiple eigenvalues Conjugate Direction minimization 20 / 106

  2. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. (4 / 5) . STEP 4: problem reformulation. say α i and α j are the only non zero coefficients, then α 2 i + α 2 j = 1 and we can write � �� � α 2 i λ i + α 2 α 2 i λ − 1 + α 2 j λ − 1 h ( α 1 , . . . , α n ) = j λ j i j � λ i � + λ j = α 4 i + α 4 j + α 2 i α 2 j λ j λ i � λ i � + λ j = α 2 i (1 − α 2 j ) + α 2 j (1 − α 2 i ) + α 2 i α 2 j λ j λ i � λ i � + λ j = 1 + α 2 i α 2 − 2 j λ j λ i i )( λ i − λ j ) 2 = 1 + α 2 i (1 − α 2 λ i λ j Conjugate Direction minimization 21 / 106

  3. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. (5 / 5) . STEP 5: bounding maxima and minima. notice that 0 ≤ β (1 − β ) ≤ 1 4 , ∀ β ∈ [0 , 1] i )( λ i − λ j ) 2 ≤ 1 + ( λ i − λ j ) 2 = ( λ i + λ j ) 2 1 ≤ 1 + α 2 i (1 − α 2 λ i λ j 4 λ i λ j 4 λ i λ j to bound ( λ i + λ j ) 2 / (4 λ i λ j ) consider the function f ( x ) = (1 + x ) 2 /x which is increasing for x ≥ 1 so that we have ( λ i + λ j ) 2 ≤ ( M + m ) 2 4 λ i λ j 4 M m and finally 1 ≤ h ( α 1 , . . . , α n ) ≤ ( M + m ) 2 4 M m Conjugate Direction minimization 22 / 106

  4. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Convergence rate of Steepest Descent The Kantorovich inequality permits to prove: Theorem (Convergence rate of Steepest Descent) Let A ∈ ❘ n × n an SPD matrix then the steepest descent method: x k +1 = x k + r T k r k r k r T k Ar k converge to the solution x ⋆ = A − 1 b with at least linear q -rate in the norm �·� A . Moreover we have the error estimate � x k +1 − x ⋆ � A ≤ κ − 1 κ + 1 � x k − x ⋆ � A κ = M/m is the condition number where m = λ 1 is the smallest eigenvalue of A and M = λ n is the biggest eigenvalue of A . Conjugate Direction minimization 23 / 106

  5. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. Remember from slide N ◦ 16 � � ( r T k r k ) 2 � x ⋆ − x k +1 � 2 A = � x ⋆ − x k � 2 1 − A ( r T k A − 1 r k )( r T k Ar k ) from Kantorovich inequality ( r T k r k ) 2 ( M + m ) 2 = ( M − m ) 2 4 M m 1 − k Ar k ) ≤ 1 − ( r T k A − 1 r k )( r T ( M + m ) 2 so that � x ⋆ − x k +1 � A ≤ M − m M + m � x ⋆ − x k � A Conjugate Direction minimization 24 / 106

  6. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Remark (One step convergence) The steepest descent method can converge in one iteration if κ = 1 or when r 0 = u k where u k is an eigenvector of A . 1 In the first case ( κ = 1 ) we have A = β I for some β > 0 so it is not interesting. 2 In the second case we have ( u T k u k ) 2 ( u T k u k ) 2 k Au k ) = k u k ) = 1 ( u T k A − 1 u k )( u T λ − 1 k ( u T k u k ) λ k ( u T in both cases we have r 1 = 0 i.e. we have found the solution. Conjugate Direction minimization 25 / 106

  7. Conjugate direction method Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 26 / 106

  8. Conjugate direction method Conjugate vectors Conjugate direction method Definition (Conjugate vector) Given two vectors p and q in ❘ n are conjugate respect to A if they are orthogonal respect the scalar product induced by A ; i.e., n � p T Aq = A ij p i q j = 0 . i,j =1 Clearly, n vectors p 1 , p 2 , . . . p n ∈ ❘ n that are pair wise conjugated respect to A form a base of ❘ n . Conjugate Direction minimization 27 / 106

  9. Conjugate direction method Conjugate vectors Problem (Linear system) 2 x T Ax − b T x + c is equivalent to Find the minimum of q ( x ) = 1 solve the first order necessary condition, i.e. Find x ⋆ ∈ ❘ n such that: Ax ⋆ = b . Observation Consider x 0 ∈ ❘ n and decompose the error e 0 = x ⋆ − x 0 by the conjugate vectors p 1 , p 2 , . . . , p n ∈ ❘ n : e 0 = x ⋆ − x 0 = σ 1 p 1 + σ 2 p 2 + · · · + σ n p n . Evaluating the coefficients σ 1 , σ 2 , . . . , σ n ∈ ❘ is equivalent to solve the problem Ax ⋆ = b , because knowing e 0 we have x ⋆ = x 0 + e 0 . Conjugate Direction minimization 28 / 106

  10. Conjugate direction method Conjugate vectors Observation Using conjugacy the coefficients σ 1 , σ 2 , . . . , σ n ∈ ❘ can be computed as σ i = p T i Ae 0 , for i = 1 , 2 , . . . , n. p T i Ap i In fact, for all 1 ≤ i ≤ n , we have p T i Ae 0 = p T i A ( σ 1 p 1 + σ 2 p 2 + . . . + σ n p n ) , = σ 1 p T i Ap 1 + σ 2 p T i Ap 2 + . . . + σ n p T i Ap n , = σ i p T i Ap i , because p T i Ap j = 0 for i � = j . Conjugate Direction minimization 29 / 106

  11. Conjugate direction method Conjugate vectors The conjugate direction method evaluate the coefficients σ 1 , σ 2 , . . . , σ n ∈ ❘ recursively in n steps, solving for k ≥ 0 the minimization problem: Conjugate direction method Given x 0 ; k ← 0 ; repeat k ← k + 1 ; Find x k ∈ x 0 + V k such that: x k = arg min � x ⋆ − x � A x ∈ x 0 + V k until k = n where V k is the subspace of ❘ n generated by the first k conjugate direction; i.e., � � V k = span p 1 , p 2 , . . . , p k . Conjugate Direction minimization 30 / 106

  12. Conjugate direction method First step Step: x 0 → x 1 At the first step we consider the subspace x 0 + span { p 1 } which consists in vectors of the form x ( α ) = x 0 + α p 1 α ∈ ❘ The minimization problem becomes: Minimization step x 0 → x 1 Find x 1 = x 0 + α 1 p 1 (i.e., find α 1 !) such that: � x ⋆ − x 1 � A = min α ∈ ❘ � x ⋆ − ( x 0 + α p 1 ) � A , Conjugate Direction minimization 31 / 106

  13. Conjugate direction method First step Solving first step method 1 The minimization problem is the minimum respect to α of the quadratic: Φ( α ) = � x ⋆ − ( x 0 + α p 1 ) � 2 A , = ( x ⋆ − ( x 0 + α p 1 )) T A ( x ⋆ − ( x 0 + α p 1 )) , = ( e 0 − α p 1 ) T A ( e 0 − α p 1 ) , = e T 0 Ae 0 − 2 α p T 1 Ae 0 + α 2 p T 1 Ap 1 . minimum is found by imposing: α 1 = p T dΦ( α ) 1 Ae 0 = − 2 p T 1 Ae 0 + 2 α p T 1 Ap 1 = 0 ⇒ p T d α 1 Ap 1 Conjugate Direction minimization 32 / 106

  14. Conjugate direction method First step Solving first step method 2 (1 / 2) Remember the error expansion: x ⋆ − x 0 = σ 1 p 1 + σ 2 p 2 + · · · + σ n p n . Let x ( α ) = x 0 + α p 1 , the difference x ⋆ − x ( α ) becomes: x ⋆ − x ( α ) = ( σ 1 − α ) p 1 + σ 2 p 2 + . . . + σ n p n due to conjugacy the error � x ⋆ − x ( α ) � A becomes � x ⋆ − x ( α ) � 2 A � � T � � n n � � = ( σ 1 − α ) p 1 + σ i p i ( σ 1 − α ) p 1 + σ j p i A i =2 j =2 n � = ( σ 1 − α ) 2 p T σ 2 j p T 1 Ap 1 + j Ap j j =2 Conjugate Direction minimization 33 / 106

  15. Conjugate direction method First step Solving first step method 2 (2 / 2) Because n � A = ( σ 1 − α ) 2 � p 1 � 2 � x ⋆ − x ( α ) � 2 σ 2 2 � p i � 2 A + A , i =2 we have that n � � x ⋆ − x ( α 1 ) � 2 σ 2 i � p i � 2 A ≤ � x ⋆ − x ( α ) � 2 A = for all α � = σ 1 A i =2 so that minimum is found by imposing α 1 = σ 1 : α 1 = p T 1 Ae 0 p T 1 Ap 1 This argument can be generalized for all k > 1 (see next slides). Conjugate Direction minimization 34 / 106

  16. Conjugate direction method k th Step Step, x k − 1 → x k For the step from k − 1 to k we consider the subspace of ❘ n � � V k = span p 1 , p 2 , . . . , p k which contains vectors of the form: x ( α (1) , α (2) , . . . , α ( k ) ) = x 0 + α (1) p 1 + α (2) p 2 + . . . + α ( k ) p k The minimization problem becomes: Minimization step x k − 1 → x k Find x k = x 0 + α 1 p 1 + α 2 p 2 + . . . + α k p k (i.e. α 1 , α 2 , . . . , α k ) such that: � � � � � x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) � x ⋆ − x k � A = min � α (1) ,α (2) ,...,α ( k ) ∈ ❘ A Conjugate Direction minimization 35 / 106

  17. Conjugate direction method k th Step Solving k th Step: x k − 1 → x k (1 / 2) Remember the error expansion: x ⋆ − x 0 = σ 1 p 1 + σ 2 p 2 + · · · + σ n p n . Consider a vector of the form x ( α (1) , α (2) , . . . , α ( k ) ) = x 0 + α (1) p 1 + α (2) p 2 + . . . + α ( k ) p k the error x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) can be written as k � x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) = x ⋆ − x 0 − α ( i ) p i , i =1 k n � � � σ i − α ( i ) � = p i + σ i p i . i =1 i = k +1 Conjugate Direction minimization 36 / 106

  18. Conjugate direction method k th Step Solving k th Step: x k − 1 → x k (2 / 2) using conjugacy of p i we obtain the norm of the error: � � 2 � � � x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) � A k n � � � σ i − α ( i ) � 2 � p i � 2 i � p i � 2 σ 2 = A + A . i =1 i = k +1 So that minimum is found by imposing α i = σ i : for i = 1 , 2 , . . . , k . α i = p T i Ae 0 i = 1 , 2 , . . . , k p T i Ap i Conjugate Direction minimization 37 / 106

  19. Conjugate direction method Successive one dimensional minimization Successive one dimensional minimization (1 / 3) notice that α i = σ i and that x k = x 0 + α 1 p 1 + · · · + α k p k = x k − 1 + α k p k so that x k − 1 contains k − 1 coefficients α i for the minimization. if we consider the one dimensional minimization on the subspace x k − 1 + span { p k } we find again x k ! Conjugate Direction minimization 38 / 106

  20. Conjugate direction method Successive one dimensional minimization Successive one dimensional minimization (2 / 3) Consider a vector of the form x ( α ) = x k − 1 + α p k remember that x k − 1 = x 0 + α 1 p 1 + · · · + α k − 1 p k − 1 so that the error x ⋆ − x ( α ) can be written as k − 1 � x ⋆ − x ( α ) = x ⋆ − x 0 − α i p i + α p k i =1 k − 1 n � � � � � � = σ i − α i p i + σ k − α p k + σ i p i . i =1 i = k +1 due to the equality σ i = α i the blue part of the expression is 0 . Conjugate Direction minimization 39 / 106

  21. Conjugate direction method Successive one dimensional minimization Successive one dimensional minimization (3 / 3) Using conjugacy of p i we obtain the norm of the error: n � � � 2 � p k � 2 � x ⋆ − x ( α ) � 2 i � p i � 2 σ 2 A = σ k − α A + A . i = k +1 So that minimum is found by imposing α = σ k : α k = p T k Ae 0 p T k Ap k Remark This observation permit to perform the minimization on the k -dimensional space x 0 + V k as successive one dimensional minimizations along the conjugate directions p k !. Conjugate Direction minimization 40 / 106

  22. Conjugate direction method Successive one dimensional minimization Problem (one dimensional successive minimization) Find x k = x k − 1 + α k p k such that: � x ⋆ − x k � A = min α ∈ ❘ � x ⋆ − ( x k − 1 + α p k ) � A , The solution is the minimum respect to α of the quadratic: Φ( α ) = ( x ⋆ − ( x k − 1 + α p k )) T A ( x ⋆ − ( x k − 1 + α p k )) , = ( e k − 1 − α p k ) T A ( e k − 1 − α p k ) , = e T k − 1 Ae k − 1 − 2 α p T k Ae k − 1 + α 2 p T k Ap k . minimum is found by imposing: α k = p T dΦ( α ) k Ae k − 1 = − 2 p T k Ae k − 1 + 2 α p T k Ap k = 0 ⇒ p T d α k Ap k Conjugate Direction minimization 41 / 106

  23. Conjugate direction method Successive one dimensional minimization In the case of minimization on the subspace x 0 + V k we have: α k = p T k Ae 0 / p T k Ap k In the case of one dimensional minimization on the subspace x k − 1 + span { p k } we have: α k = p T k Ae k − 1 / p T k Ap k Apparently they are different results, however by using the conjugacy of the vectors p i we have p T k Ae k − 1 = p T k A ( x ⋆ − x k − 1 ) � � = p T k A x ⋆ − ( x 0 + α 1 p 1 + · · · + α k − 1 p k − 1 ) = p T k Ae 0 − α 1 p T k Ap 1 − · · · − α k − 1 p T k Ap k − 1 = p T k Ae 0 Conjugate Direction minimization 42 / 106

  24. Conjugate direction method Successive one dimensional minimization The one step minimization in the space x 0 + V n and the successive minimization in the space x k − 1 + span { p k } , k = 1 , 2 , . . . , n are equivalent if p i s are conjugate. The successive minimization is useful when p i s are not known in advance but must be computed as the minimization process proceeds. The evaluation of α k is apparently not computable because e i is not known. However noticing Ae k = A ( x ⋆ − x k ) = b − Ax k = r k we can write α k = p T k Ae k − 1 / p T k Ap k = p T k r k − 1 / p T k Ap k = Finally for the residual is valid the recurrence r k = b − Ax k = b − A ( x k − 1 + α k p k ) = r k − 1 − α k Ap k . Conjugate Direction minimization 43 / 106

  25. Conjugate direction method Conjugate direction minimization Conjugate direction minimization Algorithm (Conjugate direction minimization) k ← 0 ; x 0 assigned; r 0 ← b − Ax 0 ; while not converged do k ← k + 1 ; r T k − 1 p T α k ← p k Ap k ; k x k ← x k − 1 + α k p k ; r k ← r k − 1 − α k Ap k ; end while Observation (Computazional cost) The conjugate direction minimization requires at each step one matrix–vector product for the evaluation of α k and two update AXPY for x k and r k . Conjugate Direction minimization 44 / 106

  26. Conjugate direction method Conjugate direction minimization Monotonic behavior of the error Remark (Monotonic behavior of the error) The energy norm of the error � e k � A is monotonically decreasing in k . In fact: e k = x ⋆ − x k = α k +1 p k +1 + . . . + α n p n , and by conjugacy � e k � 2 A = � x ⋆ − x k � 2 k +1 � p k +1 � 2 n � p n � 2 A = σ 2 A + . . . + σ 2 A . Finally from this relation we have e n = 0 . Conjugate Direction minimization 45 / 106

  27. Conjugate Gradient method Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 46 / 106

  28. Conjugate Gradient method Conjugate Gradient method The Conjugate Gradient method combine the Conjugate Direction method with an orthogonalization process (like Gram-Schmidt) applied to the residual to construct the conjugate directions. In fact, because A define a scalar product in the next slide we prove: each residue is orthogonal to the previous conjugate directions, and consequently linearly independent from the previous conjugate directions. if the residual is not null is can be used to construct a new conjugate direction. Conjugate Direction minimization 47 / 106

  29. Conjugate Gradient method Orthogonality of the residue r k respect V k The residue r k is orthogonal to p 1 , p 2 , . . . , p k . In fact, from the error expansion e k = α k +1 p k +1 + α k +2 p k +2 + · · · + α n p n because r k = Ae k , for i = 1 , 2 , . . . , k we have p T i r k = p T i Ae k n n � � = p T α j p T α j p j = i A i Ap j j = k +1 j = k +1 = 0 Conjugate Direction minimization 48 / 106

  30. Conjugate Gradient method Building new conjugate direction (1 / 2) The conjugate direction method build one new direction at each step. If r k � = 0 it can be used to build the new direction p k +1 by a Gram-Schmidt orthogonalization process p k +1 = r k + β ( k +1) p 1 + β ( k +1) p 2 + . . . + β ( k +1) p k , 1 2 k where the k coefficients β ( k +1) , β ( k +1) , . . . , β ( k +1) must 1 2 k satisfy: p T i Ap k +1 = 0 , for i = 1 , 2 , . . . , k. Conjugate Direction minimization 49 / 106

  31. Conjugate Gradient method Building new conjugate direction (2 / 2) (repeating from previous slide) p k +1 = r k + β ( k +1) p 1 + β ( k +1) p 2 + · · · + β ( k +1) p k , 1 2 k expanding the expression: 0 = p T i Ap k +1 , � � r k + β ( k +1) p 1 + β ( k +1) p 2 + · · · + β ( k +1) = p T , i A p k 1 2 k i Ar k + β ( k +1) = p T p T i Ap i , i = − p T i Ar k β ( k +1) ⇒ i = 1 , 2 , . . . , k i p T i Ap i Conjugate Direction minimization 50 / 106

  32. Conjugate Gradient method The choice of the residual r k � = 0 for the construction of the new conjugate direction p k +1 has three important consequences: 1 simplification of the expression for α k ; 2 Orthogonality of the residual r k from the previous residue r 0 , r 1 , . . . , r k − 1 ; 3 three point formula and simplification of the coefficients β ( k +1) . i this facts will be examined in the next slides. Conjugate Direction minimization 51 / 106

  33. Conjugate Gradient method Simplification of the expression for α k Writing the expression for p k from the orthogonalization process p k = r k − 1 + β ( k +1) p 1 + β ( k +1) p 2 + . . . + β ( k +1) k − 1 p k − 1 , 1 2 using orthogonality of r k − 1 and the vectors p 1 , p 2 , . . . , p k − 1 , (see slide N.48) we have � � r k − 1 + β ( k +1) p 1 + β ( k +1) p 2 + . . . + β ( k +1) r T k − 1 p k = r T , k − 1 p k − 1 k − 1 1 3 = r T k − 1 r k − 1 . recalling the definition of α k it follows: α k = e T = r T r T k − 1 Ap k k − 1 p k k − 1 r k − 1 = p T p T p T k Ap k k Ap k k Ap k Conjugate Direction minimization 52 / 106

  34. Conjugate Gradient method Orthogonally of the residue r k from r 0 , r 1 , . . . , r k − 1 From the definition of p i +1 it follows: p i +1 = r i + β ( i +1) p 1 + β ( i +1) p 2 + . . . + β ( i +1) p i , 1 2 i � � ⇒ r i ∈ span { p 1 , p 2 , . . . , p i , p i +1 } = V i +1 obvious using orthogonality of r k and the vectors p 1 , p 2 , . . . , p k , (see slide N.48) for i < k we have � � i � β ( i +1) r T k r i = r T p i +1 − , p j k j j =1 i � β ( i +1) = r T r T k p i +1 − k p j = 0 . j j =1 Conjugate Direction minimization 53 / 106

  35. Conjugate Gradient method Three point formula and simplification of β ( k +1) i r T k r i = r T From the relation k ( r i − 1 − α i Ap i ) we deduce � − r T k Ap i = r T k r i − 1 − r T k r k /α k if i = k ; k r i r T = α i 0 if i < k ; remembering that α k = r T k − 1 r k − 1 / p T k Ap k we obtain  r T  k r k  = − r T i = k ; k Ap i β ( k +1) r T k − 1 r k − 1 = i p T  i Ap i  0 i < k ; i.e. there is only one non zero coefficient β ( k +1) , so we write k β k = β ( k +1) and obtain the three point formula: k p k +1 = r k + β k p k Conjugate Direction minimization 54 / 106

  36. Conjugate Gradient method Conjugate gradient algorithm initial step: k ← 0 ; x 0 assigned; r 0 ← b − Ax 0 ; p 1 ← r 0 ; while � r k � > ǫ do k ← k + 1 ; Conjugate direction method r T k − 1 r k − 1 α k ← k Ap k ; p T x k ← x k − 1 + α k p k ; r k ← r k − 1 − α k Ap k ; Residual orthogonalization r T k r k β k ← k − 1 r k − 1 ; r T p k +1 ← r k + β k p k ; end while Conjugate Direction minimization 55 / 106

  37. Conjugate Gradient convergence rate Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 56 / 106

  38. Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (1 / 5) From the Conjugate Gradient iterative scheme on slide 55 we have Lemma There exists k -degree polynomial P k ( x ) and Q k ( x ) such that r k = P k ( A ) r 0 k = 0 , 1 , . . . , n p k = Q k − 1 ( A ) r 0 k = 1 , 2 , . . . , n Moreover P k (0) = 1 for all k . Proof. (1 / 2) . The proof is by induction. Base k = 0 p 1 = r 0 so that P 0 ( x ) = 1 and Q 0 ( x ) = 1 . Conjugate Direction minimization 57 / 106

  39. Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (2 / 5) Proof. (2 / 2) . let the expansion valid for k − 1 Consider the recursion for the residual: r k = r k − 1 − α k Ap k = P k − 1 ( A ) r 0 + α k A Q k − 1 ( A ) r 0 � � = P k − 1 ( A ) + α k A Q k − 1 ( A ) r 0 then P k ( x ) = P k − 1 ( x ) + α k xQ k − 1 ( x ) and P k (0) = P k − 1 (0) = 1 . Consider the recursion for the conjugate direction p k +1 = P k ( A ) r 0 + β k Q k − 1 ( A ) r 0 � � = P k ( A ) + β k Q k − 1 ( A ) r 0 then Q k ( x ) = P k ( x ) + β k Q k − 1 ( x ) . Conjugate Direction minimization 58 / 106

  40. Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (3 / 5) We have the following trivial equality � � V k = span p 1 , p 2 , . . . p k � � = span r 0 , r 1 , . . . r k − 1 � � q ( A ) r 0 | q ∈ P k − 1 , = � � p ( A ) e 0 | p ∈ P k , p (0) = 0 = In this way the optimality of CG step can be written as � x ⋆ − x k � A ≤ � x ⋆ − x � A , ∀ x ∈ x 0 + V k ∀ p ∈ P k , p (0) = 0 � x ⋆ − x k � A ≤ � x ⋆ − ( x 0 + p ( A ) e 0 ) � A , ∀ P ∈ P k , P (0) = 1 � x ⋆ − x k � A ≤ � P ( A ) e 0 � A , Conjugate Direction minimization 59 / 106

  41. Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (4 / 5) Recalling that A − 1 r k = A − 1 ( b − Ax k ) = x ⋆ − x k = e k we can write e k = x ⋆ − x k = A − 1 r k = A − 1 P k ( A ) r 0 = P k ( A ) A − 1 r 0 = P k ( A )( x ⋆ − x 0 ) = P k ( A ) e 0 . due to the optimality of the conjugate gradient we have: Conjugate Direction minimization 60 / 106

  42. Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (5 / 5) Using the results of slide 59 and 60 we can write e k = P k ( A ) e 0 , ∀ P ∈ P k , P (0) = 1 � e k � A = � P k ( A ) e 0 � A ≤ � P ( A ) e 0 � A and from this equation we have the estimate � e k � A ≤ P ∈ P k , P (0)=1 � P ( A ) e 0 � A inf So an estimate of the form P ∈ P k , P (0)=1 � P ( A ) e 0 � A ≤ C k � e 0 � A inf can be used to proof a convergence rate theorem, as for the steepest descent algorithm. Conjugate Direction minimization 61 / 106

  43. Conjugate Gradient convergence rate Convergence rate calculation Convergence rate calculation Lemma Let A ∈ ❘ n × n an SPD matrix, and p ∈ P k a polynomial, then � p ( A ) x � A ≤ � p ( A ) � 2 � x � A Proof. (1 / 2) . The matrix A is SPD so that we can write A = U T Λ U , Λ = diag { λ 1 , λ 2 , . . . , λ n } where U is an orthogonal matrix (i.e. U T U = I ) and Λ ≥ 0 is diagonal. We can define the SPD matrix A 1 / 2 as follows A 1 / 2 = U T Λ 1 / 2 U , Λ 1 / 2 = diag { λ 1 / 2 , λ 1 / 2 , . . . , λ 1 / 2 n } 1 2 and obviously A 1 / 2 A 1 / 2 = A . Conjugate Direction minimization 62 / 106

  44. Conjugate Gradient convergence rate Convergence rate calculation Proof. (2 / 2) . Notice that � � 2 � � A = x T Ax = x T A 1 / 2 A 1 / 2 x = � x � 2 � A 1 / 2 x � 2 so that � � � � � A 1 / 2 p ( A ) x � p ( A ) x � A = � 2 � � � � � p ( A ) A 1 / 2 x = � 2 � � � � � A 1 / 2 x ≤ � p ( A ) � 2 � 2 = � p ( A ) � 2 � x � A Conjugate Direction minimization 63 / 106

  45. Conjugate Gradient convergence rate Convergence rate calculation Lemma Let A ∈ ❘ n × n an SPD matrix, and p ∈ P k a polynomial, then � p ( A ) � 2 = max λ ∈ σ ( A ) | p ( λ ) | Proof. The matrix p ( A ) is symmetric, and for a generic symmetric matrix B we have � B � 2 = max λ ∈ σ ( B ) | λ | observing that if λ is an eigenvalue of A then p ( λ ) is an eigenvalue of p ( A ) the thesis easily follows. Conjugate Direction minimization 64 / 106

  46. Conjugate Gradient convergence rate Convergence rate calculation Starting the error estimate � e k � A ≤ P ∈ P k , P (0)=1 � P ( A ) e 0 � A inf Combining the last two lemma we easily obtain the estimate � � � e k � A ≤ inf λ ∈ σ ( A ) | P ( λ ) | max � e 0 � A P ∈ P k , P (0)=1 The convergence rate is estimated by bounding the constant � � inf λ ∈ σ ( A ) | P ( λ ) | max P ∈ P k , P (0)=1 Conjugate Direction minimization 65 / 106

  47. Conjugate Gradient convergence rate Finite termination of Conjugate Gradient Finite termination of Conjugate Gradient Theorem (Finite termination of Conjugate Gradient) Let A ∈ ❘ n × n an SPD matrix, the the Conjugate Gradient applied to the linear system Ax = b terminate finding the exact solution in at most n -step. Proof. From the estimate � � � e k � A ≤ inf λ ∈ σ ( A ) | P ( λ ) | max � e 0 � A P ∈ P k , P (0)=1 � � � choosing P ( x ) = ( x − λ ) (0 − λ ) λ ∈ σ ( A ) λ ∈ σ ( A ) we have max λ ∈ σ ( A ) | P ( λ ) | = 0 and � e n � A = 0 . Conjugate Direction minimization 66 / 106

  48. Conjugate Gradient convergence rate Convergence rate of Conjugate Gradient Convergence rate of Conjugate Gradient 1 The constant � � inf λ ∈ σ ( A ) | P ( λ ) | max P ∈ P k , P (0)=1 is not easy to evaluate, 2 The following bound, is useful λ ∈ σ ( A ) | P ( λ ) | ≤ max λ ∈ [ λ 1 ,λ n ] | P ( λ ) | max 3 in particular the final estimate will be obtained by � � � � � ¯ � inf λ ∈ σ ( A ) | P ( λ ) | max ≤ max P k ( λ ) P ∈ P k , P (0)=1 λ ∈ [ λ 1 ,λ n ] where ¯ P k ( x ) is an opportune k -degree polynomial for which � � ¯ � ¯ � . P k (0) = 1 and it is easy to evaluate max λ ∈ [ λ 1 ,λ n ] P k ( λ ) Conjugate Direction minimization 67 / 106

  49. Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (1 / 4) 1 The Chebyshev Polynomials of the First Kind are the right polynomial for this estimate. This polynomial have the following definition in the interval [ − 1 , 1] : T k ( x ) = cos( k arccos( x )) 2 Another equivalent definition valid in the interval ( −∞ , ∞ ) is the following �� � k � � k � � � T k ( x ) = 1 x 2 − 1 x 2 − 1 x + + x − 2 3 In spite of these definition, T k ( x ) is effectively a polynomial. Conjugate Direction minimization 68 / 106

  50. Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (2 / 4) Some example of Chebyshev Polynomials. 1.5 1.5 T1 T3 T2 T4 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 1.5 1.5 T12 T20 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Conjugate Direction minimization 69 / 106

  51. Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (3 / 4) 1 It is easy to show that T k ( x ) is a polynomial by the use of cos( α + β ) = cos α cos β − sin α sin β cos( α + β ) + cos( α − β ) = 2 cos α cos β let θ = arccos( x ) : T 0 ( x ) = cos(0 θ ) = 1 ; 1 T 1 ( x ) = cos(1 θ ) = x ; 2 T 2 ( x ) = cos(2 θ ) = cos( θ ) 2 − sin( θ ) 2 = 2 cos( θ ) 2 − 1 = 2 x 2 − 1 ; 3 T k +1 ( x ) + T k − 1 ( x ) = cos(( k + 1) θ ) + cos(( k − 1) θ ) 4 = 2 cos( kθ ) cos( θ ) = 2 x T k ( x ) 2 In general we have the following recurrence: T 0 ( x ) = 1 ; 1 T 1 ( x ) = x ; 2 T k +1 ( x ) = 2 x T k ( x ) − T k − 1 ( x ) . 3 Conjugate Direction minimization 70 / 106

  52. Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (4 / 4) Solving the recurrence: T 0 ( x ) = 1 ; 1 T 1 ( x ) = x ; 2 T k +1 ( x ) = 2 x T k ( x ) − T k − 1 ( x ) . 3 We obtain the explicit form of the Chebyshev Polynomials �� � k � � k � � � T k ( x ) = 1 x 2 − 1 x 2 − 1 x + + x − 2 The translated and scaled polynomial is useful in the study of the conjugate gradient method: � a + b − 2 x � T k ( x ; a, b ) = T k b − a where we have | T k ( x ; a, b ) | ≤ 1 for all x ∈ [ a, b ] . Conjugate Direction minimization 71 / 106

  53. Conjugate Gradient convergence rate Convergence rate of Conjugate Gradient method Convergence rate of Conjugate Gradient method Theorem (Convergence rate of Conjugate Gradient method) Let A ∈ ❘ n × n an SPD matrix then the Conjugate Gradient method converge to the solution x ⋆ = A − 1 b with at least linear r -rate in the norm �·� A . Moreover we have the error estimate � √ κ − 1 � k √ κ + 1 � e k � A � 2 � e 0 � A κ = M/m is the condition number where m = λ 1 is the smallest eigenvalue of A and M = λ n is the biggest eigenvalue of A . The expression a k � b k means that for all ǫ > 0 there exists k 0 > 0 such that: a k ≤ (1 − ǫ ) b k , ∀ k > k 0 Conjugate Direction minimization 72 / 106

  54. Conjugate Gradient convergence rate Convergence rate of Conjugate Gradient method Proof. From the estimate P ∈ P k , P (0) = 1 � e k � A ≤ λ ∈ [ m,M ] | P ( λ ) | � e 0 � A , max choosing P ( x ) = T k ( x ; m, M ) /T k (0; m, M ) from the fact that | T k ( x ; m, M ) | ≤ 1 for x ∈ [ m, M ] we have � M + m � − 1 � e k � A ≤ T k (0; m, M ) − 1 � e 0 � A = T k � e 0 � A M − m observe that M + m M − m = κ +1 κ − 1 and � κ + 1 � − 1 � � √ κ + 1 � k � √ κ − 1 � k � − 1 T k = 2 √ κ − 1 + √ κ + 1 κ − 1 � √ κ − 1 � k finally notice that → 0 as k → ∞ . √ κ +1 Conjugate Direction minimization 73 / 106

  55. Preconditioning the Conjugate Gradient method Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 74 / 106

  56. Preconditioning the Conjugate Gradient method Preconditioning Preconditioning Problem (Preconditioned linear system) Given A , P ∈ ❘ n × n , with A an SPD matrix and P non singular matrix and b ∈ ❘ n . Find x ⋆ ∈ ❘ n such that: P − T Ax ⋆ = P − T b . A good choice for P should be such that M = P T P ≈ A , where ≈ denotes that M is an approximation of A in some sense to precise later. Notice that: P non singular imply: P − T ( b − Ax ) = 0 ⇐ ⇒ b − Ax = 0; A = P − T AP − 1 is also SPD (obvious proof). A SPD imply � Conjugate Direction minimization 75 / 106

  57. Preconditioning the Conjugate Gradient method Preconditioning Now we reformulate the preconditioned system: Problem (Preconditioned linear system) Given A , P ∈ ❘ n × n , with A an SPD matrix and P non singular matrix and b ∈ ❘ n the preconditioned problem is the following: x ⋆ ∈ ❘ n such that: � x ⋆ = � Find � A � b where A = P − T AP − 1 b = P − T b � � notice that if x ⋆ is the solution of the linear system Ax = b then x ⋆ = P x ⋆ is the solution of the linear system � Ax = � � b . Conjugate Direction minimization 76 / 106

  58. Preconditioning the Conjugate Gradient method Preconditioning PCG: preliminary version initial step: k ← 0 ; x 0 assigned; r 0 ← � b − � x 0 ← P x 0 ; � � A � x 0 ; � p 1 ← � r 0 ; while � � r k � > ǫ do k ← k + 1 ; Conjugate direction method r T � k − 1 � r k − 1 � α k ← p k ; k � p T � A � x k ← � � x k − 1 + � α k � p k ; α k � � r k ← � r k − 1 − � A � p k ; Residual orthogonalization r T � k � � r k β k ← r k − 1 ; r T � k − 1 � r k + � p k +1 ← � � β k � p k ; end while final step P − 1 � x k ; Conjugate Direction minimization 77 / 106

  59. Preconditioning the Conjugate Gradient method CG reformulation Conjugate gradient algorithm applied to � x = � A � b require the evaluation of thing like: p k = P − T AP − 1 � � A � p k . this can be done without evaluate directly the matrix � A , by the following operations: k = P − 1 � 1 solve P s ′ p k for s ′ k = � p k ; 2 evaluate s ′′ k = As ′ k ; 3 solve P T s ′′′ k = P − T s ′′ . k = s ′′ k for s ′′′ Step 1 and 3 require the solution of two auxiliary linear system. This is not a big problem if P and P T are triangular matrices (see e.g. incomplete Cholesky). Conjugate Direction minimization 78 / 106

  60. Preconditioning the Conjugate Gradient method CG reformulation However. . . we can reformulate the algorithm using only the matrices A and P ! Definition For all k ≥ 1 , we introduce the vector q k = P − 1 � p . Observation p k for all 1 ≤ k ≤ n are � If the vectors � p 1 , � p 2 , . . . � A -conjugate, then the corresponding vectors q 1 , q 2 , . . . q k are A -conjugate. In fact: A P − 1 � q T p T j P − T p T � j Aq i = � p i = � A p i = 0 , � if i � = j, ���� j � �� � � �� � = P − T AP − 1 = q T = q T j j that is a consequence of � A -conjugation of vectors � p i . Conjugate Direction minimization 79 / 106

  61. Preconditioning the Conjugate Gradient method CG reformulation Definition For all k ≥ 1 , we introduce the vectors x k = x k − 1 + � α k q k . Observation If we assume, by construction, � x 0 = P x 0 , then we have � x k = P x k , for all k with 1 ≤ k ≤ n . In fact, if � x k − 1 = P x k − 1 (inductive hypothesis), then x k = � � x k − 1 + � α k � p k [preconditioned CG] = P x k − 1 + � α k P q k [inductive Hyp. defs of q k ] = P ( x k − 1 + � α k q k ) [obvious] = P x k [defs. of x k ] Conjugate Direction minimization 80 / 106

  62. Preconditioning the Conjugate Gradient method CG reformulation Observation Because � x k = P x k for all k ≥ 0 , we have the recurrence between r k = � b − � the corresponding residue � A � x and r k = b − Ax k : r k = P − T r k . � In fact, r k = � b − � � A � x k , [defs. of � r k ] = P − T b − P − T AP − 1 P x k , [defs. of � b , � A , � x k ] = P − T ( b − Ax k ) , [obvious] = P − T r k . [defs. of r k ] Conjugate Direction minimization 81 / 106

  63. Preconditioning the Conjugate Gradient method CG reformulation Definition For all k , with 1 ≤ k ≤ n , the vector z k is the solution of the linear system Mz k = r k . where M = P T P . Formally, z k = M − 1 r k = P − 1 P − T r k . Using the vectors { z k } , α k and � we can express � β k in terms of A , the residual r k , and conjugate direction q k ; we can build a recurrence relation for the A -conjugate directions q k . Conjugate Direction minimization 82 / 106

  64. Preconditioning the Conjugate Gradient method CG reformulation Observation r k − 1 P − 1 P − T r k − 1 r T α k = � k − 1 � = r k − 1 M − 1 r k − 1 r k − 1 � = , k P T P − T AP − 1 P q k q T k � p T q k Aq k � A � p k r k − 1 z k − 1 = . q k Aq k Observation k P − 1 P − T r k r T r T r T k M − 1 r k � k � r k � β k = = = , k − 1 P − 1 P − T r k − 1 r T r T r T k − 1 M − 1 r k − 1 � k − 1 � r k − 1 r T k z k = . r T k − 1 z k − 1 Conjugate Direction minimization 83 / 106

  65. Preconditioning the Conjugate Gradient method CG reformulation Observation Using the vector z k = M − 1 r k , the following recurrence is true q k +1 = z k + � β k q k In fact: r k + � p k +1 = � � β k � p k [preconditioned CG] P − 1 � β k P − 1 � p k +1 = P − 1 � r k + � [left mult P − 1 ] p k P − 1 � β k P − 1 � p k +1 = P − 1 P − T r k + � p k [ r k +1 = P − T r k +1 ] P − 1 � β k P − 1 � [ M − 1 = P − 1 P − T ] p k +1 = M − 1 r k + � p k [ q k = P − 1 � q k +1 = z k + � β k q k p k ] Conjugate Direction minimization 84 / 106

  66. Preconditioning the Conjugate Gradient method CG reformulation PCG: final version initial step: k ← 0 ; x 0 assigned; r 0 ← b − Ax 0 ; q 1 ← r 0 ; while � z k � > ǫ do k ← k + 1 ; Conjugate direction method r T k − 1 z k − 1 � α k ← Aq k ; k � q T x k ← x k − 1 + � α k q k ; r k ← r k − 1 − � α k Aq k ; Preconditioning z k = M − 1 r k ; Residual orthogonalization r T k z k � β k ← k − 1 z k − 1 ; r T q k +1 ← z k + � β k q k ; end while Conjugate Direction minimization 85 / 106

  67. Nonlinear Conjugate Gradient extension Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 86 / 106

  68. Nonlinear Conjugate Gradient extension Nonlinear Conjugate Gradient extension 1 The conjugate gradient algorithm can be extended for nonlinear minimization. 2 Fletcher and Reeves extend CG for the minimization of a general non linear function f ( x ) as follows: Substitute the evaluation of α k by an line search 1 Substitute the residual r k with the gradient ∇ f ( x k ) 2 3 We also translate the index for the search direction p k to be more consistent with the gradients. The resulting algorithm is in the next slide Conjugate Direction minimization 87 / 106

  69. Nonlinear Conjugate Gradient extension Fletcher and Reeves Fletcher and Reeves Nonlinear Conjugate Gradient initial step: k ← 0 ; x 0 assigned; f 0 ← f ( x 0 ) ; g 0 ← ∇ f ( x 0 ) T ; p 0 ← − g 0 ; while � g k � > ǫ do k ← k + 1 ; Conjugate direction method Compute α k by line-search; x k ← x k − 1 + α k p k − 1 ; g k ← ∇ f ( x k ) T ; Residual orthogonalization g T k g k β FR ← k − 1 g k − 1 ; k g T p k ← − g k + β FR p k − 1 ; k end while Conjugate Direction minimization 88 / 106

  70. Nonlinear Conjugate Gradient extension Fletcher and Reeves 1 To ensure convergence and apply Zoutendijk global convergence theorem we need to ensure that p k is a descent direction. 2 p 0 is a descent direction by construction, for p k we have k p k = − � g k � 2 + β FR g T g T k p k − 1 k if the line-search is exact than g T k p k − 1 = 0 because p k − 1 is the direction of the line-search. So by induction p k is a descent direction. 3 Exact line-search is expensive, however if we use inexact line-search with strong Wolfe conditions sufficient decrease: f ( x k + α k p k ) ≤ f ( x k ) + c 1 α k ∇ f ( x k ) p k ; 1 curvature condition: |∇ f ( x k + α k p k ) p k | ≤ c 2 |∇ f ( x k ) p k | . 2 with 0 < c 1 < c 2 < 1 / 2 then we can prove that p k is a descent direction. Conjugate Direction minimization 89 / 106

  71. Nonlinear Conjugate Gradient extension convergence analysis The previous consideration permits to say that Fletcher and Reeves nonlinear conjugate gradient method with strong Wolfe line-search is globally convergent 1 To prove globally convergence we need the following lemma: Lemma (descent direction bound) Suppose we apply Fletcher and Reeves nonlinear conjugate gradient method to f ( x ) with strong Wolfe line-search with 0 < c 2 < 1 / 2 . The the method generates descent direction p k that satisfy the following inequality ≤ g T 1 � g k � 2 ≤ − 1 − 2 c 2 k p k − , k = 0 , 1 , 2 , . . . 1 − c 2 1 − c 2 1 globally here means that Zoutendijk like theorem apply Conjugate Direction minimization 90 / 106

  72. Nonlinear Conjugate Gradient extension convergence analysis Proof. (1 / 3) . The proof is by induction. First notice that the function t ( ξ ) = 2 ξ − 1 1 − ξ is monotonically increasing on the interval [0 , 1 / 2] and that t (0) = − 1 and t (1 / 2) = 0 . Hence, because of c 2 ∈ (0 , 1 / 2) we have: − 1 < 2 c 2 − 1 < 0 . ( ⋆ ) 1 − c 2 base of induction k = 0 : For k = 0 we have p 0 = − g 0 so that 0 p 0 / � g 0 � 2 = − 1 . From ( ⋆ ) the lemma inequality is trivially g T satisfied. Conjugate Direction minimization 91 / 106

  73. Nonlinear Conjugate Gradient extension convergence analysis Proof. (2 / 3) . Using update direction formula’s of the algorithm: g T k g k β FR p k = − g k + β FR = p k − 1 k k g T k − 1 g k − 1 we can write g T g T = − 1 + g T k p k k p k − 1 k p k − 1 � g k � 2 = − 1 + β FR k � g k � 2 � g k − 1 � 2 and by using second strong Wolfe condition: g T g T � g k − 1 � 2 ≤ g T k − 1 p k − 1 k − 1 p k − 1 k p k − 1 + c 2 � g k � 2 ≤ − 1 − c 2 � g k − 1 � 2 Conjugate Direction minimization 92 / 106

  74. Nonlinear Conjugate Gradient extension convergence analysis Proof. (3 / 3) . by induction we have ≥ − g T k − 1 p k − 1 1 � g k − 1 � 2 > 0 1 − c 2 so that g T g T k − 1 p k − 1 k p k 1 = 2 c 2 − 1 � g k � 2 ≤ − 1 − c 2 � g k − 1 � 2 ≤ − 1 + c 2 1 − c 2 1 − c 2 and g T g T k − 1 p k − 1 k p k 1 1 � g k � 2 ≥ − 1 + c 2 � g k − 1 � 2 ≥ − 1 − c 2 = − 1 − c 2 1 − c 2 Conjugate Direction minimization 93 / 106

  75. Nonlinear Conjugate Gradient extension convergence analysis 1 The inequality of the the previous lemma can be written as: g T 1 � g k � � g k � � p k � ≥ 1 − 2 c 2 � g k � k p k � p k � ≥ − � p k � > 0 1 − c 2 1 − c 2 2 Remembering the Zoutendijk theorem we have ∞ � g T k p k (cos θ k ) 2 � g k � 2 < ∞ , where cos θ k = − � g k � � p k � k =1 3 so that if � g k � / � p k � is bounded from below we have that cos θ k ≥ δ for all k and then from Zoutendijk theorem the scheme converge. 4 Unfortunately this bound cant be proved so that Zoutendijk theorem cant be applied directly. However it is possible to prove a weaker results, i.e. that lim inf k →∞ � g k � = 0 ! Conjugate Direction minimization 94 / 106

  76. Nonlinear Conjugate Gradient extension convergence analysis Convergence of Fletcher and Reeves method Assumption (Regularity assumption) We assume f ∈ C 1 ( ❘ n ) with Lipschitz continuous gradient, i.e. there exists γ > 0 such that � � ∇ f ( x ) T − ∇ f ( y ) T � � ≤ γ � x − y � , ∀ x , y ∈ ❘ n Conjugate Direction minimization 95 / 106

  77. Nonlinear Conjugate Gradient extension convergence analysis Theorem (Convergence of Fletcher and Reeves method) Suppose the method of Fletcher and Reeves is implemented with strong Wolfe line-search with 0 < c 1 < c 2 < 1 / 2 . If f ( x ) and x 0 satisfy the previous regularity assumptions, then lim inf k →∞ � g k � = 0 Proof. (1 / 4) . From previous Lemma we have 1 � g k � cos θ k ≥ k = 1 , 2 , . . . 1 − c 2 � p k � ∞ � � g k � 4 substituting in Zoutendijk condition we have � p k � 2 < ∞ . k =1 The proof is by contradiction. in fact if theorem is not true than the series diverge. Next we want to bound � p k � . Conjugate Direction minimization 96 / 106

  78. Nonlinear Conjugate Gradient extension convergence analysis Proof. (bounding � p k � ) (2 / 4) . Using second Wolfe condition and previous Lemma � � c 2 � g T � ≤ − c 2 g T � g k − 1 � 2 k p k − 1 k p k − 1 ≤ 1 − c 2 using p k ← − g k + β FR p k − 1 we have k � � � p k � 2 ≤ � g k � 2 + 2 β FR ) 2 � p k − 1 � 2 � g T � + ( β FR k p k − 1 k k 2 c 2 ≤ � g k � 2 + � g k − 1 � 2 + ( β FR ) 2 � p k − 1 � 2 β FR k k 1 − c 2 ← � g k � 2 / � g k − 1 � 2 then recall that β FR k � p k � 2 ≤ 1 + c 2 � g k � 2 + ( β FR ) 2 � p k − 1 � 2 k 1 − c 2 Conjugate Direction minimization 97 / 106

  79. Nonlinear Conjugate Gradient extension convergence analysis Proof. (bounding � p k � ) (3 / 4) . setting c 3 = 1+ c 2 1 − c 2 and using repeatedly the last inequality we obtain: ) 2 � k − 1 ) 2 � p k − 2 � 2 � � p k � 2 ≤ c 3 � g k � 2 + ( β FR c 3 � g k − 1 � 2 + ( β FR k = c 3 � g k � 4 � � g k � − 2 + � g k − 1 � − 2 � � g k � 4 � g k − 2 � 4 � p k − 2 � 2 + ≤ c 3 � g k � 4 � � g k � − 2 + � g k − 1 � − 2 + � g k − 2 � − 2 � + � g k � 4 � g k − 3 � 4 � p k − 3 � 2 k � ≤ c 3 � g k � 4 � g j � − 2 j =1 Conjugate Direction minimization 98 / 106

  80. Nonlinear Conjugate Gradient extension convergence analysis Proof. (4 / 4) . Suppose now by contradiction there exists δ > 0 such that � g k � ≥ δ a by using the regularity assumptions we have k � � p k � 2 ≤ c 3 � g k � 4 � g j � − 2 ≤ c 3 � g k � 4 δ − 2 k j =1 Substituting in Zoutendijk condition we have ∞ ∞ � � g k � 4 � � p k � 2 ≥ δ 2 1 ∞ > k = ∞ c 4 k =1 k =1 this contradict assumption. a the correct assumption is that there exists k 0 such that � g k � ≥ δ for k ≥ k 0 but this complicate a little bit the following inequality without introducing new idea. Conjugate Direction minimization 99 / 106

  81. Nonlinear Conjugate Gradient extension Weakness of Fletcher and Reeves method Suppose that p k is a bad search direction, i.e. cos θ k ≈ 0 . From the descent direction bound Lemma (see slide 90) we have 1 � p k � ≥ cos θ k ≥ 1 − 2 c 2 � g k � � g k � � p k � > 0 1 − c 2 1 − c 2 so that to have cos θ k ≈ 0 we needs � p k � ≫ � g k � . since p k is a bad direction near orthogonal to g k it is likely that the step is small and x k +1 ≈ x k . If so we have also g k +1 ≈ g k and β FR k +1 ≈ 1 . but remember that p k +1 ← − g k +1 + β FR k +1 p k , so that p k +1 ≈ p k . This means that a long sequence of unproductive iterates will follows. Conjugate Direction minimization 100 / 106

Recommend


More recommend