MATH 4211/6211 – Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0
Quasi-Newton Method Motivation : Approximate the inverse Hessian ( ∇ 2 f ( x ( k ) )) − 1 in the New- ton’s method by some H k : x ( k +1) = x ( k ) − α k H k g ( k ) That is, the search direction is set to d ( k ) = − H k g ( k ) . Based on H k , x ( k ) , g ( k ) , quasi-Newton generates the next H k +1 , and so on. Xiaojing Ye, Math & Stat, Georgia State University 1
Proposition . If f ∈ C 1 , g ( k ) � = 0 , and H k ≻ 0 , then d ( k ) = − H k g ( k ) is a descent direction. Proof . Let x ( k +1) = x ( k ) − α H k g ( k ) for some α , then by Taylor’s expansion f ( x ( k +1) ) = f ( x ( k ) ) − α g ( k ) ⊤ H k g ( k ) + o ( � H k g ( k ) � α ) < f ( x ( k ) ) for α sufficiently small. Xiaojing Ye, Math & Stat, Georgia State University 2
Recall that for quadratic functions with Q ≻ 0 , the Hessian is H ( k ) = Q for all k , and g ( k +1) − g ( k ) = Q ( x ( k +1) − x ( k ) ) For notation simplicity, we denote ∆ x ( k ) = x ( k +1) − x ( k ) ∆ g ( k ) = g ( k +1) − g ( k ) and Then we can write the identity above as ∆ g ( k ) = Q ∆ x ( k ) or equivalently Q − 1 ∆ g ( k ) = ∆ x ( k ) Xiaojing Ye, Math & Stat, Georgia State University 3
In quasi-Newton method, H k is in the place of Q − 1 : x ( k +1) = x ( k ) − α k Q − 1 g ( k ) Newton : x ( k +1) = x ( k ) − α k H k g ( k ) Quasi-Newton : Therefore we would like to have a sequence of H k with same property of Q − 1 : H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k for all k = 0 , 1 , 2 , . . . . Xiaojing Ye, Math & Stat, Georgia State University 4
If this is true, then at iteration n , there are H n ∆ g (0) = ∆ x (0) H n ∆ g (1) = ∆ x (1) . . . H n ∆ g ( n − 1) = ∆ x ( n − 1) or H n [∆ g (0) , . . . , ∆ g ( n − 1) ] = [∆ x (0) , . . . , ∆ x ( n − 1) ] . On the other hand, Q − 1 [∆ g (0) , . . . , ∆ g ( n − 1) ] = [∆ x (0) , . . . , ∆ x ( n − 1) ] . If [∆ g (0) , . . . , ∆ g ( n − 1) ] is invertible, then we have H n = Q − 1 . Then at the iteration n + 1 , there is x ( n +1) = x ( n ) − α n H n g ( n ) = x ∗ since this is the same as the Newton’s update. Hence for quadratic functions, quasi-Newton method would converge in at most n steps. Xiaojing Ye, Math & Stat, Georgia State University 5
Quasi-Newton method d ( k ) = − H k g ( k ) f ( x ( k ) + α k d ( k ) ) α k = arg min α ≥ 0 x ( k +1) = x ( k ) + α k d ( k ) where H 0 , H 1 , . . . are symmetric. Moreover, for quadratic functions of form f ( x ) = 1 2 x ⊤ Qx − b ⊤ x , the matrices H 0 , H 1 , . . . are required to satisfy H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k Xiaojing Ye, Math & Stat, Georgia State University 6
Theorem . Consider a quasi-Newton algorithm applied to a quadratic function with symmetric Q ≻ 0 , such that for all k = 0 , 1 , . . . , n − 1 , there are H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k and H k are all symmetric. If α i � = 0 for 0 ≤ i ≤ k , then d (0) , . . . , d ( n ) are Q -conjugate. Xiaojing Ye, Math & Stat, Georgia State University 7
Proof . We prove by induction. It is trivial to show g (1) ⊤ d ( i ) . Assume the claim holds for some k < n − 1 . We have for i ≤ k that d ( k +1) ⊤ Qd ( i ) = − ( H k +1 g ( k +1) ) ⊤ Qd ( i ) Q ∆ x ( i ) = − g ( k +1) ⊤ H k +1 α i ∆ g ( i ) = − g ( k +1) ⊤ H k +1 α i = − g ( k +1) ⊤ ∆ x ( i ) α i = − g ( k +1) ⊤ d ( i ) Since d (0) , . . . , d ( k ) are Q -conjugate, we know g ( k +1) ⊤ d ( i ) = 0 for all i ≤ Hence d (0) , . . . , d ( k ) , d ( k +1) are Q -conjugate. k . By induction the claim holds. Xiaojing Ye, Math & Stat, Georgia State University 8
The theorem above also shows that quasi-Newton method is a conjugate di- rection method, and hence converges in n steps for quadratic objective func- tions. In practice, there are various ways to generate H k such that H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k Now we learn three algorithms that produce such H k . Xiaojing Ye, Math & Stat, Georgia State University 9
Rank one correction formula Suppose we would like to update H k to H k +1 by adding a rank one matrix a k z ( k ) z ( k ) ⊤ for some a k ∈ R and z ( k ) ∈ R n : H k +1 = H k + a k z ( k ) z ( k ) ⊤ Now let us derive what this a k z ( k ) z ( k ) ⊤ should be. Since we need H k +1 ∆ g ( i ) = ∆ x ( i ) for i ≤ k , we at least need H k +1 ∆ g ( k ) = ∆ x ( k ) . That is ∆ x ( k ) = H k +1 ∆ g ( k ) = ( H k + a k z ( k ) z ( k ) ⊤ )∆ g ( k ) = H k ∆ g ( k ) + a k ( z ( k ) ⊤ ∆ g ( k ) ) z ( k ) Xiaojing Ye, Math & Stat, Georgia State University 10
Therefore z ( k ) = ∆ x ( k ) − H k ∆ g ( k ) a k ( z ( k ) ⊤ ∆ g ( k ) ) and hence H k +1 = H k + (∆ x ( k ) − H k ∆ g ( k ) )(∆ x ( k ) − H k ∆ g ( k ) ) ⊤ a k ( z ( k ) ⊤ ∆ g ( k ) ) 2 On the other hand, multiplying ∆ g ( k ) ⊤ on both sides of ∆ x ( k ) − H k g ( k ) = a k ( z ( k ) ⊤ ∆ g ( k ) ) z ( k ) , we obtain ∆ g ( k ) ⊤ (∆ x ( k ) − H k ∆ g ( k ) ) = a k ( z ( k ) ⊤ ∆ g ( k ) ) 2 . Hence H k +1 = H k + (∆ x ( k ) − H k ∆ g ( k ) )(∆ x ( k ) − H k ∆ g ( k ) ) ⊤ ∆ g ( k ) ⊤ (∆ x ( k ) − H k ∆ g ( k ) ) This is the rank one correction formula. Xiaojing Ye, Math & Stat, Georgia State University 11
We obtained the formula by requiring H k +1 ∆ g ( k ) = ∆ x ( k ) . However, we also need H k +1 ∆ g ( i ) = ∆ x ( i ) for i < k . This turns out to be true automat- ically: Theorem . For the rank one algorithm applied to quadratic functions with Hes- sian symmetric Q , there are H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k for k = 0 , 1 , . . . , n − 1 . Xiaojing Ye, Math & Stat, Georgia State University 12
We have showed H k +1 ∆ g ( k ) = ∆ x ( k ) for all k = 0 , 1 , 2 , · · · . Proof . Assume the identities hold up to k , we use induction to show it’s true for k +1 . We here only need to show H k +1 ∆ g ( i ) = ∆ x ( i ) for i < k : H k + (∆ x ( k ) − H k ∆ g ( k ) )(∆ x ( k ) − H k ∆ g ( k ) ) ⊤ � � H k +1 ∆ g ( i ) = ∆ g ( i ) ∆ g ( k ) ⊤ (∆ x ( k ) − H k ∆ g ( k ) ) = ∆ x ( i ) + (∆ x ( k ) − H k ∆ g ( k ) )(∆ x ( k ) − H k ∆ g ( k ) ) ⊤ ∆ g ( i ) ∆ g ( k ) ⊤ (∆ x ( k ) − H k ∆ g ( k ) ) Note that ( H k ∆ g ( k ) ) ⊤ ∆ g ( i ) = ∆ g ( k ) ⊤ H k ∆ g ( i ) = ∆ g ( k ) ⊤ ∆ x ( i ) = ∆ x ( k ) ⊤ Q ∆ x ( i ) = ∆ x ( k ) ⊤ ∆ g ( i ) Hence the second term on the right is zero, and we obtain H k ∆ g ( i ) = ∆ x ( i ) This completes the proof. Xiaojing Ye, Math & Stat, Georgia State University 13
Issues with rank one correction formula: • H k +1 may not be positive definite even if H k is. Hence − H k g ( k ) may not be a descent direction; • the denominator in the rank one correction is ∆ g ( k ) ⊤ (∆ x ( k ) − H k ∆ g ( k ) ) , which can be close to 0 and makes computation unstable. Xiaojing Ye, Math & Stat, Georgia State University 14
We now study the DFP algorithm which improves the rank one correction for- mula by ensuring positive definiteness of H k . DFP algoirthm [Davidson 1959, Fletcher and Powell 1963] H k +1 = H k + ∆ x ( k ) ∆ x ( k ) ⊤ ∆ x ( k ) ⊤ ∆ g ( k ) − ( H k ∆ g ( k ) )( H k ∆ g ( k ) ) ⊤ ∆ g ( k ) ⊤ H k ∆ g ( k ) Xiaojing Ye, Math & Stat, Georgia State University 15
We first show that DFP is a quasi-Newton method. Theorem . The DFP algorithm applied to quadratic functions satisfies H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k for all k . Xiaojing Ye, Math & Stat, Georgia State University 16
Proof . We prove this by induction. It is trivial for k = 0 . Assume the claim is true for k , i.e., H k ∆ g ( i ) = ∆ x ( i ) for all i ≤ k − 1 . Now we first have H k +1 ∆ g ( i ) = ∆ x ( i ) for i = k by direct computation. For i < k , there is H k +1 ∆ g ( i ) = H k ∆ g ( i ) + ∆ x ( k ) (∆ x ( k ) ⊤ ∆ g ( i ) ) ∆ x ( k ) ⊤ ∆ g ( k ) − ( H k ∆ g ( k ) )( H k ∆ g ( k ) ) ⊤ ∆ g ( i ) ∆ g ( k ) ⊤ H k ∆ g ( k ) Note that due to assumption d (0) , . . . , d ( k ) are Q -conjugate, and hence ∆ x ( k ) ⊤ ∆ g ( i ) = ∆ x ( k ) ⊤ Q ∆ x ( i ) = α k α i d ( k ) ⊤ Qd ( i ) = 0 similarly ∆ g ( k ) ⊤ H k ∆ g ( i ) = ∆ g ( k ) ⊤ ∆ x ( i ) = 0 . This completes the proof. Xiaojing Ye, Math & Stat, Georgia State University 17
Next we show that H k +1 inherits positive definiteness of H k in DFP algorithm. Theorem . Suppose g ( k ) � = 0 , then H k ≻ 0 implies H k +1 ≻ 0 in DFP . Proof . For any x ∈ R n , there is x ⊤ H k +1 x = x ⊤ H k x + ( x ⊤ ∆ x ( k ) ) 2 ∆ x ( k ) ⊤ ∆ g ( k ) − ( x ⊤ H k ∆ g ( k ) ) 2 ∆ g ( k ) ⊤ H k ∆ g ( k ) For notation simplicity, we denote a = H 1 / 2 b = H 1 / 2 ∆ g ( k ) x and k k where H k = H 1 / 2 H 1 / 2 (we know H 1 / 2 exists since H k is SPD). k k k Xiaojing Ye, Math & Stat, Georgia State University 18
Recommend
More recommend