Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8
Theorem 3.4 Suppose f is twice cont diff and the sequence of steepest descent converge to x* satisfying SOSC. ⎛ ⎞ λ − λ ⎛ ⎞ ρ − 1 Let ∈ = λ ≤ ≤ λ ∇ 2 ⎜ n 1 ⎟ ⎜ ⎟ r where ... eigenvaluesof f x ( *) λ + λ ρ + 1 n ⎝ ⎠ ⎝ ⎠ 1 n 1 The for all k sufficient large [ ] − ≤ − 2 f ( x ) f ( x *) r f ( x ) f ( x *) + k 1 k
Choosing better directions Steepest Descent – simple and cheap per iterations but can converge very slowly if conditioning is bad. Modified Newton’s – expensive per iterations but converges quickly. Goal – first order methods with Newton like behavior
Scaled Steepest Descent Pick approximation of Hessian D k + = − α ∇ x x D f x ( ) k 1 k k k k ( ) = 1/ 2 Let S D k k Do change = x Sy of variables Now problem is = min ( ) h y f Sy ( )
Scaled Steepest Descent… = − α ∇ y y h y ( ) + k 1 k k k = − α ∇ y S f Sy ( ) k k k Multiple by S = − α ∇ S y Sy SS f Sy ( ) + k 1 k k k = − α ∇ S y Sy D f Sy ( ) + k 1 k k k = − α ∇ x x D f x ( ) + k 1 k k k Thus convergence rate of steepest descent applies in this space = = ( ) ( ) ' g y f Sy y SQSy
Scaled Steepest Descent… Convergence rate governed by eigs of SQS λ = smallest eigenvale of SQS 1 λ = largest eigenvale of SQS n ( ) 1/ 2 λ λ -1 Choose S close to Q to make / close to 1 n 1 ( ) ( ) ( ) 1/ 2 1/ 2 = -1 -1 note Q Q Q I
Cheap Newton Approximation Use just diagonal of Hessian ⎡ − ⎤ 1 ⎛ ⎞ ∂ 2 f ⎢ ⎥ ⎜ ⎟ 0 0 ∂ ∂ ⎢ ⎥ ⎝ ⎠ x x 1 1 ⎢ ⎥ − 1 ⎛ ⎞ ∂ ⎢ ⎥ 2 f = = ⎜ ⎟ S D 0 0 ⎢ ⎥ ∂ ∂ k ⎝ ⎠ x x ⎢ ⎥ 2 2 ⎢ ⎥ − 1 ⎛ ⎞ ∂ 2 f ⎢ ⎥ ⎜ ⎟ 0 0 ⎢ ⎥ ∂ ∂ ⎝ ⎠ x x ⎢ ⎥ ⎣ ⎦ 3 3 Linear storage and computation, inverse is trivial. Limited effectiveness.
Quasi-Newton Methods Newton’s Method 2 ( ∇ = −∇ f x ) p f x ( ) k k Instead substitute B k = −∇ B p f x ( ) k k to get Newton-like directions
Better yet – estimate Newton inverse Quasi-Newton Methods ) k f x ( = −∇ 1 B − k = B p k k H directly
1-dimensional case In 1-d case might estimate change in derivative − f '( x ) f '( x ) ≈ − k k 1 f ''( x ) change in x − k x x − k k 1 If you do this you get secant method − x x = − − k k 1 x x f '( x ) f '( x ) + − k 1 k k k f '( x ) f '( x ) − k k 1 0 x k-1 x k x k+1
1-d convergence Secant method has superlinear convergence with rate ( ) 1 1 r = + 5 (the "golden ratio" again!) 2 But Secant Method only applies to 1-d
Secant Condition 1-d condition − = − '' f ( x )( x x ) f '( x ) f '( x ) − − k k k 1 k k 1 Generalizes to ∇ − = ∇ −∇ 2 f x ( )( x x ) f x ( ) f x ( ) − − k k k 1 k k 1 So we want − = ∇ −∇ B ( x x ) f x ( ) f x ( ) − − k k k 1 k k 1
Another way to think about it Approximating quadratic model = + ∇ + 1 m ( ) p f x ( ) f x ( )' p p B p ' k k k k 2 Gradient = grad of current iterate ∇ = ∇ m (0) f x ( ) k k Want gradient = gradient of old iterate ∇ − = ∇ − α = ∇ − α = ∇ m ( x x ) m ( p ) f x ( ) B p f x ( ) − − − − − k k k 1 k k k 1 k k k 1 k 1 k 1 So α − = ∇ −∇ B p f x ( ) f x ( ) − − k k 1 k 1 k k 1
Quadratic Case For min 1/2 x’Qx-b’x ∇ −∇ = − − − f x ( ) f x ( ) ( Qx b ) ( Qx b ) − − k k 1 k k 1 = − Q x ( x ) − 1 k k So B k should act like Q along direction = − s x x − k k k 1 = ∇ −∇ y f x ( ) f x ( ) Let − k k k 1 So Quasi Newton Condition becomes = B s y + k 1 k k
Choice of B At each step we get information about Q along direction x k -x k-1 Use it to update our estimate of Q Many possible ways to do this and still satisfy quasi-Newton condition
BFGS Update Update by adding two matrices ′ ′ = + α + β B B a a b b Note outer product + k 1 k k k k k Need ′ ′ = + α + β B s B s a a s b b s =y from QNC + k 1 k k k k k k k k k k ′ α = So we make a a s y k k k k ′ = β and - b b s B s k k k k
BFGS Update ′ = β To make b b s - B s k k k k ( ) = Define b B s k k k ( )( ) β = β ' ' So b b s B s B s s k k k k k k k k ( ) ( ) = β ' s B s B s k k k k k 1 So pick β =- ' s B s k k k
BFGS Update k y k ) k y y s y ) k ' y s = k k ' y s k 1 k ' k a a s k ′ α ( k α ( k α = k = = y α k = a a s So define ' k k a To make k Define α So
BFGS Update Final Update is ′ ( )( ) ′ B s B s y y = − + k k k k k k B B + 1 ′ ′ k k s B s y s k k k k k This is called a BFGS family update for Broyden Fletcher Goldfarb and Shanno
Key Ideas This update is called a rank 2 update since it adds two rank one matrices. We want B k to be p.d. and symmetric. =−∇ Want to solve efficiently. B p f x ( ) k k k Two possible ways
Descent directions Need B to be positive definite. Necessary condition=Curvature Condition = ⇒ = > B s y s ' B s s ' y 0 + + k 1 k k k k 1 k k k Enforce for general conditions using Wolfe or Strong Wolfe Conditions
Wolfe Conditions For 0<c 1 <c 2 <1 ≤ + α ∇ f ( x ) f ( x ) c f ( x ) ' p + k 1 k 1 k k ∇ ≥ ∇ f ( x ) ' p c f ( x ) ' p + k 1 k 2 k k Implies ∇ ≥ ∇ f ( x ) ' s c f ( x ) ' s + k 1 k 2 k k ( ) = ∇ − ∇ ≥ − ∇ y ' s f ( x ) f ( x ) ' s ( c 1) f ( x ) ' s + k k k 1 k k 2 k k = − ∇ α > ( c 1) f ( x ) '( p ) 0 2 k k k
Guaranteeing B p.d. and sym. Lemma 11.5 in Nash and Sofer if B k is p.d. and symmetric then B k+1 is p.d. if and only if y k ’s k >0 So enforce this condition in linesearch procedure using wolfe conditions ′ [ ] [ ] ∇ −∇ − > f x ( ) f x ( ) x x 0 − − k k 1 k k 1
Quasi-Newton Algorithm with BFGS update Start with x 0. B 0 e.g. B 0 = I For k =1,…,K � If x k is optimal then stop � Solve: =−∇ B p f x ( ) using modified cholesky fact. k k k � Perform linesearch satisfying Wolf conditions x k+1 =x k + α k p k � Update s k =x k+1 -x k, = ∇ −∇ y f x ( ) f x ( ) + k k 1 k ′ ( )( ) ′ B s B s y y = − + k k k k k k � B B + 1 ′ ′ k k s B s y s k k k k k
Add Wolfe Condition to Linesearch Wolfe condition is approximation to optimality condition for the exact linesearch. + α = α min f x ( p ) g ( ) α k k α = = ∇ + α ' Optim. Cond. '( ) g 0 p f x ( p ) k k k ∇ + α ≤ η ∇ η > ' ' want p f x ( p ) p f x ( ) for 1> 0 k k k k k Used with Armijo search condition
Theorem 8.5 – global convergence Assumes start with symmetric pd B0 F is twice continuously differentiable X0 forms a convex level set, and eigenvalues of hessian on that level are bounded and strictly positive Then BFGS converges to minimizer of f.
Theorem 8.6 Assumes BFGS converges to x* and Hessian is Lipschitz in neighborhood of x* Then quasi-Newton BFGS algorithm has superlinear convergence.
B + = B =LL, want LL k k 1 Easy update of Cholesky Fact. Don’t need to refactorize whole matrix each time. Just much simpler matrix. ′ ( )( ) ′ B s B s y y = − + k k k k k k B B + ′ ′ k 1 k s B s y s k k k k k ′ ( )( ) ′ LL s ' LL s ' y y = − + k k k k LL ' O(n 2 ) ′ ′ s LL s ' y s k k k k ⎛ ⎞ ˆ ˆ ss ' yy ' ′ = − + = = ⎜ ⎟ ˆ L I L ' where s L s ' , Ly y k k ⎝ ⎠ s s ' y s ' � � � = LLL L ' ' where L are factors of inner matrix ˆ ˆ =LL'
Practical considerations see pages 200-201 Linesearches that don’t’ satisfy Wolfe conditions may not satisfy curvature condition. Then no descent direction so need some kind of recovery strategy. (Book suggest damped Newton) Can eliminate solving Newton equation
Calculating H B − = 1 Want H k k = − ρ − ρ + ρ ' ' ' H ( I s y ) H ( I s y ) s s k k k k k k k k k k k 1 ρ = where k ' y s k k Book shows derivation of this directly
Finding H = Want H y s + k 1 k k H H Such that is as close as possible to + k 1 k − min H H k = = subject to H H ' Hy s k k Can go back and forth between H and B using Sherman-Morrison-Woodbury Formula (see page 605)
Recommend
More recommend