CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John Hopcroft Center for Computer Science Shanghai Jiao Tong University November 9, 2020
Recap Strong convexity. f is m -strongly convex if 2 � x � 2 is convex • f ( x ) − m • first-order condition f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + m 2 � y − x � 2 • second-order condition ∇ 2 f ( x ) � m I ⇐ ⇒ λ min ( ∇ 2 f ( x )) ≥ m Convergence. For m -strongly convex and L -smooth f with minimum x ∗ , gradient descent with constant step size t ∈ ( 0 , 1 L ] satisfies f ( x k ) − f ( x ∗ ) ≤ L ( 1 − mt ) k [ f ( x 0 ) − f ( x ∗ )] m Condition number. For Q ≻ O , κ ( Q ) = λ max ( Q ) λ min ( Q ) Well-/Ill-conditioned if κ ( Q ) is small/large = ⇒ fast/slow convergence. 1/24
Today • exact line search • backtracking line search • Newton’s method 2/24
Step Size Gradient descent x k + 1 = x k − t k ∇ f ( x k ) • constant step size: t k = t for all k • exact line search: optimal t k for each step t k = arg min f ( x k − s ∇ f ( x k )) s • backtracking line search (Armijo’s rule): t k satisfies f ( x k ) − f ( x k − t k ∇ f ( x k )) ≥ α t k �∇ f ( x k ) � 2 2 for some given α ∈ ( 0 , 1 ) . 3/24
Exact Line Search 1: initialization x ← x 0 ∈ R n 2: while �∇ f ( x ) � > δ do t ← arg min f ( x − s ∇ f ( x )) 3: s x ← x − t ∇ f ( x ) 4: 5: end while 6: return x ⋆ −∇ f f ( x k − s ∇ f ( x k )) level curves of f ( x 1 , x 2 ) = x 2 4 + x 2 1 s t 2 Note. Often impractical; used only if the inner minimization is cheap. 4/24
Exact Line Search for Quadratic Functions f ( x ) = 1 2 x T Qx + b T x , Q ≻ O • gradient at x k is g k = ∇ f ( x k ) = Qx k + b • second-order Taylor expansion is exact for quadratic functions, h ( t ) = f ( x k − t g k ) = f ( x k ) + ∇ f ( x k ) T ( − t g k ) + 1 2 ( − t g k ) T ∇ 2 f ( x k )( − t g k ) � 1 � t 2 − g T 2 g T = k g k t + f ( x k ) k Qg k • minimizing h ( t ) yields best step size t k = g T k g k g T k Qg k • update step x k + 1 = x k − t k g k = x k − g T k g k g k g T k Qg k 5/24
Example 2 x T Qx = γ f ( x 1 , x 2 ) = 1 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Well-conditioned. γ = 0 . 5 , x 0 = ( 2 , 1 ) T 1.5 10 − 1 1.0 f ( x * ) 10 − 3 0.5 0.0 error f ( x k ) x2 10 − 5 0.5 10 − 7 1.0 10 − 9 1.5 2 1 0 1 2 0.0 2.5 5.0 7.5 10.0 iteration (k) x1 Fast convergence. Note. Successive gradient directions are always orthogonal, as 0 = h ′ ( t k ) = −∇ f ( x k − t k ∇ f ( x k )) T ∇ f ( x k ) = −∇ f ( x k + 1 ) T ∇ f ( x k ) 6/24
Example (cont’d) f ( x 1 , x 2 ) = 1 2 x T Qx = γ 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Ill-conditioned. γ = 0 . 01 , convergence rate depends on initial point 0.1 0.25 x2 0.0 x2 0.00 0.1 0.25 0.0 0.5 1.0 1.5 2.0 1.2 1.4 1.6 1.8 2.0 x1 x1 10 − 2 10 − 3 f ( x * ) f ( x * ) 10 − 4 10 − 5 error f ( x k ) error f ( x k ) 10 − 6 10 − 7 10 − 8 0 0 5 10 15 100 200 300 400 iteration (k) iteration (k) x 0 = ( 2 , 0 . 3 ) , fast convergence x 0 = ( 2 , 0 . 02 ) , slow convergence 7/24
Convergence Analysis Theorem. If f is m -strongly convex and L -smooth, and x ∗ is a minimum of f , then the sequence { x k } produced by gradient descent with exact line search satisfies � � k 1 − m f ( x k ) − f ( x ∗ ) ≤ [ f ( x 0 ) − f ( x ∗ )] L Notes. L < 1 , so x k → x ∗ and f ( x k ) → f ( x ∗ ) exponentially fast • 0 ≤ 1 − m • The number of iterations to reach f ( x k ) − f ( x ∗ ) ≤ ǫ is O (log 1 ǫ ) . For ǫ = 10 − p , k = O ( p ) , linear in the number of significant digits. • The convergence rate depends on the condition number L / m and can be slow if L / m is large. When close to x ∗ , we can estimate L / m by κ ( ∇ f 2 ( x ∗ )) . 8/24
Proof 1. By the quadratic upper bound for L -smooth functions, f ( x k − t ∇ f ( x k )) ≤ f ( x k ) − t �∇ f ( x k ) � 2 + Lt 2 2 �∇ f ( x k ) � 2 � q ( t ) 2. Minimizing over t in step 1, q ( t ) = q ( 1 L ) = f ( x k ) − 1 2 L �∇ f ( x k ) � 2 f ( x k + 1 ) = min f ( x k − t ∇ f ( x k )) ≤ min t t 3. By m -strong convexity, f ( x ) ≥ f ( x k ) + ∇ f ( x k ) T ( x − x k ) + m 2 � x − x k � 2 � ˆ f ( x ) 4. Minimizing over x in step 3, f ( x k − 1 m ∇ f ( x k )) = f ( x k ) − 1 ˆ f ( x ) = ˆ 2 m �∇ f ( x k ) � 2 f ( x ∗ ) = min f ( x ) ≥ min x x 5. By 4, �∇ f ( x k ) � 2 ≥ 2 m [ f ( x k ) − f ( x ∗ )] . Plugging into 2, � � 1 − m f ( x k + 1 ) − f ( x ∗ ) ≤ [ f ( x k ) − f ( x ∗ )] L 9/24
Backtracking Line Search Exact line search is often expensive and not worth it. Suffices to find a good enough step size. One way to do so is to use backtracking line search, aka Armijo’s rule. Gradient descent with backtracking line search 1: initialization x ← x 0 ∈ R n 2: while �∇ f ( x ) � > δ do t ← t 0 3: while f ( x − t ∇ f ( x )) > f ( x ) − α t �∇ f ( x ) � 2 2 do 4: t ← β t 5: end while 6: x ← x − t ∇ f ( x ) 7: 8: end while 9: return x α ∈ ( 0 , 1 ) and β ∈ ( 0 , 1 ) are constants. Armijo used α = β = 0 . 5 Values suggested in [BV]: α ∈ [ 0 . 01 , 0 . 3 ] , β ∈ [ 0 . 1 , 0 . 8 ] Note. For general d , use condition f ( x + t d ) > f ( x ) + α t ∇ f ( x ) T d 10/24
Backtracking Line Search (cont’d) f ( x k + t d k ) f ( x k ) f ( x k ) + α t ∇ f ( x k ) T d k t 0 t t 2 = β 2 t 0 t 1 = β t 0 f ( x k ) + t ∇ f ( x k ) T d k • ∇ f ( x k ) T d k < 0 for descent direction d k • start from some “large” step size t 0 ([BV] uses t 0 = 1 ) • reduce step size geometrically until decrease is “large enough” t |∇ f ( x k ) T d k | f ( x k ) − f ( x k + t d k ) ≥ α × � �� � � �� � actual decrease in function value decrease along tangent line 11/24
Example 2 x T Qx = γ f ( x 1 , x 2 ) = 1 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Well-conditioned. γ = 0 . 5 , x 0 = ( 2 , 1 ) T 1.5 10 1 1.0 10 3 0.5 f ( x * ) 0.0 10 5 x2 f ( x k ) 0.5 10 7 1.0 10 9 1.5 2 1 0 1 2 0 5 10 15 x1 iteration (k) Fast convergence. 12/24
Example (cont’d) f ( x 1 , x 2 ) = 1 2 x T Qx = γ 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Ill-conditioned. γ = 0 . 01 0.1 0 . 1 0.0 x2 x2 0 . 0 0.1 − 0 . 1 1.2 1.4 1.6 1.8 2.0 1 . 2 1 . 4 1 . 6 1 . 8 2 . 0 x1 x1 10 1 10 − 3 10 3 f ( x k ) − f ( x ∗ ) f ( x * ) 10 − 5 10 5 f ( x k ) 10 7 10 − 7 0 200 400 600 0 200 400 600 iteration (k) iteration (k) x 0 = ( 2 , 0 . 3 ) , slow convergence x 0 = ( 2 , 0 . 02 ) , slow convergence 13/24
Convergence Analysis Theorem. If f is m -strongly convex and L -smooth, and x ∗ is a minimum of f , then the sequence { x k } produced by gradient descent with backtracking line search satisfies f ( x k ) − f ( x ∗ ) ≤ c k [ f ( x 0 ) − f ( x ∗ )] where � � 2 m α t 0 , 4 m βα ( 1 − α ) c = 1 − min L Notes. • c ∈ ( 0 , 1 ) , as 4 m βα ( 1 − α ) ≤ β m L ≤ β < 1 L so x k → x ∗ and f ( x k ) → f ( x ∗ ) exponentially fast • Number of iterations to reach f ( x k ) − f ( x ∗ ) ≤ ǫ is O (log 1 ǫ ) . For ǫ = 10 − p , k = O ( p ) , linear in the number of significant digits. 14/24
Proof The inner loop terminates with a step size bounded from below. 1. By the quadratic upper bound for L -smooth functions, f ( x k − t ∇ f ( x k )) ≤ f ( x k ) − t ( 1 − Lt 2 ) �∇ f ( x k ) � 2 2. The inner loop terminates for sure if ⇒ t ≤ 2 ( 1 − α ) − t ( 1 − Lt 2 ) �∇ f ( x k ) � 2 ≤ − α t �∇ f ( x k ) � 2 = L 3. The step size in backtracking line search satisfies � � t 0 , 2 β ( 1 − α ) t k ≥ η � min L ◮ t k = t 0 if Armijo’s condition is satisfied by t 0 ◮ otherwise, t k β > 2 ( 1 − α ) , since the inner loop did not terminate at t k L β 15/24
Proof (cont’d) Now we look at the outer loop 4. By Armijo’s condition in the inner loop, f ( x k + 1 ) = f ( x k − t k ∇ f ( x k )) ≤ f ( x k ) − α t k �∇ f ( x k ) � 2 5. By 3 and 4, f ( x k + 1 ) − f ( x ∗ ) ≤ f ( x k ) − f ( x ∗ ) − αη �∇ f ( x k ) � 2 6. By step 4 of slide 9, �∇ f ( x k ) � 2 ≥ 2 m [ f ( x k ) − f ( x ∗ )] 7. By 5 and 6, f ( x k + 1 ) − f ( x ∗ ) ≤ ( 1 − 2 m αη )[ f ( x k ) − f ( x ∗ )] = c [ f ( x k ) − f ( x ∗ )] so f ( x k ) − f ( x ∗ ) ≤ c k [ f ( x 0 ) − f ( x ∗ )] 16/24
Better Descent Direction Gradient descent uses first-order information (i.e. gradient), x k + 1 = x k − t k ∇ f ( x k ) Locally −∇ f ( x k ) is the max-rate descending direction, but globally it may not be the “right” direction. 2 x T Qx with Q = diag { 0 . 01 , 1 } , optimum is x ∗ = 0 . Example. For f ( x ) = 1 ⋆ The negative gradient is −∇ f ( x ) = − Qx = − ( 0 . 01 x 1 , x 2 ) T quite different from the “right” descent direction d = − x . Note d = − Q − 1 ∇ f ( x ) = − [ ∇ 2 f ( x )] − 1 ∇ f ( x ) With second-order information (i.e. Hessian), we hope to do better. 17/24
Recommend
More recommend