CS257 Linear and Convex Optimization Lecture 9 Bo Jiang John Hopcroft Center for Computer Science Shanghai Jiao Tong University November 2, 2020
Recap: Gradient Descent, L -Lipschitz, L -smoothness Gradient descent 1: initialization x ← x 0 ∈ R n 2: while �∇ f ( x ) � > δ do x ← x − t ∇ f ( x ) 3: 4: end while 5: return x L -Lipschitz � f ( x ) − f ( y ) � ≤ L � x − y � , ∀ x , y L -smoothness �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � , ∀ x , y A twice continuously differentiable function f : R n → R is L -smooth iff | λ | ≤ L for all eigenvalues λ of ∇ 2 f ( x ) at all x . If f is convex, then the condition becomes λ max ( ∇ 2 f ( x )) ≤ L . 1/23
Recap: Consequences of L -smoothness • Quadratic upper bound f ( y ) ≤ f ( x ) + ∇ f ( x ) T ( y − x ) + L 2 � y − x � 2 • Gradient descent with constant step size t ∈ ( 0 , 1 L ] satisfies f ( x k ) − f ( x k + 1 ) ≥ t 2 �∇ f ( x k ) � 2 • If f ∗ = inf f ( x ) is finite, N �∇ f ( x k ) � 2 ≤ 2 � t [ f ( x 0 ) − f ∗ )] < ∞ , ∀ N k = 0 so k →∞ ∇ f ( x k ) = 0 lim Note. No assertion about the convergence of f ( x k ) and x k . 2/23
Today • convergence analysis • strong convexity • condition number 3/23
Convergence Analysis Theorem. If f is convex and L -smooth, and x ∗ is a minimum of f , then for step size t ∈ ( 0 , 1 L ] , the sequence { x k } produced by the gradient descent algorithm satisfies f ( x k ) − f ( x ∗ ) ≤ � x 0 − x ∗ � 2 2 tk Notes. • f ( x k ) ↓ f ∗ as k → ∞ . • Any limiting point of x k is an optimal solution. • The rate of convergence is O ( 1 / k ) , i.e. # of iterations to guarantee f ( x k ) − f ( x ∗ ) ≤ ǫ is O ( 1 /ǫ ) . For ǫ = 10 − p , k = O ( 10 p ) , exponential in the number of significant digits! • Faster convergence with larger t ; best t = 1 L , but L is unknown. • Good initial guess helps. 4/23
Proof 1. By the basic gradient step x k + 1 = x k − t ∇ f ( x k ) , � x k + 1 − x ∗ � 2 = � x k − t ∇ f ( x k ) − x ∗ � 2 = � x k − x ∗ � 2 + t 2 �∇ f ( x k ) � 2 + 2 t ∇ f ( x k ) T ( x ∗ − x k ) 2. By the first-order condition for convexity, ∇ f ( x k ) T ( x ∗ − x k ) ≤ f ( x ∗ ) − f ( x k ) 3. Plugging 2 into 1, � x k + 1 − x ∗ � 2 ≤ � x k − x ∗ � 2 + t 2 �∇ f ( x k ) � 2 + 2 t [ f ( x ∗ ) − f ( x k )] 2 �∇ f ( x k ) � 2 ≤ f ( x k ) − f ( x k + 1 ) from slide 2, 4. Plugging in t � x k + 1 − x ∗ � 2 ≤ � x k − x ∗ � 2 + 2 t [ f ( x ∗ ) − f ( x k + 1 )] 5/23
Proof (cont’d) 5. Rearranging, f ( x k + 1 ) − f ( x ∗ ) ≤ � x k − x ∗ � 2 − � x k + 1 − x ∗ � 2 2 t 6. Summing over k from 0 to N − 1 , [ f ( x k + 1 ) − f ( x ∗ )] ≤ � x 0 − x ∗ � 2 − � x N − x ∗ � 2 N − 1 ≤ � x 0 − x ∗ � 2 � 2 t 2 t k = 0 7. Recalling the descent property f ( x k + 1 ) ≤ f ( x k ) , N − 1 [ f ( x k + 1 ) − f ( x ∗ )] ≤ � x 0 − x ∗ � 2 f ( x N ) − f ( x ∗ ) ≤ 1 � N 2 tN k = 0 6/23
Fast Convergence The following f is 12 -smooth, f ( x ) = 6 x 2 6 f ( x k ) = 6(1 − 12 t ) 2 k x 2 0 10 − 1 4 f ( x k ) − f ( x ∗ ) 10 − 4 f ( x ) 2 10 − 7 10 − 10 0 1.0 0.5 0.0 0.5 1.0 0 2 4 6 8 x iteration ( k ) For small enough step size t (e.g. 0 . 1 ), f ( x k ) = 6 x 2 0 ( 1 − 12 t ) 2 k Need O (log 1 ǫ ) iterations to get within ǫ from optimal. 7/23
Slow Convergence The following f is also 12 -smooth, � x 4 , if | x | ≤ 1 f ( x ) = 4 | x | − 3 , if | x | ≥ 1 1.0 10 0 f ( x k ) (8 tk ) − 2 0.8 f ( x k ) − f ( x ∗ ) 10 − 2 0.6 f ( x ) 10 − 4 0.4 0.2 10 − 6 0.0 -1.0 -0.5 0.0 0.5 1.0 10 0 10 1 10 2 10 3 10 4 x iteration ( k ) For x 0 ∈ ( 0 , 1 ) , small enough step size t (e.g. 0 . 1 ), and large k , 1 1 x k ∼ √ , f ( x k ) ∼ ( 8 tk ) 2 8 tk Need O ( 1 / √ ǫ ) iterations to get within ǫ from optimal value (i.e. 0 ). 8/23
Strong Convexity A function f is strongly convex with parameter m > 0 , or simply m -strongly convex, if f ( x ) = f ( x ) − m ˜ 2 � x � 2 is convex. 2 � x � 2 + ˜ 2 � x � 2 plus an extra convex term. Note. f ( x ) = m f ( x ) , i.e. f is m Informally, “ m -strongly convex” means at least as “convex” as m 2 � x � 2 . 2 � x � 2 is m -strongly convex iff a ≥ m Example. f ( x ) = a f ( x ) = 1 2 a 1 x 2 , a 1 > m f ( x ) = 1 2 mx 2 f ( x ) = 1 2 a 2 x 2 , a 2 < m x 9/23
Strong Convexity (cont’d) Example. f ( x ) = a T x is not m -strongly convex for any m > 0 , as 2 � x � 2 is concave. ˜ f ( x ) = a T x − m Example. f ( x ) = x 4 is not m -strongly convex for any m > 0 , as f ( x ) = x 4 − m 2 x 2 is not convex, ˜ f ′′ ( x ) = 12 x 2 − m < 0 ˜ � for | x | < m / 12 . f ( x ) = x 4 m 2 x 2 ˜ f ( x ) x x 10/23
First-order Condition A differentiable f is m -strongly convex iff f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + m 2 � x − y � 2 , ∀ x , y f ( y ) f ( x ) + ∇ f ( x ) T ( y − x ) + m 2 � y − x � 2 f ( x ) + ∇ f ( x ) T ( y − x ) ( x , f ( x )) y • strong convexity = ⇒ strict convexity = ⇒ convexity • m -strong convexity and L -smoothness together imply m 2 � x − y � 2 ≤ f ( y ) − f ( x ) − ∇ f ( x ) T ( y − x ) ≤ L 2 � x − y � 2 11/23
Proof 1. By definition, f ( x ) = f ( x ) − m 2 � x � 2 is convex ⇒ ˜ f is m -strongly convex ⇐ 2. By first-order condition for convexity, ⇒ ˜ f ( y ) ≥ ˜ f ( x ) + ∇ ˜ f ( x ) T ( y − x ) , ⇐ ∀ x , y 3. Noting ∇ ˜ f ( x ) = ∇ f ( x ) − m x , ⇒ f ( y ) − m 2 � y � 2 ≥ f ( x ) − m 2 � x � 2 + ( ∇ f ( x ) − m x ) T ( y − x ) , ⇐ ∀ x , y 4. Rearranging and using y T y − x T x − 2 x T ( y − x ) = ( y − x ) T ( y − x ) , ⇒ f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + m 2 � x − y � 2 , ⇐ ∀ x , y 12/23
Second-order Condition A twice continuously differentiable f is m -strongly convex iff ∇ 2 f ( x ) � m I , ∀ x or equivalently, the smallest eigenvalue of ∇ 2 f ( x ) satisfies λ min ( ∇ 2 f ( x )) ≥ m , ∀ x 2 � x � 2 is convex iff ∇ 2 ˜ Proof. ˜ f ( x ) = f ( x ) − m f ( x ) = ∇ 2 f ( x ) − m I � O � 1 � 0 , we obtain f ( x ) = 1 2 x T Qx = 1 2 x 2 1 + x 2 Example. With Q = 2 is 0 2 1-strongly convex. More generally, f ( x ) = 1 2 x T Qx with Q ≻ O is λ min ( Q ) -strongly convex, where λ min ( Q ) is the smallest eigenvalue of Q . 13/23
Convergence: 1D Example 2 mx 2 with m > 0 is both m -smooth and m -strongly convex.. f ( x ) = 1 Recall the gradient descent step is x k + 1 = x k − tf ′ ( x k ) = ( 1 − mt ) x k and x k → x ∗ = 0 iff t ∈ ( 0 , 2 m ) . m , it gets to x ∗ in one step. If t = 1 For t ∈ ( 0 , 1 m ) ∪ ( 1 m , 2 m ) , x k = ( 1 − mt ) k x 0 so both x k → x ∗ and f ( x k ) → f ( x ∗ ) exponentially fast, | x k − x ∗ | = ( 1 − mt ) k · | x 0 − x ∗ | | f ( x k ) − f ( x ∗ ) | = m ( 1 − mt ) 2 k | x 0 − x ∗ | 2 2 14/23
Convergence Analysis Theorem. If f is m -strongly convex and L -smooth, and x ∗ is a minimum of f , then for step size t ∈ ( 0 , 1 L ] , the sequence { x k } produced by the gradient descent algorithm satisfies f ( x k ) − f ( x ∗ ) ≤ L ( 1 − mt ) k � x 0 − x ∗ � 2 2 � x k − x ∗ � 2 ≤ ( 1 − mt ) k � x 0 − x ∗ � 2 Notes. L ≤ 1 − mt < 1 , so x k → x ∗ and f ( x k ) → f ( x ∗ ) • 0 ≤ 1 − m exponentially fast • The number of iterations to reach f ( x k ) − f ( x ∗ ) ≤ ǫ is O (log 1 ǫ ) . For ǫ = 10 − p , k = O ( p ) , linear in the number of significant digits! • Since ∇ f ( x ∗ ) = 0 , the bounds on slide 11 yield m 2 � x k − x ∗ � 2 ≤ f ( x k ) − f ( x ∗ ) ≤ L 2 � x k − x ∗ � 2 relating the bounds on � x k − x ∗ � 2 and those on f ( x k ) − f ( x ∗ ) 15/23
Proof Similar to proof without strong convexity, with difference highlighted. 1. By the basic gradient step x k + 1 = x k − t ∇ f ( x k ) , � x k + 1 − x ∗ � 2 = � x k − t ∇ f ( x k ) − x ∗ � 2 = � x k − x ∗ � 2 + t 2 �∇ f ( x k ) � 2 + 2 t ∇ f ( x k ) T ( x ∗ − x k ) 2. By m -strong convexity ∇ f ( x k ) T ( x ∗ − x k ) ≤ f ( x ∗ ) − f ( x k ) − m 2 � x k − x ∗ � 2 3. Plugging 2 into 1, � x k + 1 − x ∗ � 2 ≤ ( 1 − mt ) � x k − x ∗ � 2 + t 2 �∇ f ( x k ) � 2 + 2 t [ f ( x ∗ ) − f ( x k )] 2 �∇ f ( x k ) � 2 from slide 2, 4. Plugging in f ( x k + 1 ) ≤ f ( x k ) − t � x k + 1 − x ∗ � 2 ≤ ( 1 − mt ) � x k − x ∗ � 2 + 2 t [ f ( x ∗ ) − f ( x k + 1 )] 5. Since f ( x ∗ ) ≤ f ( x k + 1 ) , � x k + 1 − x ∗ � 2 ≤ ( 1 − mt ) � x k − x ∗ � 2 16/23
Convergence: 2D Quadratic Function � m � f ( x ) = 1 0 2 x T Qx , Q = 0 L where L > m > 0 . f is L -smooth and m -strongly convex. x ∗ = 0 . The gradient descent step is x k + 1 = x k − t ∇ f ( x k ) = ( I − t Q ) x k so � ( 1 − mt ) k x 01 � x k = ( I − t Q ) k x 0 = ( 1 − Lt ) k x 02 and f ( x k ) = m 01 + L 2 ( 1 − mt ) 2 k x 2 2 ( 1 − Lt ) 2 k x 2 02 To ensure convergence, t < 2 L . The convergence rate is determined by the slower of ( 1 − Lt ) 2 k and ( 1 − mt ) 2 k . 17/23
Recommend
More recommend