lecture fast proximal gradient methods
play

Lecture: Fast Proximal Gradient Methods - PowerPoint PPT Presentation

Lecture: Fast Proximal Gradient Methods http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes 1/38 Outline fast proximal gradient method (FISTA) 1 FISTA with


  1. Lecture: Fast Proximal Gradient Methods http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe’s lecture notes 1/38

  2. Outline fast proximal gradient method (FISTA) 1 FISTA with line search 2 FISTA as descent method 3 Nesterov’s second method 4 5 Proof by estimating sequence 2/38

  3. Fast (proximal) gradient methods Nesterov (1983, 1988, 2005): three projection methods with 1 / k 2 convergence rate Beck & Teboulle (2008): FISTA, a proximal gradient version of Nesterov’s 1983 method Nesterov (2004 book), Tseng (2008): overview and unified analysis of fast gradient methods several recent variations and extensions this lecture FISTA and Nesterov’s 2nd method (1988) as presented by Tseng 3/38

  4. FISTA (basic version) f ( x ) = g ( x ) + h ( x ) minimize g convex, differentiable, with dom g = R n h closed, convex, with inexpensive prox th oprator algorithm: choose any x ( 0 ) = x ( − 1 ) ; for k ≥ 1 , repeat the steps y = x ( k − 1 ) + k − 2 k + 1 ( x ( k − 1 ) − x ( k − 2 ) ) x ( k ) = prox t k h ( y − t k ∇ g ( y )) step size t k fixed or determined by line search acronym stands for ‘Fast Iterative Shrinkage-Thresholding Algorithm’ 4/38

  5. Interpretation first iteration ( k = 1 ) is a proximal gradient step at y = x ( 0 ) next iterations are proximal gradient steps at extrapolated points y note: x ( k ) is feasible (in dom h ); y may be outside dom h 5/38

  6. Example m � exp( a T log i x + b i ) minmize i = 1 randomly generated data with m = 2000 , n = 1000 , same fixed step size 6/38

  7. another instance FISTA is not a descent method 7/38

  8. Convergence of FISTA assumptions g convex with dom g = R n ; ∇ g Lipschitz continuous with constant L : �∇ g ( x ) − ∇ g ( y ) � 2 ≤ L � x − y � 2 ∀ x , y h is closed and convex ( so that prox th ( u ) is well defined) optimal value f ∗ is finite and attained at x ∗ (not necessarily unique) convergence result: f ( x ( k ) ) − f ∗ decreases at least as fast as 1 / k 2 with fixed step size t k = 1 / L with suitable line search 8/38

  9. Reformulation of FISTA define θ k = 2 / ( k + 1 ) and introduce an intermediate variable v ( k ) algorithm : choose x ( 0 ) = v ( 0 ) ; for k ≥ 1 , repeat the steps y = ( 1 − θ k ) x ( k − 1 ) + θ k v ( k − 1 ) x ( k ) = prox t k h ( y − t k ∇ g ( y )) v ( k ) = x ( k − 1 ) + 1 ( x ( k ) − x ( k − 1 ) ) θ k substituting expression for v ( k ) in formula for y gives FISTA of page 4 9/38

  10. Important inequalities choice of θ k : the sequence θ k = 2 / ( k + 1 ) satisfies θ 1 = 1 and 1 − θ k 1 ≤ , k ≥ 2 θ 2 θ 2 k k − 1 upper bound on g from Lipschitz property g ( u ) ≤ g ( z ) + ∇ g ( z ) T ( u − z ) + L 2 � u − z � 2 ∀ u , z 2 upper bound on h from definition of prox-operator h ( u ) ≤ h ( z ) + 1 t ( w − u ) T ( u − z ) ∀ w , u = prox th ( w ) , z Note min u th ( u ) + 1 2 � u − w � 2 2 gives 0 ∈ t ∂ h ( u ) + ( u − w ) gives 0 ∈ t ∂ h ( u ) + ( u − w ) . Hence, 1 t ( w − u ) ∈ ∂ h ( u ) . 10/38

  11. Progress in one iteration define x = x ( i − 1 ) , x + = x ( i ) , v = v ( i − 1 ) , v + = v ( i ) , t = t i , θ = θ i upper bound from Lipschitz property: if 0 < t ≤ 1 / L g ( x + ) ≤ g ( y ) + ∇ g ( y ) T ( x + − y ) + 1 2 t � x + − y � 2 (1) 2 upper bound from definition of prox-operator: h ( x + ) ≤ h ( z ) + ∇ g ( y ) T ( z − x + ) + 1 t ( x + − y ) T ( z − x + ) ∀ z add the upper bounds and use convexity of g f ( x + ) ≤ f ( z ) + 1 t ( x + − y ) T ( z − x + ) + 1 2 t � x + − y � 2 ∀ z 2 11/38

  12. make convex combination of upper bounds for z = x and z = x ∗ f ( x + ) − f ∗ − ( 1 − θ )( f ( x ) − f ∗ ) = f ( x + ) − θ f ∗ − ( 1 − θ ) f ( x ) ≤ 1 t ( x + − y ) T ( θ x ∗ + ( 1 − θ ) x − x + ) + 1 2 t � x + − y � 2 2 1 � 2 − � x + − ( 1 − θ ) x − θ x ∗ � 2 � � y − ( 1 − θ ) x − θ x ∗ � 2 = 2 2 t = θ 2 2 − � v + − x ∗ � 2 � � � v − x ∗ � 2 2 2 t conclusion: if the inequality (1) holds at iteration i , then � f ( x ( i ) ) − f ∗ � t i + 1 2 � v ( i ) − x ∗ � 2 2 θ 2 i (2) � f ( x ( i − 1 ) ) − f ∗ � ≤ ( 1 − θ i ) t i + 1 2 � v ( i − 1 ) − x ∗ � 2 2 θ 2 i 12/38

  13. Analysis for fixed step size take t i = t = 1 / L and apply (2) recursively, using ( 1 − θ i ) /θ 2 i ≤ 1 /θ 2 i − 1 ; � f ( x ( k ) ) − f ∗ � t + 1 2 � v ( k ) − x ∗ � 2 2 θ 2 k � f ( x ( 0 ) ) − f ∗ � ≤ ( 1 − θ 1 ) t + 1 2 � v ( 0 ) − x ∗ � 2 2 θ 2 1 = 1 2 � x ( 0 ) − x ∗ � 2 2 therefore f ( x ( k ) ) − f ∗ ≤ θ 2 2 L 2 t � x ( 0 ) − x ∗ � 2 ( k + 1 ) 2 � x ( 0 ) − x ∗ � 2 k 2 = 2 conclusion: reaches f ( x ( k ) ) − f ∗ ≤ ǫ after O ( 1 / √ ǫ ) iterations 13/38

  14. Example: quadratic program with box constraints ( 1 / 2 ) x T Ax + b T x minimize subject to 0 ≤ x ≤ 1 n = 3000 ; fixed step size t = 1 /λ max ( A ) 14/38

  15. 1-norm regularized least-squares 1 2 � Ax − b � 2 2 + � x � 1 minimize randomly generated A ∈ R 2000 × 1000 ; step t k = 1 / L with L = λ max ( A T A ) 15/38

  16. Outline fast proximal gradient method (FISTA) 1 FISTA with line search 2 FISTA as descent method 3 Nesterov’s second method 4 5 Proof by estimating sequence 16/38

  17. Key steps in the analysis of FISTA the starting point (page 11) is the inequality g ( x + ) ≤ g ( y ) + ∇ g ( y ) T ( x + − y ) + 1 2 t � x + − y � 2 (1) 2 this inequality is known to hold for 0 < t ≤ 1 / L if (1) holds, then the progress made in iteration i is bounded by � f ( x ( i ) ) − f ∗ � t i + 1 2 � v ( i ) − x ∗ � 2 2 θ 2 i (2) ≤ ( 1 − θ i ) t i � � + 1 f ( x ( i − 1 ) − f ∗ 2 � v ( i − 1 ) − x ∗ � 2 2 θ 2 i to combine these inequalities recursively, we need ( 1 − θ i ) t i ≤ t i − 1 ( i ≥ 2 ) (3) θ 2 θ 2 i i − 1 17/38

  18. if θ 1 = 1 , combing the inequalities (2) from i = 1 to k gives the bound f ( x ( k ) ) − f ∗ ≤ θ 2 � x ( 0 ) − x ∗ � 2 k 2 2 t k conclusion: rate 1 / k 2 convergence if (1) and (3) hold with θ 2 = O ( 1 k k 2 ) t k FISTA with fixed step size t k = 1 2 L , θ k = k + 1 these values satisfies (1) and (3) with θ 2 4 L k = ( k + 1 ) 2 t k 18/38

  19. FISTA with line search (method 1) replace update of x in iteration k (page 9) with ( define t 0 = ˆ t := t k − 1 t > 0 ) x := prox th ( y − t ∇ g ( y )) while g ( x ) > g ( y ) + ∇ g ( y ) T ( x − y ) + 1 2 t � x − y � 2 2 t := β t x := prox th ( y − t ∇ g ( y )) end inequality (1) holds trivially, by the backtracking exit condition inequality (3) holds with θ k = 2 / ( k + 1 ) because t k ≤ t k − 1 Lipschitz continuity of ∇ g guarantees t k ≥ t min = min { ˆ t , β/ L } preserves 1 / k 2 convergence rate because θ 2 k / t k = O ( 1 / k 2 ) : θ 2 4 k ≤ ( k + 1 ) 2 t min t k 19/38

  20. FISTA with line search (method 2) replace update of y and x in iteration k (page 9) with t := ˆ t > 0 θ := positive root of t k − 1 θ 2 = t θ 2 k − 1 ( 1 − θ ) y := ( 1 − θ ) x ( k − 1 ) + θ v ( k − 1 ) x := prox th ( y − t ∇ g ( y )) while g ( x ) > g ( y ) + ∇ g ( y ) T ( x − y ) + 1 2 t � x − y � 2 2 t := β t θ := positive root of t k − 1 θ 2 = t θ 2 k − 1 ( 1 − θ ) y := ( 1 − θ ) x ( k − 1 ) + θ v ( k − 1 ) x := prox th ( y − t ∇ g ( y )) end assume t 0 = 0 in the first iteration ( k = 1 ) , i.e. , take θ 1 = 1 , y = x ( 0 ) 20/38

  21. discussion inequality (1) holds trivially, by the backtracking exit condition inequality (3) holds trivially, bu construction of θ k Lipschitz contimuity of ∇ g guarantees t k ≥ t min = min { ˆ t , β/ L } θ i is defined as the positive root of θ 2 i / t i = ( 1 − θ i ) θ 2 i − 1 / t i − 1 ; hence √ t i √ t i √ t i − 1 � ( 1 − θ i ) t i = ≤ − θ i − 1 θ i θ i 2 combine inequalities from i = 2 to k to get √ t i ≤ √ t k √ t i � k θ k − 1 i = 2 2 rearranging shows that θ 2 k / t k = O ( 1 / k 2 ) : θ 2 1 4 k ≤ √ t i ) 2 ≤ ( √ t 1 + 1 � k ( k + 1 ) 2 t min t k i = 2 2 21/38

  22. Comparison of line search methods method 1 uses nonincreasing stepsizes (enforces t k ≤ t k − 1 ) one evaluation of g ( x ) , one prox th evaluation per line search iteration method 2 allows non-monotonic step sizes one evaluation of g ( x ) , one evaluation of g ( y ) , ∇ g ( y ) , one evaluation of prox th per line search iteration the two strategies cann be combined and extended in various ways 22/38

  23. Outline fast proximal gradient method (FISTA) 1 FISTA with line search 2 FISTA as descent method 3 Nesterov’s second method 4 5 Proof by estimating sequence 23/38

  24. Descent version of FISTA choose x ( 0 ) = v ( 0 ) ; for k ≥ 1 , repeat the steps y = ( 1 − θ k ) x ( k − 1 ) + θ k v ( k − 1 ) u = prox t k h ( y − t k ∇ g ( y )) � f ( u ) ≤ f ( x ( k − 1 ) ) u x ( k ) = x ( k − 1 ) otherwise v ( k ) = x ( k − 1 ) + 1 ( u − x ( k − 1 ) ) θ k step 3 implies f ( x ( k ) ) ≤ f ( x ( k − 1 ) ) use θ k = 2 / ( k + 1 ) and t k = 1 / L , or one of the line search methods same iteration complexity as original FISTA changes on page 11: replace x + with u and use f ( x + ) ≤ f ( u ) 24/38

  25. Example (from page 7) 25/38

Recommend


More recommend