Lecture: Fast Proximal Gradient Methods - PowerPoint PPT Presentation

Lecture: Fast Proximal Gradient Methods http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe’s lecture notes 1/38

Outline fast proximal gradient method (FISTA) 1 FISTA with line search 2 FISTA as descent method 3 Nesterov’s second method 4 5 Proof by estimating sequence 2/38

Fast (proximal) gradient methods Nesterov (1983, 1988, 2005): three projection methods with 1 / k 2 convergence rate Beck & Teboulle (2008): FISTA, a proximal gradient version of Nesterov’s 1983 method Nesterov (2004 book), Tseng (2008): overview and unified analysis of fast gradient methods several recent variations and extensions this lecture FISTA and Nesterov’s 2nd method (1988) as presented by Tseng 3/38

FISTA (basic version) f ( x ) = g ( x ) + h ( x ) minimize g convex, differentiable, with dom g = R n h closed, convex, with inexpensive prox th oprator algorithm: choose any x ( 0 ) = x ( − 1 ) ; for k ≥ 1 , repeat the steps y = x ( k − 1 ) + k − 2 k + 1 ( x ( k − 1 ) − x ( k − 2 ) ) x ( k ) = prox t k h ( y − t k ∇ g ( y )) step size t k fixed or determined by line search acronym stands for ‘Fast Iterative Shrinkage-Thresholding Algorithm’ 4/38

Interpretation first iteration ( k = 1 ) is a proximal gradient step at y = x ( 0 ) next iterations are proximal gradient steps at extrapolated points y note: x ( k ) is feasible (in dom h ); y may be outside dom h 5/38

Example m � exp( a T log i x + b i ) minmize i = 1 randomly generated data with m = 2000 , n = 1000 , same fixed step size 6/38

another instance FISTA is not a descent method 7/38

Convergence of FISTA assumptions g convex with dom g = R n ; ∇ g Lipschitz continuous with constant L : �∇ g ( x ) − ∇ g ( y ) � 2 ≤ L � x − y � 2 ∀ x , y h is closed and convex ( so that prox th ( u ) is well defined) optimal value f ∗ is finite and attained at x ∗ (not necessarily unique) convergence result: f ( x ( k ) ) − f ∗ decreases at least as fast as 1 / k 2 with fixed step size t k = 1 / L with suitable line search 8/38

Reformulation of FISTA define θ k = 2 / ( k + 1 ) and introduce an intermediate variable v ( k ) algorithm : choose x ( 0 ) = v ( 0 ) ; for k ≥ 1 , repeat the steps y = ( 1 − θ k ) x ( k − 1 ) + θ k v ( k − 1 ) x ( k ) = prox t k h ( y − t k ∇ g ( y )) v ( k ) = x ( k − 1 ) + 1 ( x ( k ) − x ( k − 1 ) ) θ k substituting expression for v ( k ) in formula for y gives FISTA of page 4 9/38

Important inequalities choice of θ k : the sequence θ k = 2 / ( k + 1 ) satisfies θ 1 = 1 and 1 − θ k 1 ≤ , k ≥ 2 θ 2 θ 2 k k − 1 upper bound on g from Lipschitz property g ( u ) ≤ g ( z ) + ∇ g ( z ) T ( u − z ) + L 2 � u − z � 2 ∀ u , z 2 upper bound on h from definition of prox-operator h ( u ) ≤ h ( z ) + 1 t ( w − u ) T ( u − z ) ∀ w , u = prox th ( w ) , z Note min u th ( u ) + 1 2 � u − w � 2 2 gives 0 ∈ t ∂ h ( u ) + ( u − w ) gives 0 ∈ t ∂ h ( u ) + ( u − w ) . Hence, 1 t ( w − u ) ∈ ∂ h ( u ) . 10/38

Progress in one iteration define x = x ( i − 1 ) , x + = x ( i ) , v = v ( i − 1 ) , v + = v ( i ) , t = t i , θ = θ i upper bound from Lipschitz property: if 0 < t ≤ 1 / L g ( x + ) ≤ g ( y ) + ∇ g ( y ) T ( x + − y ) + 1 2 t � x + − y � 2 (1) 2 upper bound from definition of prox-operator: h ( x + ) ≤ h ( z ) + ∇ g ( y ) T ( z − x + ) + 1 t ( x + − y ) T ( z − x + ) ∀ z add the upper bounds and use convexity of g f ( x + ) ≤ f ( z ) + 1 t ( x + − y ) T ( z − x + ) + 1 2 t � x + − y � 2 ∀ z 2 11/38

make convex combination of upper bounds for z = x and z = x ∗ f ( x + ) − f ∗ − ( 1 − θ )( f ( x ) − f ∗ ) = f ( x + ) − θ f ∗ − ( 1 − θ ) f ( x ) ≤ 1 t ( x + − y ) T ( θ x ∗ + ( 1 − θ ) x − x + ) + 1 2 t � x + − y � 2 2 1 � 2 − � x + − ( 1 − θ ) x − θ x ∗ � 2 � � y − ( 1 − θ ) x − θ x ∗ � 2 = 2 2 t = θ 2 2 − � v + − x ∗ � 2 � � � v − x ∗ � 2 2 2 t conclusion: if the inequality (1) holds at iteration i , then � f ( x ( i ) ) − f ∗ � t i + 1 2 � v ( i ) − x ∗ � 2 2 θ 2 i (2) � f ( x ( i − 1 ) ) − f ∗ � ≤ ( 1 − θ i ) t i + 1 2 � v ( i − 1 ) − x ∗ � 2 2 θ 2 i 12/38

Analysis for fixed step size take t i = t = 1 / L and apply (2) recursively, using ( 1 − θ i ) /θ 2 i ≤ 1 /θ 2 i − 1 ; � f ( x ( k ) ) − f ∗ � t + 1 2 � v ( k ) − x ∗ � 2 2 θ 2 k � f ( x ( 0 ) ) − f ∗ � ≤ ( 1 − θ 1 ) t + 1 2 � v ( 0 ) − x ∗ � 2 2 θ 2 1 = 1 2 � x ( 0 ) − x ∗ � 2 2 therefore f ( x ( k ) ) − f ∗ ≤ θ 2 2 L 2 t � x ( 0 ) − x ∗ � 2 ( k + 1 ) 2 � x ( 0 ) − x ∗ � 2 k 2 = 2 conclusion: reaches f ( x ( k ) ) − f ∗ ≤ ǫ after O ( 1 / √ ǫ ) iterations 13/38

Example: quadratic program with box constraints ( 1 / 2 ) x T Ax + b T x minimize subject to 0 ≤ x ≤ 1 n = 3000 ; fixed step size t = 1 /λ max ( A ) 14/38

1-norm regularized least-squares 1 2 � Ax − b � 2 2 + � x � 1 minimize randomly generated A ∈ R 2000 × 1000 ; step t k = 1 / L with L = λ max ( A T A ) 15/38

Key steps in the analysis of FISTA the starting point (page 11) is the inequality g ( x + ) ≤ g ( y ) + ∇ g ( y ) T ( x + − y ) + 1 2 t � x + − y � 2 (1) 2 this inequality is known to hold for 0 < t ≤ 1 / L if (1) holds, then the progress made in iteration i is bounded by � f ( x ( i ) ) − f ∗ � t i + 1 2 � v ( i ) − x ∗ � 2 2 θ 2 i (2) ≤ ( 1 − θ i ) t i � � + 1 f ( x ( i − 1 ) − f ∗ 2 � v ( i − 1 ) − x ∗ � 2 2 θ 2 i to combine these inequalities recursively, we need ( 1 − θ i ) t i ≤ t i − 1 ( i ≥ 2 ) (3) θ 2 θ 2 i i − 1 17/38

if θ 1 = 1 , combing the inequalities (2) from i = 1 to k gives the bound f ( x ( k ) ) − f ∗ ≤ θ 2 � x ( 0 ) − x ∗ � 2 k 2 2 t k conclusion: rate 1 / k 2 convergence if (1) and (3) hold with θ 2 = O ( 1 k k 2 ) t k FISTA with fixed step size t k = 1 2 L , θ k = k + 1 these values satisfies (1) and (3) with θ 2 4 L k = ( k + 1 ) 2 t k 18/38

FISTA with line search (method 1) replace update of x in iteration k (page 9) with ( define t 0 = ˆ t := t k − 1 t > 0 ) x := prox th ( y − t ∇ g ( y )) while g ( x ) > g ( y ) + ∇ g ( y ) T ( x − y ) + 1 2 t � x − y � 2 2 t := β t x := prox th ( y − t ∇ g ( y )) end inequality (1) holds trivially, by the backtracking exit condition inequality (3) holds with θ k = 2 / ( k + 1 ) because t k ≤ t k − 1 Lipschitz continuity of ∇ g guarantees t k ≥ t min = min { ˆ t , β/ L } preserves 1 / k 2 convergence rate because θ 2 k / t k = O ( 1 / k 2 ) : θ 2 4 k ≤ ( k + 1 ) 2 t min t k 19/38

FISTA with line search (method 2) replace update of y and x in iteration k (page 9) with t := ˆ t > 0 θ := positive root of t k − 1 θ 2 = t θ 2 k − 1 ( 1 − θ ) y := ( 1 − θ ) x ( k − 1 ) + θ v ( k − 1 ) x := prox th ( y − t ∇ g ( y )) while g ( x ) > g ( y ) + ∇ g ( y ) T ( x − y ) + 1 2 t � x − y � 2 2 t := β t θ := positive root of t k − 1 θ 2 = t θ 2 k − 1 ( 1 − θ ) y := ( 1 − θ ) x ( k − 1 ) + θ v ( k − 1 ) x := prox th ( y − t ∇ g ( y )) end assume t 0 = 0 in the first iteration ( k = 1 ) , i.e. , take θ 1 = 1 , y = x ( 0 ) 20/38

discussion inequality (1) holds trivially, by the backtracking exit condition inequality (3) holds trivially, bu construction of θ k Lipschitz contimuity of ∇ g guarantees t k ≥ t min = min { ˆ t , β/ L } θ i is defined as the positive root of θ 2 i / t i = ( 1 − θ i ) θ 2 i − 1 / t i − 1 ; hence √ t i √ t i √ t i − 1 � ( 1 − θ i ) t i = ≤ − θ i − 1 θ i θ i 2 combine inequalities from i = 2 to k to get √ t i ≤ √ t k √ t i � k θ k − 1 i = 2 2 rearranging shows that θ 2 k / t k = O ( 1 / k 2 ) : θ 2 1 4 k ≤ √ t i ) 2 ≤ ( √ t 1 + 1 � k ( k + 1 ) 2 t min t k i = 2 2 21/38

Comparison of line search methods method 1 uses nonincreasing stepsizes (enforces t k ≤ t k − 1 ) one evaluation of g ( x ) , one prox th evaluation per line search iteration method 2 allows non-monotonic step sizes one evaluation of g ( x ) , one evaluation of g ( y ) , ∇ g ( y ) , one evaluation of prox th per line search iteration the two strategies cann be combined and extended in various ways 22/38

Descent version of FISTA choose x ( 0 ) = v ( 0 ) ; for k ≥ 1 , repeat the steps y = ( 1 − θ k ) x ( k − 1 ) + θ k v ( k − 1 ) u = prox t k h ( y − t k ∇ g ( y )) � f ( u ) ≤ f ( x ( k − 1 ) ) u x ( k ) = x ( k − 1 ) otherwise v ( k ) = x ( k − 1 ) + 1 ( u − x ( k − 1 ) ) θ k step 3 implies f ( x ( k ) ) ≤ f ( x ( k − 1 ) ) use θ k = 2 / ( k + 1 ) and t k = 1 / L , or one of the line search methods same iteration complexity as original FISTA changes on page 11: replace x + with u and use f ( x + ) ≤ f ( u ) 24/38

Example (from page 7) 25/38

Lecture: Fast Proximal Gradient Methods - PowerPoint PPT Presentation

Lecture: Fast Proximal Gradient Methods http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes 1/38 Outline fast proximal gradient method (FISTA) 1 FISTA with

Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Asymmetric Proximal Point Algorithms with Moving Proximal Centers Deren Han

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey

Inexact variable metric proximal gradient methods with line-search for convex and nonconvex

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Convex Optimization ( EE227A: UC Berkeley ) Lecture 18 (Proximal methods; Incremental methods

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

On the Equivalence of Inexact Proximal ALM and ADMM for a Class of Convex Composite Programming

CSC321 Lecture 21: Policy Gradient Roger Grosse Roger Grosse CSC321 Lecture 21: Policy Gradient

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Highly Efficient Gradient Computation for Highly Efficient Gradient Computation for Density-

PKU-IDM@TRECVID-CCD 2010: Copy Detection with Visual-Audio Feature Fusion and Sequential Pyramid

PKU@TRECVID2009: Single-Actor and Pair-Activity Event Detection in Surveillance Video General

Graph Algorithms and Graph Measures for the Life Sciences Falk Schreiber 23/10/2014 1 Networks

and Research RNA in the sequence/structure network Jerome Waldispuhl School of Computer Science,

Computing with Semi-Algebraic Sets Represented by Triangular Decomposition Rong Xiao 1 joint work

Asymmetry Helps: Eigenvalue and Eigenvector Analyses of Asymmetrically Perturbed Low-Rank Matrices

Modern Object Detection Gang Yu Face++ Researcher yugang@megvii.com Visual Recognition A

The Bimodal Formation Time Distribution of Infall Dark Matter Halos and Its Effect on Galaxies