Spectral properties of steplength selections in gradient methods: from unconstrained to constrained optimization L. Zanni Department of Physics, Informatics and Mathematics, University of Modena and Reggio Emilia, Italy Variational Methods and Optimization in Imaging IHP - Paris, 4 - 8 February 2019 Joint work with: S. Crisci, V. Ruggiero , University of Ferrara, Italy F. Porta , University of Modena and Reggio Emilia, Italy L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
Outline Gradient methods for unconstrained problems 1 Spectral properties of steplength selections Design selection rules by exploiting spectral properties From the quadratic case to general unconstrained problems Gradient projection methods for box-constrained problems 2 Spectral properties of steplengths in the quadratic case New steplength rules taking into account the constraints Scaled gradient projection methods 3 Define the diagonal scaling The steplengths in variable metric approaches Practical behaviour in imaging Conclusions 4 L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
Motivation for the steplength analysis Constrained optimization problems min x ∈ Ω f ( x ) (1) f : R N − → R continuously differentiable function Ω ⊂ R N , nonempty closed convex set defined by simple constraints Gradient Projection (GP) methods for min x ∈ Ω f ( x ) x ( k ) + ϑ k d ( k ) x ( k +1) = d ( k ) = P Ω x ( k ) − α k ∇ f ( x ( k ) ) � � − x ( k ) ϑ k ∈ (0 , 1] , P Ω ( x ) = argmin z ∈ Ω � z − x � α k > 0 , Usually the updating rules for the steplength α k are those exploited in the unconstrained case: is this a suitable choice? L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
Spectral analysis of steplength selections ➤ The unconstrained case ➤ The box-constrained case ➤ The Scaled Gradient Projection methods L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
Steplength selection: the unconstrained case The recipe exploited by state-of-the-art selection rules: define steplengths by trying to capture, in an inexpensive way, some second order information design selection rules in the strictly convex quadratic case: f ( x ) = 1 2 x T A x − b T x , A symmetric positive definite second order information ↔ spectral properties of A design selection rules that generalize, in an inexpensive way, to non-quadratic cases ∇ 2 f ( x ( k ) ) depends on the iterations but ∇ 2 f ( x ( k ) ) → ∇ 2 f ( x ∗ ) L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
A popular example: the Barzilai-Borwein (BB) selection rules Consider the gradient method for the problem min f ( x ) : x ( k +1) = x ( k ) − α k ∇ f ( x ( k ) ) k = 0 , 1 , . . . , Suggestion [Barzilai-Borwein, IMA J. Num. Anal. 1988] : Force the matrix ( α k I ) − 1 to approximate the Hessian ∇ 2 f ( x ( k ) ) by imposing quasi-Newton properties � ( αI ) − 1 s ( k − 1) − z ( k − 1) � = s ( k − 1) T s ( k − 1) α BB1 = argmin k s ( k − 1) T z ( k − 1) α ∈ R or � s ( k − 1) − ( αI ) z ( k − 1) � = s ( k − 1) T z ( k − 1) α BB2 = argmin k z ( k − 1) T z ( k − 1) α ∈ R s ( k − 1) = x ( k ) − x ( k − 1) � z ( k − 1) = ( ∇ f ( x ( k ) ) − ∇ f ( x ( k − 1) )) . � where , L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
Spectral properties of the BB steplength rules Consider a gradient method for the quadratic unconstrained case: min f ( x ) ≡ 1 2 x T A x − b T x , A = diag ( λ 1 , . . . , λ N ) , 0 < λ 1 < · · · < λ N x ( k +1) = x ( k ) − α k g ( k ) , g ( k ) = ∇ f ( x ( k ) ) , k = 0 , 1 , . . . ➩ g ( k +1) = (1 − α k λ i ) g ( k ) i = 1 , . . . , N i i g ( k +1) g ( k + j ) - α k = 1 ⇒ = 0 ⇒ = 0 , j = 2 , 3 . . . λ i i i g ( k + N ) = 0 (Finite Termination) - α k + i − 1 = 1 λ i , i = 1 , . . . , N ⇒ α k must aim at approximating the inverse of the eigenvalues of A L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
BB rules in the quadratic case = g ( k − 1) T A g ( k − 1) = g ( k − 1) T g ( k − 1) 1 1 ≤ α BB2 α BB1 ≤ g ( k − 1) T A g ( k − 1) ≤ g ( k − 1) T A 2 g ( k − 1) k k λ N λ 1 Example A = diag ( λ 1 , . . . , λ 10 ) , λ i = 111 i − 110 f ( x ) = 1 2 x T A x − b T x b random vector; b i ∈ [ − 10 , 10] stopping rule: � g ( k ) � ≤ 10 − 8 � g (0) � L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
Quadratic case: exploiting spectral properties In the quadratic case ( A = diag ( λ 1 , . . . , λ N ) , 0 < λ 1 < · · · < λ N ), we have g ( k +1) = (1 − α k λ j ) g ( k ) • j = 1 , . . . , N j j � � � � � g ( k +1) � g ( k ) � ≪ very useful � � � � i i � α k ≈ 1 � � � � � g ( k +1) � g ( k ) • ⇒ � < if j < i useful � � � � j j λ i � � � � � � g ( k +1) � g ( k ) � > if j > i, λ j > 2 λ i dangerous � � � � j j � α BB2 /α BB1 = cos 2 ( g ( k − 1) , A g ( k − 1) ) • k k Idea for improving the BB rules : force a sequence of small α BB2 to reduce | g i | for large i , leading to k gradients in which these components are not dominant after a sequence of small α k , if α BB2 /α BB1 ≈ 1 , exploit k k g T g α BB1 = aiming at obtaining α BB1 ≈ 1 /λ i for small i g T A g L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
Practical implementations of this idea: ABB and ABBmin rules Alternate Barzilai-Borwein selection rule [Zhou-Gao-Dai, COAP (2006) ] α BB 2 α BB 2 if k < τ, τ ∈ (0 , 1) k α BB 1 α ABB k = k α BB 1 otherwise k ABBmin rule [Frassoldati-Zanghirati-Zanni, JIMO (2008) ] � α BB 2 � if α BB 2 / α BB 1 min | j = max { 1 , k − M α } , ..., k < τ j k k ABB min α = k α BB 1 otherwise k where M α > 0 is a parameter. L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
ABB and ABBmin rules on the previous toy problem ABB min ABB 0 0 10 10 −1 −1 10 10 α k α k −2 −2 10 10 −3 −3 10 10 0 20 40 60 80 100 120 140 0 5 10 15 20 25 30 35 40 45 Iterations Iterations Error Cauchy Steepest Descent (CSD) 0 10 α k = argmin α> 0 f ( x ( k ) − α k g ( k ) ) CSD BB1 −2 10 BB2 α k = α BB 1 BB1 → ABB ||x k −x * ||/||x * || k ABB min −4 10 α k = α BB 2 BB2 → k −6 ABB → alternation 10 ABBmin → modified alternation 50 100 150 200 250 300 350 400 450 500 Iterations L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
Similar behaviour on randomly generated test problems Quadratic test problems: N = 1000 λ N = 10 4 , λ 1 = 1 , λ i , i = 2 , . . . , N − 1 , log-spaced λ = 10 3 , λ = 1 , λ i = λ + ( λ − λ ) ∗ s i , i = 1 , . . . , N, s i ∈ (0 , 0 . 2) , i = 1 , . . . , N/ 2 , s i ∈ (0 . 8 , 1) , i = N/ 2 + 1 , . . . , N. [Di Serafino-Ruggiero-Toraldo-Z., AMC 2018] L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
Other efficient steplength rules based on spectral properties [Pronzato-Zhigljavsky, Comput. Optim. Appl. 50 (2011)] [Fletcher, Math. Program. Ser. A 135 (2012)] [Pronzato-Zhigljavsky-Bukina, Acta Appl. Math. 127 (2013)] [De Asmundis-Di Serafino-Riccio-Toraldo, IMA J. Numer. Anal. 33 (2013)]] [De Asmundis-Di Serafino-Hager-Toraldo-Zhan, Comput. Optim. Appl. 59 (2014)] [Gonzaga-Schneider, Comput. Optim. Appl. 63 (2016)] [Gonzaga, Math. Program. Ser. A 160 (2016)] Aimed at breaking the well-known cycling behaviour of the Steepest Descent method they share R-linear convergence rate in the quadratic case not all these rules easily generalize to general non-quadratic problems (BB-based rules have this crucial property) L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
General unconstrained problems : min x ∈ R N f ( x ) Gradient methods with nonmonotone linesearch: Init. : x (0) ∈ R N , 0 < α min ≤ α max , α 0 ∈ [ α min , α max ] , δ, σ ∈ (0 , 1) , M ∈ N ; for k = 0 , 1 , . . . f ref = max { f ( x ( k − j ) ) , 0 ≤ j ≤ min( k, M ) } ; ν k = α k ; f ( x ( k ) − ν k g ( k ) ) > f ref − σν k g ( k ) T g ( k ) while (line search) ν k = δν k ; end x ( k +1) = x ( k ) − ν k g ( k ) ; define a tentative steplength α k +1 ∈ [ α min , α max ] end ➤ tentative steplength: exploit effective steplength selections designed for the quadratic case and generalizable in an inexpensive way. ➤ R-linear convergence of { f ( x ( k ) ) } when f is strongly convex with Lipschitz-cont. gradient ( [Dai, JOTA 2002], [Dai-Liao, IMA J.Num.Anal. 2002] ) L. Zanni Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019
Recommend
More recommend