An adaptive backtracking strategy for non-smooth composite optimisation problems Luca Calatroni ees (CMAP), ´ Centre de Mathematiqu´ es Appliqu´ Ecole Polytechnique, Palaiseau joint work with: A. Chambolle. CMIPI 2018 Workshop University of Insubria, DISAT July 16-18 2018 Como, IT
Table of contents 1. Introduction 2. GFISTA with backtracking 3. Accelerated convergence rates 4. Imaging applications 5. Conclusions & outlook 1
Introduction
Gradient based methods: a review ( X , � · � ), Hilbert space. Given f : X → R convex, l.s.c., with x ∗ ∈ arg min f , we want to solve: min x ∈X f ( x ) 2
Gradient based methods: a review ( X , � · � ), Hilbert space. Given f : X → R convex, l.s.c., with x ∗ ∈ arg min f , we want to solve: min x ∈X f ( x ) If f is differentiable with L f -Lipschitz gradient, explicit gradient descent reads: Algorithm 1 Gradient descent with fixed step. Input : 0 < τ ≤ 2 / L f , x 0 ∈ X . for k ≥ 0 do x k +1 = x k − τ ∇ f ( x k ) end for Quite restrictive smoothness assumption! 2
Gradient based methods: a review ( X , � · � ), Hilbert space. Given f : X → R convex, l.s.c., with x ∗ ∈ arg min f , we want to solve: min x ∈X f ( x ) No further assumptions on ∇ f : use implicit gradient descent. Algorithm 2 Implicit (proximal) gradient descent with fixed step. Input : τ > 0 , x 0 ∈ X . for k ≥ 0 do x k +1 = prox τ f ( x k )(= x k − τ ∇ f ( x k +1 )) end for Note : the iteration can be rewritten as: x ∈X f ( x ) + � x − x k � 2 x k +1 = x k − τ ∇ f τ ( x k ) , f τ ( x k ) := min with , 2 τ the Moreau-Yosida regularisation of f , which is 1 /τ -Lipschitz ⇒ explicit gradient descent on f τ . Same theory applies! References : Brezis-Lions (’73, ’78), G¨ uler (’91),. . . 2
Convergence rates Theorem: O (1 / k ) rate Let x 0 ∈ X and τ ≤ 2 / L f . Then, the sequence ( x k ) of iterates of gradient descent converges to x ∗ and satisfies: 1 2 τ k � x ∗ − x 0 � 2 . f ( x k ) − f ( x ∗ ) ≤ 3
Convergence rates Theorem: O (1 / k ) rate Let x 0 ∈ X and τ ≤ 2 / L f . Then, the sequence ( x k ) of iterates of gradient descent converges to x ∗ and satisfies: 1 2 τ k � x ∗ − x 0 � 2 . f ( x k ) − f ( x ∗ ) ≤ Assume : f is µ f - strongly convex , µ f > 0: f ( y ) ≥ f ( x ) + �∇ f ( x ) , y − x � + µ f 2 � y − x � 2 , for all x , y ∈ X . Theorem: Linear rate for strongly convex objectives Let f be µ f -strongly convex. Let x 0 ∈ X and τ ≤ 2 / ( L f + µ f ). Then, the sequence ( x k ) of iterates of gradient descent satisfies: 2 τ � x k − x ∗ � 2 ≤ ω k f ( x k ) − f ( x ∗ ) + 1 2 τ � x ∗ − x 0 � 2 , with ω = (1 − µ f / L ) / (1 + µ f L ) < 1. References : Bertsekas, ’15, Nesterov ’04 3
Lower bounds 1 Theorem (Lower bounds) Let x 0 ∈ R n , L f > 0 and k < n . Then, for any first-order method there exists a convex C 1 function f with L f -Lipschitz gradient such that: 1. convex case : L f 8( k + 1) 2 � x ∗ − x 0 � 2 . f ( x k ) − f ( x ∗ ) ≥ 2. strongly convex case : � √ q − 1 � 2 k f ( x k ) − f ( x ∗ ) ≥ µ f � x ∗ − x 0 � 2 , √ q + 1 2 where q = L f /µ f ≥ 1. Remark : If k ≥ n we could use conjugate gradient! However, for imaging n ≫ 1! Usually k < n : can we improve convergence speed? 1 Nesterov, ’04 4
Nesterov acceleration for gradient descent 2 To make it faster build extrapolated sequence ( inertia ). Algorithm 3 Nesterov accelerated gradient descent with fixed step. Input : 0 < τ ≤ 1 / L f , x 0 = x − 1 = y 0 ∈ X , t 0 = 0. for k ≥ 0 do � 1 + 4 t 2 1 + k t k +1 = 2 y k = x k + t k − 1 ( x k − x k − 1 ) t k +1 x k +1 = y k − τ ∇ f ( y k ) end for 2 Nesterov, ’83, ’04, G¨ uler ’92 5
Nesterov acceleration for gradient descent 2 Algorithm 4 Nesterov accelerated gradient descent with fixed step. Input : 0 < τ ≤ 1 / L f , x 0 = x − 1 = y 0 ∈ X , t 0 = 0. for k ≥ 0 do � 1 + 4 t 2 1 + k t k +1 = 2 y k = x k + t k − 1 ( x k − x k − 1 ) t k +1 x k +1 = y k − τ ∇ f ( y k ) end for Theorem (Acceleration) Let τ ≤ 1 / L f and ( x k ) the sequence generated by the accelerated gradient descent algorithm. Then: 2 τ ( k + 1) 2 � x 0 − x ∗ � 2 . f ( x k ) − f ( x ∗ ) ≤ 2 Nesterov, ’83, ’04, G¨ uler ’92 5
Standard problem in imaging: composite structure Variational regularisation of ill-posed inverse problems Compute a reconstructed version of a given degraded image f by solving: min {F ( x ) := R ( u ) + λ D ( u , f ) } , λ > 0 u ∈X with non-smooth regularisation and smooth data fidelity. 6
Standard problem in imaging: composite structure Variational regularisation of ill-posed inverse problems Compute a reconstructed version of a given degraded image f by solving: min {F ( x ) := R ( u ) + λ D ( u , f ) } , λ > 0 u ∈X with non-smooth regularisation and smooth data fidelity. Examples in inverse problems/imaging: • R ( u ) = TV , ICTV , TGV , ℓ 1 (Rudin, Osher, Fatemi, ’92, Chambolle-Lions’ 97, Bredies, ’10) • D ( u , f ) = � u − f � 2 2 (Gaussian Rudin, Osher, Fatemi, ’92 ), D ( u , f ) = � u − f � 1 ,γ (Laplace/impulse, Nikolova, ’04), D ( u , f ) = KL γ ( u , f ) (Poisson, Burger, Sawatzky, Brune, M¨ uller, ’09). . . 6
Composite optimisation We want to solve: min x ∈X { F ( x ) := f ( x ) + g ( x ) } • f is smooth : differentiable, convex with Lipschitz gradient �∇ f ( y ) − ∇ f ( x ) � ≤ L f � y − x � , for any x , y ∈ X . • g is convex, l.s.c., non-smooth , easy proximal map. 3 Combettes, Ways, ’05, Nesterov, ’13. . . 7
Composite optimisation We want to solve: min x ∈X { F ( x ) := f ( x ) + g ( x ) } • f is smooth : differentiable, convex with Lipschitz gradient �∇ f ( y ) − ∇ f ( x ) � ≤ L f � y − x � , for any x , y ∈ X . • g is convex, l.s.c., non-smooth , easy proximal map. Composite optimisation problem Forward-Backward splitting 3 . - forward gradient descent step in f ; - backward implicit gradient descent step in g . Basic algorithm : take x 0 ∈ X , fix τ > 0 and for k ≥ 0 do: x k +1 = prox τ g ( x k − τ ∇ f ( x k )) =: T τ x k . 3 Combettes, Ways, ’05, Nesterov, ’13. . . 7
Composite optimisation We want to solve: min x ∈X { F ( x ) := f ( x ) + g ( x ) } • f is smooth : differentiable, convex with Lipschitz gradient �∇ f ( y ) − ∇ f ( x ) � ≤ L f � y − x � , for any x , y ∈ X . • g is convex, l.s.c., non-smooth , easy proximal map. Composite optimisation problem Forward-Backward splitting 3 . - forward gradient descent step in f ; - backward implicit gradient descent step in g . Basic algorithm : take x 0 ∈ X , fix τ > 0 and for k ≥ 0 do: x k +1 = prox τ g ( x k − τ ∇ f ( x k )) =: T τ x k . Rate of convergence : O (1 / k ). 3 Combettes, Ways, ’05, Nesterov, ’13. . . 7
Accelerated forward-backward, FISTA: previous work In Nesterov ’04 and Beck, Teboulle ’09, accelerated O (1 / k 2 ) convergence of is achieved by extrapolation (as above). Further properties: - convergence of iterates (Chambolle, Dossal ’15); - monotone variants (Beck, Teboulle ’09, Tseng ’08, Tao, Boley, Zhang ’15) - acceleration for inexact evaluation of operators (Villa, Salzo, Baldassarre, Verri ’13, Bonettini, Prato, Rebegoldi, ’18) 8
Accelerated forward-backward, FISTA: previous work In Nesterov ’04 and Beck, Teboulle ’09, accelerated O (1 / k 2 ) convergence of is achieved by extrapolation (as above). Further properties: - convergence of iterates (Chambolle, Dossal ’15); - monotone variants (Beck, Teboulle ’09, Tseng ’08, Tao, Boley, Zhang ’15) - acceleration for inexact evaluation of operators (Villa, Salzo, Baldassarre, Verri ’13, Bonettini, Prato, Rebegoldi, ’18) Questions 1. Can we say more when f and/or g are strongly convex? Linear convergence? 2. Can we let the gradient step (proximal parameter) vary along the iterations AND preserving acceleration? 8
A strongly convex variant of FISTA (GFISTA) Let µ f , µ g ≥ 0. Then µ = µ f + µ g . For τ > 0 define: τµ q := 1 + τµ g ∈ [0 , 1) . Algorithm 5 GFISTA 4 (no backtracking) Input : 0 < τ ≤ 1 / L f , x 0 = x − 1 ∈ X and let t 0 ∈ R s.t. 0 ≤ t 0 ≤ 1 / √ q . for k ≥ 0 do y k = x k + β k ( x k − x k − 1 ) x k +1 = T τ y k = prox τ g ( y k − τ ∇ f ( y k )) � k ) 2 + 4 t 2 1 − qt 2 (1 − qt 2 k + k t k +1 = 2 β k = t k − 1 1 + τµ g − t k +1 τµ t k +1 1 − τµ f end for Remark: µ = q = 0 = ⇒ standard FISTA. 4 Chambolle, Pock ’16 9
GFISTA: acceleration results Theorem [Chambolle, Pock ’16] Let τ ≤ 1 / L f and 0 ≤ t 0 √ q ≤ 1. Then, the sequence ( x k ) of iterates of GFISTA satisfies � � 0 ( F ( x 0 ) − F ( x ∗ )) + 1 + τµ g F ( x k ) − F ( x ∗ ) ≤ r k ( q ) t 2 � x − x ∗ � 2 , 2 where x ∗ is a minimiser of F and: ( k + 1) 2 , (1 + √ q )(1 − √ q ) k , (1 − √ q ) k � � 4 r k ( q ) = min . t 2 0 Note : for µ = q = 0, t 0 = 0 this is the standard FISTA convergence result. 10
Recommend
More recommend