l acc el eration de nesterov est elle vraiment une acc el
play

Lacc el eration de Nesterov est-elle vraiment une acc el eration - PowerPoint PPT Presentation

Lacc el eration de Nesterov est-elle vraiment une acc el eration ? cois Aujol 1 Jean-Fran Collaboration avec Vassilis Apidopoulos 1 , Charles Dossal 2 et Aude Rondepierre 2 . 1 IMB, Universit e de Bordeaux 2 INSA Toulouse, IMT


  1. L’acc´ el´ eration de Nesterov est-elle vraiment une acc´ el´ eration ? cois Aujol 1 Jean-Fran¸ Collaboration avec Vassilis Apidopoulos 1 , Charles Dossal 2 et Aude Rondepierre 2 . 1 IMB, Universit´ e de Bordeaux 2 INSA Toulouse, IMT 27 mai 2019 1/70

  2. Setting Minimize a differentiable function Let F be a convex differentiable function from R n to R whose gradient is L − Lipschitz , having at least one minimizer x ∗ . We want to build an efficient sequence to estimate arg min x ∈ R n F ( x ) 2/70

  3. Gradient descent Explicit Gradient Descent Let F be a convex differentiable function from R n to R whose gradient is L − Lipschitz , having at least one minimizer x ∗ . Gradient descent : for h < 2 L , x n +1 = x n − h ∇ F ( x n ) The sequence ( x n ) n ∈ N converges to a minimizer of F and F ( x n ) − F ( x ∗ ) � � x 0 − x ∗ � 2 2 hn 3/70

  4. Gradient descent Inertial Gradient Descent Let F be a convex differentiable function from R n to R whose gradient is L − Lipschitz , having at least one minimizer x ∗ . Inertial Gradient descent : for h < 1 L , � y n = x n + α ( x n − x n − 1 ) = y n − h ∇ F ( y n ) x n +1 If α ∈ [0 , 1], the sequence ( x n ) n ∈ N converges to a minimizer of F and � 1 � F ( x n ) − F ( x ∗ ) = O n 4/70

  5. Nesterov inertial scheme Nesterov inertial scheme A specific class of inertial gradient scheme: Nesterov Scheme for h < 1 L , and α � 3 � y n n = x n + n + α ( x n − x n − 1 ) = y n − h ∇ F ( y n ) x n +1 � 1 � F ( x n ) − F ( x ∗ ) = O n 2 Nesterov (84) proposes α = 3. 5/70

  6. Introduction Non-smooth convex optimization 10 2 ISTA FISTA 10 1 10 0 10 -1 10 -2 50 100 150 200 250 300 (a) Input y: motion blur + noise ( σ = 2) (b) Convergence prof les (c) Deconvolution ISTA(300)+UDWT (d) Deconvolution FISTA(300)+UDWT 6/70

  7. Introduction Some questions How to prove these decays ? What is the role of the inertial parameter α ? � 1 � Can we get more accurate rates than O with more n 2 information on F (i.e. assuming more then F convex) ? Are these bounds tight ? Is Nesterov scheme really an acceleration of the Gradient descent ? 7/70

  8. Outline Introduction 1 Geometry assumptions 2 Geometry for Nesterov inertial scheme 3 ODEs and Lyapunov functions 4 The non differentiable setting 5 Finite error analysis 6 8/70

  9. Outline Introduction 1 Geometry assumptions 2 Geometry for Nesterov inertial scheme 3 ODEs and Lyapunov functions 4 The non differentiable setting 5 Finite error analysis 6 9/70

  10. Convex functions Definition F is a convex function : ∀ ( x , y ) , ∀ λ ∈ [0 , 1] , F ( λ x + (1 − λ ) y ) ≤ λ F ( x ) + (1 − λ ) F ( y ) Properties of differentiable convex functions F is minorated by its affine approximation ∀ ( x , y ) , F ( y ) � F ( x ) + �∇ F ( x ) , y − x � If x �→ ∇ F ( x ) is L − Lipschitz , F is majorated by its quadratic approximation F ( y ) � F ( x ) + �∇ F ( x ) , y − x � + L 2 � x − y � 2 ∀ ( x , y ) , 2 � x − x ∗ � 2 In particular, F ( x ) − F ( x ∗ ) � L 10/70

  11. Classical geometric assumptions Strong convexity F ( y ) � F ( x ) + �∇ F ( x ) , y − x � + µ 2 � x − y � 2 Strong minimizer x ∗ the minimizer of F : � x − x ∗ � 2 ≤ 2 µ ( F ( x ) − F ( x ∗ )) In both cases, we have uniqueness of the minimizer. Remark If F strongly convex with L -Lipshitz gradient: µ 2 � x − x ∗ � 2 ≤ F ( x ) − F ( x ∗ ) ≤ L 2 � x − x ∗ � 2 11/70

  12. Refined geometric assumptions Growth condition (sharpness) X ∗ the set of minimizer of F . A function F satisfies condition L ( p ) if it exists K > 0 such that for all x ∈ R n d ( x , X ∗ ) p � K ( F ( x ) − F ( x ∗ )) The smaller p, the sharper F. Remark L (2) and uniqueness of the minimizer ⇐ ⇒ strong minimizer Remark If F convex with L -Lipschitz gradient satisfies the growth condtion L ( p ) for some p > 0, then we necessary have p ≥ 2. 12/70

  13. Another geometrical condition Flatness condition X ∗ the set of minimizer of F . F satisfies condition H ( γ ) if ∀ x ∈ R n and all x ∗ ∈ X ∗ F ( x ) − F ( x ∗ ) � 1 γ �∇ F ( x ) , x − x ∗ � Flatness properties 1 γ is convex, then F satisfies H ( γ ). If ( F − F ∗ ) If F satisfies H ( γ ) then it exists K 2 > 0 such that F ( x ) − F ( x ∗ ) � K 2 d ( x , X ∗ ) γ The hypothesis H ( γ ) can be seen as a “flatness” condition on the function F in the sense that it ensures that F is sufficiently flat (at least as flat as x �→ | x | γ ) in the neighborhood of its minimizers. The larger γ , the flatter F. 13/70

  14. Flatness and growth geometrical conditions Flatness and growth properties if F ( x ) = � x − x ∗ � r , with r > 1, F satisfies H ( γ ) for all γ ∈ [1 , r ] ... and L ( p ) for all p � r . if F satisfies H ( γ ) and L ( p ) , then p ≥ γ . if F satisfies L (2) and ∇ F is L -Lispchitz then F satisfies L H (1 + 2 K 2 ). Remark For the explicit gradient descent, only sharpness H ( γ ) assumption is used. For inertial methods, flatness L ( p ) also plays a key role. 14/70

  15. � Lojasiewicz property and growth condition � Lojasiewicz property A differentiable function F : R n → R is said to have the � Lojasiewicz property with exponent θ ∈ [0 , 1) if, for any critical point x ∗ , there exist c > 0 and ε > 0 such that: ∀ x ∈ B ( x ∗ , ε ) , �∇ F ( x ) � � c | F ( x ) − F ∗ | θ Lemma Let F : R n → R be a convex differentiable function. Then F has the � Lojasiewicz property with exponent θ ∈ [0 , 1) iff F satisfies the growth condition L ( r ) with θ = 1 − 1 r . 15/70

  16. Gradient Descent and Geometry Growth condition A function F satisfies condition L ( p ) if it exists K > 0 such that for all x ∈ R n d ( x , X ∗ ) p � K ( F ( x ) − F ( x ∗ )) Theorem Garrigos al al. 2017 (gradient descent) If F satisfies condition L ( p ) with p > 2 then � � 1 F ( x n ) − F ( x ∗ ) = O γ n γ − 2 If F satisfies condition L (2) then it exists a > 0 F ( x n ) − F ( x ∗ ) = O e − an � � 16/70

  17. Geometric convergence of GD with L (2) Geometric convergence of GD with L (2) F ( x n ) − F ( x ∗ ) � � x 0 − x ∗ � 2 and � x − x ∗ � 2 � K ( F ( x ) − F ( x ∗ )) 2 hn No memory algorithm ⇒ ∀ j � n F ( x n ) − F ( x ∗ ) � � x n − j − x ∗ � 2 � K 2 hj ( F ( x n − j ) − F ( x ∗ )) 2 hj 2 hj � 1 K ⇒ j � K 2 ⇐ If h , F ( x n ) − F ( x ∗ ) � F ( x n − j ) − F ( x ∗ ) 2 Conclusion : The decay is geometric. 17/70

  18. Nesterov scheme for strongly convex functions Nesterov inertial scheme Nesterov Scheme for h < 1 n L , and α n = n + α , with α ≥ 3: � y n = x n + α n ( x n − x n − 1 ) = y n − h ∇ F ( y n ) x n +1 � 1 � F ( x n ) − F ( x ∗ ) = O n 2 Nesterov Scheme for strongly convex functions For h < 1 L , ρ = µ L : x n + 1 −√ ρ � = 1+ √ ρ ( x n − x n − 1 ) y n x n +1 = y n − h ∇ F ( y n ) F ( x n ) − F ( x ∗ ) = O ((1 − √ ρ ) n ) 18/70

  19. Outline Introduction 1 Geometry assumptions 2 Geometry for Nesterov inertial scheme 3 ODEs and Lyapunov functions 4 The non differentiable setting 5 Finite error analysis 6 19/70

  20. Back to Nesterov scheme State of the art Nesterov Scheme for h < 1 L , and α � 3 � � n x n +1 = x n − h ∇ F x n + n + α ( x n − x n − 1 ) � 1 � F ( x n ) − F ( x ∗ ) = O n 2 Chambolle-Dossal (14) and Attouch-Peypouquet (15): � 1 � α > 3 ⇒ convergence of ( x n ) n � 1 and F ( x n ) − F ( x ∗ ) = o n 2 If α � 3, Apidopoulos et al. and Attouch et al. (17) � 1 � F ( x n ) − F ( x ∗ ) = O 2 α n 3 20/70

  21. Nesterov, with strong convexity Theorem Su Boyd Cand` es (15), Attouch Cabot (17) If F satisfies L (2) and uniqueness of minimizer, then ∀ α > 0 � 1 � F ( x n ) − F ( x ∗ ) = O 2 α n 3 21/70

  22. Geometrical condition Growth condition A function F satisfies condition L ( p ) if it exists K > 0 such that for all x ∈ R n d ( x , X ∗ ) p � K ( F ( x ) − F ( x ∗ )) Flatness condition F satisfies condition H ( γ ) if ∀ x ∈ R n and all x ∗ ∈ X ∗ F ( x ) − F ( x ∗ ) � 1 γ �∇ F ( x ) , x − x ∗ � 22/70

  23. Nesterov, flatness may improve convergence rate Theorem : Aujol et al. (18) Let F be a differentiable convex function whose gradient is L − Lipschitz If F satisfies H ( γ ), with γ > 1 and 1 if α � 1 + 2 1 γ � � 1 F ( x n ) − F ( x ∗ ) = O 2 γα n γ +2 if α > 1 + 2 γ and thus if α = 3 then 2 � 1 � F ( x n ) − F ( x ∗ ) = o n 2 and the sequence ( x n ) n � 1 converges. If F satisfies L (2), then there exists γ > 1 such that F satifies 2 H ( γ ). 23/70

  24. Nesterov, flatness may improve convergence rate Decay rate r ( α, γ ) = 2 αγ γ +2 depending on the value of α when α � γ +2 and F satisfies H ( γ ) for four values γ : γ 1 = 1 . 5 dashed γ line, γ 2 = 2, solide line, γ 3 = 3 dotted line and γ 4 = 5 dashed-dotted line. 24/70

Recommend


More recommend