proximal method with contractions for smooth convex
play

Proximal Method with Contractions for Smooth Convex Optimization - PowerPoint PPT Presentation

Proximal Method with Contractions for Smooth Convex Optimization Nikita Doikov Yurii Nesterov Catholic University of Louvain, Belgium Grenoble September 23, 2019 Plan of the Talk 1. Proximal Method with Contractions 2. Application to


  1. Proximal Method with Contractions for Smooth Convex Optimization Nikita Doikov Yurii Nesterov Catholic University of Louvain, Belgium Grenoble September 23, 2019

  2. Plan of the Talk 1. Proximal Method with Contractions 2. Application to Second-Order Methods 3. Numerical Example 2 / 19

  3. Plan of the Talk 1. Proximal Method with Contractions 2. Application to Second-Order Methods 3. Numerical Example 3 / 19

  4. Review: Proximal Method f * = min x ∈ R n f ( x ) Proximal Method: {︂ 2 a k + 1 ‖ y − x k ‖ 2 }︂ 1 x k + 1 = argmin f ( y ) + . y ∈ R n [Rockafellar, 1976] ◮ If f is convex, the objective of the subproblem 2 a k + 1 ‖ y − x k ‖ 2 is strongly convex. 1 h k + 1 ( y ) = f ( y ) + ◮ Let f has Lipschitz gradient with constant L 1 . Gradient (︁ )︁ Method needs ˜ O a k + 1 L 1 iterations to minimize h k + 1 . ◮ It is enough to use for x k + 1 an inexact minimizer of h k + 1 . [Solodov-Svaiter, 2001; Schmidt-Roux-Bach, 2011; Salzo-Villa, 2012] x k ) − f * ≤ L 1 ‖ x 0 − x ∗ ‖ 2 1 Set a k + 1 = Then f (¯ L 1 . . 2 k 4 / 19

  5. Accelerated Proximal Method def = ∑︁ k Denote A k i = 1 a i . Two sequences: { x k } k ≥ 0 , and { v k } k ≥ 0 . Initialization: v 0 = x 0 . Iterations , k ≥ 0: 1. Put y k + 1 = a k + 1 v k + A k x k . A k + 1 {︂ k + 1 ‖ y − y k + 1 ‖ 2 }︂ f ( y ) + A k + 1 2. Compute x k + 1 = argmin . 2 a 2 y ∈ R n A k 3. Put v k + 1 = x k + 1 + a k + 1 ( x k + 1 − x k ) . a 2 1 k + 1 Set A k + 1 = L 1 . Then 8 L 1 ‖ x 0 − x ∗ ‖ 2 f ( x k ) − f * ≤ . 3 ( k + 1 ) 2 [Nesterov, 1983; G¨ uler, 1992; Lin-Mairal-Harchaoui, 2015] ◮ A Universal Catalyst for First-Order Optimization. ◮ What about Second-Order Optimization? 5 / 19

  6. New Algorithm: Proximal Method with Contractions Iterations , k ≥ 0: {︂ (︂ )︂ }︂ a k + 1 y + A k x k 1. Compute v k + 1 = argmin + β d ( v k ; y ) A k + 1 f . A k + 1 y ∈ R n 2. Put x k + 1 = a k + 1 v k + 1 + A k x k . A k + 1 β d ( x ; y ) is Bregman Divergence. Basic setup: β d ( x ; y ) = 1 2 ‖ y − x ‖ 2 . Then (︃ )︃ (︂ )︂ 2 ‖ y − v k ‖ 2 = A k + 1 a k + 1 y + A k x k y )+ A k + 1 + 1 y − y k + 1 ‖ 2 f (˜ k + 1 ‖ ˜ A k + 1 f , 2 a 2 A k + 1 y ≡ a k + 1 y + A k x k and y k + 1 ≡ a k + 1 v k + A k x k where ˜ . A k + 1 A k + 1 ◮ The same iteration as in Accelerated Proximal Method . ◮ Generalization to arbitrary prox-function d ( · ) . 6 / 19

  7. Bregman Divergence Let d ( y ) be a convex differentiable function. Denote Bregman Divergence of d ( · ) , centered at x as def β d ( x ; y ) = d ( y ) − d ( x ) − ⟨∇ d ( x ) , y − x ⟩ ≥ 0 . ◮ Mirror Descent [Nemirovski-Yudin, 1979] ◮ Gradient Methods with Relative Smoothness [Lu-Freund-Nesterov, 2016; Bauschke-Bolte-Teboulle, 2016] Consider regularization of convex g ( · ) by Bregman Divergence: h ( y ) ≡ g ( y ) + β d ( v ; y ) . Main Lemma. T = argmin h ( y ) . Then y ∈ R n h ( y ) ≥ h ( T ) + β d ( T ; y ) . 7 / 19

  8. Proximal Method with Contractions: the Main Idea We want , for all y ∈ R n : β d ( x 0 ; y ) + A k f ( y ) ≥ β d ( v k ; y ) + A k f ( x k ) . ($) def How to propagate it to k + 1 ? Denote a k + 1 = A k + 1 − A k > 0. β d ( x 0 ; y ) + A k + 1 f ( y ) ≡ β d ( x 0 ; y ) + A k f ( y ) + a k + 1 f ( y ) ($) ≥ β d ( v k ; y ) + A k f ( x k ) + a k + 1 f ( y ) (︂ )︂ a k + 1 y + A k x k ≥ β d ( v k ; y ) + A k + 1 f ≡ h k + 1 ( y ) . A k + 1 Let v k + 1 = argmin h k + 1 ( y ) . Then, by the Main Lemma, y ∈ R n h k + 1 ( y ) ≥ h k + 1 ( v k + 1 ) + β d ( v k + 1 ; y ) (︂ a k + 1 v k + 1 + A k x k )︂ ≥ A k + 1 f + β d ( v k + 1 ; y ) . A k + 1 ⏟ ⏞ ≡ x k + 1 8 / 19

  9. Proximal Method with Contractions Iterations , k ≥ 0: {︂ (︂ )︂ }︂ a k + 1 y + A k x k 1. Compute v k + 1 = argmin + β d ( v k ; y ) A k + 1 f . A k + 1 y ∈ R n 2. Put x k + 1 = a k + 1 v k + 1 + A k x k . A k + 1 Rate of convergence: β d ( x 0 ; x ∗ ) f ( x k ) − f * ≤ . A k Questions: ◮ How to choose A k ? Prox-function d ( · ) ? ◮ How to compute v k + 1 ? 9 / 19

  10. Plan of the Talk 1. Proximal Method with Contractions 2. Application to Second-Order Methods 3. Numerical Example 10 / 19

  11. Newton Method with Cubic Regularization h * = min x ∈ R n h ( x ) h is convex, with Lipschitz continuous Hessian: ‖∇ 2 h ( x ) − ∇ 2 h ( y ) ‖ ≤ L 2 ‖ x − y ‖ . Model of the objective def h ( x ) + ⟨∇ h ( x ) , y − x ⟩ + 1 2 ⟨∇ 2 h ( x )( y − x ) , y − x ⟩ Ω M ( x ; y ) = + M 6 ‖ y − x ‖ 3 Iterations: z t + 1 := Ω M ( z t ; y ) , t ≥ 0 . argmin y ∈ R n Newton method with Cubic regularization [Nesterov-Polyak, 2006] ◮ Global convergence (︂ )︂ h ( z t ) − h * ≤ O L 2 R 3 . t 2 11 / 19

  12. Computing inexact Proximal Step Apply Cubic Newton to compute the Proximal Step: (︂ )︂ a k + 1 y + A k x k h k + 1 ( y ) ≡ A k + 1 f + β d ( v k ; y ) → min A k + 1 y ∈ R n ◮ Pick d ( x ) = 1 3 ‖ x − x 0 ‖ 3 . ◮ Uniformly convex objective: β h ( x ; y ) ≥ 1 6 ‖ y − x ‖ 3 . Linear rate of convergence for Cubic Newton: (︂ (︂ )︂ )︂ h ( z t ) − h * t ( h ( z 0 ) − h * ) ≤ exp − O √ L 2 . ◮ Let v k + 1 be inexact Proximal Step: ‖∇ h k + 1 ( v k + 1 ) ‖ * ≤ δ k + 1 . Theorem (︁ )︁ 3 / 2 3 − 2 / 3 ‖ x 0 − x ∗ ‖ 2 + 6 1 / 3 ∑︁ k i = 1 δ i f ( x k ) − f * ≤ A k (︂√︁ )︂ 1 ◮ O L 2 ( h k + 1 ) log iterations of Cubic Newton for step k . δ k + 1 12 / 19

  13. The choice of A k (︂ )︂ a k + 1 y + A k x k Contracted objective: g k + 1 ( y ) ≡ A k + 1 f . A k + 1 Derivatives (︂ )︂ a k + 1 y + A k x k 1. Dg k + 1 ( y ) = a k + 1 Df , A k + 1 (︂ )︂ a 2 a k + 1 y + A k x k 2. D 2 g k + 1 ( y ) = A k + 1 D 2 f k + 1 , A k + 1 (︂ )︂ a 3 a k + 1 y + A k x k 3. D 3 g k + 1 ( y ) = k + 1 D 3 f k + 1 , A 2 A k + 1 . . . a p + 1 Notice: D p + 1 f ⪯ L p ( f ) ⇒ D p + 1 g k + 1 ⪯ k + 1 k + 1 L p ( f ) . Therefore, A p a p + 1 1 k + 1 if we have ≤ then L p ( g k + 1 ) ≤ 1. A p L p ( f ) k + 1 k 3 ◮ For Cubic Newton ( p = 2) set A k = L 2 ( f ) . We obtain (︁ 1 )︁ accelerated rate of convergence: O . k 3 13 / 19

  14. High-Order Proximal Accelerated Scheme Basic Method p = 1 : Gradient Method. p = 2 : Newton method with Cubic regularization. p = 3 : Third order methods (admits effective implementation) [Grapiglia-Nesterov, 2019] . . . . p + 1 ‖ x − x 0 ‖ p + 1 . Set A k = k p + 1 1 ◮ Prox-function: d ( x ) = L p ( f ) . ◮ Let δ k = c k 2 . Theorem (︂ )︂ L p ( f ) ‖ x 0 − x ∗ ‖ p + 1 f ( x k ) − f * ≤ O . k p + 1 (︂ )︂ log 1 ◮ O steps of Basic Method every iteration. δ k 14 / 19

  15. Plan of the Talk 1. Proximal Method with Contractions 2. Application to Second-Order Methods 3. Numerical Example 15 / 19

  16. Log-sum-exp (︃ m )︃ ∑︁ e ⟨ a i , x ⟩ x ∈ R n f ( x ) = log min . i = 1 ◮ a 1 , . . . , a m ∈ R n — given data. m ∑︁ i ⪰ 0, and use ‖ x ‖ ≡ ⟨ Bx , x ⟩ 1 / 2 . ◮ Denote B ≡ a i a T i = 1 ◮ We have L 1 ≤ 1 , L 2 ≤ 2 . 16 / 19

  17. Log-sum-exp: convergence Minimizing log-sum-exp, n=10, m=30 10 0 squared gradient norm 10 2 10 4 GD AGD 10 6 APM, p=1 CN ACN 10 8 APM, p=2 0 20 40 60 80 100 iterations 17 / 19

  18. Log-sum-exp: inner steps APM, p = 2 7 6 number of inner iterations, t k 5 4 3 2 1 0 10 20 30 40 50 iterations, k 18 / 19

  19. Conclusion Two ingredients ◮ Bregman divergence β d ( v k ; y ) . ◮ Contraction operator (︂ )︂ a k + 1 y + A k x k f ( y ) ↦→ f . A k + 1 Direct acceleration vs. Proximal acceleration (︂ )︂ (︂ )︂ and ˜ 1 1 ◮ The rates are: O O , for the methods of k p + 1 k p + 1 order p ≥ 1. ◮ In practice, the number of inner steps is a constant. ◮ Proximal acceleration is more general — useful for stochastic and distributed optimization. Thank you for your attention! 19 / 19

Recommend


More recommend