Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii Nesterov UCLouvain, Belgium ICML 2020
Plan of the talk 1. Introduction: Tensor Methods in Convex Optimization 2. Inexact Tensor Methods 3. Acceleration 4. Numerical Example 2 / 22
Plan of the talk 1. Introduction: Tensor Methods in Convex Optimization 2. Inexact Tensor Methods 3. Acceleration 4. Numerical Example 3 / 22
Gradient Method Composite optimization problem x ∈ dom F F ( x ) := f ( x ) + ψ ( x ) , min ◮ f is convex and smooth; ◮ ψ : R n → R ∪ { + ∞} is convex (possibly nonsmooth, but simple ). The Gradient Method: {︂ 2 ‖ y − x k ‖ 2 + ψ ( y ) }︂ ⟨∇ f ( x k ) , y − x k ⟩ + H = , k ≥ 0 . x k + 1 argmin y ◮ Gradient of f is Lipschitz continuous: ‖∇ f ( y ) − ∇ f ( x ) ‖ ≤ L 1 ‖ y − x ‖ ⇒ H := L 1 ◮ Global sublinear convergence: F ( x k ) − F * ≤ O ( 1 / k ) . 4 / 22
Newton Method with Cubic Regularization ◮ Hessian of f is Lipschitz continuous: ‖∇ 2 f ( y ) − ∇ 2 f ( x ) ‖ ≤ L 2 ‖ y − x ‖ . Cubic Newton: {︂ ⟨∇ f ( x k ) , y − x k ⟩ + 1 2 ⟨∇ 2 f ( x k )( y − x k ) , y − x k ⟩ x k + 1 = argmin y 6 ‖ y − x k ‖ 3 + ψ ( y ) }︂ + H , k ≥ 0 . ◮ H := 0 ⇒ Classical Newton. Global convergence: F ( x k ) − F * ≤ O ( 1 / k 2 ) . ◮ H := L 2 ⇒ [Nesterov-Polyak, 2006] 5 / 22
Tensor Methods Let x ∈ R n be fixed, consider arbitrary h ∈ R n and one-dimensional φ ( t ) := f ( x + th ) , t ∈ R . Then φ ( 0 ) = f ( x ) , φ ′ ( 0 ) = ⟨∇ f ( x ) , h ⟩ , φ ′′ ( 0 ) = ⟨∇ 2 f ( x ) h , h ⟩ . Denote: φ ( p ) ( 0 ) . D p f ( x )[ h ] p := The model: p k ! D k f ( x )[ y − x ] k + ( p + 1 )! ‖ y − x ‖ p + 1 + ψ ( y ) . 1 H Ω H ( x ; y ) := ∑︁ k = 1 Tensor Method of order p ≥ 1 : = Ω H ( x k ; y ) , k ≥ 0 . x k + 1 argmin y ◮ p -th derivative is Lipschitz continuous: ‖ D p f ( y ) − D p f ( x ) ‖ ≤ L p ‖ y − x ‖ . ◮ Global convergence: F ( x k ) − F * ≤ O ( 1 / k p ) . [Baes, 2009] 6 / 22
Tensor Methods: Solving the Subproblem At each iteration k ≥ 0, the subproblem is p k ! D k f ( x )[ y − x ] k + ( p + 1 )! ‖ y − x ‖ p + 1 + ψ ( y ) . 1 H min Ω H ( x k ; y ) := ∑︁ y k = 1 ◮ H ≥ pL p ⇒ Ω H ( x k ; y ) is convex in y . [Nesterov, 2018] ◮ For p = 3: efficient implementation, using Gradient Method with relative smoothness condition [Van Nguyen, 2017; Bauschke-Bolte-Teboulle, 2016; Lu-Freund-Nesterov, 2018] . The cost of minimizing Ω H ( x k ; · ) is: O ( n 3 ) + ˜ O ( n ) . 7 / 22
Some Recent Results ◮ Accelerated Tensor Methods: F ( x k ) − F * ≤ O ( 1 / k p + 1 ) [Baes, 2009; Nesterov, 2018] . 3 p + 1 ◮ Optimal Tensor Methods: F ( x k ) − F * ≤ O ( 1 / k 2 ) [Gasnikov et al., 2019; Kamzolov-Gasnikov-Dvurechensky, 2020] . The oracle complexity matches the lower bound (up to logarithmic factor) from [Arjevani-Shamir-Shiff, 2017] . ◮ Universal Tensor Methods: [Grapiglia-Nesterov, 2019] . ◮ Stochastic Tensor Methods: [Lucchi-Kohler, 2019] . ◮ . . . 8 / 22
Plan of the talk 1. Introduction: Tensor Methods in Convex Optimization 2. Inexact Tensor Methods 3. Acceleration 4. Numerical Example 9 / 22
Definition of Inexactness Use a point T = T H ,δ ( x k ) with small residual in function value: Ω H ( x k ; T ) − min Ω H ( x k ; y ) ≤ δ. y ◮ Easier to achieve by inner method. ◮ Can be controlled in practice using the duality gap. Set H := pL p . We have F ( T ) ≤ F ( x k ) + δ. ◮ Inexact step can be nonmonotone. 10 / 22
Monotone Inexact Tensor Methods Initialization: choose x 0 ∈ dom F , set H := pL p . Iterations: k ≥ 0. 1: Pick up δ k + 1 ≥ 0. 2: Compute inexact monotone tensor step T , such that Ω H ( x k ; T ) − min Ω H ( x k ; y ) ≤ δ k + 1 , y and F ( T ) < F ( x k ) . 3: x k + 1 := T . c Theorem 1. Set δ k := k p + 1 , for c ≥ 0. Then (︁ 1 F ( x k ) − F * )︁ ≤ O . k p 11 / 22
Adaptive Strategy for Inner Accuracy Let us set δ k := c ( F ( x k − 2 ) − F ( x k − 1 )) . Theorem 2. (General convex case) (︁ 1 F ( x k ) − F * )︁ ≤ O . k p Theorem 3. (Uniformly convex objective) Let F ( x ) + ⟨ F ′ ( x ) , y − x ⟩ + σ p + 1 p + 1 ‖ y − x ‖ p + 1 . F ( y ) ≥ Denote ω p := max { ( p + 1 ) 2 L p p ! σ p + 1 , 1 } . Then we have linear rate 1 − p ω − 1 / p (︂ )︂ F ( x k + 1 ) − F * ( F ( x k ) − F * ) . p ≤ 2 ( p + 1 ) ◮ This works for methods, starting from p ≥ 1. Theorem 4. For p ≥ 2 and strongly convex objective, we have local superlinear rate. 12 / 22
Plan of the talk 1. Introduction: Tensor Methods in Convex Optimization 2. Inexact Tensor Methods 3. Acceleration 4. Numerical Example 13 / 22
Contracting Proximal Scheme ◮ Fix prox-function d ( x ) . Bregman divergence: β d ( x ; y ) := d ( y ) − d ( x ) − ⟨∇ d ( x ) , y − x ⟩ . ◮ Two sequences of points { x k } k ≥ 0 , { v k } k ≥ 0 , v 0 = x 0 . def = ∑︁ k ◮ Sequence of positive coefficients { a k } k ≥ 0 , A k i = 1 a i . Iterations , k ≥ 0: 1. Compute (︁ a k + 1 y + A k x k {︂ }︂ )︁ = + a k + 1 ψ ( y ) + β d ( v k ; y ) . v k + 1 argmin A k + 1 f A k + 1 y 2. Put x k + 1 = a k + 1 v k + 1 + A k x k . A k + 1 The rate of convergence: F ( x k ) − F * ≤ β d ( x 0 ; x ∗ ) . A k [Doikov-Nesterov, 2019] 14 / 22
Acceleration of Tensor Steps For Tensor Method of order p ≥ 1: p + 1 ‖ x − x 0 ‖ p + 1 . 1 ◮ Set d ( x ) := ◮ A k + 1 := ( k + 1 ) p + 1 . L p For contracted objective with regularization (︁ a k + 1 y + A k x k )︁ h k + 1 ( y ) := A k + 1 f + a k + 1 ψ ( y ) + β d ( v k ; y ) , A k + 1 we compute inexact minimizer v k + 1 : h k + 1 ( v k + 1 ) − h * c ≤ ( k + 1 ) p + 2 . k + 1 ◮ It requires ˜ O ( 1 ) inexact Tensor Steps. Theorem. For outer iterations, we obtain accelerated rate: F ( x k ) − F * 1 (︁ )︁ ≤ O . k p + 1 15 / 22
Plan of the talk 1. Introduction: Tensor Methods in Convex Optimization 2. Inexact Tensor Methods 3. Acceleration 4. Numerical Example 16 / 22
Log-sum-exp (︃ m )︂)︃ (︂ ⟨ a i , x ⟩− b i ∑︁ x ∈ R n f ( x ) := µ log min exp (SoftMax) . µ i = 1 ◮ a 1 , . . . , a m , b — given data. ◮ µ > 0 — smoothing parameter. m i ⪰ 0, and use ‖ x ‖ ≡ ⟨ Bx , x ⟩ 1 / 2 . ◮ Denote B ≡ a i a T ∑︁ i = 1 We have L 1 ≤ 1 2 4 µ , L 2 ≤ µ 2 , L 3 ≤ µ 3 . ◮ Cubic Newton ( p = 2). ◮ Compute each step (inexactly) by Fast Gradient Method. 17 / 22
Log-sum-exp: Constant strategies ◮ δ k := const. Log-sum-exp, = 0.05: constant strategies 10 1 10 1 Functional residual 1 1 10 10 10 3 10 3 10 5 10 5 2 10 10 4 10 6 10 7 10 7 10 8 0 200 400 0 20000 40000 60000 Iterations Hessian-vector products 18 / 22
Log-sum-exp: Dynamic strategies ◮ δ k := 1 / k α . Log-sum-exp, = 0.05: dynamic strategies 10 1 10 1 Functional residual 10 1 10 1 10 3 10 3 1/ k 5 5 10 10 1/ k 2 1/ k 3 1/ k 4 10 7 10 7 10 8 0 200 400 0 20000 40000 60000 Iterations Hessian-vector products 19 / 22
Log-sum-exp: Adaptive strategies ◮ δ k := ( F ( x k − 1 ) − F ( x k )) α . Log-sum-exp, = 0.05: adaptive strategies 10 0 10 0 Functional residual 10 2 10 2 10 4 10 4 adaptive 10 6 10 6 adaptive 1.5 adaptive 2 10 8 10 8 10 8 0 50 100 150 0 20000 40000 60000 Iterations Hessian-vector products 20 / 22
Log-sum-exp: Cubic Newton vs. Tensor Method Log-sum-exp, = 0.1. 10 1 10 1 Functional residual 10 3 10 5 10 7 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Time, s Cubic Newton (p = 2) Tensor (p = 3), Exact Tensor (p = 3), Adaptive ◮ H is fixed. 21 / 22
Conclusion Inexact Tensor Methods of degree p ≥ 1: p = 1 : Gradient Method. p = 2 : Newton method with Cubic regularization. p = 3 : Third order Tensor method. We admit to solve the subproblem inexactly, δ k — accuracy in functional residual for the subproblem. ◮ Dynamic strategy δ k := c k p + 1 . ◮ Adaptive strategy δ k := c ( F ( x k ) − F ( x k − 1 )) . Global rate of convergence: F ( x k ) − F * ≤ O ( 1 k p ) . ◮ Using contracting proximal iterations we obtain accelerated 1 O ( k p + 1 ) rate. Thank you for your attention! 22 / 22
Recommend
More recommend