Convex optimization based on global lower second-order models Nikita Doikov Yurii Nesterov UCLouvain, Belgium NeurIPS 2020
Problem Composite convex optimization problem: def min x F ( x ) = f ( x ) + ψ ( x ) ◮ f is convex, differentiable. ◮ ψ : R n → R ∪ { + ∞} is convex, simple. ◮ dom ψ is bounded. D def = diam ( dom ψ ) . Example: {︄ ‖ x ‖ ≤ D 0 , 2 , ψ ( x ) = + ∞ , otherwise . ⇒ The problem with ball-regularization: min f ( x ) ‖ x ‖≤ D 2 2 / 13
Review: Gradient Methods Let ∇ f be Lipschitz continuous: ‖∇ f ( y ) − ∇ f ( x ) ‖ * ≤ L ‖ y − x ‖ . The Gradient Method: {︂ }︂ 2 ‖ y − x k ‖ 2 + ψ ( y ) f ( x k ) + ⟨∇ f ( x k ) , y − x k ⟩ + L x k + 1 = argmin . y ◮ Global convergence: F ( x k ) − F * ≤ O ( 1 k ) . The Conditional Gradient Method [Frank-Wolfe, 1956] : {︂ }︂ v k + 1 = f ( x k ) + ⟨∇ f ( x k ) , y − x k ⟩ + ψ ( y ) argmin , y = γ k v k + 1 + ( 1 − γ k ) x k . x k + 1 k + 2 . Then F ( x k ) − F * ≤ O ( 1 2 ◮ Set γ k = k ) . Note: Near-optimal for ‖ · ‖ ∞ -balls [Guzm´ an-Nemirovski, 2015] . 3 / 13
Review: Second-Order Methods Let ∇ 2 f be Lipschitz continuous: ‖∇ 2 f ( x ) − ∇ 2 f ( y ) ‖ ≤ L ‖ x − y ‖ . Newton Method : {︂ ⟨∇ f ( x k ) , y − x k ⟩ + 1 2 ⟨∇ 2 f ( x k )( y − x k ) , y − x k ⟩ = x k + 1 argmin y }︂ + ψ ( y ) . ◮ Quadratic convergence (if ∇ 2 f ( x * ) ≻ 0 and x 0 close to x * ). ◮ No global convergence. A heuristic: use line-search in practice. Newton Method with Cubic Regularization : {︂ ⟨∇ f ( x k ) , y − x k ⟩ + 1 2 ⟨∇ 2 f ( x k )( y − x k ) , y − x k ⟩ x k + 1 = argmin y }︂ 6 ‖ y − x k ‖ 3 + ψ ( y ) L + . ◮ Global rate: F ( x k ) − F * ≤ O ( 1 k 2 ) [Nesterov-Polyak, 2006] . 4 / 13
Overview of the Contributions New second-order algorithms with global convergence proofs. ◮ The methods are universal (no unknown parameters). ◮ Affine-invariant (the norm is not fixed). Stochastic methods (basic and with the variance reduction). Numerical experiments. 5 / 13
Second-Order Lower Model 1. f is convex: f ( y ) ≥ f ( x ) + ⟨∇ f ( x ) , y − x ⟩ . 2. ∇ 2 f is Lipschitz continuous: ‖∇ 2 f ( x ) − ∇ 2 f ( y ) ‖ ≤ L ‖ x − y ‖ . Convexity + Smoothness ⇒ tighter lower bound : ∀ t ∈ [ 0 , 1 ] 2 ⟨∇ 2 f ( x )( y − x ) , y − x ⟩ − t 2 L ‖ y − x ‖ 3 f ( x ) + ⟨∇ f ( x ) , y − x ⟩ + t f ( y ) ≥ . 6 4 3 2 1 0 Second-order First-order −1 −3 −2 −1 0 1 2 3 4 6 / 13
New Algorithm Contracting-Domain Newton Method: {︂ 2 ⟨∇ 2 f ( x k )( y − x k ) , y − x k ⟩ v k + 1 = argmin ⟨∇ f ( x k ) , y − x k ⟩ + γ k y }︂ + ψ ( y ) , = γ k v k + 1 + ( 1 − γ k ) x k . x k + 1 7 / 13
Trust-Region Interpretation Contracting-Domain Newton Method (reformulation): {︂ ⟨∇ f ( x k ) , y − x k ⟩ + 1 2 ⟨∇ 2 f ( x k )( y − x k ) , y − x k ⟩ = x k + 1 argmin y }︂ + γ k ψ ( x k + 1 γ k ( y − x k )) . Regularization of quadratic model by the asymmetric trust region. 8 / 13
Global Convergence Let ∇ 2 f be Lipschitz continuous: ‖∇ 2 f ( x ) − ∇ 2 f ( y ) ‖ ≤ L ‖ x − y ‖ (w.r.t. arbitrary norm). 3 Theorem 1. Set γ k = k + 3 . Then O ( LD 3 F ( x k ) − F * ≤ k 2 ) . Theorem 2. Let ψ be strongly convex with parameter µ > 0. 5 ◮ Set γ k = k + 5 . Then µ · LD 3 F ( x k ) − F * O ( LD ≤ k 4 ) . ]︂ 1 2 . Then 1 + ω , where ω def [︂ 1 LD ◮ Set γ k = = 2 µ (︂ )︂ LD 3 − k − 1 F ( x k ) − F * ≤ exp 2 . 1 + ω 9 / 13
Experiments: Logistic Regression M ∑︁ (︁ )︁ min f i ( x ) , f i ( x ) = log( 1 + exp ⟨ a i , x ⟩ ) . ‖ x ‖ 2 ≤ D i = 1 2 D plays the role of regularization parameter. w8a, D = 20 w8a, D = 100 Frank-Wolfe 10 1 10 0 Grad. Method 10 0 10 1 Fast Grad. Method 10 1 Contr. Newton 10 2 Func. residual Func. residual 2.4s 10 2 Aggr. Newton 5.1s 10 3 10 3 0.5s 10 4 2.5s 10 4 10 5 6.99s 5s 10 5 7s 0.25s 4.58s 10 6 4.48s 10 6 0.28s 4.59s 10 7 10 7 0 50 100 150 200 0 500 1000 1500 2000 Iterations Iterations For bigger D the problem becomes more ill-conditioned . 10 / 13
Stochastic Methods for Logistic Regression Approximate ∇ f ( x ) , ∇ 2 f ( x ) by stochastic estimates. YearPredictionMSD, D = 20 SGD 10 0 SVRG SNewton 10 1 SVRNewton Func. residual 20s 10 2 19.75s 10 3 20.14s 10 4 19.85s 10 5 10 6 0 50 100 150 200 Epochs The problem with big dataset size ( M = 463715) and small dimension ( n = 90). 11 / 13
Conclusions Second-order information helps in a case of ◮ ill-conditioning; ◮ small or moderate dimension (the subproblems are more expensive). No need to tune stepsize. Can be preferable for solving problems over the sets with a non-Euclidean geometry. 12 / 13
Follow Up Results Nikita Doikov and Yurii Nesterov. “Affine-invariant contracting- point methods for Convex Optimization”. In: arXiv:2009.08894 (2020) ◮ General framework of Contracting-Point Methods. ◮ Contracting-Point Tensor Methods of order p ≥ 1: F ( x k ) − F * O ( 1 ≤ k p ) . ◮ Affine-invariant smoothness condition ⇒ Affine-invariant analysis. Thank you for your attention! 13 / 13
Recommend
More recommend