randomized block cubic newton method
play

Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 - PowerPoint PPT Presentation

Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 Peter Richt 1 Higher School of Economics, Russia 2 King Abdullah University of Science and Technology, Saudi Arabia 3 The University of Edinburgh, United Kingdom 4 Moscow Institute


  1. Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 Peter Richt´ 1 Higher School of Economics, Russia 2 King Abdullah University of Science and Technology, Saudi Arabia 3 The University of Edinburgh, United Kingdom 4 Moscow Institute of Physics and Technology, Russia International Conference on Machine Learning, Stockholm July 12, 2018

  2. Plan of the Talk 1. Review: Gradient Descent and Cubic Newton methods 2. RBCN: Randomized Block Cubic Newton 3. Application: Empirical Risk Minimization 2 / 20

  3. Plan of the Talk 1. Review: Gradient Descent and Cubic Newton methods 2. RBCN: Randomized Block Cubic Newton 3. Application: Empirical Risk Minimization 3 / 20

  4. Review: Classical Gradient Descent Optimization problem: x ∈ R N F ( x ) min ◮ Main assumption: gradient of F is Lipschitz-continuous: ∀ x , y ∈ R N . ‖∇ F ( x ) − ∇ F ( y ) ‖ ≤ L ‖ x − y ‖ , ◮ From which we get the Global upper bound for the function: F ( y ) ≤ F ( x ) + ⟨∇ F ( x ) , y − x ⟩ + L 2 ‖ y − x ‖ 2 , ∀ x , y ∈ R N . ◮ The Gradient Descent: [︂ 2 ‖ y − x ‖ 2 ]︂ F ( x ) + ⟨∇ F ( x ) , y − x ⟩ + L x + = argmin = x − 1 L ∇ F ( x ) . y ∈ R N 4 / 20

  5. Review: Cubic Newton Optimization problem: x ∈ R N F ( x ) . min ◮ New assumption: Hessian of F is Lipschitz-continuous: ‖∇ 2 F ( x ) − ∇ 2 F ( y ) ‖ ≤ H ‖ x − y ‖ , ∀ x , y ∈ R N . ◮ Corresponding Global upper bound for the function: Q ( x ; y ) ≡ F ( x )+ ⟨∇ F ( x ) , y − x ⟩ + 1 2 ⟨∇ 2 F ( x )( y − x ) , y − x ⟩ , F ( y ) ≤ Q ( x ; y ) + H 6 ‖ y − x ‖ 3 , ∀ x , y ∈ R N . then ◮ Newton method with cubic regularization 1 : [︂ Q ( x ; y ) + H 6 ‖ y − x ‖ 3 ]︂ x + = argmin y ∈ R N ∇ 2 F ( x ) + H ‖ x + − x ‖ (︂ )︂ − 1 = x − ∇ F ( x ) . I 2 1 Yurii Nesterov and Boris T Polyak. “Cubic regularization of Newton’s method and its global performance”. In: Mathematical Programming 108.1 (2006), pp. 177–205. 5 / 20

  6. Gradient Descent vs. Cubic Newton Optimization problem: F ∗ = min x ∈ R N F ( x ) . ◮ F ( x K ) − F ∗ ≤ ε, What is K – ? ◮ Let F be convex : F ( y ) ≥ F ( x ) + ⟨∇ F ( x ) , y − x ⟩ . ◮ Iteration complexity estimates: (︂ )︂ (︂ )︂ 1 1 K = O for GD, and K = O for CN (much better). √ ε ε ◮ But, cost of one iteration: O ( N ) for GD and O ( N 3 ) for CN. N is huge for modern applications. Even O ( N ) is too much! 6 / 20

  7. Our Motivation Recent advances in block coordinate methods. 1. Paul Tseng and Sangwoon Yun. “A coordinate gradient descent method for nonsmooth separable minimization”. In: Mathematical Programming 117.1-2 (2009), pp. 387–423 2. Peter Richt´ arik and Martin Tak´ aˇ c. “Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function”. In: Mathematical Programming 144.1-2 (2014), pp. 1–38 3. Zheng Qu et al. “SDNA: stochastic dual Newton ascent for empirical risk minimization”. In: International Conference on Machine Learning . 2016, pp. 1823–1832 Computationally effective steps, convergence as for full methods. Aim: To create a second-order method with global complexity guarantees and low cost of every iteration. 7 / 20

  8. Plan of the Talk 1. Review: Gradient Descent and Cubic Newton methods 2. RBCN: Randomized Block Cubic Newton 3. Application: Empirical Risk Minimization 8 / 20

  9. Problem Structure ◮ Consider the following decomposition of F : R N → R : F ( x ) ≡ φ ( x ) + g ( x ) ⏟ ⏞ ⏟ ⏞ twice differentiable differentiable ◮ For a given space decomposition R N ≡ R N 1 × · · · × R N n , (︁ )︁ x ( i ) ∈ R N i , x ≡ x ( 1 ) , . . . , x ( n ) , assume block-separable structure of φ : n ∑︂ φ ( x ) ≡ φ i ( x ( i ) ) . i = 1 ◮ Block separability for g : R N → R is not fixed. 9 / 20

  10. Main Assumptions Optimization problem: min x ∈ Q F ( x ) , where n ∑︂ F ( x ) ≡ φ i ( x ( i ) ) + g ( x ) . i = 1 ◮ Every φ i ( x ( i ) ) , i ∈ { 1 , . . . , n } is twice-differentiable and convex, with Lipschitz-continuous Hessian: ‖∇ 2 φ i ( x ) − ∇ 2 φ i ( y ) ‖ ≤ H i ‖ x − y ‖ , ∀ x , y ∈ R N i . ◮ g ( x ) is differentiable, and for some fixed positive-semidefinite matrices A ⪰ G ⪰ 0 we have bounds, for all x , y ∈ R N : ◮ g ( y ) ≤ g ( x ) + ⟨∇ g ( x ) , y − x ⟩ + 1 2 ⟨ A ( y − x ) , y − x ⟩ , ◮ g ( y ) ≥ g ( x ) + ⟨∇ g ( x ) , y − x ⟩ + 1 2 ⟨ G ( y − x ) , y − x ⟩ . ◮ Q ⊂ R N is a simple convex set. 10 / 20

  11. Model of the Objective n ∑︂ Objective: F ( x ) ≡ φ i ( x ( i ) ) + g ( x ) i = 1 We want to build a model of F . ◮ Fix subset of blocks: S ⊂ { 1 , . . . , n } . ◮ For y ∈ R N denote by y [ S ] ∈ R N a vector with zeroed i / ∈ S . ◮ M H , S ( x ; y ) ≡ F ( x ) + ⟨∇ φ ( x ) , y [ S ] ⟩ + 1 2 ⟨∇ 2 φ ( x ) y [ S ] , y [ S ] ⟩ 6 ‖ y [ S ] ‖ 3 + ⟨∇ g ( x ) , y [ S ] ⟩ + 1 H + 2 ⟨ Ay [ S ] , y [ S ] ⟩ . ◮ From smoothness: F ( x + y ) ≤ M H , S ( x ; y ) , ∀ x , y ∈ R N ∑︂ for H ≥ H i . i ∈ S 11 / 20

  12. RBCN: Randomized Block Cubic Newton method ◮ Method step: T H , S ( x ) ≡ M H , S ( x ; y ) . argmin y ∈ R N [ S ] s.t. x + y ∈ Q ◮ Algorithm: Initialization: choose x 0 ∈ R N , uniform random distribution ˆ S . Iterations: k ≥ 0. 1: Sample S k ∼ ˆ S 2: Find H k > 0 such that F ( x k + T H k , S k ( x k )) ≤ M H k , S k ( x k ; x k + T H k , S k ( x k )) . = x k + T H k , S k ( x k ) . def 3: Make the step: x k + 1 12 / 20

  13. Convergence Results (︂ )︂ F ( x K ) − F ∗ ≤ ε We want to get: ≥ 1 − ρ P ε > 0 is required accuracy level , ρ ∈ ( 0 , 1 ) is confidence level . Theorem 1. General conditions. (︃ 1 (︃ )︃)︃ 1 + log 1 ε · n τ ≡ E [ | ˆ K = O τ · , S | ] . ρ Theorem 2. σ ∈ [ 0 , 1 ] is a condition number . σ ≥ λ min ( G ) λ max ( A ) > 0. (︃ 1 (︃ )︃)︃ τ · 1 1 + log 1 √ ε · n K = O σ · ρ Theorem 3. Strongly convex case: µ ≡ λ min ( G ) > 0. (︃ 1 √︄ (︃ )︃ {︃ HD }︃ )︃ · n τ · 1 D ≥ ‖ x 0 − x ∗ ‖ . K = O log σ · max µ , 1 , ε ρ 13 / 20

  14. Plan of the Talk 1. Review: Gradient Descent and Cubic Newton methods 2. RBCN: Randomized Block Cubic Newton 3. Application: Empirical Risk Minimization 14 / 20

  15. Empirical Risk Minimization ERM problem: [︃ n ]︃ ∑︂ φ i ( b T min P ( w ) ≡ i w ) + g ( w ) w ∈ R d ⏟ ⏞ ⏟ ⏞ i = 1 loss regularizer ◮ SVM: φ i ( a ) = max { 0 , 1 − y i a } , ◮ Logistic regression: φ i ( a ) = log( 1 + exp( − y i a )) , ◮ Regression: φ i ( a ) = ( a − y i ) 2 or φ i ( a ) = | a − y i | , ◮ Support vector regression: φ i ( a ) = max { 0 , | a − y i | − ν } , ◮ Generalized linear models. 15 / 20

  16. Constrained Problem Reformulation [︃ n ]︃ ∑︂ φ i ( b T w ∈ R d P ( w ) min = min i w ) + g ( w ) w ∈ R d ⏟ ⏞ i = 1 ≡ µ i [︃ n ]︃ ∑︂ = min φ i ( µ i ) + g ( w ) w ∈ R d ⏟ ⏞ i = 1 µ ∈ R n differentiable ⏟ ⏞ b T i w = µ i separable, twice ⏟ ⏞ differentiable ≡ Q ◮ Approximate φ i by second-order models with cubic regularization; ◮ Treat g as quadratic function; ◮ Project onto simple constraints {︁ w ∈ R d , µ ∈ R n ⃒ }︁ ⃒ b T Q ≡ i w = µ i . 16 / 20

  17. Proof of Concept: Does second-order information help? leukemia, tolerance=1e-6 duke breast-cancer, tolerance=1e-6 22.5 25 Block coordinate Gradient Descent 20.0 Block coordinate Cubic Newton Total computational time (s) Total computational time (s) 17.5 20 15.0 15 12.5 Block coordinate Gradient Descent Block coordinate Cubic Newton 10.0 10 7.5 5.0 5 2.5 10 25 50 100 200 500 1K 2K 4K 8K 10 25 50 100 200 500 1K 2K 4K 8K Block size Block size ◮ Training Logistic Regression, d = 7129. ◮ Cubic Newton beats Gradient Descent for 10 ≤ | S | ≤ 50. ◮ Second-order information improves convergence. 17 / 20

  18. Maximization of the Dual Problem n ∑︂ φ i ( b T Initial objective: P ( w ) ≡ i w ) + g ( w ) . i = 1 We have Primal and Dual problems: w ∈ R d P ( w ) ≥ max min α ∈ R n D ( α ) , [︁ ]︁ s T x − f ( x ) introducing Fenchel Conjugate: f ∗ ( s ) ≡ sup , we have x n − g ∗ (︂ )︂ ∑︂ − B T α − φ ∗ D ( α ) ≡ i ( α i ) . i = 1 ⏟ ⏞ ⏟ ⏞ differentiable separable, twice differentiable Solve Dual problem by our framework: ◮ Approximate φ ∗ i by second-order cubic models; ◮ Treat g ∗ as quadratic function; ◮ Project onto Q ≡ ⋂︁ n i = 1 dom φ ∗ i . 18 / 20

  19. Training Poisson Regression ◮ Solving the dual of Poisson regression. Synthetic Montreal bike lanes Cubic, 8 10 Cubic, 32 10 1 Cubic, 256 10 SDNA, 8 7 10 SDNA, 32 SDNA, 256 −1 10 SDCA, 8 4 Duality gap Duality gap 10 SDCA, 32 SDCA, 256 1 −3 10 10 Cubic, 8 Cubic, 32 −2 10 Cubic, 256 −5 10 SDNA, 8 −5 SDNA, 32 10 SDNA, 256 −7 SDCA, 8 10 −8 10 SDCA, 32 SDCA, 256 0 20 40 60 80 100 0 100 200 300 400 500 600 Epoch Epoch SDNA: Zheng Qu et al. “SDNA: stochastic dual Newton ascent for empirical risk minimization”. In: International Conference on Machine Learning . 2016, pp. 1823–1832 Shai Shalev-Shwartz and Tong Zhang. “Stochastic dual coordinate ascent SDCA: methods for regularized loss minimization”. In: Journal of Machine Learning Research 14.Feb (2013), pp. 567–599 19 / 20

Recommend


More recommend