characterization of convex objective functions and
play

Characterization of Convex Objective Functions and Optimal Expected - PowerPoint PPT Presentation

Characterization of Convex Objective Functions and Optimal Expected Convergence Rates of SGD Phuong Ha Nguyen 1 Marten van Dijk 1 , Lam M. Nguyen 2 and Dzung T. Phan 2 Marten Lam P. Ha Dzung 1. Secure Computation Laboratory, ECE, University


  1. Characterization of Convex Objective Functions and Optimal Expected Convergence Rates of SGD Phuong Ha Nguyen 1 Marten van Dijk 1 , Lam M. Nguyen 2 and Dzung T. Phan 2 Marten Lam P. Ha Dzung 1. Secure Computation Laboratory, ECE, University of Connecticut 2. IBM Research, Thomas J. Watson Research Center International Conference on Machine Learning (ICML) Long Beach, California, 2019

  2. Problem Setting Β§ Solve $∈& ' {𝐺(π‘₯) = 𝐹 𝜊 [𝑔(π‘₯; 𝜊)]} min Β§ Assumptions Β­ Convex: 𝑔 π‘₯; 𝜊 βˆ’ 𝑔 π‘₯ 6 ; 𝜊 β‰₯ 𝛼𝑔 π‘₯ 6 ; 𝜊 , π‘₯ βˆ’ π‘₯ 6 Β­ Smooth: ||𝛼𝑔 π‘₯; 𝜊 βˆ’ 𝛼𝑔 π‘₯ 6 ; 𝜊 || ≀ 𝑀||π‘₯ βˆ’ π‘₯ 6 || Β§ Find a π‘₯ = close to 𝑋 βˆ— = {π‘₯ βˆ— ∈ 𝑆 A ∢ βˆ€ $∈& ' , 𝐺 π‘₯ β‰₯ 𝐺 π‘₯ βˆ— } Β§ Problem: Characterize Expected Convergence Rates E βˆ— ∈F βˆ— ||w H βˆ’ w βˆ— || I 𝐹 inf and 𝐹[𝐺(π‘₯ = ) βˆ’ 𝐺(π‘₯ βˆ— )] 2

  3. Problem Setting Β§ Solve $∈& ' {𝐺(π‘₯) = 𝐹 𝜊 [𝑔(π‘₯; 𝜊)]} min Β§ Assumptions Β­ Convex: 𝑔 π‘₯; 𝜊 βˆ’ 𝑔 π‘₯ 6 ; 𝜊 β‰₯ 𝛼𝑔 π‘₯ 6 ; 𝜊 , π‘₯ βˆ’ π‘₯ 6 Β­ Smooth: ||𝛼𝑔 π‘₯; 𝜊 βˆ’ 𝛼𝑔 π‘₯ 6 ; 𝜊 || ≀ 𝑀||π‘₯ βˆ’ π‘₯ 6 || Β§ Find a π‘₯ = close to 𝑋 βˆ— = {π‘₯ βˆ— ∈ 𝑆 A ∢ βˆ€ $∈& ' , 𝐺 π‘₯ β‰₯ 𝐺 π‘₯ βˆ— } Β§ Problem: Characterize Expected Convergence Rates E βˆ— ∈F βˆ— ||w H βˆ’ w βˆ— || I 𝐹 inf and 𝐹[𝐺(π‘₯ = ) βˆ’ 𝐺(π‘₯ βˆ— )] 3

  4. Problem Setting Β§ Solve $∈& ' {𝐺(π‘₯) = 𝐹 𝜊 [𝑔(π‘₯; 𝜊)]} min Β§ Assumptions Β­ Convex: 𝑔 π‘₯; 𝜊 βˆ’ 𝑔 π‘₯ 6 ; 𝜊 β‰₯ 𝛼𝑔 π‘₯ 6 ; 𝜊 , π‘₯ βˆ’ π‘₯ 6 Β­ Smooth: ||𝛼𝑔 π‘₯; 𝜊 βˆ’ 𝛼𝑔 π‘₯ 6 ; 𝜊 || ≀ 𝑀||π‘₯ βˆ’ π‘₯ 6 || Β§ Find a π‘₯ = close to 𝑋 βˆ— = {π‘₯ βˆ— ∈ 𝑆 A ∢ βˆ€ $∈& ' , 𝐺 π‘₯ β‰₯ 𝐺 π‘₯ βˆ— } Β§ Problem: Characterize Expected Convergence Rates E βˆ— ∈F βˆ— ||w H βˆ’ w βˆ— || I 𝐹 inf and 𝐹[𝐺(π‘₯ = ) βˆ’ 𝐺(π‘₯ βˆ— )] 4

  5. Problem Setting Β§ Solve $∈& ' {𝐺(π‘₯) = 𝐹 𝜊 [𝑔(π‘₯; 𝜊)]} min Β§ Assumptions Stochastic Gradient Descend (SGD): Β­ Convex: 𝑔 π‘₯; 𝜊 βˆ’ 𝑔 π‘₯ 6 ; 𝜊 β‰₯ 𝛼𝑔 π‘₯ 6 ; 𝜊 , π‘₯ βˆ’ π‘₯ 6 Initialize : π‘₯ J Β­ Smooth: ||𝛼𝑔 π‘₯; 𝜊 βˆ’ 𝛼𝑔 π‘₯ 6 ; 𝜊 || ≀ 𝑀||π‘₯ βˆ’ π‘₯ 6 || Iterate : for 𝑒 = 0, 1, 2, … , do Β§ Find a π‘₯ = close to Choose πœƒ = > 0 𝑋 βˆ— = {π‘₯ βˆ— ∈ 𝑆 A ∢ βˆ€ $∈& ' , 𝐺 π‘₯ β‰₯ 𝐺 π‘₯ βˆ— } Generate random 𝜊 = Compute 𝛼𝑔 π‘₯ = ; 𝜊 = Β§ Problem: Characterize Expected Convergence Rates Update π‘₯ =RS = π‘₯ = βˆ’ πœƒ = 𝛼𝑔 π‘₯ = ; 𝜊 = E βˆ— ∈F βˆ— ||w H βˆ’ w βˆ— || I 𝐹 inf and 𝐹[𝐺(π‘₯ = ) βˆ’ 𝐺(π‘₯ βˆ— )] end for 5

  6. Problem Setting Β§ Solve $∈& ' {𝐺(π‘₯) = 𝐹 𝜊 [𝑔(π‘₯; 𝜊)]} min Β§ Assumptions Stochastic Gradient Descend (SGD): Β­ Convex: 𝑔 π‘₯; 𝜊 βˆ’ 𝑔 π‘₯ 6 ; 𝜊 β‰₯ 𝛼𝑔 π‘₯ 6 ; 𝜊 , π‘₯ βˆ’ π‘₯ 6 Initialize : π‘₯ J Β­ Smooth: ||𝛼𝑔 π‘₯; 𝜊 βˆ’ 𝛼𝑔 π‘₯ 6 ; 𝜊 || ≀ 𝑀||π‘₯ βˆ’ π‘₯ 6 || Iterate : for 𝑒 = 0, 1, 2, … , do Β§ Find a π‘₯ = close to Choose πœƒ = > 0 𝑋 βˆ— = {π‘₯ βˆ— ∈ 𝑆 A ∢ βˆ€ $∈& ' , 𝐺 π‘₯ β‰₯ 𝐺 π‘₯ βˆ— } Generate random 𝜊 = Compute 𝛼𝑔 π‘₯ = ; 𝜊 = Β§ Problem: Characterize Expected Convergence Rates Update π‘₯ =RS = π‘₯ = βˆ’ πœƒ = 𝛼𝑔 π‘₯ = ; 𝜊 = E βˆ— ∈F βˆ— ||w H βˆ’ w βˆ— || I 𝐹 inf and 𝐹[𝐺(π‘₯ = ) βˆ’ 𝐺(π‘₯ βˆ— )] end for 6

  7. Beyond convex and strongly convex functions Strongly Convex Plain Convex T I ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ 0 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯

  8. πœ• -Convexity πœ• βˆ’ Convex Strongly Convex Plain Convex $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I , πœ• 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ inf T I ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ 0 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ πœ• 6 > 0, πœ• 66 < 0 ,

  9. πœ• -Convexity with curvature β„Ž ∈ [0,1] πœ• βˆ’ Convex Strongly Convex Plain Convex $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I , πœ• 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ inf T I ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ 0 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ πœ• 6 > 0, πœ• 66 < 0 , ] β‰₯ 𝛽 $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— inf β„Ž = 0 β„Ž ∈ (0,1) β„Ž = 1

  10. πœ• -Convexity with curvature β„Ž ∈ [0,1] πœ• βˆ’ Convex Strongly Convex Plain Convex $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I , πœ• 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ inf T I ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ 0 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ πœ• 6 > 0, πœ• 66 < 0 , ] β‰₯ 𝛽 $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— inf β„Ž = 0 β„Ž ∈ (0,1) β„Ž = 1 ] β‰₯ 𝛽 $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I , where β„Ž ∈ 0,2 . HEB (Holderian Error Bound): 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— inf HEB and πœ• -convexity are not subclasses of one another but they do intersection for β„Ž ∈ 0,1 . [Bolte, J., Nguyen, T. P., Peypouquet, J., and Suter, B. W. From error bounds to the complexity of first order descent methods for convex functions. Mathematical Programming, 165(2):471–507, Oct 2017]

  11. Close to optimal stepsize πœ• βˆ’ Convex Strongly Convex Plain Convex $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I , πœ• 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ inf T I ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ 0 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ πœ• 6 > 0, πœ• 66 < 0 , β„Ž = 1 β„Ž = 0 ] β‰₯ 𝛽 $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— inf β„Ž ∈ (0,1) ` πœƒ = = SGD HRa b/ def π·π‘šπ‘π‘‘π‘“ 𝑒𝑝 π‘π‘žπ‘’π‘—π‘›π‘π‘š π‘‘π‘’π‘“π‘žπ‘‘π‘—π‘¨π‘“

  12. Convergence Rate of SGD πœ• βˆ’ Convex Strongly Convex Plain Convex $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I , πœ• 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ inf T I ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ 0 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ πœ• 6 > 0, πœ• 66 < 0 , β„Ž = 1 β„Ž = 0 ] β‰₯ 𝛽 $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— inf β„Ž ∈ (0,1) E βˆ— ∈F βˆ— ||w H βˆ’ w βˆ— || I = 𝑃 𝑒 r]/(Ir]) 𝐹 inf ` πœƒ = = SGD HRa b/ def I= 1 π·π‘šπ‘π‘‘π‘“ 𝑒𝑝 π‘π‘žπ‘’π‘—π‘›π‘π‘š π‘‘π‘’π‘“π‘žπ‘‘π‘—π‘¨π‘“ = 𝑃(𝑒 rS/(Ir]) ) 𝑒 s 𝐹 𝐺 π‘₯ t βˆ’ 𝐺 π‘₯ βˆ— tu=RS 12

  13. Convergence Rate of SGD πœ• βˆ’ Convex Strongly Convex Plain Convex $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I , πœ• 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ inf T I ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ 0 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ πœ• 6 > 0, πœ• 66 < 0 , β„Ž = 1 β„Ž = 0 ] β‰₯ 𝛽 $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— inf β„Ž ∈ (0,1) E βˆ— ∈F βˆ— ||w H βˆ’ w βˆ— || I = 𝑃 𝑒 r]/(Ir]) 𝐹 inf [Useful,1] [Useless,0] 0 ← β„Ž β†’ 1 I= 1 [Useful,1] = 𝑃(𝑒 rS/(Ir]) ) 𝑒 s 𝐹 𝐺 π‘₯ t βˆ’ 𝐺 π‘₯ βˆ— [Useful,0] tu=RS 13

  14. Convergence Rate of SGD πœ• βˆ’ Convex Strongly Convex Plain Convex $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I , πœ• 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ inf T I ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ 0 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— β‰₯ πœ• 6 > 0, πœ• 66 < 0 , β„Ž = 1 β„Ž = 0 ] β‰₯ 𝛽 $ βˆ— ∈F βˆ— ||π‘₯ βˆ’ π‘₯ βˆ— || I 𝐺 π‘₯ βˆ’ 𝐺 π‘₯ βˆ— inf β„Ž ∈ (0,1) E βˆ— ∈F βˆ— ||w H βˆ’ w βˆ— || I = 𝑃 𝑒 r]/(Ir]) h= Β½ 𝐹 inf 𝐺 π‘₯ = 𝐼 π‘₯ + πœ‡π» π‘₯ , 𝐼 π‘₯ βˆ’ π‘‘π‘π‘œπ‘€π‘“π‘¦ I= 1 A = 𝑃(𝑒 rS/(Ir]) ) 𝑒 s 𝐹 𝐺 π‘₯ t βˆ’ 𝐺 π‘₯ βˆ— [𝑓 $ € +𝑓 r$ € βˆ’ 2 βˆ’ π‘₯ t I ] 𝐻 π‘₯ = s tu=RS 14 tuS

  15. Experiment Curvature 0 (convex) Curvature unknown † π‘₯ = 𝑔 … π‘₯)) 𝑔 t π‘₯ + πœ‡ π‘₯ 𝑔 t π‘₯ = log(1 + exp(βˆ’π‘§ t 𝑦 t t Curvature Β½ Curvature 1 (strongly convex) t π‘₯ + πœ‡ † π‘₯ = 𝑔 I ` π‘₯ = 𝑔 𝑔 t π‘₯ + πœ‡π» π‘₯ 𝑔 π‘₯ t t 2 A [𝑓 $ € +𝑓 r$ € βˆ’ 2 βˆ’ π‘₯ t I ] 𝐻 π‘₯ = s tuS 15

  16. Conclusion Β§ πœ• - convexity notion: plain convex, strongly convex and something in between Β§ SGD with πœ•-convex objective functions Thank you for your attention! J https://arxiv.org/abs/1810.04100 Poster Number: #193 – Pacific Ballroom. – 06:30β€”09:00PM – 06/11 16

Recommend


More recommend