the importance of better models in stochastic optimization
play

The importance of better models in stochastic optimization John - PowerPoint PPT Presentation

The importance of better models in stochastic optimization John Duchi (based on joint work with Feng Ruan and Hilal Asi) Stanford University Les Houches 2019 Outline Motivating experiments Models in optimization Stochastic optimization


  1. Models in stochastic optimization Conditions on our models (convex case) i. Convex model: y �→ f x ( y ; s ) is convex ii. Lower bound: f x ( y ; s ) ≤ f ( y ; s ) iii. Local correctness: f x ( x ; s ) = f ( x ; s ) and ∂f x ( x ; s ) ⊂ ∂f ( x ; s ) [D. & Ruan 17; Davis & Drusvyatskiy 18]

  2. Models in stochastic optimization Conditions on our models ( ρ -weakly convex case) i. Convex model: y �→ f x ( y ; s ) is convex ii. Lower bound: f x ( y ; s ) ≤ f ( y ; s ) + ρ ( s ) � x − y � 2 2 2 iii. Local correctness: f x ( x ; s ) = f ( x ; s ) and ∂f x ( x ; s ) ⊂ ∂f ( x ; s ) [D. & Ruan 17; Davis & Drusvyatskiy 18; Asi & D. 19]

  3. Modeling conditions Model f x ( y ) of f near x f ( x )

  4. Modeling conditions Model f x ( y ) of f near x f ( x ) f x 0 ( y ) = f ( x 0 ) + ∇ f ( x 0 ) T ( y − x 0 ) ( x 0 , f ( x 0 ))

  5. Modeling conditions Model f x ( y ) of f near x truncated f ( x ) f x 0 ( y ) = f ( x 0 ) + ∇ f ( x 0 ) T ( y − x 0 ) � f ( x 0 ) + ∇ f ( x 0 ) T ( y − x 0 ) � f x 0 ( y ) = + ( x 0 , f ( x 0 ))

  6. Models in stochastic optimization Linear Truncated x 1 x 0 i. (Sub)gradient: f x ( y ) = f ( x ) + � f ′ ( x ) , y − x � ii. Truncated: f x ( y ) = ( f ( x ) + � f ′ ( x ) , y − x � ) ∨ inf x f ( x ) iii. Bundle/multi-line: f x ( y ) = max { f ( x i ) + � f ′ ( x i ) , x − x i �} iv. Prox-linear: f x ( y ) = h ( c ( x ) + ∇ c ( x ) T ( y − x ))

  7. The aProx family Iterate: iid ◮ Sample S k ∼ P ◮ Update by minimizing model � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 α k x ∈ X

  8. Outline Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results

  9. The aProx family Iterate: iid ◮ Sample S k ∼ P ◮ Update by minimizing model � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 α k x ∈ X

  10. Divergence of a gradient method

  11. Divergence of a gradient method

  12. Divergence of a gradient method

  13. Divergence of a gradient method

  14. Divergence of a gradient method

  15. Divergence of a gradient method

  16. Divergence of a gradient method

  17. Divergence of a gradient method

  18. Divergence of a gradient method

  19. Divergence of a gradient method

  20. Stability guarantees (convex) Use full stochastic-proximal method, � � 1 � x − x k � 2 x k +1 = argmin f ( x ; S k ) + . 2 α k x ∈ X Theorem (Asi & D. 18) Assume X ⋆ = argmin x ∈X F ( x ) is non-empty and E [ � f ′ ( x ⋆ ; S ) � 2 ] ≤ σ 2 . Then k � E [dist( x k , X ⋆ ) 2 ] ≤ dist( x 0 , X ⋆ ) 2 + σ 2 α 2 i i =1

  21. Stability guarantees (convex) Use full stochastic-proximal method, � � 1 � x − x k � 2 x k +1 = argmin f ( x ; S k ) + . 2 α k x ∈ X Theorem (Asi & D. 18) Assume X ⋆ = argmin x ∈X F ( x ) is non-empty and E [ � f ′ ( x ⋆ ; S ) � 2 ] ≤ σ 2 . Then k � E [dist( x k , X ⋆ ) 2 ] ≤ dist( x 0 , X ⋆ ) 2 + σ 2 α 2 i i =1 Theorem (Asi & D. 18) Under the same assumptions, dist( x k , X ⋆ ) a.s. dist( x k , X ⋆ ) < ∞ and sup → 0 . k

  22. Stability guarantees (convex) Use any model with f x ( y ; s ) ≥ inf z f ( z ; s ) (i.e. good lower bound) � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + . 2 α k x ∈ X Theorem (Asi & D. 19) Assume X ⋆ = argmin x ∈X F ( x ) is non-empty and there exists p < ∞ such that � � � 2 ] ≤ C (1 + dist( x, X ⋆ ) p ) . � f ′ ( x ; S ) E [ Then dist( x k , X ⋆ ) a.s. dist( x k , X ⋆ ) < ∞ and sup → 0 . k

  23. Example behaviors � m 1 i =1 ( a T i x − b i ) 2 On least-squares objective F ( x ) = 2 m 10 6 SGM 10 5 Prox 10 4 10 3 10 2 10 1 10 0 10 1 10 2 0 100 200 300 400 500

  24. Classical asymptotic analysis Theorem (Polyak & Juditsky 92) Let F be convex and strongly convex in a neighborhood of x ⋆ , and assume that f ( x ; S ) are globally smooth. For x k generated by stochastic gradient method, k 1 � ( x i − x ⋆ ) d � 0 , ∇ 2 F ( x ⋆ ) − 1 Cov( ∇ f ( x ⋆ ; S )) ∇ 2 F ( x ⋆ ) − 1 � √ � N . k i =1

  25. truncated New asymptotic analysis (convex case) Theorem (Asi & D. 18) Let F be convex and strongly convex in a neighborhood of x ⋆ , and assume that f ( x ; S ) are smooth near x ⋆ . Then if x k remain bounded and the models f x k ( · ; S k ) satisfy our conditions, k 1 � � 0 , ∇ 2 F ( x ⋆ ) − 1 Cov( ∇ f ( x ⋆ ; S )) ∇ 2 F ( x ⋆ ) − 1 � ( x i − x ⋆ ) d √ � N . k i =1

  26. New asymptotic analysis (convex case) Theorem (Asi & D. 18) Let F be convex and strongly convex in a neighborhood of x ⋆ , and assume that f ( x ; S ) are smooth near x ⋆ . Then if x k remain bounded and the models f x k ( · ; S k ) satisfy our conditions, k 1 � � 0 , ∇ 2 F ( x ⋆ ) − 1 Cov( ∇ f ( x ⋆ ; S )) ∇ 2 F ( x ⋆ ) − 1 � ( x i − x ⋆ ) d √ � N . k i =1 truncated ◮ Optimal by local minimax theorem [H´ ajek 72; Le Cam 73; D. & Ruan 19] ◮ Key insight: subgradients of f x k ( · ; S k ) close to ∇ f ( x k ; S k )

  27. Convergence to stationarity in weakly convex cases Convergence requires Moreau envelope [Davis & Drusvyatskiy 18] � � F ( y ) + λ 2 � y − x � 2 F λ ( x ) := inf , 2 y ∈ X Important properties: ◮ Proximal mapping: � � F ( y ) + λ x λ := prox F/λ ( x ) := argmin 2 � y − x � 2 2 y ∈ X satisfies ∇ F λ ( x ) = λ ( x − x λ ) ◮ Near stationarity and decrease: F ( x λ ) ≤ F ( x ) dist(0 , ∂F ( x λ )) ≤ �∇ F λ ( x ) � 2 and

  28. Convergence to stationarity in weakly convex cases Convergence requires Moreau envelope [Davis & Drusvyatskiy 18] � � F ( y ) + λ 2 � y − x � 2 F λ ( x ) := inf , 2 y ∈ X Important properties: ◮ Proximal mapping: � � F ( y ) + λ x λ := prox F/λ ( x ) := argmin 2 � y − x � 2 2 y ∈ X satisfies ∇ F λ ( x ) = λ ( x − x λ ) ◮ Near stationarity and decrease: F ( x λ ) ≤ F ( x ) dist(0 , ∂F ( x λ )) ≤ �∇ F λ ( x ) � 2 and Convergence: Say iterates x k converge if ∇ F λ ( x k ) → 0

  29. Moreau envelope of the absolute value For F ( x ) = | x | , F � λ 2 x 2 if | x | ≤ λ − 1 F λ ( x ) = | x | − 1 if | x | > λ − 1 2 λ F λ ◮ F ′ λ ( x ) = λx ◮ | F ′ λ ( x ) | = λ dist( x, 0) ◮ prox step x λ = 0 if | x | ≤ 1 /λ

  30. Convergence in weakly convex cases Use regularized stochastic-proximal point method, � � f ( x ; S k ) + ρ ( S k ) 1 � x − x k � 2 � x − x k � 2 x k +1 = argmin 2 + . 2 2 2 α k x ∈ X Theorem (Asi & D. 19) Let random f be ρ ( s ) weakly convex with E [ ρ 2 ( S )] < ∞ . With proximal-point iteration, iterates x k satisfy F λ ( x k ) a.s. → G and ∞ � α k �∇ F λ ( x k ) � 2 2 < ∞ . k =1

  31. Convergence in weakly convex cases Use regularized stochastic-proximal point method, � � f ( x ; S k ) + ρ ( S k ) 1 � x − x k � 2 � x − x k � 2 x k +1 = argmin 2 + . 2 2 2 α k x ∈ X Theorem (Asi & D. 19) Let random f be ρ ( s ) weakly convex with E [ ρ 2 ( S )] < ∞ . With proximal-point iteration, iterates x k satisfy F λ ( x k ) a.s. → G and ∞ � α k �∇ F λ ( x k ) � 2 2 < ∞ . k =1 Proposition (Asi & D. 19) If iterates x k remain bounded and image of stationary points has measure zero, ∇ F λ ( x k ) a.s. → 0 .

  32. What is an easy problem? ◮ Interpolation problems [Belkin, Hsu, Mitra 18; Ma, Bassily, Belkin 18] ◮ Overparameterized linear systems (Kaczmarz algorithms) [Strohmer & Vershynin 09; Needell, Srebro, Ward 14; Needell & Tropp 14] ◮ Random projections for linear constraints [Leventhal & Lewis 10] 4 subsamples) (a) MNIST (b) CIFAR-10 (c) SVHN (

  33. truncated What is an easy problem? � minimize F ( x ) := E [ f ( x ; S )] = f ( x ; s ) dP ( s ) x

  34. truncated What is an easy problem? � minimize F ( x ) := E [ f ( x ; S )] = f ( x ; s ) dP ( s ) x Definition: Problem is easy if there exists x ⋆ such that f ( x ⋆ ; S ) = inf x f ( x ; S ) with probability 1. [Schmidt & Le Roux 13; Ma, Bassily, Belkin 18; Belkin, Rakhlin, Tsybakov 18]

  35. What is an easy problem? � minimize F ( x ) := E [ f ( x ; S )] = f ( x ; s ) dP ( s ) x Definition: Problem is easy if there exists x ⋆ such that f ( x ⋆ ; S ) = inf x f ( x ; S ) with probability 1. [Schmidt & Le Roux 13; Ma, Bassily, Belkin 18; Belkin, Rakhlin, Tsybakov 18] truncated One additional condition iv. The models f x satisfy x ⋆ ∈ X f ( x ⋆ ; s ) f x ( y ; s ) ≥ inf

  36. Easy strongly convex problems Theorem (Asi & D. 18) Let the function F satisfy the growth condition F ( x ) ≥ F ( x ⋆ ) + λ 2 dist( x, X ⋆ ) 2 where X ⋆ = argmin x F ( x ) , and be easy. Then � � � � k � E [dist( x k , X ⋆ ) 2 ] ≤ max dist( x 1 , X ⋆ ) 2 . exp − c α i , exp ( − ck ) i =1

  37. Easy strongly convex problems Theorem (Asi & D. 18) Let the function F satisfy the growth condition F ( x ) ≥ F ( x ⋆ ) + λ 2 dist( x, X ⋆ ) 2 where X ⋆ = argmin x F ( x ) , and be easy. Then � � � � k � E [dist( x k , X ⋆ ) 2 ] ≤ max dist( x 1 , X ⋆ ) 2 . exp − c α i , exp ( − ck ) i =1 ◮ Adaptive no matter the stepsizes ◮ Most other results (e.g. for SGM [Schmidt & Le Roux 13; Ma, Bassily, Belkin 18] ) require careful stepsize choices

  38. Sharp problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m +

  39. Sharp problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m +

  40. Sharp problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m +

  41. Sharp convex problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m + � m ◮ Projection onto intersections: F ( x ) = 1 i =1 dist( x, C i ) m

  42. Sharp convex problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m + � m ◮ Projection onto intersections: F ( x ) = 1 i =1 dist( x, C i ) m Theorem (Asi & D. 18) Let F have sharp growth and be easy. If F is convex, � � �� k � E [dist( x k +1 , X ⋆ ) 2 ] ≤ max dist( x 1 , X ⋆ ) 2 . exp( − ck ) , exp − c α i i =1

  43. Sharp weakly problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] � � ( Ax ) 2 − ( Ax ⋆ ) 2 � ◮ Phase retrieval F ( x ) = 1 � m 1 ◮ Blind deconvolution [Charisopoulos et al. 19]

  44. Sharp weakly problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] � � ( Ax ) 2 − ( Ax ⋆ ) 2 � ◮ Phase retrieval F ( x ) = 1 � m 1 ◮ Blind deconvolution [Charisopoulos et al. 19] Theorem (Asi & D. 19) Let F have sharp growth and be easy. There exists c ∈ (0 , 1) such that on the event x k → X ⋆ , dist( x k , X ⋆ ) lim sup < ∞ . (1 − c ) k k

  45. Outline Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results

  46. Methods Iterate � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 2 α k x

  47. Methods Iterate � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 2 α k x ◮ Stochastic gradient f x k ( x ; S k ) = f ( x k ; S k ) + � f ′ ( x k ; S k ) , x − x k � ◮ Truncated gradient ( f ≥ 0 ): � � f ( x k ; S k ) + � f ′ ( x k ; S k ) , x − x k � f x k ( x ; S k ) = + ◮ (Stochastic) proximal point f x k ( x ; S k ) = f ( x ; S k )

  48. Linear regression with low noise m 1 � ( a T i x − b i ) 2 F ( x ) = 2 m i =1 10000 8000 Time to ǫ -accuracy SGM 6000 Truncated Prox 4000 2000 0 10 2 10 1 10 0 10 1 10 2 10 3 10 4 Initial stepsize α 0

  49. Linear regression with no noise m 1 � ( a T i x − b i ) 2 F ( x ) = 2 m i =1 10000 8000 Time to ǫ -accuracy 6000 4000 SGM Truncated 2000 Prox 0 10 2 10 1 10 0 10 1 10 2 10 3 10 4 Initial stepsize α 0

  50. Linear regression with “poor” conditioning Accuracy epsilon = 0.055 1000 900 800 Proximal 700 SGM Truncated 600 Bundle 500 400 300 10 1 10 0 10 1 10 2 10 3 10 4 10 5

  51. Linear regression with “poor” conditioning Accuracy epsilon = 0.055 1000 900 800 Proximal 700 SGM Truncated 600 Bundle 500 400 300 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Poor conditioning? κ ( A ) = 15

  52. Multiclass hinge loss: no noise f ( x ; ( a, l )) = max i � = l [1 + � a, x i − x l � ] + 10000 SGM Truncated 8000 Prox Time to ǫ -accuracy 6000 4000 2000 0 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Initial stepsize α 0

  53. Multiclass hinge loss: small label flipping f ( x ; ( a, l )) = max i � = l [1 + � a, x i − x l � ] + 10000 8000 Time to ǫ -accuracy SGM 6000 Truncated Prox 4000 2000 0 1 10 1 10 4 10 5 10 10 0 10 2 10 3 Initial stepsize α 0

  54. Multiclass hinge loss: substantial label flipping f ( x ; ( a, l )) = max i � = l [1 + � a, x i − x l � ] + 10000 8000 Time to ǫ -accuracy 6000 4000 SGM Truncated 2000 Prox 0 1 10 1 10 4 10 5 10 10 0 10 2 10 3 Initial stepsize α 0

  55. (Robust) Phase retrieval [Cand` es, Li, Soltanolkotabi 15]

  56. (Robust) Phase retrieval [Cand` es, Li, Soltanolkotabi 15] Observations (usually) b i = � a i , x ⋆ � 2 yield objective m f ( x ) = 1 � |� a i , x � 2 − b i | m i =1

  57. Phase retrieval without noise m F ( x ) = 1 � |� a i , x � 2 − b i | m i =1 1000 800 Time to ǫ -accuracy 600 Proximal SGM Truncated 400 200 0 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Initial stepsize α 0

Recommend


More recommend