Models in stochastic optimization Conditions on our models (convex case) i. Convex model: y �→ f x ( y ; s ) is convex ii. Lower bound: f x ( y ; s ) ≤ f ( y ; s ) iii. Local correctness: f x ( x ; s ) = f ( x ; s ) and ∂f x ( x ; s ) ⊂ ∂f ( x ; s ) [D. & Ruan 17; Davis & Drusvyatskiy 18]
Models in stochastic optimization Conditions on our models ( ρ -weakly convex case) i. Convex model: y �→ f x ( y ; s ) is convex ii. Lower bound: f x ( y ; s ) ≤ f ( y ; s ) + ρ ( s ) � x − y � 2 2 2 iii. Local correctness: f x ( x ; s ) = f ( x ; s ) and ∂f x ( x ; s ) ⊂ ∂f ( x ; s ) [D. & Ruan 17; Davis & Drusvyatskiy 18; Asi & D. 19]
Modeling conditions Model f x ( y ) of f near x f ( x )
Modeling conditions Model f x ( y ) of f near x f ( x ) f x 0 ( y ) = f ( x 0 ) + ∇ f ( x 0 ) T ( y − x 0 ) ( x 0 , f ( x 0 ))
Modeling conditions Model f x ( y ) of f near x truncated f ( x ) f x 0 ( y ) = f ( x 0 ) + ∇ f ( x 0 ) T ( y − x 0 ) � f ( x 0 ) + ∇ f ( x 0 ) T ( y − x 0 ) � f x 0 ( y ) = + ( x 0 , f ( x 0 ))
Models in stochastic optimization Linear Truncated x 1 x 0 i. (Sub)gradient: f x ( y ) = f ( x ) + � f ′ ( x ) , y − x � ii. Truncated: f x ( y ) = ( f ( x ) + � f ′ ( x ) , y − x � ) ∨ inf x f ( x ) iii. Bundle/multi-line: f x ( y ) = max { f ( x i ) + � f ′ ( x i ) , x − x i �} iv. Prox-linear: f x ( y ) = h ( c ( x ) + ∇ c ( x ) T ( y − x ))
The aProx family Iterate: iid ◮ Sample S k ∼ P ◮ Update by minimizing model � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 α k x ∈ X
Outline Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results
The aProx family Iterate: iid ◮ Sample S k ∼ P ◮ Update by minimizing model � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 α k x ∈ X
Divergence of a gradient method
Divergence of a gradient method
Divergence of a gradient method
Divergence of a gradient method
Divergence of a gradient method
Divergence of a gradient method
Divergence of a gradient method
Divergence of a gradient method
Divergence of a gradient method
Divergence of a gradient method
Stability guarantees (convex) Use full stochastic-proximal method, � � 1 � x − x k � 2 x k +1 = argmin f ( x ; S k ) + . 2 α k x ∈ X Theorem (Asi & D. 18) Assume X ⋆ = argmin x ∈X F ( x ) is non-empty and E [ � f ′ ( x ⋆ ; S ) � 2 ] ≤ σ 2 . Then k � E [dist( x k , X ⋆ ) 2 ] ≤ dist( x 0 , X ⋆ ) 2 + σ 2 α 2 i i =1
Stability guarantees (convex) Use full stochastic-proximal method, � � 1 � x − x k � 2 x k +1 = argmin f ( x ; S k ) + . 2 α k x ∈ X Theorem (Asi & D. 18) Assume X ⋆ = argmin x ∈X F ( x ) is non-empty and E [ � f ′ ( x ⋆ ; S ) � 2 ] ≤ σ 2 . Then k � E [dist( x k , X ⋆ ) 2 ] ≤ dist( x 0 , X ⋆ ) 2 + σ 2 α 2 i i =1 Theorem (Asi & D. 18) Under the same assumptions, dist( x k , X ⋆ ) a.s. dist( x k , X ⋆ ) < ∞ and sup → 0 . k
Stability guarantees (convex) Use any model with f x ( y ; s ) ≥ inf z f ( z ; s ) (i.e. good lower bound) � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + . 2 α k x ∈ X Theorem (Asi & D. 19) Assume X ⋆ = argmin x ∈X F ( x ) is non-empty and there exists p < ∞ such that � � � 2 ] ≤ C (1 + dist( x, X ⋆ ) p ) . � f ′ ( x ; S ) E [ Then dist( x k , X ⋆ ) a.s. dist( x k , X ⋆ ) < ∞ and sup → 0 . k
Example behaviors � m 1 i =1 ( a T i x − b i ) 2 On least-squares objective F ( x ) = 2 m 10 6 SGM 10 5 Prox 10 4 10 3 10 2 10 1 10 0 10 1 10 2 0 100 200 300 400 500
Classical asymptotic analysis Theorem (Polyak & Juditsky 92) Let F be convex and strongly convex in a neighborhood of x ⋆ , and assume that f ( x ; S ) are globally smooth. For x k generated by stochastic gradient method, k 1 � ( x i − x ⋆ ) d � 0 , ∇ 2 F ( x ⋆ ) − 1 Cov( ∇ f ( x ⋆ ; S )) ∇ 2 F ( x ⋆ ) − 1 � √ � N . k i =1
truncated New asymptotic analysis (convex case) Theorem (Asi & D. 18) Let F be convex and strongly convex in a neighborhood of x ⋆ , and assume that f ( x ; S ) are smooth near x ⋆ . Then if x k remain bounded and the models f x k ( · ; S k ) satisfy our conditions, k 1 � � 0 , ∇ 2 F ( x ⋆ ) − 1 Cov( ∇ f ( x ⋆ ; S )) ∇ 2 F ( x ⋆ ) − 1 � ( x i − x ⋆ ) d √ � N . k i =1
New asymptotic analysis (convex case) Theorem (Asi & D. 18) Let F be convex and strongly convex in a neighborhood of x ⋆ , and assume that f ( x ; S ) are smooth near x ⋆ . Then if x k remain bounded and the models f x k ( · ; S k ) satisfy our conditions, k 1 � � 0 , ∇ 2 F ( x ⋆ ) − 1 Cov( ∇ f ( x ⋆ ; S )) ∇ 2 F ( x ⋆ ) − 1 � ( x i − x ⋆ ) d √ � N . k i =1 truncated ◮ Optimal by local minimax theorem [H´ ajek 72; Le Cam 73; D. & Ruan 19] ◮ Key insight: subgradients of f x k ( · ; S k ) close to ∇ f ( x k ; S k )
Convergence to stationarity in weakly convex cases Convergence requires Moreau envelope [Davis & Drusvyatskiy 18] � � F ( y ) + λ 2 � y − x � 2 F λ ( x ) := inf , 2 y ∈ X Important properties: ◮ Proximal mapping: � � F ( y ) + λ x λ := prox F/λ ( x ) := argmin 2 � y − x � 2 2 y ∈ X satisfies ∇ F λ ( x ) = λ ( x − x λ ) ◮ Near stationarity and decrease: F ( x λ ) ≤ F ( x ) dist(0 , ∂F ( x λ )) ≤ �∇ F λ ( x ) � 2 and
Convergence to stationarity in weakly convex cases Convergence requires Moreau envelope [Davis & Drusvyatskiy 18] � � F ( y ) + λ 2 � y − x � 2 F λ ( x ) := inf , 2 y ∈ X Important properties: ◮ Proximal mapping: � � F ( y ) + λ x λ := prox F/λ ( x ) := argmin 2 � y − x � 2 2 y ∈ X satisfies ∇ F λ ( x ) = λ ( x − x λ ) ◮ Near stationarity and decrease: F ( x λ ) ≤ F ( x ) dist(0 , ∂F ( x λ )) ≤ �∇ F λ ( x ) � 2 and Convergence: Say iterates x k converge if ∇ F λ ( x k ) → 0
Moreau envelope of the absolute value For F ( x ) = | x | , F � λ 2 x 2 if | x | ≤ λ − 1 F λ ( x ) = | x | − 1 if | x | > λ − 1 2 λ F λ ◮ F ′ λ ( x ) = λx ◮ | F ′ λ ( x ) | = λ dist( x, 0) ◮ prox step x λ = 0 if | x | ≤ 1 /λ
Convergence in weakly convex cases Use regularized stochastic-proximal point method, � � f ( x ; S k ) + ρ ( S k ) 1 � x − x k � 2 � x − x k � 2 x k +1 = argmin 2 + . 2 2 2 α k x ∈ X Theorem (Asi & D. 19) Let random f be ρ ( s ) weakly convex with E [ ρ 2 ( S )] < ∞ . With proximal-point iteration, iterates x k satisfy F λ ( x k ) a.s. → G and ∞ � α k �∇ F λ ( x k ) � 2 2 < ∞ . k =1
Convergence in weakly convex cases Use regularized stochastic-proximal point method, � � f ( x ; S k ) + ρ ( S k ) 1 � x − x k � 2 � x − x k � 2 x k +1 = argmin 2 + . 2 2 2 α k x ∈ X Theorem (Asi & D. 19) Let random f be ρ ( s ) weakly convex with E [ ρ 2 ( S )] < ∞ . With proximal-point iteration, iterates x k satisfy F λ ( x k ) a.s. → G and ∞ � α k �∇ F λ ( x k ) � 2 2 < ∞ . k =1 Proposition (Asi & D. 19) If iterates x k remain bounded and image of stationary points has measure zero, ∇ F λ ( x k ) a.s. → 0 .
What is an easy problem? ◮ Interpolation problems [Belkin, Hsu, Mitra 18; Ma, Bassily, Belkin 18] ◮ Overparameterized linear systems (Kaczmarz algorithms) [Strohmer & Vershynin 09; Needell, Srebro, Ward 14; Needell & Tropp 14] ◮ Random projections for linear constraints [Leventhal & Lewis 10] 4 subsamples) (a) MNIST (b) CIFAR-10 (c) SVHN (
truncated What is an easy problem? � minimize F ( x ) := E [ f ( x ; S )] = f ( x ; s ) dP ( s ) x
truncated What is an easy problem? � minimize F ( x ) := E [ f ( x ; S )] = f ( x ; s ) dP ( s ) x Definition: Problem is easy if there exists x ⋆ such that f ( x ⋆ ; S ) = inf x f ( x ; S ) with probability 1. [Schmidt & Le Roux 13; Ma, Bassily, Belkin 18; Belkin, Rakhlin, Tsybakov 18]
What is an easy problem? � minimize F ( x ) := E [ f ( x ; S )] = f ( x ; s ) dP ( s ) x Definition: Problem is easy if there exists x ⋆ such that f ( x ⋆ ; S ) = inf x f ( x ; S ) with probability 1. [Schmidt & Le Roux 13; Ma, Bassily, Belkin 18; Belkin, Rakhlin, Tsybakov 18] truncated One additional condition iv. The models f x satisfy x ⋆ ∈ X f ( x ⋆ ; s ) f x ( y ; s ) ≥ inf
Easy strongly convex problems Theorem (Asi & D. 18) Let the function F satisfy the growth condition F ( x ) ≥ F ( x ⋆ ) + λ 2 dist( x, X ⋆ ) 2 where X ⋆ = argmin x F ( x ) , and be easy. Then � � � � k � E [dist( x k , X ⋆ ) 2 ] ≤ max dist( x 1 , X ⋆ ) 2 . exp − c α i , exp ( − ck ) i =1
Easy strongly convex problems Theorem (Asi & D. 18) Let the function F satisfy the growth condition F ( x ) ≥ F ( x ⋆ ) + λ 2 dist( x, X ⋆ ) 2 where X ⋆ = argmin x F ( x ) , and be easy. Then � � � � k � E [dist( x k , X ⋆ ) 2 ] ≤ max dist( x 1 , X ⋆ ) 2 . exp − c α i , exp ( − ck ) i =1 ◮ Adaptive no matter the stepsizes ◮ Most other results (e.g. for SGM [Schmidt & Le Roux 13; Ma, Bassily, Belkin 18] ) require careful stepsize choices
Sharp problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m +
Sharp problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m +
Sharp problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m +
Sharp convex problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m + � m ◮ Projection onto intersections: F ( x ) = 1 i =1 dist( x, C i ) m
Sharp convex problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m + � m ◮ Projection onto intersections: F ( x ) = 1 i =1 dist( x, C i ) m Theorem (Asi & D. 18) Let F have sharp growth and be easy. If F is convex, � � �� k � E [dist( x k +1 , X ⋆ ) 2 ] ≤ max dist( x 1 , X ⋆ ) 2 . exp( − ck ) , exp − c α i i =1
Sharp weakly problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] � � ( Ax ) 2 − ( Ax ⋆ ) 2 � ◮ Phase retrieval F ( x ) = 1 � m 1 ◮ Blind deconvolution [Charisopoulos et al. 19]
Sharp weakly problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] � � ( Ax ) 2 − ( Ax ⋆ ) 2 � ◮ Phase retrieval F ( x ) = 1 � m 1 ◮ Blind deconvolution [Charisopoulos et al. 19] Theorem (Asi & D. 19) Let F have sharp growth and be easy. There exists c ∈ (0 , 1) such that on the event x k → X ⋆ , dist( x k , X ⋆ ) lim sup < ∞ . (1 − c ) k k
Outline Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results
Methods Iterate � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 2 α k x
Methods Iterate � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 2 α k x ◮ Stochastic gradient f x k ( x ; S k ) = f ( x k ; S k ) + � f ′ ( x k ; S k ) , x − x k � ◮ Truncated gradient ( f ≥ 0 ): � � f ( x k ; S k ) + � f ′ ( x k ; S k ) , x − x k � f x k ( x ; S k ) = + ◮ (Stochastic) proximal point f x k ( x ; S k ) = f ( x ; S k )
Linear regression with low noise m 1 � ( a T i x − b i ) 2 F ( x ) = 2 m i =1 10000 8000 Time to ǫ -accuracy SGM 6000 Truncated Prox 4000 2000 0 10 2 10 1 10 0 10 1 10 2 10 3 10 4 Initial stepsize α 0
Linear regression with no noise m 1 � ( a T i x − b i ) 2 F ( x ) = 2 m i =1 10000 8000 Time to ǫ -accuracy 6000 4000 SGM Truncated 2000 Prox 0 10 2 10 1 10 0 10 1 10 2 10 3 10 4 Initial stepsize α 0
Linear regression with “poor” conditioning Accuracy epsilon = 0.055 1000 900 800 Proximal 700 SGM Truncated 600 Bundle 500 400 300 10 1 10 0 10 1 10 2 10 3 10 4 10 5
Linear regression with “poor” conditioning Accuracy epsilon = 0.055 1000 900 800 Proximal 700 SGM Truncated 600 Bundle 500 400 300 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Poor conditioning? κ ( A ) = 15
Multiclass hinge loss: no noise f ( x ; ( a, l )) = max i � = l [1 + � a, x i − x l � ] + 10000 SGM Truncated 8000 Prox Time to ǫ -accuracy 6000 4000 2000 0 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Initial stepsize α 0
Multiclass hinge loss: small label flipping f ( x ; ( a, l )) = max i � = l [1 + � a, x i − x l � ] + 10000 8000 Time to ǫ -accuracy SGM 6000 Truncated Prox 4000 2000 0 1 10 1 10 4 10 5 10 10 0 10 2 10 3 Initial stepsize α 0
Multiclass hinge loss: substantial label flipping f ( x ; ( a, l )) = max i � = l [1 + � a, x i − x l � ] + 10000 8000 Time to ǫ -accuracy 6000 4000 SGM Truncated 2000 Prox 0 1 10 1 10 4 10 5 10 10 0 10 2 10 3 Initial stepsize α 0
(Robust) Phase retrieval [Cand` es, Li, Soltanolkotabi 15]
(Robust) Phase retrieval [Cand` es, Li, Soltanolkotabi 15] Observations (usually) b i = � a i , x ⋆ � 2 yield objective m f ( x ) = 1 � |� a i , x � 2 − b i | m i =1
Phase retrieval without noise m F ( x ) = 1 � |� a i , x � 2 − b i | m i =1 1000 800 Time to ǫ -accuracy 600 Proximal SGM Truncated 400 200 0 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Initial stepsize α 0
Recommend
More recommend