block stochastic gradient update method
play

Block stochastic gradient update method Yangyang Xu and Wotao Yin - PowerPoint PPT Presentation

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic gradient method


  1. Block stochastic gradient update method Yangyang Xu ∗ and Wotao Yin † ∗ IMA, University of Minnesota † Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26

  2. Stochastic gradient method Consider the stochastic programming min x ∈ X F ( x ) = E ξ f ( x ; ξ ) . Stochastic gradient update (SG): x k − α k ˜ x k +1 = P X � g k � g k a stochastic gradient, often E [˜ g k ] ∈ ∂ F ( x k ) • ˜ • Originally for stochastic problem where exact gradient not available • Now also popular for deterministic problem where exact gradient � N 1 expensive; e.g., F ( x ) = i =1 f i ( x ) with large N N • Faster than deterministic gradient method to reach not-high accuracy 2 / 26

  3. Stochastic gradient method • First appears in [Robbins-Monro’51] ; now tons of works √ • O (1 / k ) rate for weakly convex problem and O (1 / k ) for strongly convex problem (e.g., [Nemirovski et. al’09] ) • For deterministic problem, linear convergence is possible if exact gradient allowed periodically [Xiao-Zhang’14] • Convergence in terms of first-order optimality condition for nonconvex problem [Ghadimi-Lan’13] 3 / 26

  4. Block gradient descent Consider the problem s � min x F ( x ) = f ( x 1 , . . . , x s ) + r i ( x i ) . i =1 • f is smooth • r i ’s possibly nonsmooth and extended-valued Block gradient update (BGD): 1 x k +1 i k � 2 �∇ i k f ( x k ) , x i k − x k 2 α k � x i k − x k = arg min i k � + 2 + r i k ( x i k ) i k x ik • Simpler than classic block coordinate descent, i.e., block minimization • Allows different ways to choose i k : cyclicly, greedily, or randomly • Low iteration complexity • Larger stepsize than full gradient and often faster convergence 4 / 26

  5. Block gradient descent • First appears in [Tseng-Yun’09] ; famous since [Nesverov’12] • Both cyclic and randomized selection: O (1 / k ) for weakly convex problem and linear convergence for strongly convex problem (e.g., [Hong et. al’15] ) • Cyclic version harder than random or greedy version to analyze • Subsequence convergence for nonconvex problem and whole sequence convergence if certain local property holds (e.g., [X.-Yin’14] ) 5 / 26

  6. Stochastic programming with block structure Consider problem s � min x Φ( x ) = E ξ f ( x 1 , . . . , x s ; ξ ) + r i ( x i ) (BSP) i =1 • Example: tensor regression [Zhou-Li-Zhu’13] X 1 ,..., X s E [ ℓ ( X 1 ◦ · · · ◦ X s ; A , b )] min This talk presents an algorithm for (BSP) with properties: • Only requiring stochastic block gradient • Simple update and low computational complexity • Guaranteed convergence • Optimal convergence rate if the problem is convex 6 / 26

  7. How and why • use stochastic partial gradient in BGD • exact partial gradient unavailable or expensive • stochastic gradient works but performs not as well • random cyclic selection, i.e., shuffle and then cycle • random shuffling for faster convergence and more stable performance [Chang-Hsieh-Lin’08] • cyclic for lower computational complexity but analysis more difficult 7 / 26

  8. Block stochastic gradient method At each iteration/cycle k 1. Sample one function or a batch of functions 2. Random shuffle blocks to ( k 1 , . . . , k s ) 3. From i = 1 through s , do 1 k i � 2 + r k i ( x k i ) x k +1 g k � x k i − x k = arg min � ˜ k i , x k i � + k i 2 α k x ki k i g k • ˜ k i stochastic partial gradient, dependent on sampled functions and intermediate point ( x k +1 k < i , x k k ≥ i ) k i − ∇ F ( x k +1 • possibly biased estimate, i.e., E [˜ g k k < i , x k k ≥ i )] � = 0 , where F ( x ) = E ξ f ( x ; ξ ) 8 / 26

  9. Pros and cons of cyclic selection Pros: • lower computational complexity, e.g., for Φ( x ) = E ( a , b ) ( a ⊤ x − b ) 2 with x ∈ R n • cyclic selection takes about 2 n to update all coordinates once • random selection takes n to update one coordinate • Gauss-seidel type fast convergence (see numerical results later) Cons: • biased stochastic partial gradient • makes analysis more difficult 9 / 26

  10. Literature Just a few papers so far • [Liu-Wright, arXiv14] : an asynchronous parallel randomized Kaczmarz algorithm • [Dang-Lan, SIOPT15] : stochastic block mirror descent methods for nonsmooth and stochastic optimization • [Zhao et al. NIPS14] : accelerated mini-batch randomized block coordinate descent method • [Wang-Banerjee, arXiv14] : randomized block coordinate descent for online and stochastic optimization • [Hua-Kadomoto-Yamashita, OptOnline15] : regret analysis of block coordinate gradient methods for online convex programming 10 / 26

  11. Assumptions Recall F ( x ) = E ξ f ( x ; ξ ) . Let δ k g k i − ∇ x i F ( x k +1 < i , x k i = ˜ ≥ i ) . Error bound of stochastic partial gradient: i | x k − 1 ] � ≤ A · max � � E [ δ k � α k j , ∀ i , k , ( A = 0 if unbiased ) j i � 2 ≤ σ 2 E � δ k k ≤ σ 2 , ∀ i , k . Lipschitz continuous partial gradient: �∇ x i F ( x + (0 , . . . , d i , . . . , 0)) − ∇ x i F ( x ) � ≤ L i � d i � , ∀ i , ∀ x , d . 11 / 26

  12. Convergence of block stochastic gradient i = α k = O ( 1 • Convex case: F and r i ’s are convex. Take α k k ) , ∀ i , k , and √ let � k κ =1 α κ x κ +1 x k = ˜ . � k κ =1 α κ Then √ x k ) − Φ( x ∗ )] ≤ O (log k / E [Φ(˜ k ) . √ • Can be improved to O (1 / k ) if the number of iterations is pre-known, and thus achieves the optimal order of rate • Strongly convex case: Φ is strongly convex. Take α k i = α k = O ( 1 k ) , ∀ i , k . Then E � x k − x ∗ � 2 ≤ O (1 / k ) . • Again, optimal order of rate is achieved 12 / 26

  13. Convergence of block stochastic gradient • Unconstrained smooth nonconvex case : If { α k i } is taken such that ∞ ∞ i ) 2 < ∞ , ∀ i , � � α k ( α k i = ∞ , k =1 k =1 then k →∞ E �∇ Φ( x k ) � = 0 . lim L i , ∀ i , k , and � ∞ • Nonsmooth nonconvex case : If α k 2 k =1 σ 2 i < k < ∞ , then k →∞ E � dist ( 0 , ∂ Φ( x k )) � lim = 0 . 13 / 26

  14. Numerical experiments Tested problems • Stochastic least square • Linear logistic regression • Bilinear logistic regression • Low-rank tensor recovery from Gaussian measurements Tested methods • block stochastic gradient (BSG) [proposed] • block gradient (deterministic) • stochastic gradient method (SG) • stochastic block mirror descent (SBMD) [Dang-Lan’15] 14 / 26

  15. Stochastic least square 1 2( a ⊤ x − b ) 2 Consider min x E ( a , b ) • a ∼ N (0 , I ) , b = a ⊤ ˆ x + η , and η ∼ N (0 , 0 . 01) • ˆ x is the optimal solution, and minimum value 0 . 005 • { ( a i , b i ) } observed sequentially from i = 1 to N • Deterministic (partial) gradient unavailable 15 / 26

  16. Objective values by different methods N Samples BSG SG SBMD-10 SBMD-50 SBMD-100 4000 6.45e-3 6.03e-3 67.49 4.79 1.03e-1 6000 5.69e-3 5.79e-3 53.84 1.43 1.43e-2 8000 5.57e-3 5.65e-3 42.98 4.92e-1 6.70e-3 10000 5.53e-3 5.58e-3 35.71 2.09e-1 5.74e-3 • One coordinate as one block • SBMD- t : SBMD with t coordinates selected each update • Objective valued by another 100,000 samples • Each update of all methods costs O ( n ) Observation: Better to update more coordinates and BSG performs best 16 / 26

  17. Logistic regression N 1 � log � 1 + exp � � x ⊤ i w + b ��� min − y i (LR) N w , b i =1 • Training samples { ( x i , y i ) } N i =1 with y i ∈ {− 1 , +1 } • Deterministic problem but exact gradient expensive for large N • Stochastic gradient faster than deterministic gradient for not-high accuracy 17 / 26

  18. Performance of different methods on logistic regression 0 0 10 10 BSG Objective − Optimal value Objective − Optimal value BSG SBMD−10 SBMD−1k −2 SBMD−50 10 SBMD−3k −1 SBMD−100 10 SG SG −4 10 −2 10 −6 10 −8 −3 10 10 0 10 20 30 40 50 0 10 20 30 40 50 Epochs Epochs (a) random dataset (b) gisette dataset • Random dataset: 2000 Gaussian random samples of dimension 200 • gisette dataset: 6,000 samples of dimension 5000, from LIBSVM Datasets Observation: BSG gives best performance among compared methods. 18 / 26

  19. Low-rank tensor recovery from Gaussian measurements N 1 � ( A ℓ ( X 1 ◦ X 2 ◦ X 3 ) − b ℓ ) 2 min (LRTR) 2 N X ℓ =1 • b ℓ = A ℓ ( M ) = � G ℓ , M � with G ℓ ∼ N (0 , I ) , ∀ ℓ • G ℓ ’s are dense • For large N , reading all G ℓ may out of memory even for medium G ℓ • Deterministic problem but exact gradient too expensive for large N 19 / 26

  20. Peformance of block deterministic and stochastic gradient 0 10 −2 10 BSG Objective BCGD −4 10 −6 10 −8 10 0 10 20 30 40 50 Epochs • G ℓ ∈ R 32 × 32 × 32 and N = 15 , 000 • BCGD: block deterministic gradient • BSG: block stochastic gradient • G ℓ ∈ R 60 × 60 × 60 and N = 40 , 000 Observation: BSG faster and BCGD • Original (top) and recovered trapped at bad local solution (bottom) by BSG with 50 epochs 20 / 26

  21. Bilinear logistic regression N 1 � � UV ⊤ , X i � + b ��� log � 1 + exp � � min − y i (BLR) N U , V , b i =1 • Training samples { ( X i , y i ) } N i =1 with y i ∈ {− 1 , +1 } • Better than linear logistic regression for 2D dataset [Dyrholm et al.’07] 21 / 26

Recommend


More recommend