Block stochastic gradient update method Yangyang Xu ∗ and Wotao Yin † ∗ IMA, University of Minnesota † Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26
Stochastic gradient method Consider the stochastic programming min x ∈ X F ( x ) = E ξ f ( x ; ξ ) . Stochastic gradient update (SG): x k − α k ˜ x k +1 = P X � g k � g k a stochastic gradient, often E [˜ g k ] ∈ ∂ F ( x k ) • ˜ • Originally for stochastic problem where exact gradient not available • Now also popular for deterministic problem where exact gradient � N 1 expensive; e.g., F ( x ) = i =1 f i ( x ) with large N N • Faster than deterministic gradient method to reach not-high accuracy 2 / 26
Stochastic gradient method • First appears in [Robbins-Monro’51] ; now tons of works √ • O (1 / k ) rate for weakly convex problem and O (1 / k ) for strongly convex problem (e.g., [Nemirovski et. al’09] ) • For deterministic problem, linear convergence is possible if exact gradient allowed periodically [Xiao-Zhang’14] • Convergence in terms of first-order optimality condition for nonconvex problem [Ghadimi-Lan’13] 3 / 26
Block gradient descent Consider the problem s � min x F ( x ) = f ( x 1 , . . . , x s ) + r i ( x i ) . i =1 • f is smooth • r i ’s possibly nonsmooth and extended-valued Block gradient update (BGD): 1 x k +1 i k � 2 �∇ i k f ( x k ) , x i k − x k 2 α k � x i k − x k = arg min i k � + 2 + r i k ( x i k ) i k x ik • Simpler than classic block coordinate descent, i.e., block minimization • Allows different ways to choose i k : cyclicly, greedily, or randomly • Low iteration complexity • Larger stepsize than full gradient and often faster convergence 4 / 26
Block gradient descent • First appears in [Tseng-Yun’09] ; famous since [Nesverov’12] • Both cyclic and randomized selection: O (1 / k ) for weakly convex problem and linear convergence for strongly convex problem (e.g., [Hong et. al’15] ) • Cyclic version harder than random or greedy version to analyze • Subsequence convergence for nonconvex problem and whole sequence convergence if certain local property holds (e.g., [X.-Yin’14] ) 5 / 26
Stochastic programming with block structure Consider problem s � min x Φ( x ) = E ξ f ( x 1 , . . . , x s ; ξ ) + r i ( x i ) (BSP) i =1 • Example: tensor regression [Zhou-Li-Zhu’13] X 1 ,..., X s E [ ℓ ( X 1 ◦ · · · ◦ X s ; A , b )] min This talk presents an algorithm for (BSP) with properties: • Only requiring stochastic block gradient • Simple update and low computational complexity • Guaranteed convergence • Optimal convergence rate if the problem is convex 6 / 26
How and why • use stochastic partial gradient in BGD • exact partial gradient unavailable or expensive • stochastic gradient works but performs not as well • random cyclic selection, i.e., shuffle and then cycle • random shuffling for faster convergence and more stable performance [Chang-Hsieh-Lin’08] • cyclic for lower computational complexity but analysis more difficult 7 / 26
Block stochastic gradient method At each iteration/cycle k 1. Sample one function or a batch of functions 2. Random shuffle blocks to ( k 1 , . . . , k s ) 3. From i = 1 through s , do 1 k i � 2 + r k i ( x k i ) x k +1 g k � x k i − x k = arg min � ˜ k i , x k i � + k i 2 α k x ki k i g k • ˜ k i stochastic partial gradient, dependent on sampled functions and intermediate point ( x k +1 k < i , x k k ≥ i ) k i − ∇ F ( x k +1 • possibly biased estimate, i.e., E [˜ g k k < i , x k k ≥ i )] � = 0 , where F ( x ) = E ξ f ( x ; ξ ) 8 / 26
Pros and cons of cyclic selection Pros: • lower computational complexity, e.g., for Φ( x ) = E ( a , b ) ( a ⊤ x − b ) 2 with x ∈ R n • cyclic selection takes about 2 n to update all coordinates once • random selection takes n to update one coordinate • Gauss-seidel type fast convergence (see numerical results later) Cons: • biased stochastic partial gradient • makes analysis more difficult 9 / 26
Literature Just a few papers so far • [Liu-Wright, arXiv14] : an asynchronous parallel randomized Kaczmarz algorithm • [Dang-Lan, SIOPT15] : stochastic block mirror descent methods for nonsmooth and stochastic optimization • [Zhao et al. NIPS14] : accelerated mini-batch randomized block coordinate descent method • [Wang-Banerjee, arXiv14] : randomized block coordinate descent for online and stochastic optimization • [Hua-Kadomoto-Yamashita, OptOnline15] : regret analysis of block coordinate gradient methods for online convex programming 10 / 26
Assumptions Recall F ( x ) = E ξ f ( x ; ξ ) . Let δ k g k i − ∇ x i F ( x k +1 < i , x k i = ˜ ≥ i ) . Error bound of stochastic partial gradient: i | x k − 1 ] � ≤ A · max � � E [ δ k � α k j , ∀ i , k , ( A = 0 if unbiased ) j i � 2 ≤ σ 2 E � δ k k ≤ σ 2 , ∀ i , k . Lipschitz continuous partial gradient: �∇ x i F ( x + (0 , . . . , d i , . . . , 0)) − ∇ x i F ( x ) � ≤ L i � d i � , ∀ i , ∀ x , d . 11 / 26
Convergence of block stochastic gradient i = α k = O ( 1 • Convex case: F and r i ’s are convex. Take α k k ) , ∀ i , k , and √ let � k κ =1 α κ x κ +1 x k = ˜ . � k κ =1 α κ Then √ x k ) − Φ( x ∗ )] ≤ O (log k / E [Φ(˜ k ) . √ • Can be improved to O (1 / k ) if the number of iterations is pre-known, and thus achieves the optimal order of rate • Strongly convex case: Φ is strongly convex. Take α k i = α k = O ( 1 k ) , ∀ i , k . Then E � x k − x ∗ � 2 ≤ O (1 / k ) . • Again, optimal order of rate is achieved 12 / 26
Convergence of block stochastic gradient • Unconstrained smooth nonconvex case : If { α k i } is taken such that ∞ ∞ i ) 2 < ∞ , ∀ i , � � α k ( α k i = ∞ , k =1 k =1 then k →∞ E �∇ Φ( x k ) � = 0 . lim L i , ∀ i , k , and � ∞ • Nonsmooth nonconvex case : If α k 2 k =1 σ 2 i < k < ∞ , then k →∞ E � dist ( 0 , ∂ Φ( x k )) � lim = 0 . 13 / 26
Numerical experiments Tested problems • Stochastic least square • Linear logistic regression • Bilinear logistic regression • Low-rank tensor recovery from Gaussian measurements Tested methods • block stochastic gradient (BSG) [proposed] • block gradient (deterministic) • stochastic gradient method (SG) • stochastic block mirror descent (SBMD) [Dang-Lan’15] 14 / 26
Stochastic least square 1 2( a ⊤ x − b ) 2 Consider min x E ( a , b ) • a ∼ N (0 , I ) , b = a ⊤ ˆ x + η , and η ∼ N (0 , 0 . 01) • ˆ x is the optimal solution, and minimum value 0 . 005 • { ( a i , b i ) } observed sequentially from i = 1 to N • Deterministic (partial) gradient unavailable 15 / 26
Objective values by different methods N Samples BSG SG SBMD-10 SBMD-50 SBMD-100 4000 6.45e-3 6.03e-3 67.49 4.79 1.03e-1 6000 5.69e-3 5.79e-3 53.84 1.43 1.43e-2 8000 5.57e-3 5.65e-3 42.98 4.92e-1 6.70e-3 10000 5.53e-3 5.58e-3 35.71 2.09e-1 5.74e-3 • One coordinate as one block • SBMD- t : SBMD with t coordinates selected each update • Objective valued by another 100,000 samples • Each update of all methods costs O ( n ) Observation: Better to update more coordinates and BSG performs best 16 / 26
Logistic regression N 1 � log � 1 + exp � � x ⊤ i w + b ��� min − y i (LR) N w , b i =1 • Training samples { ( x i , y i ) } N i =1 with y i ∈ {− 1 , +1 } • Deterministic problem but exact gradient expensive for large N • Stochastic gradient faster than deterministic gradient for not-high accuracy 17 / 26
Performance of different methods on logistic regression 0 0 10 10 BSG Objective − Optimal value Objective − Optimal value BSG SBMD−10 SBMD−1k −2 SBMD−50 10 SBMD−3k −1 SBMD−100 10 SG SG −4 10 −2 10 −6 10 −8 −3 10 10 0 10 20 30 40 50 0 10 20 30 40 50 Epochs Epochs (a) random dataset (b) gisette dataset • Random dataset: 2000 Gaussian random samples of dimension 200 • gisette dataset: 6,000 samples of dimension 5000, from LIBSVM Datasets Observation: BSG gives best performance among compared methods. 18 / 26
Low-rank tensor recovery from Gaussian measurements N 1 � ( A ℓ ( X 1 ◦ X 2 ◦ X 3 ) − b ℓ ) 2 min (LRTR) 2 N X ℓ =1 • b ℓ = A ℓ ( M ) = � G ℓ , M � with G ℓ ∼ N (0 , I ) , ∀ ℓ • G ℓ ’s are dense • For large N , reading all G ℓ may out of memory even for medium G ℓ • Deterministic problem but exact gradient too expensive for large N 19 / 26
Peformance of block deterministic and stochastic gradient 0 10 −2 10 BSG Objective BCGD −4 10 −6 10 −8 10 0 10 20 30 40 50 Epochs • G ℓ ∈ R 32 × 32 × 32 and N = 15 , 000 • BCGD: block deterministic gradient • BSG: block stochastic gradient • G ℓ ∈ R 60 × 60 × 60 and N = 40 , 000 Observation: BSG faster and BCGD • Original (top) and recovered trapped at bad local solution (bottom) by BSG with 50 epochs 20 / 26
Bilinear logistic regression N 1 � � UV ⊤ , X i � + b ��� log � 1 + exp � � min − y i (BLR) N U , V , b i =1 • Training samples { ( X i , y i ) } N i =1 with y i ∈ {− 1 , +1 } • Better than linear logistic regression for 2D dataset [Dyrholm et al.’07] 21 / 26
Recommend
More recommend