Block stochastic gradient update method Yangyang Xu and Wotao Yin - PowerPoint PPT Presentation

Block stochastic gradient update method Yangyang Xu ∗ and Wotao Yin † ∗ IMA, University of Minnesota † Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26

Stochastic gradient method Consider the stochastic programming min x ∈ X F ( x ) = E ξ f ( x ; ξ ) . Stochastic gradient update (SG): x k − α k ˜ x k +1 = P X � g k � g k a stochastic gradient, often E [˜ g k ] ∈ ∂ F ( x k ) • ˜ • Originally for stochastic problem where exact gradient not available • Now also popular for deterministic problem where exact gradient � N 1 expensive; e.g., F ( x ) = i =1 f i ( x ) with large N N • Faster than deterministic gradient method to reach not-high accuracy 2 / 26

Stochastic gradient method • First appears in [Robbins-Monro’51] ; now tons of works √ • O (1 / k ) rate for weakly convex problem and O (1 / k ) for strongly convex problem (e.g., [Nemirovski et. al’09] ) • For deterministic problem, linear convergence is possible if exact gradient allowed periodically [Xiao-Zhang’14] • Convergence in terms of first-order optimality condition for nonconvex problem [Ghadimi-Lan’13] 3 / 26

Block gradient descent Consider the problem s � min x F ( x ) = f ( x 1 , . . . , x s ) + r i ( x i ) . i =1 • f is smooth • r i ’s possibly nonsmooth and extended-valued Block gradient update (BGD): 1 x k +1 i k � 2 �∇ i k f ( x k ) , x i k − x k 2 α k � x i k − x k = arg min i k � + 2 + r i k ( x i k ) i k x ik • Simpler than classic block coordinate descent, i.e., block minimization • Allows different ways to choose i k : cyclicly, greedily, or randomly • Low iteration complexity • Larger stepsize than full gradient and often faster convergence 4 / 26

Block gradient descent • First appears in [Tseng-Yun’09] ; famous since [Nesverov’12] • Both cyclic and randomized selection: O (1 / k ) for weakly convex problem and linear convergence for strongly convex problem (e.g., [Hong et. al’15] ) • Cyclic version harder than random or greedy version to analyze • Subsequence convergence for nonconvex problem and whole sequence convergence if certain local property holds (e.g., [X.-Yin’14] ) 5 / 26

Stochastic programming with block structure Consider problem s � min x Φ( x ) = E ξ f ( x 1 , . . . , x s ; ξ ) + r i ( x i ) (BSP) i =1 • Example: tensor regression [Zhou-Li-Zhu’13] X 1 ,..., X s E [ ℓ ( X 1 ◦ · · · ◦ X s ; A , b )] min This talk presents an algorithm for (BSP) with properties: • Only requiring stochastic block gradient • Simple update and low computational complexity • Guaranteed convergence • Optimal convergence rate if the problem is convex 6 / 26

How and why • use stochastic partial gradient in BGD • exact partial gradient unavailable or expensive • stochastic gradient works but performs not as well • random cyclic selection, i.e., shuffle and then cycle • random shuffling for faster convergence and more stable performance [Chang-Hsieh-Lin’08] • cyclic for lower computational complexity but analysis more difficult 7 / 26

Block stochastic gradient method At each iteration/cycle k 1. Sample one function or a batch of functions 2. Random shuffle blocks to ( k 1 , . . . , k s ) 3. From i = 1 through s , do 1 k i � 2 + r k i ( x k i ) x k +1 g k � x k i − x k = arg min � ˜ k i , x k i � + k i 2 α k x ki k i g k • ˜ k i stochastic partial gradient, dependent on sampled functions and intermediate point ( x k +1 k < i , x k k ≥ i ) k i − ∇ F ( x k +1 • possibly biased estimate, i.e., E [˜ g k k < i , x k k ≥ i )] � = 0 , where F ( x ) = E ξ f ( x ; ξ ) 8 / 26

Pros and cons of cyclic selection Pros: • lower computational complexity, e.g., for Φ( x ) = E ( a , b ) ( a ⊤ x − b ) 2 with x ∈ R n • cyclic selection takes about 2 n to update all coordinates once • random selection takes n to update one coordinate • Gauss-seidel type fast convergence (see numerical results later) Cons: • biased stochastic partial gradient • makes analysis more difficult 9 / 26

Literature Just a few papers so far • [Liu-Wright, arXiv14] : an asynchronous parallel randomized Kaczmarz algorithm • [Dang-Lan, SIOPT15] : stochastic block mirror descent methods for nonsmooth and stochastic optimization • [Zhao et al. NIPS14] : accelerated mini-batch randomized block coordinate descent method • [Wang-Banerjee, arXiv14] : randomized block coordinate descent for online and stochastic optimization • [Hua-Kadomoto-Yamashita, OptOnline15] : regret analysis of block coordinate gradient methods for online convex programming 10 / 26

Assumptions Recall F ( x ) = E ξ f ( x ; ξ ) . Let δ k g k i − ∇ x i F ( x k +1 < i , x k i = ˜ ≥ i ) . Error bound of stochastic partial gradient: i | x k − 1 ] � ≤ A · max � � E [ δ k � α k j , ∀ i , k , ( A = 0 if unbiased ) j i � 2 ≤ σ 2 E � δ k k ≤ σ 2 , ∀ i , k . Lipschitz continuous partial gradient: �∇ x i F ( x + (0 , . . . , d i , . . . , 0)) − ∇ x i F ( x ) � ≤ L i � d i � , ∀ i , ∀ x , d . 11 / 26

Convergence of block stochastic gradient i = α k = O ( 1 • Convex case: F and r i ’s are convex. Take α k k ) , ∀ i , k , and √ let � k κ =1 α κ x κ +1 x k = ˜ . � k κ =1 α κ Then √ x k ) − Φ( x ∗ )] ≤ O (log k / E [Φ(˜ k ) . √ • Can be improved to O (1 / k ) if the number of iterations is pre-known, and thus achieves the optimal order of rate • Strongly convex case: Φ is strongly convex. Take α k i = α k = O ( 1 k ) , ∀ i , k . Then E � x k − x ∗ � 2 ≤ O (1 / k ) . • Again, optimal order of rate is achieved 12 / 26

Convergence of block stochastic gradient • Unconstrained smooth nonconvex case : If { α k i } is taken such that ∞ ∞ i ) 2 < ∞ , ∀ i , � � α k ( α k i = ∞ , k =1 k =1 then k →∞ E �∇ Φ( x k ) � = 0 . lim L i , ∀ i , k , and � ∞ • Nonsmooth nonconvex case : If α k 2 k =1 σ 2 i < k < ∞ , then k →∞ E � dist ( 0 , ∂ Φ( x k )) � lim = 0 . 13 / 26

Numerical experiments Tested problems • Stochastic least square • Linear logistic regression • Bilinear logistic regression • Low-rank tensor recovery from Gaussian measurements Tested methods • block stochastic gradient (BSG) [proposed] • block gradient (deterministic) • stochastic gradient method (SG) • stochastic block mirror descent (SBMD) [Dang-Lan’15] 14 / 26

Stochastic least square 1 2( a ⊤ x − b ) 2 Consider min x E ( a , b ) • a ∼ N (0 , I ) , b = a ⊤ ˆ x + η , and η ∼ N (0 , 0 . 01) • ˆ x is the optimal solution, and minimum value 0 . 005 • { ( a i , b i ) } observed sequentially from i = 1 to N • Deterministic (partial) gradient unavailable 15 / 26

Objective values by different methods N Samples BSG SG SBMD-10 SBMD-50 SBMD-100 4000 6.45e-3 6.03e-3 67.49 4.79 1.03e-1 6000 5.69e-3 5.79e-3 53.84 1.43 1.43e-2 8000 5.57e-3 5.65e-3 42.98 4.92e-1 6.70e-3 10000 5.53e-3 5.58e-3 35.71 2.09e-1 5.74e-3 • One coordinate as one block • SBMD- t : SBMD with t coordinates selected each update • Objective valued by another 100,000 samples • Each update of all methods costs O ( n ) Observation: Better to update more coordinates and BSG performs best 16 / 26

Logistic regression N 1 � log � 1 + exp � � x ⊤ i w + b �� min − y i (LR) N w , b i =1 • Training samples { ( x i , y i ) } N i =1 with y i ∈ {− 1 , +1 } • Deterministic problem but exact gradient expensive for large N • Stochastic gradient faster than deterministic gradient for not-high accuracy 17 / 26

Performance of different methods on logistic regression 0 0 10 10 BSG Objective − Optimal value Objective − Optimal value BSG SBMD−10 SBMD−1k −2 SBMD−50 10 SBMD−3k −1 SBMD−100 10 SG SG −4 10 −2 10 −6 10 −8 −3 10 10 0 10 20 30 40 50 0 10 20 30 40 50 Epochs Epochs (a) random dataset (b) gisette dataset • Random dataset: 2000 Gaussian random samples of dimension 200 • gisette dataset: 6,000 samples of dimension 5000, from LIBSVM Datasets Observation: BSG gives best performance among compared methods. 18 / 26

Low-rank tensor recovery from Gaussian measurements N 1 � ( A ℓ ( X 1 ◦ X 2 ◦ X 3 ) − b ℓ ) 2 min (LRTR) 2 N X ℓ =1 • b ℓ = A ℓ ( M ) = � G ℓ , M � with G ℓ ∼ N (0 , I ) , ∀ ℓ • G ℓ ’s are dense • For large N , reading all G ℓ may out of memory even for medium G ℓ • Deterministic problem but exact gradient too expensive for large N 19 / 26

Peformance of block deterministic and stochastic gradient 0 10 −2 10 BSG Objective BCGD −4 10 −6 10 −8 10 0 10 20 30 40 50 Epochs • G ℓ ∈ R 32 × 32 × 32 and N = 15 , 000 • BCGD: block deterministic gradient • BSG: block stochastic gradient • G ℓ ∈ R 60 × 60 × 60 and N = 40 , 000 Observation: BSG faster and BCGD • Original (top) and recovered trapped at bad local solution (bottom) by BSG with 50 epochs 20 / 26

Bilinear logistic regression N 1 � � UV ⊤ , X i � + b �� log � 1 + exp � � min − y i (BLR) N U , V , b i =1 • Training samples { ( X i , y i ) } N i =1 with y i ∈ {− 1 , +1 } • Better than linear logistic regression for 2D dataset [Dyrholm et al.’07] 21 / 26

Block stochastic gradient update method Yangyang Xu and Wotao Yin - PowerPoint PPT Presentation

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic gradient method

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Stochastic Gradient Method: Applications February 03, 2015 P. Carpentier Master MMMEF Cours

Applications of the Stochastic Gradient Method December 11, 2019 P. Carpentier Master

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Emotions ...in life Biblical wisdom for our feelings ...in counselling ...in Scripture

DEFENSE INITIA DEFENSE INITIATED TED VICTIM VICTIM OUTRE OUTREACH Kate Siska Mitigation

Council of Graduate Coordinators and Staff (CGCS) Meeting January 10, 2014 Agenda Items

Matthew Series Lesson #127 July 10, 2016 Dean Bible Ministries www.deanbibleministries.org Dr.

Structured Predictions: Practical Advancements and Applications Kai-Wei Chang University of

MDL and the complexity of natural language John Goldsmith University of Chicago/CNRS MoDyCo

Performance Measures: Stochastic Optimization & Statistical Consistency Harikrishna Narasimhan

Information Theory on Convex sets In celebration of Prof. Shunichi Amaris 80 years birthday

Block stochastic gradient update method Yangyang Xu and Wotao Yin - PowerPoint PPT Presentation

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic gradient method

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Stochastic Gradient Method: Applications February 03, 2015 P. Carpentier Master MMMEF Cours

Applications of the Stochastic Gradient Method December 11, 2019 P. Carpentier Master

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Emotions ...in life Biblical wisdom for our feelings ...in counselling ...in Scripture

DEFENSE INITIA DEFENSE INITIATED TED VICTIM VICTIM OUTRE OUTREACH Kate Siska Mitigation

Council of Graduate Coordinators and Staff (CGCS) Meeting January 10, 2014 Agenda Items

Matthew Series Lesson #127 July 10, 2016 Dean Bible Ministries www.deanbibleministries.org Dr.

Structured Predictions: Practical Advancements and Applications Kai-Wei Chang University of

MDL and the complexity of natural language John Goldsmith University of Chicago/CNRS MoDyCo

Performance Measures: Stochastic Optimization &amp; Statistical Consistency Harikrishna Narasimhan

Information Theory on Convex sets In celebration of Prof. Shunichi Amaris 80 years birthday

Performance Measures: Stochastic Optimization & Statistical Consistency Harikrishna Narasimhan