Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 - PowerPoint PPT Presentation

Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 Wolfgang Bangerth

Background In many practical applications, the objective function is a large sum: N f ( x ) = ∑ i = 1 f i ( x ) Issues and questions: ● Evaluating gradients/Hessians is expensive ● Do all of these f i really provide complementary information? ● Can we exploit the sum structure somehow to make the algorithm cheaper? 182 Wolfgang Bangerth

Stochastic gradient descent Approach: Let’s use gradient descent (steepest descent), but instead of using the full gradient p k = − α k g k = − α k ∇ f ( x k ) Try to approximate it somehow in each step, using only a subset of the functions f i : p k = − α k ~ g k Note: In many practical applications, the step lengths are chosen a priori, based on knowledge of the application. 183 Wolfgang Bangerth

Stochastic gradient descent Idea 1: Use only one f i at a time when evaluating the gradient: ● In iteration 1, approximate g 1 = ∇ f ( x 1 ) ≈ ∇ f 1 ( x 1 ) = : ~ g 1 ● In iteration 2, approximate g 2 = ∇ f ( x 2 ) ≈ ∇ f 2 ( x 2 ) = : ~ g 2 ● … ● After iteration N , start over: g N + 1 = ∇ f ( x N + 1 ) ≈ ∇ f 1 ( x N + 1 ) = : ~ g N + 1 184 Wolfgang Bangerth

Stochastic gradient descent Idea 2: Use only one f i at a time, randomly chosen: ● In iteration 1, approximate g 1 = ∇ f ( x 1 ) ≈ ∇ f r 1 ( x 1 ) = : ~ g 1 ● In iteration 2, approximate g 2 = ∇ f ( x 2 ) ≈ ∇ f r 2 ( x 2 ) = : ~ g 2 ● … Here, r i are randomly chosen numbers between 1 and N . 185 Wolfgang Bangerth

Stochastic gradient descent Idea 3: Use a subset of the f i at a time, randomly chosen: ● In iteration 1, approximate g 1 = ∇ f ( x 1 ) ≈ ∑ i ∈ S 1 ∇ f i ( x 1 ) = : ~ g 1 ● In iteration 2, approximate g 2 = ∇ f ( x 2 ) ≈ ∑ i ∈ S 2 ∇ f i ( x 2 ) = : ~ g 2 ● … Here, S i are randomly chosen subsets of {1...N} of a fixed size, but relatively small size M<<N . 186 Wolfgang Bangerth

Stochastic gradient descent Analysis: Why might anything like this work at all? ● The approximate gradient direction in each step is wrong. ● The search direction might not even be a descent direction. ● The sum of each block of N partial gradients equals one exact gradient, so there does not seem to be any savings But: ● On average , the search direction will be correct. ● In many practical cases, the functions f i are not truly independent, but have redundancy. Consequence: Far fewer than N steps are necessary compared to one exact gradient step! 187 Wolfgang Bangerth

Stochastic Newton Idea: The same principle can be applied for Newton’s method. Either select a single f in each iteration and approximate g k = ∇ f ( x k ) ≈ ∇ f r k ( x k ) = : ~ g k H k = ∇ 2 f ( x k ) ≈ ∇ 2 f r k ( x k ) = : ~ H k Or use a small subset: g k = ∇ f ( x k ) ≈ ∑ i ∈ S k ∇ f i ( x k ) = : ~ g k 2 f ( x k ) ≈ ∑ i ∈ S k ∇ 2 f i ( x k ) = : ~ H k = ∇ H k 188 Wolfgang Bangerth

Summary Redundancy: In many practical cases, the functions f i are not truly independent, but have redundancy. Stochastic methods: ● Exploit this by only evaluating a small subset of these functions in each iteration. ● Can be shown to converge under certain conditions ● Are often faster than the original method because – they require vastly fewer function evaluations in each iteration – even though they require more iterations 189 Wolfgang Bangerth

Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 - PowerPoint PPT Presentation

Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 Wolfgang Bangerth Background In many practical applications, the objective function is a large sum: N f ( x ) = i = 1 f i ( x ) Issues and questions: Evaluating

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Deep Learning: Theory and Practice Matrix Calculus 31-1-2019 Linear and Logistic Regression

Op Optimization for Machine Learning: g: Be Beyon ond St Stoc ochastic Gradient Descent

Stochastic Gradient Methods for Neural Networks Chih-Jen Lin National Taiwan University Last

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Scaled gradient projection methods in image deblurring and denoising Mario Bertero 1 Patrizia

Stochastic Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI),

Adjoint approach to optimization Praveen. C praveen@math.tifrbng.res.in Tata Institute of

Inverse Scattering Problems Chaiwoot Boonyasiriwat October 14, 2020 Direct Scattering Problem

Sambuz

Useful Links

Newsletter

Mail Us

Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 - PowerPoint PPT Presentation

Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 Wolfgang Bangerth Background In many practical applications, the objective function is a large sum: N f ( x ) = i = 1 f i ( x ) Issues and questions: Evaluating

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Deep Learning: Theory and Practice Matrix Calculus 31-1-2019 Linear and Logistic Regression

Op Optimization for Machine Learning: g: Be Beyon ond St Stoc ochastic Gradient Descent

Stochastic Gradient Methods for Neural Networks Chih-Jen Lin National Taiwan University Last

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Scaled gradient projection methods in image deblurring and denoising Mario Bertero 1 Patrizia

Stochastic Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI),

Adjoint approach to optimization Praveen. C praveen@math.tifrbng.res.in Tata Institute of

Inverse Scattering Problems Chaiwoot Boonyasiriwat October 14, 2020 Direct Scattering Problem

Sambuz

Useful Links

Newsletter

Mail Us

Gradient Descent Michail Michailidis & Patrick Maiden Outline