Variance-based Stochastic Gradient Descent (vSGD): No More Pesky - PowerPoint PPT Presentation

Apr 30, 2023 •125 likes •278 views

Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The idea - Remove need for setting learning rates by updating them optimally from the Hessian values. ADAM: A Method For Stochastic

Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13
The idea - Remove need for setting learning rates by updating them optimally from the Hessian values.
ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14
The idea - Establish and update trust region where the gradient is assumed to hold. - Attempts to combine the robustness to sparse gradients of AdaGrad and the robustness of RMSProp to non-stationary objectives.
Alternative form: AdaMax - The second moment is calculated as a sum of squares and its square root is used in the update in ADAM. - Changing that from power of two to power of p as p goes to infinity yields AdaMax.
Results
AdaGrad: Adaptive Subgradient Methods for Online Learning and Stochastic Optimization Duchi et al., COLT10
The idea - Decrease the update over time by penalizing quickly moving values.
The problem - The learning rate only ever decreases. - Complex problems may need more freedom.
Precursor to - AdaDelta (Zeiler, ArXiv12) - Uses the square root of exponential moving average of squares instead of just accumulating. - Approximate a Hessian correction using the same moving impulse over the weight updates. - Removes need for learning rate - AdaSecant (Gulcehre et al., ArXiv14) - Uses expected values to reduce variance.
Comparisons - https://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html - Doesn’t have ADAM in the default run, but ADAM is implemented and can be added. - Doesn’t have Batch Normalization, vSGD, AdaMax, or AdaSecant.
Questions?

Recommend

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS 2020 Aaron Mishkin, amishkin@cs.ubc.ca 1 21 Stochastic Gradient Descent: Workhorse of ML? Stochastic gradient descent (SGD) is today one of

634 views • 21 slides

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap Regression vs Classification Generalization / Overfitting / Underfitting Regularization

633 views • 30 slides

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS 2019 Aaron Mishkin 1 21 Stochastic Gradient Descent: Workhorse of ML? Stochastic gradient descent (SGD) is today one of the main

1.22k views • 21 slides

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Gradient Descent 2. Stochastic Gradient Descent 3. SGD with Momentum 4. Adaptive Learning Rates 1 Gradient Descent

767 views • 66 slides

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using an adaptive learning rate sub-gradient

569 views • 34 slides

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Midterm: Weds, Feb 7th. Policies: You may use a single

333 views • 18 slides

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner New requirement for the final project: For the first time ever, researchers who submit

620 views • 34 slides

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient Descent Quadratic Forms Gradient Descent in Quadratic Forms Eigen vectors and values Gradient Descent Convergence Conjugate

1.1k views • 50 slides

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla December 17, 2011 Peter J. Haas Yannis Sismanis Christina Teflioudi Faraz Makari Outline Matrix Factorization Stochastic Gradient Descent

1.24k views • 75 slides

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla August 23, 2011 Peter J. Haas Yannis Sismanis Erik Nijkamp Outline Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce

1.05k views • 81 slides

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning. iii. Asymptotic Analysis.

853 views • 37 slides

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Gradient Descent Michail Michailidis & Patrick Maiden Outline Mo4va4on Gradient Descent Algorithm Issues & Alterna4ves Stochas4c Gradient Descent

840 views • 29 slides

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1 Introduction The general aim of machine learning is always learning the data by itself, with as less human efforts as possible. Then it comes to the focus

396 views • 10 slides

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON Matthieu R Bloch Thursday, January 30, 2020 1 LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB 523a) - 2pm-3:15pm Tuesday: TJ (VL

299 views • 14 slides

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X is: var( X ) = E [( X E X ) 2 ] Alternatively, (check using linearity of expectation), var( X ) = E [ X 2 ] ( E X ) 2 Variance Variance is

395 views • 12 slides

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j Heuristic improvements to gradient descent (momentum) w j = weight vector at step . Steepest descent training algorithm [ ] E w j j

376 views • 20 slides

CSE 105Theory of Computability Fall, 2006 Lecture 10October 24 Turing Machines

CSE 105Theory of Computability Fall, 2006 Lecture 10October 24 Turing Machines Instructor: Neil Rhodes Turing Machine Has a one-way infinite tape Input is written on the tape, with blanks afterward Has a current location on the

186 views • 3 slides

CS 744: Powergraph Shivaram Venkataraman Fall 2020 ADMINISTRIVIA ! ! - Midterm update

CS 744: Powergraph Shivaram Venkataraman Fall 2020 ADMINISTRIVIA ! ! - Midterm update Tonight - Course Project reminders groups Discussion - id - email Group Number : - Piazza group corresponding the You can join - week !

465 views • 20 slides

Imaging Galactic Dark Matter with IceCubes High-Energy Cosmic Neutrinos Ali Kheirandish The

Imaging Galactic Dark Matter with IceCubes High-Energy Cosmic Neutrinos Ali Kheirandish The 26th International Workshop on Weak Interactions and Neutrinos (WIN2017) University of California-Irvine, June, 19, 2017 Based on [arXiv:1703.00451]

398 views • 25 slides

Motion and Interaction SIGGRAPH 99 Course: Fundamental Issues of Visual Perception for

Motion and Interaction SIGGRAPH 99 Course: Fundamental Issues of Visual Perception for Effective Image Generation Penny Rheingans University of Maryland Baltimore County Overview Roles of Motion Processing Mechanism of Motion

316 views • 15 slides

Ch. 5 continued 1 Classification systems Evolutionary Systematics: explains ancestor-descendant

Ch. 5 continued 1 Classification systems Evolutionary Systematics: explains ancestor-descendant relationships over time . Cladistics: uses shared derived traits to classify new species. -No time dimension. 2 Homologies Homologies: similarities

318 views • 9 slides

AN ADAPTIVE GATEWAY DISCOVERY IN HYBRID MANETS F. D. Trujillo, A. Trivio, E. Casilari and A.

AN ADAPTIVE GATEWAY DISCOVERY IN HYBRID MANETS F. D. Trujillo, A. Trivio, E. Casilari and A. Daz-Estrella Department of Electronic Technology University of Malaga A. J. Yuste Department of Telecommunication Engineering University of Jaen

666 views • 18 slides

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano, MSaad, Karimi Chapter 11: Direct Adaptive Control 2 Adaptive Control Landau, Lozano, MSaad, Karimi Adaptive Control A Basic Scheme

457 views • 24 slides

Foundations of Computer Science Last Time Lecture 9 Sums And Asymptotics Computing Sums

Foundations of Computer Science Last Time Lecture 9 Sums And Asymptotics Computing Sums Asymptotics: big-( ), big- O ( ), big-( ) The Integration Method 1 Structural induction: proofs about recursively defined sets. Matched

276 views • 4 slides

Variance-based Stochastic Gradient Descent (vSGD): No More Pesky - PowerPoint PPT Presentation

Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The idea - Remove need for setting learning rates by updating them optimally from the Hessian values. ADAM: A Method For Stochastic

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

CSE 105Theory of Computability Fall, 2006 Lecture 10October 24 Turing Machines

CS 744: Powergraph Shivaram Venkataraman Fall 2020 ADMINISTRIVIA ! ! - Midterm update

Imaging Galactic Dark Matter with IceCubes High-Energy Cosmic Neutrinos Ali Kheirandish The

Motion and Interaction SIGGRAPH 99 Course: Fundamental Issues of Visual Perception for

Ch. 5 continued 1 Classification systems Evolutionary Systematics: explains ancestor-descendant

AN ADAPTIVE GATEWAY DISCOVERY IN HYBRID MANETS F. D. Trujillo, A. Trivio, E. Casilari and A.

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Foundations of Computer Science Last Time Lecture 9 Sums And Asymptotics Computing Sums

Gradient Descent Michail Michailidis & Patrick Maiden Outline