The Effect of Network Width on Stochastic Gradient Descent and - PowerPoint PPT Presentation

Nov 20, 2022 •275 likes •377 views

The Effect of Network Width on Stochastic Gradient Descent and Generalization Daniel S. Park Google ICML 2019 Daniel S. Park (Google) ICML 2019 1 / 9 Work with Jascha Sohl-Dickstein, Quoc V. Le and Samuel L. Smith. Daniel S. Park (Google)

The Effect of Network Width on Stochastic Gradient Descent and Generalization Daniel S. Park Google ICML 2019 Daniel S. Park (Google) ICML 2019 1 / 9
Work with Jascha Sohl-Dickstein, Quoc V. Le and Samuel L. Smith. Daniel S. Park (Google) ICML 2019 2 / 9
Motivation Let us assume that • we found hyperparameters that maximize test set accuracy for a given network, • but now we want to make the network bigger by widening all the channels by factor w . What do we do with the hyperparameters for the new network? Daniel S. Park (Google) ICML 2019 3 / 9
Main Result We find a rule that governs how hyperparameters that maximize test accuracy change when the network width is varied. The rule is that the optimal value of the normalized noise scale (which is a function of the hyperparameters of SGD) scales proportionally to the width of the network. Daniel S. Park (Google) ICML 2019 4 / 9
The Normalized Noise Scale ¯ g 1 • ¯ g = ǫ init governs how noisy the SGD is. B (1 − m ) · σ 2 • ¯ g determines the generalization performance. ∗ *Mandt et al. (2017); Chaudhari & Soatto (2017); Jastrzebski et al. (2017); Smith & Le (2017). Daniel S. Park (Google) ICML 2019 5 / 9
Rule for Hyperparameter Selection • There exists a simple rule for hyperparameter selection: Increase ¯ g proportionally with w . Daniel S. Park (Google) ICML 2019 6 / 9
Wider networks require smaller batch sizes • To maximize generalization performance, wide networks (eventually) need to be trained with small batch sizes: B opt ≤ (constant) w Daniel S. Park (Google) ICML 2019 7 / 9
Bigger networks perform better due to noise resistance • Bigger networks have better peak test set performance which is reached at higher noise scales. Daniel S. Park (Google) ICML 2019 8 / 9
Visit our poster (Pacific Ballroom #55) to learn more. Thank you! Daniel S. Park (Google) ICML 2019 9 / 9

Recommend

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth and width during base- -level fluctuations level fluctuations during base during base- -level fluctuations

492 views • 14 slides

Carving-width, tree-width and area-optimal planar graph drawing Therese Biedl University of

Carving-width, tree-width and area-optimal planar graph drawing Therese Biedl University of Waterloo biedl@uwaterloo.ca May 5, 2014 Therese Biedl Carving-width, tree-width and graph drawing 1 / 19 Graph Drawing Given: A graph G = ( V , E

1.04k views • 70 slides

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS 2020 Aaron Mishkin, amishkin@cs.ubc.ca 1 21 Stochastic Gradient Descent: Workhorse of ML? Stochastic gradient descent (SGD) is today one of

634 views • 21 slides

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap Regression vs Classification Generalization / Overfitting / Underfitting Regularization

633 views • 30 slides

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS 2019 Aaron Mishkin 1 21 Stochastic Gradient Descent: Workhorse of ML? Stochastic gradient descent (SGD) is today one of the main

1.22k views • 21 slides

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization Stochastic

338 views • 33 slides

Approximating the Diameter, Width, Smallest Enclosing Cylinder, and Minimum-Width Annulus

Approximating the Diameter, Width, Smallest Enclosing Cylinder, and Minimum-Width Annulus Timothy M. Chan International Journal of Computational Geometry and Applications Approximating the Diameter, Width, Smallest Enclosing Cylinder, and

754 views • 25 slides

Multi-Clique-Width, a Powerful New Width Parameter Martin Frer Pennsylvania State University

Multi-Clique-Width, a Powerful New Width Parameter Martin Frer Pennsylvania State University Why tree-width? Many combinatorial graph problems are NP- hard. Usually, they are easy for trees. One wants to extend feasibility to a

681 views • 21 slides

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer Polytechnic Institute October 26, 2019 1 / 22 Stochastic gradient method stochastic program: F ( x ; ) min x X f ( x ) = E N

959 views • 22 slides

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Multivariate Fundamentals: Rotation/Distance Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective: Use one dataset to explain another Use the spatial patterns of each dataset to try and understand the

364 views • 24 slides

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient Descent Quadratic Forms Gradient Descent in Quadratic Forms Eigen vectors and values Gradient Descent Convergence Conjugate

1.1k views • 50 slides

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Closed Loop Stochastic Optimization Problems Dual Effect in Stochastic Optimization Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours MNOS 2014-2015 162 / 267 Closed Loop Stochastic Optimization

673 views • 49 slides

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent, with momentum, and maybe

406 views • 12 slides

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using an adaptive learning rate sub-gradient

569 views • 34 slides

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Gradient Descent 2. Stochastic Gradient Descent 3. SGD with Momentum 4. Adaptive Learning Rates 1 Gradient Descent

767 views • 66 slides

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Midterm: Weds, Feb 7th. Policies: You may use a single

333 views • 18 slides

On the Generalization Benefjt of Noise in Stochastic Gradient Descent Samuel L. Smith, Erich

On the Generalization Benefjt of Noise in Stochastic Gradient Descent Samuel L. Smith, Erich Elsen and Soham De ICML 2020 Joint work with Soham De Erich Elsen With thanks to: Esme Sutherland, James Martens, Yee Whye Teh Sander Dieleman,

778 views • 25 slides

Computational Learning Theory: The Theory of Generalization Machine Learning 1 Slides based on

Computational Learning Theory: The Theory of Generalization Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others Checkpoint: The bigger picture Supervised learning: instances, concepts, and

895 views • 42 slides

Generalization of Deep Learning 1 Yuan YAO HKUST Some Theories are limited but help:

Generalization of Deep Learning 1 Yuan YAO HKUST Some Theories are limited but help: Approximation Theory and Harmonic Analysis : What functions are represented well by deep neural networks, without suffering the curse of dimensionality and

568 views • 15 slides

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview Weve focused so far on how to optimize neural nets how to get them to make good predictions on the training set. How do we make

906 views • 27 slides

Michael Spece Departments of Machine Learning and Statistics Carnegie Mellon University June 11,

Generalization Martingale Bounds Ongoing Work Generalization for Streaming Data Michael Spece Departments of Machine Learning and Statistics Carnegie Mellon University June 11, 2015 1 / 12 Generalization Martingale Bounds Ongoing Work

489 views • 12 slides

Lecture 4.5: Generalized Fourier series Matthew Macauley Department of Mathematical Sciences

Lecture 4.5: Generalized Fourier series Matthew Macauley Department of Mathematical Sciences Clemson University http://www.math.clemson.edu/~macaule/ Math 4340, Advanced Engineering Mathematics M. Macauley (Clemson) Lecture 4.5: Generalized

748 views • 7 slides

Video to Text Description Jia Chen 1 , Shizhe Chen 2 , Qin Jin 2 , Alexander Hauptmann 1 Carnegie

INF@TRECVID2017 Video to Text Description Jia Chen 1 , Shizhe Chen 2 , Qin Jin 2 , Alexander Hauptmann 1 Carnegie Mellon University 1 Renmin University of China 2 Main focus in this year: cross-dataset generalization Last year: As the

174 views • 14 slides

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND BABY PICTURE) LOGISTICS (AND BABY PICTURE) Problem Set 4 Assigned very soon, but no work expected during Spring break Project proposal Deadline

372 views • 11 slides