Gradient Descent Finds Global Minima of Deep Neural Networks Simon - PowerPoint PPT Presentation

Dec 06, 2023 •178 likes •264 views

Gradient Descent Finds Global Minima of Deep Neural Networks Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai 1 Empirical Observations on Empirical Risk Zhang et al, 2017, Understanding Deep Learning Requires Rethinking

Gradient Descent Finds Global Minima of Deep Neural Networks Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai 1
Empirical Observations on Empirical Risk • Zhang et al, 2017, Understanding Deep Learning Requires Rethinking Generalization. Randomization Test: replace true labels by random labels. Observations: Empirical Risk-> 0 for both true labels and random labels. Conjecture: because neural networks are over-parameterized. Open Problem: why gradient descent can find a neural network that fits all labels. 2
Setup { x i , y i } n • Training Data: i =1 , x i ∈ R d , y i ∈ R • A Model. • Fully connected neural network: f ( θ , x ) = W L σ ( W L − 1 · · · W 2 σ ( W 1 x ) · · · ) • A loss function. n • Quadratic loss: R ( θ ) = 1 X ( f ( θ , x i ) − y i ) 2 2 n i =1 • An optimization algorithm: • Gradient descent: θ ( t + 1) ← θ ( t ) − η ∂ R ( θ ( t )) ∂θ ( t ) 3
Trajectory-based Analysis θ ( t + 1) ← θ ( t ) − η ∂ R ( θ ( t )) ∂θ ( t ) • Trajectory of parameters: θ (0) , θ (1) , θ (2) , · · · • Predictions: u i ( t ) , f ( θ ( t ) , x i ) , u ( t ) , ( u 1 ( t ) , . . . , u n ( t )) > ∈ R n • Trajectory of predictions: u (0) , u (1) , u (2) , . . . 4
Proof Sketch • Simplified form (continuous time): L du ( t ) ij ( t ) = 1 n h ∂ u i ( t ) ∂ W ` ( t ) , ∂ u j ( t ) X H ` ( t ) ( y − u ( t )) H ` = − ∂ W ` ( t ) i dt ` =1 • Random initialization + concentration + perturbation analysis: L L L X X X H ` (0) → H ∞ H ` ( t ) → H ` (0) , ∀ t ≥ 0 lim lim m →∞ m →∞ ` =1 ` =1 ` =1 • Linear ODE theory: k u ( t ) � y k 2 2  exp ( � λ 0 t ) k u (0) � y k 2 2 , λ 0 = λ min ( H ∞ ) 5
Main Results Theorem 1: For fully-connected neural network with smooth activation, if ! = poly ', 2 * , 1/- . and step 1 2 size / = 0 3 4 5 6(8) , then with high probability over random initialization we have: for : = 1,2, … @ <(=(0)) . < = : ≤ 1 − /- . First global linear convergence guarantee for deep NN. • Exponential dependence due to error propagation. • 6
Main Results (Cont’d) Theorem 2: For ResNet or Convolutional ResNet with smooth activation, if ! = 0 1 poly ', ), 1/, - and step size . = / 2 3 , then with high probability over random initialization we have: for 4 = 1,2, … ; 7(8(0)) . 7 8 4 ≤ 1 − ., - ResNet architecture makes the error propagation more stable => • exponential improvement over fully-connected neural networks. 7
Learn more @ Pacific Ball Room #80 8

Recommend

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Gradient Descent 2. Stochastic Gradient Descent 3. SGD with Momentum 4. Adaptive Learning Rates 1 Gradient Descent

767 views • 66 slides

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner New requirement for the final project: For the first time ever, researchers who submit

620 views • 34 slides

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using an adaptive learning rate sub-gradient

569 views • 34 slides

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient Descent Quadratic Forms Gradient Descent in Quadratic Forms Eigen vectors and values Gradient Descent Convergence Conjugate

1.1k views • 50 slides

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap Regression vs Classification Generalization / Overfitting / Underfitting Regularization

633 views • 30 slides

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit: Vlad Morariu Training (Deep) Neural Networks Computational graphs Improvements to gradient descent Stochastic gradient descent Momentum

814 views • 21 slides

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS 2020 Aaron Mishkin, amishkin@cs.ubc.ca 1 21 Stochastic Gradient Descent: Workhorse of ML? Stochastic gradient descent (SGD) is today one of

634 views • 21 slides

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Gradient Descent Michail Michailidis & Patrick Maiden Outline Mo4va4on Gradient Descent Algorithm Issues & Alterna4ves Stochas4c Gradient Descent

840 views • 29 slides

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1 Introduction The general aim of machine learning is always learning the data by itself, with as less human efforts as possible. Then it comes to the focus

396 views • 10 slides

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Midterm: Weds, Feb 7th. Policies: You may use a single

333 views • 18 slides

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON Matthieu R Bloch Thursday, January 30, 2020 1 LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB 523a) - 2pm-3:15pm Tuesday: TJ (VL

299 views • 14 slides

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS 2019 Aaron Mishkin 1 21 Stochastic Gradient Descent: Workhorse of ML? Stochastic gradient descent (SGD) is today one of the main

1.22k views • 21 slides

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects Zhanxing Zhu , Jingfeng Wu , Bing Yu, Lei Wu, Jinwen Ma. Peking University Beijing Institute of Big Data Research

319 views • 15 slides

Optimization why does it work How many minima Do they control worm complexity Plain

Optimization why does it work How many minima Do they control worm complexity Plain Background SGD NN on and cross entropy traj Minima Behout Number square loss Besout Degeneracy SGD and Langevin SGD fuids global minima

597 views • 15 slides

Aeronautical Federal Aviation Administration Charting Forum 12-01 CATEGORY III CHART MINIMA

Aeronautical Federal Aviation Administration Charting Forum 12-01 CATEGORY III CHART MINIMA Presented to: Aeronautical Charting Forum By: Chris Hope, AFS-410 April 25-26, 2012 Date: Current CAT III Minima Presentation Minima

221 views • 6 slides

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao and Quanquan Gu Computer Science Department 1 / 14 Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks

631 views • 14 slides

Compositional Grading Theory and Practice Lars Hier , Statoil Curtis H. Whitson , NTNU and Pera

SPE 63085 Compositional Grading Theory and Practice Lars Hier , Statoil Curtis H. Whitson , NTNU and Pera Theory Simple 1D Gradient Models Isothermal Gravity/Chemical Equilibrium Defining General Characteristics Different

479 views • 28 slides

Advection (or Convection) Solute (contaminant) gets transported (seepage velocity) t 0 along

IIT Bombay Slide 4 Advection (or Convection) Solute (contaminant) gets transported (seepage velocity) t 0 along with the flowing fluid (water) in response to a gradient (hydraulic). V s = k.i/ t 1 If a mass of solute (non reactive) of

241 views • 10 slides

Atomistic modeling of damage production and accumulation in irradiated metals M. J. Caturla

Atomistic modeling of damage production and accumulation in irradiated metals M. J. Caturla Dept. Fsica Aplicada, UA, Spain BEMOD12 March 26- 29, 2012 Dresden, Germany The University of Alicante Collaborators and co-authors C.

592 views • 36 slides

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented by Jen-Wei Kuo Reference 1. X. Huang et. al., Spoken Language Processing, Chapter 8 2. Daniel Jurafsky and James H. Martin, Speech and Language

1.05k views • 65 slides

Overview of Silicon Device Physics Dr. David W. Graham West Virginia University Lane Department

Overview of Silicon Device Physics Dr. David W. Graham West Virginia University Lane Department of Computer Science and Electrical Engineering 1 Silicon is the primary semiconductor used in VLSI systems Si has 14

500 views • 38 slides

M Low Lower barrier Potential implies lower barrier energy Applied potential Energy

Boolean gates (M=0) Threshold logic gates (M=1) Multiple valued logic (?) Analog (low precision) 1 M Low Lower barrier Potential implies lower barrier energy Applied potential Energy Switching errors savings are

344 views • 21 slides

Diffusion (2020) Prof. Dr. THARWAT G. ABDEL-MALIK EMERITUS PROFESSOR SUBJECT:-P312:SOLID STATE

P312:SOLID STATE PHYSICS Prof. THARWAT G. ABDEL- MALIK Diffusion (2020) Prof. Dr. THARWAT G. ABDEL-MALIK EMERITUS PROFESSOR SUBJECT:-P312:SOLID STATE PHYSICS LECTURER NUMBER EIGHT (25 SLIDES) e-mail:-tharwatdr@gmail.com Diffusion Diffusion

460 views • 25 slides

seeding of tissue engineering scaffolds Andreea-Paula ROBU* L cr mioara STOICU-TIVADAR*

UNIVERSITY POLITEHNICA TIMISOARA, ROMANIA FACULTY OF AUTOMATION AND COMPUTERS DEPARTMENT OF AUTOMATION AND APPLIED INFORMATICS Computational study of the potential role of chemotaxis in enhancing the cell seeding of tissue engineering scaffolds

262 views • 22 slides