Fitting Neural Networks Gradient Descent and Stochastic Gradient - PowerPoint PPT Presentation

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner

New requirement for the final project: For the first time ever, researchers who submit papers to NeurIPS or other conferences, must now state the “ potential broader impact of their work ” on society CS109A final project will also include the same requirement: “ potential broader impact of your work ” A guide to writing the impact statement: https://medium.com/@BrentH/suggestions-for-writing-neurips- 2020-broader-impacts-statements-121da1b765bf CS109A, P ROTOPAPAS , R ADER , T ANNER

CS109A, P ROTOPAPAS , R ADER , T ANNER

Outline • Gradient Descent • Stochastic Gradient Descent CS109A, P ROTOPAPAS , R ADER , T ANNER 4

Considerations • We still need to calculate the derivatives. • We need to know what is the learning rate or how to set it. • Local vs global minima. • The full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER , T ANNER 5

Calculate the Derivatives Can we do it? Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. Example: Logistic Regression derivatives CS109A, P ROTOPAPAS , R ADER , T ANNER 7

Chain Rule Chain rule for computing gradients: 𝑧 = 𝑕 𝑦 𝑨 = 𝑔 𝑧 = 𝑔 𝑕 𝑦 𝒛 = 𝑕 𝒚 𝑨 = 𝑔 𝒛 = 𝑔 𝑕 𝒚 𝜖𝑨 𝜖𝑦 = 𝜖𝑨 𝜖𝑧 𝜖𝑧 " 𝜖𝑨 𝜖𝑨 = * 𝜖𝑧 𝜖𝑦 𝜖𝑦 ! 𝜖𝑧 " 𝜖𝑦 ! " • For longer chains: … ∂ y j m ∂ z ∂ z ∑ … ∑ = ∂ x i ∂ y j 1 ∂ x i j 1 j m CS109A, P ROTOPAPAS , R ADER , T ANNER 8

Logistic Regression derivatives For logistic regression, the –ve log of the likelihood is: ℒ = * ℒ ! = − * log 𝑀 ! = − * [𝑧 ! log 𝑞 ! + 1 − 𝑧 ! log(1 − 𝑞 ! )] ! ! ! 1 1 ℒ ! = −𝑧 ! log 1 + 𝑓 &% $ ' − 1 − 𝑧 ! log(1 − 1 + 𝑓 &% $ ' ) To simplify the analysis let us split it into two parts, ( + ℒ ! ) ℒ ! = ℒ ! So the derivative with respect to W is: " # #ℒ #ℒ ! #ℒ ! #ℒ ! #% = ∑ ! #% = ∑ ! ( #% + #% ) CS109A, P ROTOPAPAS , R ADER , T ANNER 9

1 " = −𝑧 ! log ℒ ! 1 + 𝑓 #$ ! % Variables Partial derivatives Partial derivatives 𝜖𝜊 ! 𝜖𝜊 ! 𝜊 ! = −𝑋 " 𝑌 𝜖𝑋 = −𝑌 𝜖𝑋 = −𝑌 𝜖𝜊 # 𝜖𝜊 # = 𝑓 %& " ' = 𝑓 $ ! 𝜊 # = 𝑓 $ ! = 𝑓 %& " ' 𝜖𝜊 ! 𝜖𝜊 ! 𝜖𝜊 ( )$ # 𝜊 ( = 1 + 𝜊 # = 1 + 𝑓 %& " ' = 1 )$ $ = 1 𝜖𝜊 # 𝜖𝜊 * = − 1 𝜖𝜊 * 1 𝜊 * = 1 1 = − = 1 + 𝑓 %& " ' = 𝑞 # 𝜖𝜊 ( 𝜊 ( 𝜖𝜊 ( 1 + 𝑓 %& " ' # 𝜊 ( 𝜖𝜊 + = 1 1 𝜖𝜊 + = 1 + 𝑓 %& " ' 𝜊 + = log 𝜊 * = log 𝑞 = log 𝜖𝜊 * 𝜊 * 1 + 𝑓 %& " ' 𝜖𝜊 * 𝜖ℒ 𝜖ℒ - = −𝑧𝜊 + = −𝑧 ℒ , = −𝑧 𝜖𝜊 + 𝜖𝜊 + - 𝜖ℒ , 𝜖𝑋 = 𝜖ℒ , 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! - 𝜖ℒ , 1 𝜖𝑋 = −𝑧𝑌𝑓 %& " ' 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! 𝜖𝑋 1 + 𝑓 %& " ' CS109A, P ROTOPAPAS , R ADER , T ANNER

1 & = −(1 − 𝑧 ! ) log[1 − ℒ ! 1 + 𝑓 #$ ! % ] Variables derivatives Partial derivatives wrt to X,W 𝜊 ! = −𝑋 " 𝑌 𝜖𝜊 ! 𝜖𝜊 ! 𝜖𝑋 = −𝑌 𝜖𝑋 = −𝑌 𝜖𝜊 # 𝜖𝜊 # 𝜊 # = 𝑓 $ ! = 𝑓 %& " ' = 𝑓 %& " ' = 𝑓 $ ! 𝜖𝜊 ! 𝜖𝜊 ! 𝜖𝜊 ( 𝜊 ( = 1 + 𝜊 # = 1 + 𝑓 %& " ' )$ # = 1 )# = 1 𝜖𝜊 # 𝜊 * = 1 1 𝜖𝜊 * = − 1 𝜖𝜊 * 1 = 1 + 𝑓 %& " ' = 𝑞 = − # 𝜊 ( 𝜖𝜊 ( 1 + 𝑓 %& " ' # 𝜊 ( 𝜖𝜊 ( 1 𝜖𝜊 + 𝜊 + = 1 − 𝜊 * = 1 − = −1 )$ % )$ & = -1 1 + 𝑓 %& " ' 𝜖𝜊 * 1 𝜖𝜊 . = 1 = 1 + 𝑓 %& " ' 𝜖𝜊 . 𝜊 . = log 𝜊 + = log(1 − 𝑞) = log 𝜖𝜊 + 𝜊 + 1 + 𝑓 %& " ' 𝜖𝜊 + 𝑓 %& " ' / = (1 − 𝑧)𝜊 . 𝜖ℒ 𝜖ℒ ℒ , = 1 − 𝑧 = 1 − 𝑧 𝜖𝜊 . 𝜖𝜊 . / / 𝜖ℒ , 𝜖𝑋 = 𝜖ℒ , 𝜖𝜊 . 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! / 𝜖ℒ , 1 𝜖𝑋 = (1 − 𝑧)𝑌 𝜖𝜊 . 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! 𝜖𝑋 1 + 𝑓 %& " ' CS109A, P ROTOPAPAS , R ADER , T ANNER

Learning Rate Our choice of the learning rate has a significant impact on the performance of gradient descent. When 𝜃 is too large, the algorithm When 𝜃 is appropriate, the When 𝜃 is too small, the algorithm may overshoot the minimum and algorithm will find the minimum. makes very little progress. has crazy oscillations. The algorithm converges ! CS109A, P ROTOPAPAS , R ADER , T ANNER 13

How can we tell when gradient descent is converging? We visualize the loss function at each step of gradient descent. This is called the trace plot . While the loss is decreasing throughout training, it does not look like descent hit the bottom. Loss is mostly oscillating between values rather than converging. The loss has decreased significantly during training. Towards the end, the loss stabilizes and it can’t decrease further. CS109A, P ROTOPAPAS , R ADER , T ANNER

Learning Rate There are many alternative methods which address how to set or adjust the learning rate, using the derivative or second derivatives and or the momentum. More on this later. CS109A, P ROTOPAPAS , R ADER , T ANNER 15

Local vs Global Minima If we choose 𝜃 correctly, then gradient descent will converge to a stationary point. But will this point be a global minimum ? If the function is convex then the stationary point will be a global minimum. CS109A, P ROTOPAPAS , R ADER , T ANNER 17

Local vs Global Minima No guarantee that we get the global minimum. Question: What would be a good strategy? • Random restarts • Add noise to the loss function CS109A, P ROTOPAPAS , R ADER , T ANNER 18

Batch and Stochastic Gradient Descent ℒ = − * [𝑧 ! log 𝑞 ! + 1 − 𝑧 ! log(1 − 𝑞 ! )] ! Instead of using all the examples for every step, use a subset of them (batch). For each iteration k, use the following loss function to derive the derivatives: ℒ 8 = − * [𝑧 ! log 𝑞 ! + 1 − 𝑧 ! log(1 − 𝑞 ! )] !∈: 6 which is an approximation to the full loss function. CS109A, P ROTOPAPAS , R ADER , T ANNER 20

1 I DATA ¥Y*"tI7E ↳ calculate 4. ⇒ off ⇒ w±w - riffs tabulate L ⇒ off ⇒ w±w - right 2 ÷ ↳ ↳ ⇒ FE ⇒ w±w - ndffif EPOCH COMPLETE DATA ⇒ ONE DATA AND REPEAT RESHUFFLE CS109A, P ROTOPAPAS , R ADER , T ANNER

Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 26

Fitting Neural Networks Gradient Descent and Stochastic Gradient - PowerPoint PPT Presentation

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner New requirement for the final project: For the first time ever, researchers who submit

Track fitting, vertex fitting and Track fitting, vertex fitting and Track fitting, vertex fitting

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Week 2 Video 5 Cross-Validation and Over-Fitting Over-Fitting Ive mentioned over-fitting a

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Dasin Retail Trust Financial Results for the Half Year ended 30 June 2020 13 August 2020

42 nd Annual Topics in Emergency Medicine November 4-6, 2013 Parc 55 Wyndham San Francisco, San

Metabolic Alterations in Fumarate Hydratase Deficient Cells Christian Frezza 1 MRC Cancer Unit,

Open Smart Grid id Pla latform An Open source IoT platform for large infrastructures Why did

What's Next? 1. What's next? 2. K-means What's next? Last Class Friday No office hours

University of California High-Performance AstroComputing Center UC-HIPACC JOEL PRIMACK UCSC

Re-inserting human interaction ! into cancer genome interpretation ! CYDNEY NIELSEN UNIVERSITY OF

MYCOSIS FUNGOIDES Christiane Querfeld, MD, PhD 2015... 2018 T-Cell Lymphomas: we are close to

Fitting Neural Networks Gradient Descent and Stochastic Gradient - PowerPoint PPT Presentation

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner New requirement for the final project: For the first time ever, researchers who submit

Track fitting, vertex fitting and Track fitting, vertex fitting and Track fitting, vertex fitting

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Week 2 Video 5 Cross-Validation and Over-Fitting Over-Fitting Ive mentioned over-fitting a

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Dasin Retail Trust Financial Results for the Half Year ended 30 June 2020 13 August 2020

42 nd Annual Topics in Emergency Medicine November 4-6, 2013 Parc 55 Wyndham San Francisco, San

Metabolic Alterations in Fumarate Hydratase Deficient Cells Christian Frezza 1 MRC Cancer Unit,

Open Smart Grid id Pla latform An Open source IoT platform for large infrastructures Why did

What's Next? 1. What's next? 2. K-means What's next? Last Class Friday No office hours

University of California High-Performance AstroComputing Center UC-HIPACC JOEL PRIMACK UCSC

Re-inserting human interaction ! into cancer genome interpretation ! CYDNEY NIELSEN UNIVERSITY OF

MYCOSIS FUNGOIDES Christiane Querfeld, MD, PhD 2015... 2018 T-Cell Lymphomas: we are close to

Gradient Descent Michail Michailidis & Patrick Maiden Outline