Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner
New requirement for the final project: For the first time ever, researchers who submit papers to NeurIPS or other conferences, must now state the “ potential broader impact of their work ” on society CS109A final project will also include the same requirement: “ potential broader impact of your work ” A guide to writing the impact statement: https://medium.com/@BrentH/suggestions-for-writing-neurips- 2020-broader-impacts-statements-121da1b765bf CS109A, P ROTOPAPAS , R ADER , T ANNER
CS109A, P ROTOPAPAS , R ADER , T ANNER
Outline • Gradient Descent • Stochastic Gradient Descent CS109A, P ROTOPAPAS , R ADER , T ANNER 4
Considerations • We still need to calculate the derivatives. • We need to know what is the learning rate or how to set it. • Local vs global minima. • The full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER , T ANNER 5
Considerations • We still need to calculate the derivatives. • We need to know what is the learning rate or how to set it. • Local vs global minima. • The full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER , T ANNER 6
Calculate the Derivatives Can we do it? Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. Example: Logistic Regression derivatives CS109A, P ROTOPAPAS , R ADER , T ANNER 7
Chain Rule Chain rule for computing gradients: 𝑧 = 𝑦 𝑨 = 𝑔 𝑧 = 𝑔 𝑦 𝒛 = 𝒚 𝑨 = 𝑔 𝒛 = 𝑔 𝒚 𝜖𝑨 𝜖𝑦 = 𝜖𝑨 𝜖𝑧 𝜖𝑧 " 𝜖𝑨 𝜖𝑨 = * 𝜖𝑧 𝜖𝑦 𝜖𝑦 ! 𝜖𝑧 " 𝜖𝑦 ! " • For longer chains: … ∂ y j m ∂ z ∂ z ∑ … ∑ = ∂ x i ∂ y j 1 ∂ x i j 1 j m CS109A, P ROTOPAPAS , R ADER , T ANNER 8
Logistic Regression derivatives For logistic regression, the –ve log of the likelihood is: ℒ = * ℒ ! = − * log 𝑀 ! = − * [𝑧 ! log 𝑞 ! + 1 − 𝑧 ! log(1 − 𝑞 ! )] ! ! ! 1 1 ℒ ! = −𝑧 ! log 1 + 𝑓 &% $ ' − 1 − 𝑧 ! log(1 − 1 + 𝑓 &% $ ' ) To simplify the analysis let us split it into two parts, ( + ℒ ! ) ℒ ! = ℒ ! So the derivative with respect to W is: " # #ℒ #ℒ ! #ℒ ! #ℒ ! #% = ∑ ! #% = ∑ ! ( #% + #% ) CS109A, P ROTOPAPAS , R ADER , T ANNER 9
1 " = −𝑧 ! log ℒ ! 1 + 𝑓 #$ ! % Variables Partial derivatives Partial derivatives 𝜖𝜊 ! 𝜖𝜊 ! 𝜊 ! = −𝑋 " 𝑌 𝜖𝑋 = −𝑌 𝜖𝑋 = −𝑌 𝜖𝜊 # 𝜖𝜊 # = 𝑓 %& " ' = 𝑓 $ ! 𝜊 # = 𝑓 $ ! = 𝑓 %& " ' 𝜖𝜊 ! 𝜖𝜊 ! 𝜖𝜊 ( )$ # 𝜊 ( = 1 + 𝜊 # = 1 + 𝑓 %& " ' = 1 )$ $ = 1 𝜖𝜊 # 𝜖𝜊 * = − 1 𝜖𝜊 * 1 𝜊 * = 1 1 = − = 1 + 𝑓 %& " ' = 𝑞 # 𝜖𝜊 ( 𝜊 ( 𝜖𝜊 ( 1 + 𝑓 %& " ' # 𝜊 ( 𝜖𝜊 + = 1 1 𝜖𝜊 + = 1 + 𝑓 %& " ' 𝜊 + = log 𝜊 * = log 𝑞 = log 𝜖𝜊 * 𝜊 * 1 + 𝑓 %& " ' 𝜖𝜊 * 𝜖ℒ 𝜖ℒ - = −𝑧𝜊 + = −𝑧 ℒ , = −𝑧 𝜖𝜊 + 𝜖𝜊 + - 𝜖ℒ , 𝜖𝑋 = 𝜖ℒ , 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! - 𝜖ℒ , 1 𝜖𝑋 = −𝑧𝑌𝑓 %& " ' 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! 𝜖𝑋 1 + 𝑓 %& " ' CS109A, P ROTOPAPAS , R ADER , T ANNER
1 & = −(1 − 𝑧 ! ) log[1 − ℒ ! 1 + 𝑓 #$ ! % ] Variables derivatives Partial derivatives wrt to X,W 𝜊 ! = −𝑋 " 𝑌 𝜖𝜊 ! 𝜖𝜊 ! 𝜖𝑋 = −𝑌 𝜖𝑋 = −𝑌 𝜖𝜊 # 𝜖𝜊 # 𝜊 # = 𝑓 $ ! = 𝑓 %& " ' = 𝑓 %& " ' = 𝑓 $ ! 𝜖𝜊 ! 𝜖𝜊 ! 𝜖𝜊 ( 𝜊 ( = 1 + 𝜊 # = 1 + 𝑓 %& " ' )$ # = 1 )# = 1 𝜖𝜊 # 𝜊 * = 1 1 𝜖𝜊 * = − 1 𝜖𝜊 * 1 = 1 + 𝑓 %& " ' = 𝑞 = − # 𝜊 ( 𝜖𝜊 ( 1 + 𝑓 %& " ' # 𝜊 ( 𝜖𝜊 ( 1 𝜖𝜊 + 𝜊 + = 1 − 𝜊 * = 1 − = −1 )$ % )$ & = -1 1 + 𝑓 %& " ' 𝜖𝜊 * 1 𝜖𝜊 . = 1 = 1 + 𝑓 %& " ' 𝜖𝜊 . 𝜊 . = log 𝜊 + = log(1 − 𝑞) = log 𝜖𝜊 + 𝜊 + 1 + 𝑓 %& " ' 𝜖𝜊 + 𝑓 %& " ' / = (1 − 𝑧)𝜊 . 𝜖ℒ 𝜖ℒ ℒ , = 1 − 𝑧 = 1 − 𝑧 𝜖𝜊 . 𝜖𝜊 . / / 𝜖ℒ , 𝜖𝑋 = 𝜖ℒ , 𝜖𝜊 . 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! / 𝜖ℒ , 1 𝜖𝑋 = (1 − 𝑧)𝑌 𝜖𝜊 . 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! 𝜖𝑋 1 + 𝑓 %& " ' CS109A, P ROTOPAPAS , R ADER , T ANNER
Considerations • We still need to calculate the derivatives. • We need to know what is the learning rate or how to set it. • Local vs global minima. • The full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER , T ANNER 12
Learning Rate Our choice of the learning rate has a significant impact on the performance of gradient descent. When 𝜃 is too large, the algorithm When 𝜃 is appropriate, the When 𝜃 is too small, the algorithm may overshoot the minimum and algorithm will find the minimum. makes very little progress. has crazy oscillations. The algorithm converges ! CS109A, P ROTOPAPAS , R ADER , T ANNER 13
How can we tell when gradient descent is converging? We visualize the loss function at each step of gradient descent. This is called the trace plot . While the loss is decreasing throughout training, it does not look like descent hit the bottom. Loss is mostly oscillating between values rather than converging. The loss has decreased significantly during training. Towards the end, the loss stabilizes and it can’t decrease further. CS109A, P ROTOPAPAS , R ADER , T ANNER
Learning Rate There are many alternative methods which address how to set or adjust the learning rate, using the derivative or second derivatives and or the momentum. More on this later. CS109A, P ROTOPAPAS , R ADER , T ANNER 15
Considerations • We still need to calculate the derivatives. • We need to know what is the learning rate or how to set it. • Local vs global minima. • The full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER , T ANNER 16
Local vs Global Minima If we choose 𝜃 correctly, then gradient descent will converge to a stationary point. But will this point be a global minimum ? If the function is convex then the stationary point will be a global minimum. CS109A, P ROTOPAPAS , R ADER , T ANNER 17
Local vs Global Minima No guarantee that we get the global minimum. Question: What would be a good strategy? • Random restarts • Add noise to the loss function CS109A, P ROTOPAPAS , R ADER , T ANNER 18
Considerations • We still need to calculate the derivatives. • We need to know what is the learning rate or how to set it. • Local vs global minima. • The full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER , T ANNER 19
Batch and Stochastic Gradient Descent ℒ = − * [𝑧 ! log 𝑞 ! + 1 − 𝑧 ! log(1 − 𝑞 ! )] ! Instead of using all the examples for every step, use a subset of them (batch). For each iteration k, use the following loss function to derive the derivatives: ℒ 8 = − * [𝑧 ! log 𝑞 ! + 1 − 𝑧 ! log(1 − 𝑞 ! )] !∈: 6 which is an approximation to the full loss function. CS109A, P ROTOPAPAS , R ADER , T ANNER 20
CS109A, P ROTOPAPAS , R ADER , T ANNER
CS109A, P ROTOPAPAS , R ADER , T ANNER
1 I DATA ¥Y*"tI7E ↳ calculate 4. ⇒ off ⇒ w±w - riffs tabulate L ⇒ off ⇒ w±w - right 2 ÷ ↳ ↳ ⇒ FE ⇒ w±w - ndffif EPOCH COMPLETE DATA ⇒ ONE DATA AND REPEAT RESHUFFLE CS109A, P ROTOPAPAS , R ADER , T ANNER
CS109A, P ROTOPAPAS , R ADER , T ANNER
CS109A, P ROTOPAPAS , R ADER , T ANNER
Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 26
Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 27
Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 28
Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 29
Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 30
Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 31
Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 32
Recommend
More recommend