fitting neural networks gradient descent and stochastic
play

Fitting Neural Networks Gradient Descent and Stochastic Gradient - PowerPoint PPT Presentation

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner New requirement for the final project: For the first time ever, researchers who submit


  1. Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner

  2. New requirement for the final project: For the first time ever, researchers who submit papers to NeurIPS or other conferences, must now state the “ potential broader impact of their work ” on society CS109A final project will also include the same requirement: “ potential broader impact of your work ” A guide to writing the impact statement: https://medium.com/@BrentH/suggestions-for-writing-neurips- 2020-broader-impacts-statements-121da1b765bf CS109A, P ROTOPAPAS , R ADER , T ANNER

  3. CS109A, P ROTOPAPAS , R ADER , T ANNER

  4. Outline • Gradient Descent • Stochastic Gradient Descent CS109A, P ROTOPAPAS , R ADER , T ANNER 4

  5. Considerations • We still need to calculate the derivatives. • We need to know what is the learning rate or how to set it. • Local vs global minima. • The full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER , T ANNER 5

  6. Considerations • We still need to calculate the derivatives. • We need to know what is the learning rate or how to set it. • Local vs global minima. • The full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER , T ANNER 6

  7. Calculate the Derivatives Can we do it? Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. Example: Logistic Regression derivatives CS109A, P ROTOPAPAS , R ADER , T ANNER 7

  8. Chain Rule Chain rule for computing gradients: 𝑧 = 𝑕 𝑦 𝑨 = 𝑔 𝑧 = 𝑔 𝑕 𝑦 𝒛 = 𝑕 𝒚 𝑨 = 𝑔 𝒛 = 𝑔 𝑕 𝒚 𝜖𝑨 𝜖𝑦 = 𝜖𝑨 𝜖𝑧 𝜖𝑧 " 𝜖𝑨 𝜖𝑨 = * 𝜖𝑧 𝜖𝑦 𝜖𝑦 ! 𝜖𝑧 " 𝜖𝑦 ! " • For longer chains: … ∂ y j m ∂ z ∂ z ∑ … ∑ = ∂ x i ∂ y j 1 ∂ x i j 1 j m CS109A, P ROTOPAPAS , R ADER , T ANNER 8

  9. Logistic Regression derivatives For logistic regression, the –ve log of the likelihood is: ℒ = * ℒ ! = − * log 𝑀 ! = − * [𝑧 ! log 𝑞 ! + 1 − 𝑧 ! log(1 − 𝑞 ! )] ! ! ! 1 1 ℒ ! = −𝑧 ! log 1 + 𝑓 &% $ ' − 1 − 𝑧 ! log(1 − 1 + 𝑓 &% $ ' ) To simplify the analysis let us split it into two parts, ( + ℒ ! ) ℒ ! = ℒ ! So the derivative with respect to W is: " # #ℒ #ℒ ! #ℒ ! #ℒ ! #% = ∑ ! #% = ∑ ! ( #% + #% ) CS109A, P ROTOPAPAS , R ADER , T ANNER 9

  10. 1 " = −𝑧 ! log ℒ ! 1 + 𝑓 #$ ! % Variables Partial derivatives Partial derivatives 𝜖𝜊 ! 𝜖𝜊 ! 𝜊 ! = −𝑋 " 𝑌 𝜖𝑋 = −𝑌 𝜖𝑋 = −𝑌 𝜖𝜊 # 𝜖𝜊 # = 𝑓 %& " ' = 𝑓 $ ! 𝜊 # = 𝑓 $ ! = 𝑓 %& " ' 𝜖𝜊 ! 𝜖𝜊 ! 𝜖𝜊 ( )$ # 𝜊 ( = 1 + 𝜊 # = 1 + 𝑓 %& " ' = 1 )$ $ = 1 𝜖𝜊 # 𝜖𝜊 * = − 1 𝜖𝜊 * 1 𝜊 * = 1 1 = − = 1 + 𝑓 %& " ' = 𝑞 # 𝜖𝜊 ( 𝜊 ( 𝜖𝜊 ( 1 + 𝑓 %& " ' # 𝜊 ( 𝜖𝜊 + = 1 1 𝜖𝜊 + = 1 + 𝑓 %& " ' 𝜊 + = log 𝜊 * = log 𝑞 = log 𝜖𝜊 * 𝜊 * 1 + 𝑓 %& " ' 𝜖𝜊 * 𝜖ℒ 𝜖ℒ - = −𝑧𝜊 + = −𝑧 ℒ , = −𝑧 𝜖𝜊 + 𝜖𝜊 + - 𝜖ℒ , 𝜖𝑋 = 𝜖ℒ , 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! - 𝜖ℒ , 1 𝜖𝑋 = −𝑧𝑌𝑓 %& " ' 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! 𝜖𝑋 1 + 𝑓 %& " ' CS109A, P ROTOPAPAS , R ADER , T ANNER

  11. 1 & = −(1 − 𝑧 ! ) log[1 − ℒ ! 1 + 𝑓 #$ ! % ] Variables derivatives Partial derivatives wrt to X,W 𝜊 ! = −𝑋 " 𝑌 𝜖𝜊 ! 𝜖𝜊 ! 𝜖𝑋 = −𝑌 𝜖𝑋 = −𝑌 𝜖𝜊 # 𝜖𝜊 # 𝜊 # = 𝑓 $ ! = 𝑓 %& " ' = 𝑓 %& " ' = 𝑓 $ ! 𝜖𝜊 ! 𝜖𝜊 ! 𝜖𝜊 ( 𝜊 ( = 1 + 𝜊 # = 1 + 𝑓 %& " ' )$ # = 1 )# = 1 𝜖𝜊 # 𝜊 * = 1 1 𝜖𝜊 * = − 1 𝜖𝜊 * 1 = 1 + 𝑓 %& " ' = 𝑞 = − # 𝜊 ( 𝜖𝜊 ( 1 + 𝑓 %& " ' # 𝜊 ( 𝜖𝜊 ( 1 𝜖𝜊 + 𝜊 + = 1 − 𝜊 * = 1 − = −1 )$ % )$ & = -1 1 + 𝑓 %& " ' 𝜖𝜊 * 1 𝜖𝜊 . = 1 = 1 + 𝑓 %& " ' 𝜖𝜊 . 𝜊 . = log 𝜊 + = log(1 − 𝑞) = log 𝜖𝜊 + 𝜊 + 1 + 𝑓 %& " ' 𝜖𝜊 + 𝑓 %& " ' / = (1 − 𝑧)𝜊 . 𝜖ℒ 𝜖ℒ ℒ , = 1 − 𝑧 = 1 − 𝑧 𝜖𝜊 . 𝜖𝜊 . / / 𝜖ℒ , 𝜖𝑋 = 𝜖ℒ , 𝜖𝜊 . 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! / 𝜖ℒ , 1 𝜖𝑋 = (1 − 𝑧)𝑌 𝜖𝜊 . 𝜖𝜊 + 𝜖𝜊 * 𝜖𝜊 ( 𝜖𝜊 # 𝜖𝜊 ! 𝜖𝑋 1 + 𝑓 %& " ' CS109A, P ROTOPAPAS , R ADER , T ANNER

  12. Considerations • We still need to calculate the derivatives. • We need to know what is the learning rate or how to set it. • Local vs global minima. • The full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER , T ANNER 12

  13. Learning Rate Our choice of the learning rate has a significant impact on the performance of gradient descent. When 𝜃 is too large, the algorithm When 𝜃 is appropriate, the When 𝜃 is too small, the algorithm may overshoot the minimum and algorithm will find the minimum. makes very little progress. has crazy oscillations. The algorithm converges ! CS109A, P ROTOPAPAS , R ADER , T ANNER 13

  14. How can we tell when gradient descent is converging? We visualize the loss function at each step of gradient descent. This is called the trace plot . While the loss is decreasing throughout training, it does not look like descent hit the bottom. Loss is mostly oscillating between values rather than converging. The loss has decreased significantly during training. Towards the end, the loss stabilizes and it can’t decrease further. CS109A, P ROTOPAPAS , R ADER , T ANNER

  15. Learning Rate There are many alternative methods which address how to set or adjust the learning rate, using the derivative or second derivatives and or the momentum. More on this later. CS109A, P ROTOPAPAS , R ADER , T ANNER 15

  16. Considerations • We still need to calculate the derivatives. • We need to know what is the learning rate or how to set it. • Local vs global minima. • The full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER , T ANNER 16

  17. Local vs Global Minima If we choose 𝜃 correctly, then gradient descent will converge to a stationary point. But will this point be a global minimum ? If the function is convex then the stationary point will be a global minimum. CS109A, P ROTOPAPAS , R ADER , T ANNER 17

  18. Local vs Global Minima No guarantee that we get the global minimum. Question: What would be a good strategy? • Random restarts • Add noise to the loss function CS109A, P ROTOPAPAS , R ADER , T ANNER 18

  19. Considerations • We still need to calculate the derivatives. • We need to know what is the learning rate or how to set it. • Local vs global minima. • The full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER , T ANNER 19

  20. Batch and Stochastic Gradient Descent ℒ = − * [𝑧 ! log 𝑞 ! + 1 − 𝑧 ! log(1 − 𝑞 ! )] ! Instead of using all the examples for every step, use a subset of them (batch). For each iteration k, use the following loss function to derive the derivatives: ℒ 8 = − * [𝑧 ! log 𝑞 ! + 1 − 𝑧 ! log(1 − 𝑞 ! )] !∈: 6 which is an approximation to the full loss function. CS109A, P ROTOPAPAS , R ADER , T ANNER 20

  21. CS109A, P ROTOPAPAS , R ADER , T ANNER

  22. CS109A, P ROTOPAPAS , R ADER , T ANNER

  23. 1 I DATA ¥Y*"tI7E ↳ calculate 4. ⇒ off ⇒ w±w - riffs tabulate L ⇒ off ⇒ w±w - right 2 ÷ ↳ ↳ ⇒ FE ⇒ w±w - ndffif EPOCH COMPLETE DATA ⇒ ONE DATA AND REPEAT RESHUFFLE CS109A, P ROTOPAPAS , R ADER , T ANNER

  24. CS109A, P ROTOPAPAS , R ADER , T ANNER

  25. CS109A, P ROTOPAPAS , R ADER , T ANNER

  26. Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 26

  27. Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 27

  28. Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 28

  29. Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 29

  30. Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 30

  31. Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 31

  32. Batch and Stochastic Gradient Descent Full Likelihood: L Batch Likelihood: 𝛊 CS109A, P ROTOPAPAS , R ADER , T ANNER 32

Recommend


More recommend