lecture 21 optimization and regularization
play

Lecture 21: Optimization and Regularization CS109A Introduction to - PowerPoint PPT Presentation

Lecture 21: Optimization and Regularization CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner ANNOUNCEMENTS Homework 7 OH: For conceptual questions: Kevin and Chris will continue their office hours.


  1. Lecture 21: Optimization and Regularization CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner

  2. ANNOUNCEMENTS • Homework 7 OH: • For conceptual questions: Kevin and Chris will continue their office hours. • If you have problems with TensorFlow please let us know on ED. We will arrange special OH to help if necessary. • Project: • Milestone3 due on Wed. EDA and base model CS109A, P ROTOPAPAS , R ADER , T ANNER 2

  3. Outline Optimization Regularization of NN CS109A, P ROTOPAPAS , R ADER , T ANNER 3

  4. Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization Regularization of NN § Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout CS109A, P ROTOPAPAS , R ADER , T ANNER 4

  5. Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization Regularization of NN § Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout CS109A, P ROTOPAPAS , R ADER , T ANNER 5

  6. � Learning vs. Optimization Goal of learning: minimize generalization error, or the loss function 𝓜 𝑿 = 𝔽 𝒚,𝒛 ~𝒒 𝒆𝒃𝒖𝒃 𝑀(𝑔 𝑦, 𝑋 , 𝑧 f is the neural network In practice, empirical risk minimization: ℒ 𝑋 = 4 𝑀(𝑔 𝑦 5 ; 𝑋 , 𝑧 5 5 Quantity optimized different from the quantity we care about CS109A, P ROTOPAPAS , R ADER , T ANNER 6

  7. Local Minima Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 7

  8. Critical Points Points with zero gradient 2 nd -derivate (Hessian) determines curvature Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 8

  9. Local Minima Old view: local minima is major problem in neural network training Recent view: • For sufficiently large neural networks, most local minima incur low cost • Not important to find true global minimum CS109A, P ROTOPAPAS , R ADER , T ANNER 9

  10. Saddle Points Both local min Recent studies indicate that in and max high dim, saddle points are more likely than local min Gradient can be very small near saddle points Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 10

  11. Poor Conditioning Poorly conditioned Hessian matrix – High curvature: small steps leads to huge increase Learning is slow despite strong gradients Oscillations slow down progress CS109A, P ROTOPAPAS , R ADER , T ANNER 11

  12. No Critical Points Some cost functions do not have critical points. In particular classification. WHY? CS109A, P ROTOPAPAS , R ADER , T ANNER 12

  13. Exploding and Vanishing Gradients ℎ 5 = 𝑋𝑦 Linear activation ℎ 5 = 𝑋ℎ 5;< , 𝑗 = 2, … , 𝑜 deeplearning.ai CS109A, P ROTOPAPAS , R ADER , T ANNER 13

  14. Exploding and Vanishing Gradients ! $ a 0 Suppose W = & : # 0 b " % ! $ ! $ ! $ ! $ ! $ h 1 h n ! $ x 1 x 1 a n a 0 0 # & # & 1 1 ! # & # & # & & = & = # & # # h 1 0 b x 2 h n b n x 2 # & # & 0 # & " % " % " % " % " % " % 2 2 CS109A, P ROTOPAPAS , R ADER , T ANNER 14

  15. Exploding and Vanishing Gradients ! $ 1 Suppose x = # & 1 " % Case 1: a = 1, b = 2 : ! $ n Explodes! y → 1, ∇ y → # & n 2 n − 1 # & " % Case 2: a = 0.5, b = 0.9 : ! $ 0 Vanishes! y → 0, ∇ y → # & 0 " % CS109A, P ROTOPAPAS , R ADER , T ANNER 15

  16. Exploding and Vanishing Gradients Exploding gradients lead to cliffs Can be mitigated using gradient clipping if 𝑕 > 𝑣 𝑕 ⟵ 𝑕𝑣 𝑕 Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 16

  17. Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization Regularization of NN § Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout CS109A, P ROTOPAPAS , R ADER , T ANNER 17

  18. Momentum Oscillations because updates do not exploit curvature information 𝑀(𝑋) G 𝑋 𝑋 < Average gradient presents faster path to optimal: vertical components cancel out CS109A, P ROTOPAPAS , R ADER , T ANNER 18

  19. Momentum Question : Why not this? 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 19

  20. Momentum Let us figure out an algorithm which will lead us to the minimum faster. 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 20

  21. Momentum Look each component at a time 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 21

  22. Momentum Let us figure out an algorithm 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 22

  23. Momentum Let us figure out an algorithm 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 23

  24. Momentum Let us figure out an algorithm 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 24

  25. Momentum Let us figure out an algorithm 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 25

  26. � Momentum 𝑔 is the Neural Network Old gradient descent: 𝑕 = 1 𝑋 ∗ = 𝑋 − 𝜇𝑕 𝑛 4 𝛼 R 𝑀(𝑔 𝑦 5 ; 𝑋 , 𝑧 5 ) 5 𝑕 = 𝑕 + 𝑏𝑤𝑓𝑠𝑏𝑕𝑓 𝑕 𝑔𝑠𝑝𝑛 𝑐𝑓𝑔𝑝𝑠𝑓 New gradient descent with momentum: 𝑋 ∗ = 𝑋 − 𝜇𝜉 𝜉 = 𝛽𝜉 + (1 − 𝛽) 𝑕 controls how quickly α ∈ [0,1) effect of past gradients decay CS109A, P ROTOPAPAS , R ADER , T ANNER 26

  27. � Nesterov Momentum Apply an interim update: X = 𝑋 + 𝜉 𝑋 Perform a correction based on gradient at the interim point: 𝑕 = 1 X , 𝑧 5 ) 𝑛 4 𝛼 R 𝑀(𝑔 𝑦 5 ; 𝑋 5 v = α v − ε g Momentum based on look-ahead slope CS109A, P ROTOPAPAS , R ADER , T ANNER 27

  28. 28

  29. Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization Regularization of NN § Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout CS109A, P ROTOPAPAS , R ADER , T ANNER 29

  30. Adaptive Learning Rates 𝑀(𝑋) G 𝑋 𝑋 < Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER , T ANNER 30

  31. Adaptive Learning Rates 𝑀(𝑋) G 𝑋 𝑋 < Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER , T ANNER 31

  32. Adaptive Learning Rates 𝑀(𝑋) G 𝑋 𝑋 < Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER , T ANNER 32

  33. Adaptive Learning Rates 𝑀(𝑋) G 𝑋 𝑋 < Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER , T ANNER 33

  34. Adaptive Learning Rates 𝑀(𝑋) G 𝑋 𝑋 < Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER , T ANNER 34

  35. AdaGrad • Accumulate squared gradients: 𝑕 is the gradient 2 r i = r i + g i • Update each parameter: Inversely proportional to • Greater progress along gently sloped directions cumulative squared gradient CS109A, P ROTOPAPAS , R ADER , T ANNER 35

  36. � � 𝜀 is a small number, making sure AdaGrad this does not become too large Old gradient descent: 𝑕 = 1 𝑋 ∗ = 𝑋 − 𝜇𝑕 𝑛 4 𝛼 R 𝑀(𝑔 𝑦 5 ; 𝑋 , 𝑧 5 ) 5 We would like 𝜇 Y 𝑡 not to be the same and inversely proportional to the |𝑕 5 | 𝜇 5 ∝ 1 1 ∗ = 𝑋 𝑋 5 − 𝜇 5 𝑕 5 |𝑕 5 | = 5 𝜀 + |𝑕 5 | New gradient descent with adaptive learning rate: 𝜗 ∗ = 𝑋 ∗ = 𝑠 G 𝑋 5 − g ` 𝑠 5 + 𝑕 5 5 5 𝜀 + 𝑠 5 CS109A, P ROTOPAPAS , R ADER , T ANNER 36

  37. RMSProp • For non-convex problems, AdaGrad can prematurely decrease learning rate • Use exponentially weighted average for gradient accumulation 2 r i = ρ r i + (1 − ρ ) g i CS109A, P ROTOPAPAS , R ADER , T ANNER 37

  38. Adam • RMSProp + Momentum • Estimate first moment: Also applies bias correction v i = ρ 1 v i + (1 − ρ 1 ) g i to v and r • Estimate second moment: 2 r i = ρ 2 r i + (1 − ρ 2 ) g i • Update parameters: Works well in practice, is fairly robust to hyper-parameters CS109A, P ROTOPAPAS , R ADER , T ANNER 38

  39. Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization Regularization of NN § Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout CS109A, P ROTOPAPAS , R ADER , T ANNER 39

Recommend


More recommend