Lecture 21: Optimization and Regularization CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner
ANNOUNCEMENTS • Homework 7 OH: • For conceptual questions: Kevin and Chris will continue their office hours. • If you have problems with TensorFlow please let us know on ED. We will arrange special OH to help if necessary. • Project: • Milestone3 due on Wed. EDA and base model CS109A, P ROTOPAPAS , R ADER , T ANNER 2
Outline Optimization Regularization of NN CS109A, P ROTOPAPAS , R ADER , T ANNER 3
Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization Regularization of NN § Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout CS109A, P ROTOPAPAS , R ADER , T ANNER 4
Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization Regularization of NN § Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout CS109A, P ROTOPAPAS , R ADER , T ANNER 5
� Learning vs. Optimization Goal of learning: minimize generalization error, or the loss function 𝓜 𝑿 = 𝔽 𝒚,𝒛 ~𝒒 𝒆𝒃𝒖𝒃 𝑀(𝑔 𝑦, 𝑋 , 𝑧 f is the neural network In practice, empirical risk minimization: ℒ 𝑋 = 4 𝑀(𝑔 𝑦 5 ; 𝑋 , 𝑧 5 5 Quantity optimized different from the quantity we care about CS109A, P ROTOPAPAS , R ADER , T ANNER 6
Local Minima Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 7
Critical Points Points with zero gradient 2 nd -derivate (Hessian) determines curvature Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 8
Local Minima Old view: local minima is major problem in neural network training Recent view: • For sufficiently large neural networks, most local minima incur low cost • Not important to find true global minimum CS109A, P ROTOPAPAS , R ADER , T ANNER 9
Saddle Points Both local min Recent studies indicate that in and max high dim, saddle points are more likely than local min Gradient can be very small near saddle points Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 10
Poor Conditioning Poorly conditioned Hessian matrix – High curvature: small steps leads to huge increase Learning is slow despite strong gradients Oscillations slow down progress CS109A, P ROTOPAPAS , R ADER , T ANNER 11
No Critical Points Some cost functions do not have critical points. In particular classification. WHY? CS109A, P ROTOPAPAS , R ADER , T ANNER 12
Exploding and Vanishing Gradients ℎ 5 = 𝑋𝑦 Linear activation ℎ 5 = 𝑋ℎ 5;< , 𝑗 = 2, … , 𝑜 deeplearning.ai CS109A, P ROTOPAPAS , R ADER , T ANNER 13
Exploding and Vanishing Gradients ! $ a 0 Suppose W = & : # 0 b " % ! $ ! $ ! $ ! $ ! $ h 1 h n ! $ x 1 x 1 a n a 0 0 # & # & 1 1 ! # & # & # & & = & = # & # # h 1 0 b x 2 h n b n x 2 # & # & 0 # & " % " % " % " % " % " % 2 2 CS109A, P ROTOPAPAS , R ADER , T ANNER 14
Exploding and Vanishing Gradients ! $ 1 Suppose x = # & 1 " % Case 1: a = 1, b = 2 : ! $ n Explodes! y → 1, ∇ y → # & n 2 n − 1 # & " % Case 2: a = 0.5, b = 0.9 : ! $ 0 Vanishes! y → 0, ∇ y → # & 0 " % CS109A, P ROTOPAPAS , R ADER , T ANNER 15
Exploding and Vanishing Gradients Exploding gradients lead to cliffs Can be mitigated using gradient clipping if > 𝑣 ⟵ 𝑣 Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 16
Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization Regularization of NN § Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout CS109A, P ROTOPAPAS , R ADER , T ANNER 17
Momentum Oscillations because updates do not exploit curvature information 𝑀(𝑋) G 𝑋 𝑋 < Average gradient presents faster path to optimal: vertical components cancel out CS109A, P ROTOPAPAS , R ADER , T ANNER 18
Momentum Question : Why not this? 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 19
Momentum Let us figure out an algorithm which will lead us to the minimum faster. 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 20
Momentum Look each component at a time 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 21
Momentum Let us figure out an algorithm 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 22
Momentum Let us figure out an algorithm 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 23
Momentum Let us figure out an algorithm 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 24
Momentum Let us figure out an algorithm 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 25
� Momentum 𝑔 is the Neural Network Old gradient descent: = 1 𝑋 ∗ = 𝑋 − 𝜇 𝑛 4 𝛼 R 𝑀(𝑔 𝑦 5 ; 𝑋 , 𝑧 5 ) 5 = + 𝑏𝑤𝑓𝑠𝑏𝑓 𝑔𝑠𝑝𝑛 𝑐𝑓𝑔𝑝𝑠𝑓 New gradient descent with momentum: 𝑋 ∗ = 𝑋 − 𝜇𝜉 𝜉 = 𝛽𝜉 + (1 − 𝛽) controls how quickly α ∈ [0,1) effect of past gradients decay CS109A, P ROTOPAPAS , R ADER , T ANNER 26
� Nesterov Momentum Apply an interim update: X = 𝑋 + 𝜉 𝑋 Perform a correction based on gradient at the interim point: = 1 X , 𝑧 5 ) 𝑛 4 𝛼 R 𝑀(𝑔 𝑦 5 ; 𝑋 5 v = α v − ε g Momentum based on look-ahead slope CS109A, P ROTOPAPAS , R ADER , T ANNER 27
28
Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization Regularization of NN § Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout CS109A, P ROTOPAPAS , R ADER , T ANNER 29
Adaptive Learning Rates 𝑀(𝑋) G 𝑋 𝑋 < Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER , T ANNER 30
Adaptive Learning Rates 𝑀(𝑋) G 𝑋 𝑋 < Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER , T ANNER 31
Adaptive Learning Rates 𝑀(𝑋) G 𝑋 𝑋 < Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER , T ANNER 32
Adaptive Learning Rates 𝑀(𝑋) G 𝑋 𝑋 < Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER , T ANNER 33
Adaptive Learning Rates 𝑀(𝑋) G 𝑋 𝑋 < Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER , T ANNER 34
AdaGrad • Accumulate squared gradients: is the gradient 2 r i = r i + g i • Update each parameter: Inversely proportional to • Greater progress along gently sloped directions cumulative squared gradient CS109A, P ROTOPAPAS , R ADER , T ANNER 35
� � 𝜀 is a small number, making sure AdaGrad this does not become too large Old gradient descent: = 1 𝑋 ∗ = 𝑋 − 𝜇 𝑛 4 𝛼 R 𝑀(𝑔 𝑦 5 ; 𝑋 , 𝑧 5 ) 5 We would like 𝜇 Y 𝑡 not to be the same and inversely proportional to the | 5 | 𝜇 5 ∝ 1 1 ∗ = 𝑋 𝑋 5 − 𝜇 5 5 | 5 | = 5 𝜀 + | 5 | New gradient descent with adaptive learning rate: 𝜗 ∗ = 𝑋 ∗ = 𝑠 G 𝑋 5 − g ` 𝑠 5 + 5 5 5 𝜀 + 𝑠 5 CS109A, P ROTOPAPAS , R ADER , T ANNER 36
RMSProp • For non-convex problems, AdaGrad can prematurely decrease learning rate • Use exponentially weighted average for gradient accumulation 2 r i = ρ r i + (1 − ρ ) g i CS109A, P ROTOPAPAS , R ADER , T ANNER 37
Adam • RMSProp + Momentum • Estimate first moment: Also applies bias correction v i = ρ 1 v i + (1 − ρ 1 ) g i to v and r • Estimate second moment: 2 r i = ρ 2 r i + (1 − ρ 2 ) g i • Update parameters: Works well in practice, is fairly robust to hyper-parameters CS109A, P ROTOPAPAS , R ADER , T ANNER 38
Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization Regularization of NN § Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout CS109A, P ROTOPAPAS , R ADER , T ANNER 39
Recommend
More recommend