Lecture 21: Optimization and Regularization CS109A Introduction to - PowerPoint PPT Presentation

Lecture 21: Optimization and Regularization CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner

ANNOUNCEMENTS • Homework 7 OH: • For conceptual questions: Kevin and Chris will continue their office hours. • If you have problems with TensorFlow please let us know on ED. We will arrange special OH to help if necessary. • Project: • Milestone3 due on Wed. EDA and base model CS109A, P ROTOPAPAS , R ADER , T ANNER 2

Outline Optimization Regularization of NN CS109A, P ROTOPAPAS , R ADER , T ANNER 3

Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization Regularization of NN § Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout CS109A, P ROTOPAPAS , R ADER , T ANNER 4

� Learning vs. Optimization Goal of learning: minimize generalization error, or the loss function 𝓜 𝑿 = 𝔽 𝒚,𝒛 ~𝒒 𝒆𝒃𝒖𝒃 𝑀(𝑔 𝑦, 𝑋 , 𝑧 f is the neural network In practice, empirical risk minimization: ℒ 𝑋 = 4 𝑀(𝑔 𝑦 5 ; 𝑋 , 𝑧 5 5 Quantity optimized different from the quantity we care about CS109A, P ROTOPAPAS , R ADER , T ANNER 6

Local Minima Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 7

Critical Points Points with zero gradient 2 nd -derivate (Hessian) determines curvature Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 8

Local Minima Old view: local minima is major problem in neural network training Recent view: • For sufficiently large neural networks, most local minima incur low cost • Not important to find true global minimum CS109A, P ROTOPAPAS , R ADER , T ANNER 9

Saddle Points Both local min Recent studies indicate that in and max high dim, saddle points are more likely than local min Gradient can be very small near saddle points Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 10

Poor Conditioning Poorly conditioned Hessian matrix – High curvature: small steps leads to huge increase Learning is slow despite strong gradients Oscillations slow down progress CS109A, P ROTOPAPAS , R ADER , T ANNER 11

No Critical Points Some cost functions do not have critical points. In particular classification. WHY? CS109A, P ROTOPAPAS , R ADER , T ANNER 12

Exploding and Vanishing Gradients ℎ 5 = 𝑋𝑦 Linear activation ℎ 5 = 𝑋ℎ 5;< , 𝑗 = 2, … , 𝑜 deeplearning.ai CS109A, P ROTOPAPAS , R ADER , T ANNER 13

Exploding and Vanishing Gradients ! $ a 0 Suppose W = & : # 0 b " % ! $ ! $ ! $ ! $ ! $ h 1 h n ! $ x 1 x 1 a n a 0 0 # & # & 1 1 ! # & # & # & & = & = # & # # h 1 0 b x 2 h n b n x 2 # & # & 0 # & " % " % " % " % " % " % 2 2 CS109A, P ROTOPAPAS , R ADER , T ANNER 14

Exploding and Vanishing Gradients ! $ 1 Suppose x = # & 1 " % Case 1: a = 1, b = 2 : ! $ n Explodes! y → 1, ∇ y → # & n 2 n − 1 # & " % Case 2: a = 0.5, b = 0.9 : ! $ 0 Vanishes! y → 0, ∇ y → # & 0 " % CS109A, P ROTOPAPAS , R ADER , T ANNER 15

Exploding and Vanishing Gradients Exploding gradients lead to cliffs Can be mitigated using gradient clipping if 𝑕 > 𝑣 𝑕 ⟵ 𝑕𝑣 𝑕 Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER , T ANNER 16

Momentum Oscillations because updates do not exploit curvature information 𝑀(𝑋) G 𝑋 𝑋 < Average gradient presents faster path to optimal: vertical components cancel out CS109A, P ROTOPAPAS , R ADER , T ANNER 18

Momentum Question : Why not this? 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 19

Momentum Let us figure out an algorithm which will lead us to the minimum faster. 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 20

Momentum Look each component at a time 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 21

Momentum Let us figure out an algorithm 𝑀(𝑋) G 𝑋 𝑋 < CS109A, P ROTOPAPAS , R ADER , T ANNER 22

� Momentum 𝑔 is the Neural Network Old gradient descent: 𝑕 = 1 𝑋 ∗ = 𝑋 − 𝜇𝑕 𝑛 4 𝛼 R 𝑀(𝑔 𝑦 5 ; 𝑋 , 𝑧 5 ) 5 𝑕 = 𝑕 + 𝑏𝑤𝑓𝑠𝑏𝑕𝑓 𝑕 𝑔𝑠𝑝𝑛 𝑐𝑓𝑔𝑝𝑠𝑓 New gradient descent with momentum: 𝑋 ∗ = 𝑋 − 𝜇𝜉 𝜉 = 𝛽𝜉 + (1 − 𝛽) 𝑕 controls how quickly α ∈ [0,1) effect of past gradients decay CS109A, P ROTOPAPAS , R ADER , T ANNER 26

� Nesterov Momentum Apply an interim update: X = 𝑋 + 𝜉 𝑋 Perform a correction based on gradient at the interim point: 𝑕 = 1 X , 𝑧 5 ) 𝑛 4 𝛼 R 𝑀(𝑔 𝑦 5 ; 𝑋 5 v = α v − ε g Momentum based on look-ahead slope CS109A, P ROTOPAPAS , R ADER , T ANNER 27

Adaptive Learning Rates 𝑀(𝑋) G 𝑋 𝑋 < Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER , T ANNER 30

AdaGrad • Accumulate squared gradients: 𝑕 is the gradient 2 r i = r i + g i • Update each parameter: Inversely proportional to • Greater progress along gently sloped directions cumulative squared gradient CS109A, P ROTOPAPAS , R ADER , T ANNER 35

� � 𝜀 is a small number, making sure AdaGrad this does not become too large Old gradient descent: 𝑕 = 1 𝑋 ∗ = 𝑋 − 𝜇𝑕 𝑛 4 𝛼 R 𝑀(𝑔 𝑦 5 ; 𝑋 , 𝑧 5 ) 5 We would like 𝜇 Y 𝑡 not to be the same and inversely proportional to the |𝑕 5 | 𝜇 5 ∝ 1 1 ∗ = 𝑋 𝑋 5 − 𝜇 5 𝑕 5 |𝑕 5 | = 5 𝜀 + |𝑕 5 | New gradient descent with adaptive learning rate: 𝜗 ∗ = 𝑋 ∗ = 𝑠 G 𝑋 5 − g ` 𝑠 5 + 𝑕 5 5 5 𝜀 + 𝑠 5 CS109A, P ROTOPAPAS , R ADER , T ANNER 36

RMSProp • For non-convex problems, AdaGrad can prematurely decrease learning rate • Use exponentially weighted average for gradient accumulation 2 r i = ρ r i + (1 − ρ ) g i CS109A, P ROTOPAPAS , R ADER , T ANNER 37

Adam • RMSProp + Momentum • Estimate first moment: Also applies bias correction v i = ρ 1 v i + (1 − ρ 1 ) g i to v and r • Estimate second moment: 2 r i = ρ 2 r i + (1 − ρ 2 ) g i • Update parameters: Works well in practice, is fairly robust to hyper-parameters CS109A, P ROTOPAPAS , R ADER , T ANNER 38

Lecture 21: Optimization and Regularization CS109A Introduction to - PowerPoint PPT Presentation

Lecture 21: Optimization and Regularization CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner ANNOUNCEMENTS Homework 7 OH: For conceptual questions: Kevin and Chris will continue their office hours.

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems & Multicollinearity We will

10. Regularization More on tradeoffs Regularization Effect of using different norms

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Improving Bug Prediction Accuracy by Regularization and Hyperparameter Optimization Haidar Osman

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

9.54 class 8 Supervised learning Optimization, regularization, kernels Shimon Ullman + Tomaso

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods Jorge Nocedal

Welcome! Capital City High School 9TH GRADE 2019-20 ENROLLMENT 1 Goals for Today Learn

Practical weight-constrained conditioned portfolio optimisation using risk aversion indicator

By: Mrs. Ferro Do I want to become a dog trainer? Brainstormed Found a Mentor Created

Advanced Section #1: Moving averages, optimization algorithms, understanding dropout and batch

QTL Association Mapping 1 / 38 Introduction to Quantitative Trait Mapping We previously focused

A Balance of Intelligence The Art & Science of working in the digital world @DvirYuval

Genetics-based Machine Learning and Behaviour Based Robotics: A New Synthesis

Lecture 21: Optimization and Regularization CS109A Introduction to - PowerPoint PPT Presentation

Lecture 21: Optimization and Regularization CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner ANNOUNCEMENTS Homework 7 OH: For conceptual questions: Kevin and Chris will continue their office hours.

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

10. Regularization More on tradeoffs Regularization Effect of using different norms

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Improving Bug Prediction Accuracy by Regularization and Hyperparameter Optimization Haidar Osman

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

9.54 class 8 Supervised learning Optimization, regularization, kernels Shimon Ullman + Tomaso

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods Jorge Nocedal

Welcome! Capital City High School 9TH GRADE 2019-20 ENROLLMENT 1 Goals for Today Learn

Practical weight-constrained conditioned portfolio optimisation using risk aversion indicator

By: Mrs. Ferro Do I want to become a dog trainer? Brainstormed Found a Mentor Created

Advanced Section #1: Moving averages, optimization algorithms, understanding dropout and batch

QTL Association Mapping 1 / 38 Introduction to Quantitative Trait Mapping We previously focused

A Balance of Intelligence The Art &amp; Science of working in the digital world @DvirYuval

Genetics-based Machine Learning and Behaviour Based Robotics: A New Synthesis

Regularization Overview Regularization Overview Problems & Multicollinearity We will

A Balance of Intelligence The Art & Science of working in the digital world @DvirYuval