Lecture 19 Additional Material: Optimization CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader
Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization CS109A, P ROTOPAPAS , R ADER 2
Learning vs. Optimization Goal of learning: minimize generalization error In practice, empirical risk minimization: [ ] J ( θ ) = E ( x , y )~ p data L ( f ( x ; θ ), y ) m J ( θ ) = 1 ˆ ∑ ( f ( x ( i ) ; θ ), y ( i ) ) L m i = 1 Quantity optimized different from the quantity we care about CS109A, P ROTOPAPAS , R ADER 3
Batch vs. Stochastic Algorithms Batch algorithms • Optimize empirical risk using exact gradients Stochastic algorithms • Estimates gradient from a small random sample [ ] ∇ J ( θ ) = E ( x , y )~ p data ∇ L ( f ( x ; θ ), y ) Large mini-batch : gradient computation expensive Small mini-batch : greater variance in estimate, longer steps for convergence CS109A, P ROTOPAPAS , R ADER 4
Critical Points Points with zero gradient 2 nd -derivate (Hessian) determines curvature Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER 5
Stochastic Gradient Descent Take small steps in direction of negative gradient Sample m examples from training set and compute: g = 1 ∑ ∇ L ( f ( x ( i ) ; θ ), y ( i ) ) Update parameters: m i θ = θ − ε k g In practice: shuffle training set once and pass through multiple times CS109A, P ROTOPAPAS , R ADER 6
Stochastic Gradient Descent J ( θ ) Oscillations because updates do not exploit curvature information Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER 7
Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization CS109A, P ROTOPAPAS , R ADER 8
Local Minima Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER 9
Local Minima Old view: local minima is major problem in neural network training Recent view: • For sufficiently large neural networks, most local minima incur low cost • Not important to find true global minimum CS109A, P ROTOPAPAS , R ADER 10
Saddle Points Both local min Recent studies indicate that in and max high dim, saddle points are more likely than local min Gradient can be very small near saddle points Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER 11
Saddle Points SGD is seen to escape saddle points – Moves down-hill, uses noisy gradients Second-order methods get stuck – solves for a point with zero gradient Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER 12
Poor Conditioning Poorly conditioned Hessian matrix – High curvature: small steps leads to huge increase Learning is slow despite strong gradients Oscillations slow down progress CS109A, P ROTOPAPAS , R ADER 13
No Critical Points Some cost functions do not have critical points. In particular classification. CS109A, P ROTOPAPAS , R ADER 14
No Critical Points Gradient norm increases, but validation error decreases Convolution Nets for Object Detection Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER 15
Exploding and Vanishing Gradients h 1 = Wx h i = Wh i − 1 , i = 2 … n Linear activation 1 n + h 2 n ), where σ ( s ) = y = σ ( h 1 1 + e − s deeplearning.ai CS109A, P ROTOPAPAS , R ADER 16
Exploding and Vanishing Gradients ! $ a 0 Suppose W = & : # 0 b " % ! $ ! $ ! $ ! $ ! $ h 1 h n ! $ x 1 x 1 a n a 0 0 # & # & 1 1 ! # & # & # & & = & = # & # # h 1 0 b x 2 h n b n x 2 # & # & 0 # & " % " % " % " % " % " % 2 2 y = σ ( a n x 1 + b n x 2 ) $ ' na n − 1 x 1 & ) σ ( a n x 1 + b n x 2 ) ∇ y = " & ) nb n − 1 x 2 % ( CS109A, P ROTOPAPAS , R ADER 17
Exploding and Vanishing Gradients ! $ 1 Suppose x = # & 1 " % Case 1: a = 1, b = 2 : ! $ n Explodes! y → 1, ∇ y → # & n 2 n − 1 # & " % Case 2: a = 0.5, b = 0.9 : ! $ 0 Vanishes! y → 0, ∇ y → # & 0 " % CS109A, P ROTOPAPAS , R ADER 18
Exploding and Vanishing Gradients Exploding gradients lead to cliffs Can be mitigated using gradient clipping Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER 19
Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization CS109A, P ROTOPAPAS , R ADER 20
Momentum SGD is slow when there is high curvature The image part with relationship ID rId5 was not found in the file. Average gradient presents faster path to opt: – vertical components cancel out CS109A, P ROTOPAPAS , R ADER 21
Momentum Uses past gradients for update Maintains a new quantity: ‘velocity’ Exponentially decaying average of gradients: Current gradient update v = α v + ( − ε g ) controls how quickly The image part with relationship ID rId4 was not found in the file. effect of past gradients decay CS109A, P ROTOPAPAS , R ADER 22
Momentum Compute gradient estimate: g = 1 ∑ ∇ θ L ( f ( x ( i ) ; θ ), y ( i ) ) m i Update velocity: v = α v − ε g Update parameters: θ = θ + v CS109A, P ROTOPAPAS , R ADER 23
Momentum The image part with relationship ID rId4 was not found in the file. Damped oscillations: gradients in opposite directions get cancelled out Goodfellow et al. (2016) CS109A, P ROTOPAPAS , R ADER 24
Nesterov Momentum Apply an interim update: ! θ = θ + v Perform a correction based on gradient at the interim point: g = 1 ∇ θ L ( f ( x ( i ) ; ! ∑ θ ), y ( i ) ) m i v = α v − ε g θ = θ + v Momentum based on look-ahead slope CS109A, P ROTOPAPAS , R ADER 25
Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization CS109A, P ROTOPAPAS , R ADER 26
Adaptive Learning Rates The image part with relationship ID rId4 was not found in the file. The image part with relationsh ip ID rId4 was not The image part with relations hip ID rId4 was Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109A, P ROTOPAPAS , R ADER 27
AdaGrad • Accumulate squared gradients: 2 r i = r i + g i Inversely • Update each parameter: proportional to ε cumulative g i θ i = θ i − squared gradient r δ + i • Greater progress along gently sloped directions CS109A, P ROTOPAPAS , R ADER 28
RMSProp • For non-convex problems, AdaGrad can prematurely decrease learning rate • Use exponentially weighted average for gradient accumulation 2 r i = ρ r i + (1 − ρ ) g i ε g i θ i = θ i − r δ + i CS109A, P ROTOPAPAS , R ADER 29
Adam • RMSProp + Momentum • Estimate first moment: Also applies bias correction v i = ρ 1 v i + (1 − ρ 1 ) g i to v and r • Estimate second moment: 2 r i = ρ 2 r i + (1 − ρ 2 ) g i • Update parameters: Works well in practice, ε v i θ i = θ i − is fairly robust to r δ + hyper-parameters i CS109A, P ROTOPAPAS , R ADER 30
Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization CS109A, P ROTOPAPAS , R ADER 31
Parameter Initialization • Goal: break symmetry between units • so that each unit computes a different function • Initialize all weights (not biases) randomly • Gaussian or uniform distribution • Scale of initialization? • Large -> grad explosion, Small -> grad vanishing CS109A, P ROTOPAPAS , R ADER 32
Xavier Initialization • Heuristic for all outputs to have unit variance • For a fully-connected layer with m inputs: ! $ W ij ~ N 0, 1 # & " m % • For ReLU units, it is recommended: ! $ W ij ~ N 0, 2 # & " m % CS109A, P ROTOPAPAS , R ADER 33
Normalized Initialization • Fully-connected layer with m inputs, n outputs: " % 6 6 W ij ~ U − m + n , $ ' m + n # & • Heuristic trades off between initialize all layers have same activation and gradient variance • Sparse variant when m is large Initialize k nonzero weights in each unit – CS109A, P ROTOPAPAS , R ADER 34
Bias Initialization • Output unit bias • Marginal statistics of the output in the training set • Hidden unit bias • Avoid saturation at initialization • E.g. in ReLU, initialize bias to 0.1 instead of 0 • Units controlling participation of other units • Set bias to allow participation at initialization CS109A, P ROTOPAPAS , R ADER 35
Outline Challenges in Optimization Momentum Adaptive Learning Rate Parameter Initialization Batch Normalization CS109A, P ROTOPAPAS , R ADER 36
Feature Normalization Good practice to normalize features before applying learning algorithm: Feature vector Vector of mean feature values x = x − µ ! σ Vector of SD of feature values Features in same scale: mean 0 and variance 1 – Speeds up learning CS109A, P ROTOPAPAS , R ADER 37
The image part with relationship ID rId4 was not found in the file. Feature Normalization Before normalization After normalization CS109A, P ROTOPAPAS , R ADER 38
Recommend
More recommend