Key Points in Batch Normalization Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40
Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40
Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. When in inference, the whole population of training data is used for mean and variance statistics instead of the estimate. E ( x ) ← E B [ µ B ] m m − 1 E B [ σ 2 Var [ x ] ← B ] Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40
Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. When in inference, the whole population of training data is used for mean and variance statistics instead of the estimate. E ( x ) ← E B [ µ B ] m m − 1 E B [ σ 2 Var [ x ] ← B ] In Convolutional layers, different locations of a feature map should be normalized in the same way. m ′ = |B| = m · pq , and γ ( k ) , β ( k ) per feature map Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40
Key Points in Batch Normalization Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40
Key Points in Batch Normalization Higher learning rates are allowed Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40
Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40
Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) ∂ BN ( Wu ) = ∂ BN (( aW ) u ) , ∂ BN ( Wu ) = 1 a · ∂ BN (( aW ) u ) ∂ u ∂ u ∂ aW ∂ W Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40
Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) ∂ BN ( Wu ) = ∂ BN (( aW ) u ) , ∂ BN ( Wu ) = 1 a · ∂ BN (( aW ) u ) ∂ u ∂ u ∂ aW ∂ W Batch Normalization will regularize the model with less overfitting. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40
Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 15 / 40
Activations over time Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40
Activations over time Batch Normalization helps train faster and achieve higher accuracy. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40
Activations over time Batch Normalization helps train faster and achieve higher accuracy. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40
Activations over time Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40
Activations over time Batch Normalization makes input distribution more stable. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40
Activations over time Batch Normalization makes input distribution more stable. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40
Accelerating Batch Normalization Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40
Accelerating Batch Normalization Networks Tricks to follow Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40
Accelerating Batch Normalization Networks Tricks to follow Increasing learning rate Remove or Reduce Dropout Reduce ℓ 2 weight regularization Accelerate the learning rate decay Remove Local Response Normalization Shuffle training examples more thoroughly Reduce the photometric distortions Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40
Network Comparisons Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40
Network Comparisons Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40
Network Comparisons Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40
Network Comparisons Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40
Ensemble Classification Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40
Ensemble Classification Top-5 validation error of 4.9% and test error of 4.82%, exceeds the estimated accuracy of human raters. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40
Ensemble Classification Top-5 validation error of 4.9% and test error of 4.82%, exceeds the estimated accuracy of human raters. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40
Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 21 / 40
Challenges to be solved Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40
Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40
Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40
Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Well-designed random initialization Slowly increasing schedule for momentum parameter Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40
Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Well-designed random initialization Slowly increasing schedule for momentum parameter No need for sophisticated second-order methods. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40
Overview of first-order method First-order Methods Vanilla Stochastic Gradient Descent SGD + Momentum Nesterov’s Accelerated Gradient(NAG) AdaGrad Adam Rprop RMSProp AdaDelta slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 23 / 40
Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 24 / 40
Several First-order Methods Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40
Several First-order Methods Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40
Several First-order Methods Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Vanilla SGD v t +1 = ǫ ▽ f ( θ t ) θ t +1 = θ t − v t +1 slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40
Several First-order Methods Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 26 / 40
Several First-order Methods Rprop Update if ▽ f t ▽ f t − 1 > 0 v t = η + v t − 1 else if ▽ f t ▽ f t − 1 < 0 v t = η − v t − 1 else v t = v t − 1 θ t +1 = θ t − v t where 0 < η − < 1 < η + slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 26 / 40
Several First-order Methods Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40
Several First-order Methods AdaGrad r t = θ 2 t + r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40
Several First-order Methods AdaGrad r t = θ 2 t + r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 RMSProp = Rprop + SGD r t = (1 − γ ) θ 2 t + γ r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40
Several First-order Methods Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40
Several First-order Methods AdaDelta v t +1 = H − 1 ▽ f , ∝ f ′ f ′′ 1 / units of θ ∝ (1 / units of θ ) 2 ∝ units of θ Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40
Several First-order Methods AdaDelta v t +1 = H − 1 ▽ f , ∝ f ′ f ′′ 1 / units of θ ∝ (1 / units of θ ) 2 ∝ units of θ Adam r t = (1 − γ 1 ) ▽ f ( θ t ) + γ 1 r t − 1 p t = (1 − γ 2 ) ▽ f ( θ t ) 2 + γ 2 p t − 1 r t r t = ˆ (1 − (1 − γ 1 ) t ) p t p t = ˆ (1 − (1 − r 2 ) t ) v t = α ˆ r t √ ˆ p t θ t +1 = θ t − v t slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40
Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 29 / 40
Momentum and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40
Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40
Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Classical Momentum v t +1 = µ v t − ǫ ▽ f ( θ t ) θ t +1 = θ t + v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40
Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Classical Momentum v t +1 = µ v t − ǫ ▽ f ( θ t ) θ t +1 = θ t + v t +1 Nesterov’s Accelerated Gradient v t +1 = µ v t − ǫ ▽ f ( θ t + µ v t ) θ t +1 = θ t + v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40
Relationship between CM and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40
Relationship between CM and NAG NAG uses θ t + µ v t but MISSING the yet unknown correction. Thus when the addition of µ v t results in an immediate undesirable increase in the objective f , Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40
Relationship between CM and NAG NAG uses θ t + µ v t but MISSING the yet unknown correction. Thus when the addition of µ v t results in an immediate undesirable increase in the objective f , Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40
Relationship between CM and NAG NAG uses θ t + µ v t but MISSING the yet unknown correction. Thus when the addition of µ v t results in an immediate undesirable increase in the objective f , figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40
Relationship between CM and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40
Relationship between CM and NAG Apply CM and NAG to a positive definite quadratic objective q ( x ) = x T Ax / 2 + b T x . Difference in effective momentum coefficient Classical Momentum: µ NAG: µ (1 − λǫ ), where λ is the eigenvalue of A . Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40
Relationship between CM and NAG Apply CM and NAG to a positive definite quadratic objective q ( x ) = x T Ax / 2 + b T x . Difference in effective momentum coefficient Classical Momentum: µ NAG: µ (1 − λǫ ), where λ is the eigenvalue of A . ǫ small, CM and NAG are equivalent ǫ large, NAG gives smaller µ (1 − λ i ǫ ) to stop oscillations. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40
Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 33 / 40
Deep Autoencoders Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40
Deep Autoencoders Structure of Deep Autoencoder Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40
Deep Autoencoders Structure of Deep Autoencoder figure credit: http://deeplearning4j.org/deepautoencoder.html Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40
Deep Autoencoders Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40
Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40
Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) µ t = 1 − 3 / ( t + 5), not strongly convex - Nesterov(1983) constant µ t , strongly convex - Nesterov(2003) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40
Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) µ t = 1 − 3 / ( t + 5), not strongly convex - Nesterov(1983) constant µ t , strongly convex - Nesterov(2003) table credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40
RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40
RNN - Echo-State Networks Echo-State Networks (a family RNNs) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40
RNN - Echo-State Networks Echo-State Networks (a family RNNs) figure credit: Mantas Lukoevicius Hidden-to-output connections learned from data Recurrent connections fixed to a random draw from a specific distribution Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40
RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40
RNN - Echo-State Networks ESN-based Initialization Spectral Radius of Hidden-to-hidden matrix around 1.1 Initial scale of Input-to-hidden connections plays an important role (Gaussian draw with a standard deviation of 0.001 achieves good balance, but is Task Dependent) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40
Recommend
More recommend