Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech - PowerPoint PPT Presentation

Key Points in Batch Normalization Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. When in inference, the whole population of training data is used for mean and variance statistics instead of the estimate. E ( x ) ← E B [ µ B ] m m − 1 E B [ σ 2 Var [ x ] ← B ] Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. When in inference, the whole population of training data is used for mean and variance statistics instead of the estimate. E ( x ) ← E B [ µ B ] m m − 1 E B [ σ 2 Var [ x ] ← B ] In Convolutional layers, different locations of a feature map should be normalized in the same way. m ′ = |B| = m · pq , and γ ( k ) , β ( k ) per feature map Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

Key Points in Batch Normalization Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

Key Points in Batch Normalization Higher learning rates are allowed Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) ∂ BN ( Wu ) = ∂ BN (( aW ) u ) , ∂ BN ( Wu ) = 1 a · ∂ BN (( aW ) u ) ∂ u ∂ u ∂ aW ∂ W Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) ∂ BN ( Wu ) = ∂ BN (( aW ) u ) , ∂ BN ( Wu ) = 1 a · ∂ BN (( aW ) u ) ∂ u ∂ u ∂ aW ∂ W Batch Normalization will regularize the model with less overfitting. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 15 / 40

Activations over time Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

Activations over time Batch Normalization helps train faster and achieve higher accuracy. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

Activations over time Batch Normalization helps train faster and achieve higher accuracy. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

Activations over time Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

Activations over time Batch Normalization makes input distribution more stable. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

Activations over time Batch Normalization makes input distribution more stable. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

Accelerating Batch Normalization Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

Accelerating Batch Normalization Networks Tricks to follow Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

Accelerating Batch Normalization Networks Tricks to follow Increasing learning rate Remove or Reduce Dropout Reduce ℓ 2 weight regularization Accelerate the learning rate decay Remove Local Response Normalization Shuffle training examples more thoroughly Reduce the photometric distortions Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

Network Comparisons Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

Network Comparisons Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

Network Comparisons Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

Ensemble Classification Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

Ensemble Classification Top-5 validation error of 4.9% and test error of 4.82%, exceeds the estimated accuracy of human raters. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

Ensemble Classification Top-5 validation error of 4.9% and test error of 4.82%, exceeds the estimated accuracy of human raters. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

Challenges to be solved Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Well-designed random initialization Slowly increasing schedule for momentum parameter Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Well-designed random initialization Slowly increasing schedule for momentum parameter No need for sophisticated second-order methods. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

Overview of first-order method First-order Methods Vanilla Stochastic Gradient Descent SGD + Momentum Nesterov’s Accelerated Gradient(NAG) AdaGrad Adam Rprop RMSProp AdaDelta slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 23 / 40

Several First-order Methods Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

Several First-order Methods Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

Several First-order Methods Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Vanilla SGD v t +1 = ǫ ▽ f ( θ t ) θ t +1 = θ t − v t +1 slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

Several First-order Methods Rprop Update if ▽ f t ▽ f t − 1 > 0 v t = η + v t − 1 else if ▽ f t ▽ f t − 1 < 0 v t = η − v t − 1 else v t = v t − 1 θ t +1 = θ t − v t where 0 < η − < 1 < η + slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 26 / 40

Several First-order Methods AdaGrad r t = θ 2 t + r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40

Several First-order Methods AdaGrad r t = θ 2 t + r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 RMSProp = Rprop + SGD r t = (1 − γ ) θ 2 t + γ r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40

Several First-order Methods AdaDelta v t +1 = H − 1 ▽ f , ∝ f ′ f ′′ 1 / units of θ ∝ (1 / units of θ ) 2 ∝ units of θ Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40

Several First-order Methods AdaDelta v t +1 = H − 1 ▽ f , ∝ f ′ f ′′ 1 / units of θ ∝ (1 / units of θ ) 2 ∝ units of θ Adam r t = (1 − γ 1 ) ▽ f ( θ t ) + γ 1 r t − 1 p t = (1 − γ 2 ) ▽ f ( θ t ) 2 + γ 2 p t − 1 r t r t = ˆ (1 − (1 − γ 1 ) t ) p t p t = ˆ (1 − (1 − r 2 ) t ) v t = α ˆ r t √ ˆ p t θ t +1 = θ t − v t slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40

Momentum and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Classical Momentum v t +1 = µ v t − ǫ ▽ f ( θ t ) θ t +1 = θ t + v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Classical Momentum v t +1 = µ v t − ǫ ▽ f ( θ t ) θ t +1 = θ t + v t +1 Nesterov’s Accelerated Gradient v t +1 = µ v t − ǫ ▽ f ( θ t + µ v t ) θ t +1 = θ t + v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

Relationship between CM and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

Relationship between CM and NAG NAG uses θ t + µ v t but MISSING the yet unknown correction. Thus when the addition of µ v t results in an immediate undesirable increase in the objective f , Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

Relationship between CM and NAG NAG uses θ t + µ v t but MISSING the yet unknown correction. Thus when the addition of µ v t results in an immediate undesirable increase in the objective f , figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

Relationship between CM and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

Relationship between CM and NAG Apply CM and NAG to a positive definite quadratic objective q ( x ) = x T Ax / 2 + b T x . Difference in effective momentum coefficient Classical Momentum: µ NAG: µ (1 − λǫ ), where λ is the eigenvalue of A . Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

Relationship between CM and NAG Apply CM and NAG to a positive definite quadratic objective q ( x ) = x T Ax / 2 + b T x . Difference in effective momentum coefficient Classical Momentum: µ NAG: µ (1 − λǫ ), where λ is the eigenvalue of A . ǫ small, CM and NAG are equivalent ǫ large, NAG gives smaller µ (1 − λ i ǫ ) to stop oscillations. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

Deep Autoencoders Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

Deep Autoencoders Structure of Deep Autoencoder Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

Deep Autoencoders Structure of Deep Autoencoder figure credit: http://deeplearning4j.org/deepautoencoder.html Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

Deep Autoencoders Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) µ t = 1 − 3 / ( t + 5), not strongly convex - Nesterov(1983) constant µ t , strongly convex - Nesterov(2003) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) µ t = 1 − 3 / ( t + 5), not strongly convex - Nesterov(1983) constant µ t , strongly convex - Nesterov(2003) table credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

RNN - Echo-State Networks Echo-State Networks (a family RNNs) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

RNN - Echo-State Networks Echo-State Networks (a family RNNs) figure credit: Mantas Lukoevicius Hidden-to-output connections learned from data Recurrent connections fixed to a random draw from a specific distribution Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40

RNN - Echo-State Networks ESN-based Initialization Spectral Radius of Hidden-to-hidden matrix around 1.1 Initial scale of Input-to-hidden connections plays an important role (Gaussian draw with a standard deviation of 0.001 achieves good balance, but is Task Dependent) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech - PowerPoint PPT Presentation

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech November 17, 2015 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 1 / 40 Overview Batch Normalization 1 Internal Covariate Shift

Hyper: Make VM Runs Like Container Xu Wang <xu@hyper.sh> Hyper HQ Agenda Lesson

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business

Tweaking structures: working on the fiddly bits Kevin Karplus karplus@soe.ucsc.edu Biomolecular

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

1 Hyper-heuristics: Raising the Level of Generality of Search Hyper-heuristics: Raising the Level

Status of the Hyper- Kamiokande Experiment Erin OSullivan, on behalf of the Hyper-Kamiokande

Camera Parameters INEL 6088 Computer Vision Camera Parameters Extrinsic parameters: define

Hyper-scaling on Openstack with Open Source tooling A use case in deploying hyper-scale grid

Tra racking Hyper Bo cking Hyper Boosted sted Top Q p Qua uarks @ 100 T rks @ 100 TeV eV

Hyper-Vacancy in a census tract More than 10 percent of housing units in this category Cuyahoga

Hyper-K David Hadley, University of Warwick Outline Hyper-K Detector Long baseline neutrino

Outreach for Hyper-Kamiokande Jost Migenda (they/them) for the Hyper-Kamiokande

Indirect WIMP detection with neutrinos in Hyper-K Yusuke Koshio for Hyper-K astrophysics working

Atmospheric Neutrino Studies at Hyper-K Advanced Workshop on Physics of Atmospheric Neutrinos -

Developing a RESTful Web application for Liberty in CICS Introduction Course introduction What

SWEN 256 Software Process & Project Management Not everything that can be

Welcome! A Case for Response Time Focused Query Processing Olaf Hartjg

CSCI 3 3 4 2 I nternet Program m ing MW , 3 :0 5 pm 4 :2 0 pm ACSB 2 .1 1 3 Emmett Tomai

Neural Autoregressive Distribution Estimation Instructor: John Thickstun Discussion Board:

Boo ook Acq cquis isitio ion in n the the (ne (new) era er a of of dig digit ital l

Second Order Adjoints with the NAG Fortran 95 Compiler Uwe Naumann, Michael Maier RWTH Aachen,

On the Verification of Synthesized Kalman Filters Ruben Gamboa, John Cowles, Jeff Van Baalen

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech - PowerPoint PPT Presentation

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech November 17, 2015 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 1 / 40 Overview Batch Normalization 1 Internal Covariate Shift

Hyper: Make VM Runs Like Container Xu Wang &lt;xu@hyper.sh&gt; Hyper HQ Agenda Lesson

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business

Tweaking structures: working on the fiddly bits Kevin Karplus karplus@soe.ucsc.edu Biomolecular

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

1 Hyper-heuristics: Raising the Level of Generality of Search Hyper-heuristics: Raising the Level

Status of the Hyper- Kamiokande Experiment Erin OSullivan, on behalf of the Hyper-Kamiokande

Camera Parameters INEL 6088 Computer Vision Camera Parameters Extrinsic parameters: define

Hyper-scaling on Openstack with Open Source tooling A use case in deploying hyper-scale grid

Tra racking Hyper Bo cking Hyper Boosted sted Top Q p Qua uarks @ 100 T rks @ 100 TeV eV

Hyper-Vacancy in a census tract More than 10 percent of housing units in this category Cuyahoga

Hyper-K David Hadley, University of Warwick Outline Hyper-K Detector Long baseline neutrino

Outreach for Hyper-Kamiokande Jost Migenda (they/them) for the Hyper-Kamiokande

Indirect WIMP detection with neutrinos in Hyper-K Yusuke Koshio for Hyper-K astrophysics working

Atmospheric Neutrino Studies at Hyper-K Advanced Workshop on Physics of Atmospheric Neutrinos -

Developing a RESTful Web application for Liberty in CICS Introduction Course introduction What

SWEN 256 Software Process &amp; Project Management Not everything that can be

Welcome! A Case for Response Time Focused Query Processing Olaf Hartjg

CSCI 3 3 4 2 I nternet Program m ing MW , 3 :0 5 pm 4 :2 0 pm ACSB 2 .1 1 3 Emmett Tomai

Neural Autoregressive Distribution Estimation Instructor: John Thickstun Discussion Board:

Boo ook Acq cquis isitio ion in n the the (ne (new) era er a of of dig digit ital l

Second Order Adjoints with the NAG Fortran 95 Compiler Uwe Naumann, Michael Maier RWTH Aachen,

On the Verification of Synthesized Kalman Filters Ruben Gamboa, John Cowles, Jeff Van Baalen

Hyper: Make VM Runs Like Container Xu Wang <xu@hyper.sh> Hyper HQ Agenda Lesson

SWEN 256 Software Process & Project Management Not everything that can be