Normalization Techniques in Training of Deep Neural Networks Lei - PowerPoint PPT Presentation

Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th , 2017

Outline • Introduction to Deep Neural Networks (DNNs) • Training DNNs: Optimization • Batch Normalization • Other Normalization Techniques • Centered Weight Normalization

Machine learning Y=F(X) P(Y|X)  dataset D={X, Y}  Input: X  Output: Y  Learning: Y=F(X) or P(Y|X)  Fitting and Generalization  Types: view of models  Non-parametric model  Y=F(X; 𝑦 1 , 𝑦 2 … 𝑦 𝑜 )  Parametric model  Y=F(X; 𝜄 )

Neural network • Neural network – Y=F(X)= 𝑔 𝑈 (𝑔 𝑈−1 (… 𝑔 1 (𝑌))) – 𝑔 𝑗 𝑦 = 𝑕(𝑋𝑦 + 𝑐) • Nonlinear activation – sigmod – Relu

Deep neural network • Why deep? – Powerful representation capacity

Key properties of Deep learning • End to End learning – no distinction between feature extractor and classifier • “Deep” architectures: – Hierarchy of simpler non-linear modules

Applications and techniques of DNNs • Successful applications in a range of domains – Speech – Computer Vision – Natural Language processing • Main techniques in using deep neural networks networks – Design the architecture • Module selection and Module connection • Loss function – Train the model based on optimization • Initialize the parameters • Search direction in parameters space • Learning rate schedule • Regularization techniques • …

Training of Neural Networks • Multi-layer perceptron (example)  2 Backward ， Calculate d L ： (1) = 1 𝒆 𝒚 ℎ 0 𝑦 0 = 1 ( 1 , 0, 0) 𝑈 d L 𝐳 =2 ( 𝐳 − 𝑦 1 𝒛) 𝒆𝒃 (𝟑) = d L d L 𝑦 2 𝐞𝐳 ∙ 𝜏 𝒃 (𝟑) ∙ (1−𝜏 𝒃 (𝟑) ) 𝒆 𝒊 (𝟐) = d L d L 𝑦 3 𝒆𝒃 (𝟑) 𝑿 (𝟑) d L 𝒃 (𝟐) = d L output 𝒆 𝒊 (𝟐) ∙ 𝜏 𝒃 (𝟐) ∙ (1−𝜏 𝒃 (𝟐) ) input hidden layer d L 𝒆𝒚 = d L 𝒆𝒃 (𝟐) 𝑿 (𝟐)  1 Forward calculate y ： 𝒃 (𝟐) = 𝑿 (𝟐) ∙ 𝒚 𝒊 (𝟐) = 𝜏 𝒃 (𝟐)  3 calculate gradient d L 𝒆 𝑿 ： 𝒃 (𝟑) = 𝑿 (𝟑) ∙ 𝒊 𝟐 𝒆 𝑿 (𝟑) = d L d L 𝒆𝒃 (𝟑) 𝒊 (𝟐) 𝒛 = 𝜏 𝒃 (𝟑) 𝒆 𝑿 (𝟐) = d L d L 𝒆𝒃 (𝟐) 𝒚 9 𝒛) 2  MSE Loss: L=( 𝐳 −

Optimization in Deep Model • Goal: • Update Iteratively: • Challenge: – Non-convex and local optimal points – Saddle point – Severe correlation between dimensions and highly non-isotropic parameter space (ill-shaped)

First order optimization • First order stochastic gradient descent (SGD): – The direction of the gradient – Gradient is averaged by the sampled examples – Disadvantage • Over-aggressive steps on ridges • Too small steps on plateaus • Slow convergence • non-robust performance. Figure 2: zig-zag iteration path for SGD

Advanced Optimization • Estimate curvature or scale – Quadratic optimization • Newton or quasi-Newton – Inverse of Hessian • Natural Gradient – Inverse of FIM – Estimate the scale • AdaGrad Iteration path of SGD (red) and NGD (green) • Rmsprop • Adam • Normalize input/activation – Intuition ： the landscape of cost w.r.t parameters is controlled by Input/activation L=(f(x, 𝜄 ),y) – Method: Stabilize the distribution of input/activation • Normalize the input explicitly • Normalize the input implicitly (constrain weights)

Some intuitions of normalization for optimization • How Normalizing activation affect the optimization? – 𝑧 = 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 +b – L= (𝑧 − 𝑧)^2 0 < 𝑦 1 < 2 ′ = 𝑦 1 /2 < 1 𝑥 2 0 < 𝑦 1 𝑥 2 0 < 𝑦 2 < 0.5 ′ = 𝑦 2 ∗ 2 < 1 0 < 𝑦 2 L (𝑥 1 , 𝑥 2 ) L (𝑥 1 , 𝑥 2 ) 𝑥 1 𝑥 1

Batch Normalization--motivation • Solving Internal Covariate Shift ℎ 2 = 𝑥ℎ 1 ℎ 1 = 𝑥𝑦 • Whitening Input benefits optimization (1998,Lecun, Efficient back-propagation) centering – Centering – Decorrelate – stretch decorrelate y=Wx, MSE loss stretch

Batch Normalization--method • Only standardize input: decorrelating is expensive – Centering – Stretch centering stretch • How to do it? 𝑦−𝐹(𝑦) – 𝑦 = 𝑡𝑢𝑒(𝑦)

Batch Normalization--training • Forward

Batch Normalization--training • Backward

Batch Normalization--Inference • Inference (in paper) • Inference (in practice) – Running average • 𝐹 𝑦 = 𝛽 𝜈 𝐶 + (1 − 𝛽)𝐹 𝑦 2 + 1 − 𝛽 𝑤𝑏𝑠 𝑦 • 𝑊𝑏𝑠 𝑦 = 𝛽 𝜏 𝐶

Batch Normalization — how to use • Convolution layer • Wrapped as a module – Before or after nonlinear? • For shallow module, after nonlinear (Layer <11) • For deep model, before nonlinear – Advantage of before nonlinear • For Relu, half activated • For sigmod, avoiding saturated region. – Advantage of after nonlinear • The intuition of whitening

Batch Normalization — how to use • Example: Residual block (CVPR 2015) Pre-activation Residual block (ECCV 2016)

Batch Normalization — characteristics • For accelerating training ： – Weight scale invariant: Not sensitive for weight initialization – Adjustable learning rate – Large learning rate • Better conditioning, (1998 Lecun) • For generalization – Stochastic, works like Dropout

Batch Normalization • Routine in deep feed forward neural networks, especially for CNNs. • Weakness – Can not be used for online learning – Unstable for small mini batch size – Used in RNN with caution

Batch Normalization – for RNN • The extra problems need be considered ： – Where BN should put? – Sequence data • 2016,ICASSP, Batch Normalized Recurrent Neural Networks – How to put BN module – Sequence data problem • Frame-wise normalization • Sequence-wise normalization

Batch Normalization for RNN • 2017, ICLR, Recurrent Batch Normalization – How to put BN module – Sequence data problem • T_max Frame-wise normalization • It depends…..

Norm-propagation (2016, ICML) • Target BN’s drawback: – Can not be used for online learning – Unstable for small mini batch size. • Data independent parametric estimate of mean and variance – Normalize input: 0-mean and unit variance – Assuming W is orthogonal – Derivate the nonlinear dynamic • Relu:

Layer Normalization (2016, Arxiv) • Target BN’s drawback: – Can not be used for online learning – Unstable for small mini batch size – RNN • Normalizing each example, over dimensions BN LN

Natural Neural Network (2015, NIPS) • How about decorrelate the activations? • Canonical model(MLP): ℎ 𝑗 = 𝑔 𝑗 (𝑋 𝑗 ℎ 𝑗−1 + 𝑐 𝑗 ) • Natural neural network • Model parameters: Ω = { , 𝑊 1 , 𝑒 1 … , 𝑊 𝑀 , 𝑒 𝑀 } • Whitening coefficients : Φ = { , 𝑉 0 , 𝑑 0 … , 𝑉 𝑀−1 , 𝑑 𝑀−1 }

Weight Normalization (2016, NIPS) • Target BN’s drawback: – Can not be used for online learning – Unstable for small mini batch size – RNN • Express weight as new parameters • Decouple direction and length of vectors

Reference • batch normalization accelerating deep network training by reducing internal covariate shift, ICML 2015 (Batch Normalization) • Normalization Propagation A Parametric Technique for Removing Internal Covariate Shift in Deep Networks, ICML, 2016 • Weight Normalization A Simple Reparameterization to Accelerate Training of Deep Neural Networks, NIPS, 2016 • Layer Normalization, Arxiv:1607.06450, 2016 • Recurrent Batch Normalization, ICLR,2017 • Batch Normalized Recurrent Neural Networks, ICASSP, 2016 • Natural Neural Networks, NIPS, 2015 • Normalizing the normaliziers-comparing and extending network normalization schemes, ICLR, 2017 • Batch Renormalization, Arxiv:1702.03275, 2017 • mean-normalized stochastic gradient for large-scale deep learning, ICASSP 2014 • deep learning made easier by linear transformations in perceptrons, AISTATS 2012 31

Centered Weight Normalization in Accelerating Training of Deep Neural Networks Lei Huang, Xianglong Liu, Yang Liu, Bo Lang, Dacheng Tao International Conference on Computer Vision (ICCV) 2017

Motivation • Stable distribution in hidden layer • Initialization method – Random Init (1998, YanLecun) • Zero-mean, stable-var – Xavier Init (2010, Xavier) – He Init (2015, He Kaiming) 2 • 𝑋~𝑂 0, 𝑜 , 𝑜 = 𝑝𝑣𝑢 ∗ 𝐼 ∗ 𝑋 • Keep desired characters during training

Normalization Techniques in Training of Deep Neural Networks Lei - PowerPoint PPT Presentation

Normalization Techniques in Training of Deep Neural Networks Lei Huang ( ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th , 2017 Outline Introduction to Deep

Training Neural Networks: Normalization, Regularization etc. Intro to Deep Learning, Fall 2020 1

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

TAEP/ AWMA Joint Meeting TAEP/ AWMA Joint Meeting Normalization of the Abnorm Normalization of

Strong normalization for the parameter-free Strong polymorphic lambda calculus based on the

Normalization Lecture 9 Normalization 24 February 2015 1 Wentworth Institute of Technology

Linear Logic and Strong Normalization Beniamino Accattoli Carnegie Mellon University B.

Formalizing Strong Normalization Proofs Kazuhiko Sakaguchi College of Information Science,

Normal forms and normalization An example of normalization using normal forms We assume we have

Database Normalization Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th)

Normalization Redundancy causes several anomalies : insert, delete and update

Normalization-Invariant Fuzzy Logic Need for Normalization Operations Explain Empirical Success

Normalization by evaluation for Thorsten Altenkirch Tarmo Uustalu University of Nottingham

Normalization by Evaluation for Martin-Lf Type Theory with One Universe Peter Dybjer,

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

MATLAB Programming (Lecture 1) Dr. SUN Bing School of EIE Beihang University www.buaa.edu.cn

Cyber Security on Nutanix at Nutanix want to use and select Insert . Resize the picture to fit

iOS Development with SwiftUI Anthony Li Room 138 Link Welcome d a o l n w o d t s u

ANALYSIS OF REVIEWS FROM THE GOOGLE PLAY STORE Prof. Rachel Harrison Oxford Brookes University

Multifractal analysis: an example with two different Olsens cutoff functions Jacques Peyri`

Short introduction to the CHAIN-REDS Project Federico Ruggieri INFN/GARR Joint CHAIN-REDS /

Dual Variational Generation for Low Shot Heterogeneous Face Recognition Chaoyou Fu, Xiang Wu

Predicting Temporal Sets with Deep Neural Networks Le Yu, Leilei Sun*, Bowen Du, Chuanren Liu, Hui