normalization techniques in training of deep neural
play

Normalization Techniques in Training of Deep Neural Networks Lei - PowerPoint PPT Presentation

Normalization Techniques in Training of Deep Neural Networks Lei Huang ( ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th , 2017 Outline Introduction to Deep


  1. Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th , 2017

  2. Outline • Introduction to Deep Neural Networks (DNNs) • Training DNNs: Optimization • Batch Normalization • Other Normalization Techniques • Centered Weight Normalization

  3. Machine learning Y=F(X) P(Y|X)  dataset D={X, Y}  Input: X  Output: Y  Learning: Y=F(X) or P(Y|X)  Fitting and Generalization  Types: view of models  Non-parametric model  Y=F(X; 𝑦 1 , 𝑦 2 … 𝑦 𝑜 )  Parametric model  Y=F(X; 𝜄 )

  4. Neural network • Neural network – Y=F(X)= 𝑔 𝑈 (𝑔 𝑈−1 (… 𝑔 1 (𝑌))) – 𝑔 𝑗 𝑦 = 𝑕(𝑋𝑦 + 𝑐) • Nonlinear activation – sigmod – Relu

  5. Deep neural network • Why deep? – Powerful representation capacity

  6. Key properties of Deep learning • End to End learning – no distinction between feature extractor and classifier • “Deep” architectures: – Hierarchy of simpler non-linear modules

  7. Applications and techniques of DNNs • Successful applications in a range of domains – Speech – Computer Vision – Natural Language processing • Main techniques in using deep neural networks networks – Design the architecture • Module selection and Module connection • Loss function – Train the model based on optimization • Initialize the parameters • Search direction in parameters space • Learning rate schedule • Regularization techniques • …

  8. Outline • Introduction to Deep Neural Networks (DNNs) • Training DNNs: Optimization • Batch Normalization • Other Normalization Techniques • Centered Weight Normalization

  9. Training of Neural Networks • Multi-layer perceptron (example)  2 Backward , Calculate d L : (1) = 1 𝒆 𝒚 ℎ 0 𝑦 0 = 1 ( 1 , 0, 0) 𝑈 d L 𝐳 =2 ( 𝐳 − 𝑦 1 𝒛) 𝒆𝒃 (𝟑) = d L d L 𝑦 2 𝐞𝐳 ∙ 𝜏 𝒃 (𝟑) ∙ (1−𝜏 𝒃 (𝟑) ) 𝒆 𝒊 (𝟐) = d L d L 𝑦 3 𝒆𝒃 (𝟑) 𝑿 (𝟑) d L 𝒃 (𝟐) = d L output 𝒆 𝒊 (𝟐) ∙ 𝜏 𝒃 (𝟐) ∙ (1−𝜏 𝒃 (𝟐) ) input hidden layer d L 𝒆𝒚 = d L 𝒆𝒃 (𝟐) 𝑿 (𝟐)  1 Forward calculate y : 𝒃 (𝟐) = 𝑿 (𝟐) ∙ 𝒚 𝒊 (𝟐) = 𝜏 𝒃 (𝟐)  3 calculate gradient d L 𝒆 𝑿 : 𝒃 (𝟑) = 𝑿 (𝟑) ∙ 𝒊 𝟐 𝒆 𝑿 (𝟑) = d L d L 𝒆𝒃 (𝟑) 𝒊 (𝟐) 𝒛 = 𝜏 𝒃 (𝟑) 𝒆 𝑿 (𝟐) = d L d L 𝒆𝒃 (𝟐) 𝒚 9 𝒛) 2  MSE Loss: L=( 𝐳 −

  10. Optimization in Deep Model • Goal: • Update Iteratively: • Challenge: – Non-convex and local optimal points – Saddle point – Severe correlation between dimensions and highly non-isotropic parameter space (ill-shaped)

  11. First order optimization • First order stochastic gradient descent (SGD): – The direction of the gradient – Gradient is averaged by the sampled examples – Disadvantage • Over-aggressive steps on ridges • Too small steps on plateaus • Slow convergence • non-robust performance. Figure 2: zig-zag iteration path for SGD

  12. Advanced Optimization • Estimate curvature or scale – Quadratic optimization • Newton or quasi-Newton – Inverse of Hessian • Natural Gradient – Inverse of FIM – Estimate the scale • AdaGrad Iteration path of SGD (red) and NGD (green) • Rmsprop • Adam • Normalize input/activation – Intuition : the landscape of cost w.r.t parameters is controlled by Input/activation L=(f(x, 𝜄 ),y) – Method: Stabilize the distribution of input/activation • Normalize the input explicitly • Normalize the input implicitly (constrain weights)

  13. Some intuitions of normalization for optimization • How Normalizing activation affect the optimization? – 𝑧 = 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 +b – L= (𝑧 − 𝑧)^2 0 < 𝑦 1 < 2 ′ = 𝑦 1 /2 < 1 𝑥 2 0 < 𝑦 1 𝑥 2 0 < 𝑦 2 < 0.5 ′ = 𝑦 2 ∗ 2 < 1 0 < 𝑦 2 L (𝑥 1 , 𝑥 2 ) L (𝑥 1 , 𝑥 2 ) 𝑥 1 𝑥 1

  14. Outline • Introduction to Deep Neural Networks (DNNs) • Training DNNs: Optimization • Batch Normalization • Other Normalization Techniques • Centered Weight Normalization

  15. Batch Normalization--motivation • Solving Internal Covariate Shift ℎ 2 = 𝑥ℎ 1 ℎ 1 = 𝑥𝑦 • Whitening Input benefits optimization (1998,Lecun, Efficient back-propagation) centering – Centering – Decorrelate – stretch decorrelate y=Wx, MSE loss stretch

  16. Batch Normalization--method • Only standardize input: decorrelating is expensive – Centering – Stretch centering stretch • How to do it? 𝑦−𝐹(𝑦) – 𝑦 = 𝑡𝑢𝑒(𝑦)

  17. Batch Normalization--training • Forward

  18. Batch Normalization--training • Backward

  19. Batch Normalization--Inference • Inference (in paper) • Inference (in practice) – Running average • 𝐹 𝑦 = 𝛽 𝜈 𝐶 + (1 − 𝛽)𝐹 𝑦 2 + 1 − 𝛽 𝑤𝑏𝑠 𝑦 • 𝑊𝑏𝑠 𝑦 = 𝛽 𝜏 𝐶

  20. Batch Normalization — how to use • Convolution layer • Wrapped as a module – Before or after nonlinear? • For shallow module, after nonlinear (Layer <11) • For deep model, before nonlinear – Advantage of before nonlinear • For Relu, half activated • For sigmod, avoiding saturated region. – Advantage of after nonlinear • The intuition of whitening

  21. Batch Normalization — how to use • Example: Residual block (CVPR 2015) Pre-activation Residual block (ECCV 2016)

  22. Batch Normalization — characteristics • For accelerating training : – Weight scale invariant: Not sensitive for weight initialization – Adjustable learning rate – Large learning rate • Better conditioning, (1998 Lecun) • For generalization – Stochastic, works like Dropout

  23. Batch Normalization • Routine in deep feed forward neural networks, especially for CNNs. • Weakness – Can not be used for online learning – Unstable for small mini batch size – Used in RNN with caution

  24. Batch Normalization – for RNN • The extra problems need be considered : – Where BN should put? – Sequence data • 2016,ICASSP, Batch Normalized Recurrent Neural Networks – How to put BN module – Sequence data problem • Frame-wise normalization • Sequence-wise normalization

  25. Batch Normalization for RNN • 2017, ICLR, Recurrent Batch Normalization – How to put BN module – Sequence data problem • T_max Frame-wise normalization • It depends…..

  26. Outline • Introduction to Deep Neural Networks (DNNs) • Training DNNs: Optimization • Batch Normalization • Other Normalization Techniques • Centered Weight Normalization

  27. Norm-propagation (2016, ICML) • Target BN’s drawback: – Can not be used for online learning – Unstable for small mini batch size. • Data independent parametric estimate of mean and variance – Normalize input: 0-mean and unit variance – Assuming W is orthogonal – Derivate the nonlinear dynamic • Relu:

  28. Layer Normalization (2016, Arxiv) • Target BN’s drawback: – Can not be used for online learning – Unstable for small mini batch size – RNN • Normalizing each example, over dimensions BN LN

  29. Natural Neural Network (2015, NIPS) • How about decorrelate the activations? • Canonical model(MLP): ℎ 𝑗 = 𝑔 𝑗 (𝑋 𝑗 ℎ 𝑗−1 + 𝑐 𝑗 ) • Natural neural network • Model parameters: Ω = { , 𝑊 1 , 𝑒 1 … , 𝑊 𝑀 , 𝑒 𝑀 } • Whitening coefficients : Φ = { , 𝑉 0 , 𝑑 0 … , 𝑉 𝑀−1 , 𝑑 𝑀−1 }

  30. Weight Normalization (2016, NIPS) • Target BN’s drawback: – Can not be used for online learning – Unstable for small mini batch size – RNN • Express weight as new parameters • Decouple direction and length of vectors

  31. Reference • batch normalization accelerating deep network training by reducing internal covariate shift, ICML 2015 (Batch Normalization) • Normalization Propagation A Parametric Technique for Removing Internal Covariate Shift in Deep Networks, ICML, 2016 • Weight Normalization A Simple Reparameterization to Accelerate Training of Deep Neural Networks, NIPS, 2016 • Layer Normalization, Arxiv:1607.06450, 2016 • Recurrent Batch Normalization, ICLR,2017 • Batch Normalized Recurrent Neural Networks, ICASSP, 2016 • Natural Neural Networks, NIPS, 2015 • Normalizing the normaliziers-comparing and extending network normalization schemes, ICLR, 2017 • Batch Renormalization, Arxiv:1702.03275, 2017 • mean-normalized stochastic gradient for large-scale deep learning, ICASSP 2014 • deep learning made easier by linear transformations in perceptrons, AISTATS 2012 31

  32. Outline • Introduction to Deep Neural Networks (DNNs) • Training DNNs: Optimization • Batch Normalization • Other Normalization Techniques • Centered Weight Normalization

  33. Centered Weight Normalization in Accelerating Training of Deep Neural Networks Lei Huang, Xianglong Liu, Yang Liu, Bo Lang, Dacheng Tao International Conference on Computer Vision (ICCV) 2017

  34. Motivation • Stable distribution in hidden layer • Initialization method – Random Init (1998, YanLecun) • Zero-mean, stable-var – Xavier Init (2010, Xavier) – He Init (2015, He Kaiming) 2 • 𝑋~𝑂 0, 𝑜 , 𝑜 = 𝑝𝑣𝑢 ∗ 𝐼 ∗ 𝑋 • Keep desired characters during training

Recommend


More recommend