Understanding Normalization in Deep Learning Speaker: Wenqi Shao Email: Weqish@link.cuhk.edu.hk
Outline ➢ Introduction ➢ Various Normalizers: IN, BN, LN, SN, SSN ➢ An Unified Representation: Meta Norm (MN) Back-propagation & Geometric Interpretation ➢ Why Batch Normalization? Optimization & Generalization ➢ Normalization in Various Computer Vision Tasks
Introduction ⚫ Normalization is a well-known technique in deep learning. ⚫ The first normalization method----Batch Normalization (BN). BN achieves the same accuracy with 14 times fewer training steps ⚫ Normalization improves both optimization and generalization of a DNN. ⚫ Various normalizers in terms of tasks and network architecture — Batch Normalization (BN), Image classification [1] — Instance Normalization (IN), Image style transfer [2] — Layer normalization (LN), Recurrent Neural Network (RNN) [3] — Group normalization (GN), robust to batch size, image classification, object detection [4] Normalization methods have been a foundation of various state-of-the-art computer vision tasks
Introduction Input ⚫ Object of normalization method — a 4-D tensor 𝒊 ∈ 𝑺 𝑶×𝑫×𝑰×𝑿 Convolution 𝑂𝐷𝐼𝑋 N- minibatch size (the number of samples) h = ℎ 𝑜𝑑𝑗𝑘 h Normalization C- number of channels H- height of a channel ReLu W- width of a channel Convolution ⚫ A very common building block — Conv+Norm+ReLU Normalization ⚫ They work by standardizing the activations within specific scope. ⚫ Two statistics : mean 𝜈 and variance 𝜏 2 Output ⚫ Two learnable parameters: scale parameter 𝛿 and shift parameter 𝛾
Various Normalizers-IN, BN, LN and GN Calculating mean 𝜈 and variance 𝜏 2 in different scope produces different normalizers. Given a feature map in DNN 𝒊 𝒐𝒅𝒋𝒌 ∈ 𝑺 𝑶×𝑫×𝑰×𝑿 , 𝐼,𝑋 𝐼,𝑋 1 1 2 𝟑 = IN 𝝂 𝑱𝑶 = 𝐼𝑋 ℎ 𝑜𝑑𝑗𝑘 , 𝝉 𝑱𝑶 𝐼𝑋 ℎ 𝑜𝑑𝑗𝑘 − 𝜈 𝐽𝑂 𝑗,𝑘=1 𝑗,𝑘=1 𝑂,𝐼,𝑋 𝑂,𝐼,𝑋 1 1 2 𝟑 𝝂 𝑪𝑶 = 𝑂𝐼𝑋 ℎ 𝑜𝑑𝑗𝑘 , 𝝉 𝑪𝑶 = 𝑂𝐼𝑋 ℎ 𝑜𝑑𝑗𝑘 − 𝜈 𝐶𝑂 BN 𝑜,𝑗,𝑘=1 𝑜,𝑗,𝑘=1 𝐷,𝐼,𝑋 𝐷,𝐼,𝑋 1 1 2 𝟑 𝝂 𝑴𝑶 = 𝐷𝐼𝑋 ℎ 𝑜𝑑𝑗𝑘 , 𝝉 𝑴𝑶 = 𝐷𝐼𝑋 ℎ 𝑜𝑑𝑗𝑘 − 𝜈 𝑀𝑂 LN 𝑑,𝑗,𝑘=1 𝑑,𝑗,𝑘=1 𝐷 ,𝐼,𝑋 𝐷 ,𝐼,𝑋 GN divides the channels into groups and 1 1 𝒉𝟑 = 2 𝒉 GN computes within each group the mean and 𝝂 𝑯𝑶 = 𝐷 𝐼𝑋 ℎ 𝑜𝑑𝑗𝑘 , 𝝉 𝑯𝑶 𝐷 𝐼𝑋 ℎ 𝑜𝑑𝑗𝑘 − 𝜈 𝐻𝑂 variance for normalization. 𝑑,𝑗,𝑘=1 𝑑,𝑗,𝑘=1
Various Normalizers-SN and SSN The above-mentioned methods of normalization use the same normalizer in different normalization layer. Swithchable Normalization (SN) is able to learn different normalizer for each normalization layer [5] . 2 = 𝑞 1 𝜏 𝑇𝑂 2 + 𝑞 2 𝜏 𝑇𝑂 2 + 𝑞 3 𝜏 𝑇𝑂 2 𝜈 𝑇𝑂 = 𝑞 1 𝜈 𝐽𝑂 + 𝑞 2 𝜈 𝐶𝑂 + 𝑞 3 𝜈 𝑀𝑂 , 𝜏 𝑇𝑂 Where 𝑞 1 , 𝑞 2 , 𝑞 3 = softmax 𝑨 1 , 𝑨 2 , 𝑨 3 and 𝑨 1 , 𝑨 2 , 𝑨 3 are learnable parameters 𝑨 1 , 𝑨 2 , 𝑨 3 learned by SGD in different layers could be different
Various Normalizers-SN and SSN However, SN suffers from overfitting and redundant computation. — overfitting , 𝑨 1 , 𝑨 2 , 𝑨 3 are optimized without any constraint. — redundant computation , compute all statistics in IN, BN and LN in the inference stage Sparse Switchable Normalization (SSN) is able to learn only one normalizer for each normalization layer [6] . Statistics in SSN: 2 = 𝑞 1 𝜏 𝑇𝑂 2 + 𝑞 2 𝜏 𝑇𝑂 2 + 𝑞 3 𝜏 𝑇𝑂 2 𝜈 𝑇𝑂 = 𝑞 1 𝜈 𝐽𝑂 + 𝑞 2 𝜈 𝐶𝑂 + 𝑞 3 𝜈 𝑀𝑂 , 𝜏 𝑇𝑂 Such that 𝑞 1 + 𝑞 2 + 𝑞 3 = 1 𝑏𝑜𝑒 𝑞 𝑗 ∈ 0,1 SSN is achieved by a novel transformation ‘SparsestMax’, which is used to substituted softmax in SN
An Unified Representation: Meta Normalization [7] Question. Is there an universal normalization that could include IN, BN, LN, etc. ? To answer this question, let’s consider the relation between 𝜈 𝐽𝑂 and 𝜈 𝐶𝑂 , 𝜈 𝑀𝑂 duplicate N rows Taking sum 1 ⋯ 1 𝜈 𝐶𝑂 = 1 𝜈 𝐶𝑂 ∈ 𝑆 𝐷 over each column 𝜈 𝐽𝑂 ⋮ ⋱ ⋮ 𝑂 1 ⋯ 1 𝜈 11 ⋯ 𝜈 1𝐷 ⋮ ⋱ ⋮ 𝜈 𝑂1 ⋯ 𝜈 𝑂𝐷 1 ⋯ 1 𝜈 𝑀𝑂 = 1 duplicate C columns 𝐷 𝜈 𝐽𝑂 ⋮ ⋱ ⋮ 𝜈 𝐽𝑂 ∈ 𝑆 𝑂×𝐷 𝜈 𝑀𝑂 ∈ 𝑆 𝑂 Taking sum 1 ⋯ 1 over each row
An Unified Representation: Meta Normalization MN. We can design an universal normalization by constructing binary matrix U and V as follows: 1 1 𝜈 𝑁𝑂 = 𝑉 𝜈 𝐽𝑂 𝑊 𝑎 𝑉 𝑎 𝑊 𝑎 𝑉 and 𝑎 𝑊 are normalizing factor. 𝑉 ∈ 𝑆 𝑂×𝑂 and 𝑊 ∈ 𝑆 𝐷×𝐷 are two binary matrix whose elements are either 0 or 1 1 1 𝜏 𝑁𝑂 = 𝑉 𝜏 𝐽𝑂 𝑊 𝑎 𝑉 𝑎 𝑊 Representation Capacity. In MN, V aggregates the statistics from the channels, while U aggregates those in a batch of samples. Therefore, different V and U represent different normalization approaches. ◆ Let 𝑉 = 𝐽 and 𝑊 = 𝐽 , then MN represents IN. 1 ◆ Let 𝑉 = 𝑂 𝟐 and 𝑊 = 𝐽 , then MN turns into BN. ◆ Let 𝑉 = 𝐽 and 𝑊 = 1 𝐷 𝟐 , then MN represents LN. 𝟐 𝟏 2 ◆ Let 𝑉 = 𝐽 and 𝑊 = 𝟐 , then MN represents GN with a group number of 2. 𝟏 𝐷
Back-propagation of MN MN. Let ෨ 𝐺 𝑜𝑑𝑗𝑘 be the neuron after normalization, and then it is transformed to ത 𝐺 𝑜𝑑𝑗𝑘 . Back-propagation . What we most care about is to back- 𝝐𝑴 propagate the gradient of output 𝑮 𝒐𝒅𝒋𝒌 to the gradient of 𝝐ഥ 𝝐𝑴 input 𝝐𝑮 𝒐𝒅𝒋𝒌 .
Back-propagation of MN 𝝐𝑴 𝝐𝑴 𝑮 𝒐𝒅𝒋𝒌 ≜ ෩ Back-propagation . 𝒆 𝒐𝒅𝒋𝒌 𝝐𝑮 𝒐𝒅𝒋𝒌 ≜ 𝒆 𝒐𝒅𝒋𝒌 𝝐෩ (*) Geometric View of BN. Let 𝑉 = 1 𝑂 𝟐 and 𝑊 = 𝐽 . 1 Geometric View of LN. Let 𝑉 = 𝐽 and 𝑊 = 𝐷 𝟐 . T 11 T + ෨ Geometric View of N with group number G. 𝑗𝑘 ෨ 𝐺 𝑜 𝑑 𝐺 𝑜 𝑑 1 𝑗𝑘 𝟐 ሚ 𝑒 𝑜( 𝑑 ) = 𝐽 − 𝑒 𝑜( 𝑑 Let 𝑉 = 𝐽 and 𝑊 = . 𝟐 𝑁𝑂 ) 𝐷 𝐼𝑋 𝜏 𝑜(𝑑 𝐷 ) 𝟐 G diagonal sub-matrixes
Geometric Interpretation −𝟐 𝑩 𝑼 . Projection Matrix. Given a matrix A, we have projection matrix 𝑸 = 𝑩 𝑩 𝑼 𝑩 The columns of A, we're given, form a basis for some subspace W, matrix (𝑱 − 𝑸) is the projection matrix for the orthogonal complement of W. Given a vector y, 𝑸𝒛 lies in subspace 𝑿 and (I − 𝑸)𝒛 is in the orthogonal complement of W. y Take BN as an example. (I-P)y Py Let 𝐵 = [𝟐, ෨ 𝐺 𝑑 ], then 𝐵 𝑈 𝐵 = 𝟐 𝑼 𝟐 0 = 𝑂𝐼𝑋 𝟐 𝟐 . 𝑈 ෨ ෨ 0 𝐺 𝐺 𝑑 𝑑 Therefore, the projection matrix corresponding to A is exactly
Why Batch Normalization? BN has been an indispensable component in various networks architectures. The effectiveness of BN has been uncovered form two aspects: optimization and generalization. A more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother [8] . maximum difference ( ℓ 2 nrom) in the variation (shaded region) in loss ℓ 2 changes in the gradient gradient over distance moved in that as we move in the gradient direction. direction
Lipschitzness of the Loss BN causes the landscape to be more well-behaved , inducing favorable properties in Lipschitz- continuity 。 Let’s first consider the optimization landscape wrt. activation. gradient magnitude, empically grows quadratically bounded away from zero less than 1 captures the Lipschitzness in the dimension of the loss
Lipschitzness of the Loss Let’s now turn to consider the optimization landscape wrt. weight.
Regularization in BN Batch normalization implicitly discourages single channel reliance , suggesting an alternative regularization mechanism by which batch normalization may encourage good generalization performance. BN makes channel equal such that they play homogeneous role in representing a prediction function. How to empirically verify this conclusion? [9] measure their robustness to cumulative ablation of channels Networks trained with batch normalization are more robust to these ablations than those trained without batch normalization
Regularization in BN We explore explicit regularization expression in BN by analyzing a building block in a deep network. BN also induces Gaussian priors for batch mean 𝝂 𝑪 and batch standard deviation 𝝉 𝑪 . [10] These priors tell us that 𝝂 𝑪 and 𝝉 𝑪 would produce Gaussian noise. Taking expectation over such noise may give us explicit regularization expression in BN. [11] ◆ regularization strength ζ is inversely proportional to the batch size M . ◆ 𝝂 𝑪 and 𝝉 𝑪 produce two different regularization strengths. ◆ 𝝂 𝑪 penalizes the expectation of activation, implying that the neuron with larger output may exposure to larger regularization . expectation of activation ζ γ expectation of activation
Recommend
More recommend