norm matters efficient and accurate
play

Norm matters: efficient and accurate normalization schemes in deep - PowerPoint PPT Presentation

Norm matters: efficient and accurate normalization schemes in deep networks Elad Hoffer*, Ron Banner*, Itay Golan*, Daniel Soudry Spotlight , NeurIPS 2018 Norm Matters - Poster #27 1 *Equal contribution Batch normalization Shortcomings:


  1. Norm matters: efficient and accurate normalization schemes in deep networks Elad Hoffer*, Ron Banner*, Itay Golan*, Daniel Soudry Spotlight , NeurIPS 2018 Norm Matters - Poster #27 1 *Equal contribution

  2. Batch normalization Shortcomings: • Assumes independence between samples (problem when modeling time-series, RL, GANs, metric-learning etc.) • Why it works? Interaction with other regularization • Significant computational and memory impact, with data-bound operations – up to 25% of computation time in current models (Gitman, 17 ’ ) 2 ) , numerically unstable. • Requires high-precision operations ( σ 𝑗 𝑦 𝑗 Norm Matters - Poster #27 2

  3. Batch-norm Leads to norm invariance The key observation: 𝑥 • Given input 𝑦 , weight vector 𝑥 , its direction ෝ 𝑥 = 𝑥 • Batch-norm is norm invariant: 𝐶𝑂 𝑥 ෝ 𝑥𝑦 = 𝐶𝑂 ෝ 𝑥𝑦 • Weight norm only affects effective learning rate, e.g. in SGD: Norm Matters - Poster #27 3

  4. Weight decay before BN is redundant • Weight-decay equivalent to learning-rate scaling • Can be mimicked by With WD Without WD Without WD + LR correction Norm Matters - Poster #27 4

  5. Improving weight-norm This can help to make weight-norm work for large-scale models Weight normalization, for a channel 𝑗 : 𝑤 𝑗 𝑥 𝑗 = 𝑕 𝑗 𝑤 𝑗 Bounded Weight Normalization: 𝑥 𝑗 = 𝜍 𝑤 𝑗 𝑤 𝑗 𝜍 - constant determined from chosen initialization Resnet 50, ImageNet Norm Matters - Poster #27 5

  6. Replacing Batch-norm – switching norms 𝑦 𝑗 − 𝑦 • Batch-normalization – just scaled 𝑀 2 normalization: 𝑦 𝑗 = ෝ 1 𝑜 𝑦− 𝑦 • More numerically stable norms: 2 𝑦 1 = σ 𝑗 |𝑦 𝑗 | 𝑦 ∞ = max 𝑗 |𝑦 𝑗 | We use additional scaling constants so that the norm will behave similarly to 𝑀 2 , by assuming that neural input is Gaussian, e.g.: 1 𝜌 1 = 2∙ 𝑜𝐹 𝑦− 𝑦 𝑜𝐹 𝑦− 𝑦 2 1 Norm Matters - Poster #27 6

  7. 𝑀 1 Batch-norm (Imagenet, Resnet) Norm Matters - Poster #27 7

  8. Low precision batch-norm • 𝑀 1 batch-norm alleviates low-precision difficulties of batch-norm. • Can now train using Batch-Norm on ResNet50 without issues on FP16: Regular BN in FP16 fails L1 BN in FP16 works as well as L2 in FP32 Norm Matters - Poster #27 8

  9. With a few more tricks … • Can now train ResNet 18 ImageNet with bottleneck operations in Int8 : 8 bit Also at NeurIPS 2018 Full Precision “ Scalable Methods for 8-bit Training of Neural Networks ” *Ron Banner, *Itay Hubara, *Elad Hoffer, Daniel Soudry Norm Matters - Poster #27 9

  10. Thank you for your tim ime! Come vis isit us at poster #27 Norm Matters - Poster #27 10

Recommend


More recommend