A Quantitative Analysis of the Effect of Batch Normalization on - PowerPoint PPT Presentation

A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent Yongqiang Cai 1 , Qianxiao Li 1,2 , Zuowei Shen 1 9-15 June 2019 (ICML), Long Beach, CA, USA 1 Department of Mathematics, National University of Singapore, Singapore 2 Institute of High Performance Computing, A*STAR, Singapore

Batch Normalization A vanilla fully-connected layer z = σ ( Wu + b ) . With batch normalization (Ioffe & Szegedy 2015): ξ − E [ ξ ] z = σ ( γ N ( Wu ) + β ) , N ( ξ ) := √ Var [ ξ ] . Batch normalization works well in practice, e.g. allows stable training with large learning rates, works well in high dimensions or ill-conditioned problems Related work on BN [Ma & Klabjan (2017); Kohler et al. (2018); Arora et al. (2019)] 1

Batch Normalization A vanilla fully-connected layer z = σ ( Wu + b ) . With batch normalization (Ioffe & Szegedy 2015): ξ − E [ ξ ] z = σ ( γ N ( Wu ) + β ) , N ( ξ ) := √ Var [ ξ ] . Batch normalization works well in practice, e.g. allows stable training with large learning rates, works well in high dimensions or ill-conditioned problems Related work on BN [Ma & Klabjan (2017); Kohler et al. (2018); Arora et al. (2019)] Question: Can we quantify the precise effect of BN on gradient descent (GD)? 1

Batch Normalization on Ordinary Least Squares Linear regression model: Model: y = x T w ∗ + noise Input: x ∈ R d Label: y ∈ R 2

Batch Normalization on Ordinary Least Squares Linear regression model: Model: y = x T w ∗ + noise Input: x ∈ R d Label: y ∈ R OLS regression without BN 2 ( y − x T w ) 2 ] min w J 0 ( w ) := E x,y [ 1 Optimization problem: Gradient descent dynamics: w k +1 = w k − ε ∇ w J 0 ( w k ) = w k + ε ( g − Hw k ) , where H := E [ xx T ] , c := E [ y 2 ] . g := E [ xy ] , contraction ratio 2

Batch Normalization on Ordinary Least Squares Linear regression model: Model: y = x T w ∗ + noise Input: x ∈ R d Label: y ∈ R OLS regression with BN � 1 y − a N ( x T w ) � 2 � � Optimization problem: min a,w J ( a, w ) = E x,y 2 Gradient descent dynamics:  w T � � k g a k +1 = a k − ε a ∇ a J ( a k , w k ) = a k + ε a √ k Hw k − a k ,   w T � w T � k g εa k √ w k +1 = w k − ε ∇ w J ( a k , w k ) = w k + g − k Hw k Hw k .  w T  w T k Hw k How does this compare with the GD case? w k +1 = w k − ε ∇ w J 0 ( w k ) = w k + ε ( g − Hw k ) Properties of interest: convergence, robustness 2

Summary of Theoretical Results Property Gradient Descent Gradient Descent with BN Convergence only for small ε arbitrary ε provided ε a ≤ 1 Convergence Rate linear linear (can be faster) Robustness to Learning Rates small range of ε wide range of ε Robustness to Dimensions no effect the higher the better (a) Loss of GD and BNGD( d = 100) (b) Effect of dimension on BNGD 3

Summary of Theoretical Results Property Gradient Descent Gradient Descent with BN Convergence only for small ε arbitrary ε provided ε a ≤ 1 Convergence Rate linear linear (can be faster) Robustness to Learning Rates small range of ε wide range of ε Robustness to Dimensions no effect the higher the better • Those properties are also observed in neural network experiments. (a) Loss of GD and BNGD( d = 100) (b) Effect of dimension on BNGD (c) Accuracy of BNGD on MNIST 3

Poster: Pacific Ballroom #54 3

A Quantitative Analysis of the Effect of Batch Normalization on - PowerPoint PPT Presentation

A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent Yongqiang Cai 1 , Qianxiao Li 1,2 , Zuowei Shen 1 9-15 June 2019 (ICML), Long Beach, CA, USA 1 Department of Mathematics, National University of Singapore, Singapore

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

Quantitative Quantitative Quantitative Quantitative Modal Modal Transition Transition

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Batch Modeling and Process Monitoring Geir Rune Flten Agenda CAMO Batch analysis

Automating batch fecundity measurements Automating batch fecundity measurements using digital

A Novel Micro- -Batch Mixer Batch Mixer A Novel Micro That Scales To That Scales To The

Enabling Efficient Batch Verification Enabling Efficient Batch Verification on Data Integrity for

Process costing By: Jyotsna Khaitan Batch Costing: It is a modified form of job costing where

Building the Easy Button: Automating SAS Program Batch Runs Nancy Brucken inVentiv Health June

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan Jiang University of Illinois

Asphalt Production Asphalt Plants Batch Plant Drum Plant Produces asphalt one batch at a time

Learning about the process and organism: Batch Sef Heijnen, Department of Biotechnology, Faculty

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Introducing Quantitative Literacy in a Writing Course Using the Ultimatum Game Rob Root

Qualitative Evaluation Food for Thought Nest thermostat

Latinos in Oregon: Studio to School Initiative Trends and Opportunities Evaluation Jam Session #

Webinar: Roadmap to 20% Energy Savings A Quantitative Tool to Plan your Strategy Webinar

2020 Lectures on Urban Economics Lecture 7: Neighborhoods and Inequality Veronica Guerrieri

From Qualitative to Quantitative Theories of Software Tom Henzinger IST Austria Qualitative

Symbolic Polytopes for Quantitative Interpolation and Verification Klaus v. Gleissenthall, TU

Considering Qualitative Evaluation University of Calgary: Sheelagh Carpendale, Alice Thudt, Jo