Landscape Connectivity and Dropout Stability of SGD Solutions for - PowerPoint PPT Presentation

Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks Alexander Shevchenko Marco Mondelli

Neural Network Training From theoretical perspective training of neural networks is di ffi cult (NP-hardness, local/disconnected minima …), but in practice works remarkably well! Two key ingredients of success: Over-parameterization (Stochastic) gradient descent

Training Landscape is indeed NICE • SGD minima connected via piecewise linear path with constant loss [Garipov et al., 2018; Draxler et al., 2018] • Mode connectivity proved assuming properties of well-trained networks (dropout/noise stability) [Kuditipudi et al., 2019]

What do we show? Theorem. (Informal) As neural network grows wider the solutions obtained via SGD become increasingly more dropout stable and barriers between local optima disappear. Mean-field view: Two layers [Mei et al., 2019] Multiple layers [Araujo et al., 2019] Quantitative bounds: • independent of input dimension for two-layer networks, scale linearly for multiple layers 1 • change in loss scales with network width as width • number of training samples is just required to scale faster than the log(width)

Related Work • Local minima are globally optimal for deep linear networks and networks with more neurons than training samples • Connected landscape if the number of neurons grows large (two-layer networks, energy gap exponential in input dimension) Strong assumptions on the model and poor scaling of parameters ︎ 😟

̂ Warm-up: Two Layer Networks Data: { ( x 1 , y 1 ) , …, ( x n , y n ) } ∼ i.i.d. ℙ ( ℝ d × ℝ ) N y N ( x , θ ) = 1 a i σ ( x ; w i ) Model: ∑ N i =1 Goal: Minimize loss L N ( θ ) = 𝔽 { ( y − 1 } , θ = ( w , a ) 2 i =1 a i σ ( x ; w i ) ) N ∑ N Online SGD: θ k +1 = θ k + α N ∇ θ k ( ( y k − 1 ) 2 i ) ) i σ ( x k ; w k N ∑ N i =1 a k • bounded, sub-gaussian ∇ w σ ( x , w ) y • bounded and di ff erentiable, bounded and Lipschitz ∇ σ σ • initialization of with bounded support a i

Recap: Dropout Stability 2 ( y − 1 a i σ ( x ; w i ) ) M ∑ L M ( θ ) = 𝔽 M i =1 is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D

Recap: Dropout Stability and Connectivity 2 ( y − 1 a i σ ( x ; w i ) ) M ∑ L M ( θ ) = 𝔽 M i =1 is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D and are - connected if there exists a continuous path connecting them θ ′ θ ε C where the loss does not increase more than ε C

Main Results: Dropout Stability

Main Results: Dropout Stability log M Change in loss scales as α ( D + log N ) + M

Main Results: Dropout Stability α ≪ ( − 1 D + log N ) • Loss change vanishes as and M ≫ 1 • does not need to scale with or M N D

Main Results: Connectivity

Main Results: Connectivity log N • Change in loss scales as + α ( D + log N ) N

Main Results: Connectivity log N • Change in loss scales as + α ( D + log N ) N • Can connect SGD solutions obtained from di ff erent training data (but same data distribution) and di ff erent initialization

Proof Idea Continuous dynamics of Discrete dynamics of SGD gradient flow • θ k close to i.i.d. particles that evolve with gradient flow N • and concentrate to the same limit L N ( θ ) L M ( θ ) • Dropout stability with connectivity M = N /2 ⇒

Multilayer Case: Setup Data: { ( x 1 , y 1 ) , …, ( x n , y n ) } ∼ i.i.d. ℙ ( ℝ d x × ℝ d y ) N W L +1 σ L ( ⋯ ( N W 2 σ 1 ( W 1 x ) ) ⋯ ) y N ( x , θ ) = 1 1 Model: ̂ Goal: Minimize loss L N ( θ ) = 𝔽 { } 2 y − ̂ y N ( x , θ ) 2 Online SGD: θ k +1 = θ k + α N 2 ∇ θ k y N ( x k , θ k ) y k − ̂ • bounded y • bounded and di ff erentiable, bounded and Lipschitz ∇ σ ℓ σ ℓ • initialization with bounded support • and stay fixed (random features) W 1 W L +1

Multilayer Case: Dropout Stability Dropout stability: loss does not change much if we remove part of neurons from each layer (and suitably rescale remaining neurons).

Multilayer Case: Dropout Stability loss when we keep at most neurons per layer L M ( θ ) := M is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D

Multilayer Case: Dropout Stability and Connectivity loss when we keep at most neurons per layer L M ( θ ) := M is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D and are - connected if there exists a continuous path connecting them θ ′ θ ε C where the loss does not increase more than ε C

Multilayer Case: Results

Proof Challenges Continuous dynamics of Discrete dynamics of SGD gradient flow • Ideal particles are no longer independent (weights in di ff erent layers are correlated) • Bound on norm of weights during the training • Bound maximum distance between SGD and ideal particles ([Araujo et al., 2019] bounds the average distance) •

Numerical Results • CIFAR-10 dataset • Pretrained VGG-16 features • # layers = 3 • Keep half of neurons

Numerical Results

Conclusion

Thank You for Your Attention

Landscape Connectivity and Dropout Stability of SGD Solutions for - PowerPoint PPT Presentation

Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks Alexander Shevchenko Marco Mondelli Neural Network Training From theoretical perspective training of neural networks is di ffi cult

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Average Connectivity and Average Edge-connectivity in Graphs Suil O joint work with Jaehoon Kim

Connectivity Corollary. GRAPH CONNECTIVITY is not FO definable Connectivity Corollary. GRAPH

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang, Tianyi Zhou, Jeff

Soft Gamma-ray Polarimetry with ASTRO-H SGD August 23, 2014 HEAPA Symposium on Future

Cuts and Connectivity Cuts and Connectivity CSE, IIT KGP Vertex Cut and Connectivity Vertex Cut

A NEW STANDARD FOR IOT CONNECTIVITY Managed connectivity services for Internet of Things

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th eodore

Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel Hernndez-Lobato , Padhraic

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 A first

Preve Prevention ntion of of Dro Dropout pout in Vo in Vocatio cational Training nal

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020 A first

State Tax Litigation State Tax Litigation Landscape Landscape Landscape Landscape Helen

Project Management: Tips, Tools & Tricks for any Type or Size of Library Kirsten Clark &

The Calabi-Yau Landscape: Beyond the Lampposts Mehmet Demirtas Cornell University String Pheno

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine,

Time and Length Scales in Glassy Systems Giulio Biroli

Efficiency Issues Neighborhoods and Landscapes Marco Chiarandini Department of Mathematics &

IOT LANDSCAPE March 2017 DANNY DE COCK HTTP://GODOT.BE/SLIDES LIMITED SCOPE Security

New Horizons for Strong QCD Theory in the 12 GeV era at Jefferson Lab Strong QCD from

Towards Numerical Assistants Trust, Measurement, Community, and Generality for the Numerical

Landscape Connectivity and Dropout Stability of SGD Solutions for - PowerPoint PPT Presentation

Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks Alexander Shevchenko Marco Mondelli Neural Network Training From theoretical perspective training of neural networks is di ffi cult

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Average Connectivity and Average Edge-connectivity in Graphs Suil O joint work with Jaehoon Kim

Connectivity Corollary. GRAPH CONNECTIVITY is not FO definable Connectivity Corollary. GRAPH

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang*, Tianyi Zhou*, Jeff

Soft Gamma-ray Polarimetry with ASTRO-H SGD August 23, 2014 HEAPA Symposium on Future

Cuts and Connectivity Cuts and Connectivity CSE, IIT KGP Vertex Cut and Connectivity Vertex Cut

A NEW STANDARD FOR IOT CONNECTIVITY Managed connectivity services for Internet of Things

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th eodore

Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel Hernndez-Lobato , Padhraic

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 A first

Preve Prevention ntion of of Dro Dropout pout in Vo in Vocatio cational Training nal

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020 A first

State Tax Litigation State Tax Litigation Landscape Landscape Landscape Landscape Helen

Project Management: Tips, Tools &amp; Tricks for any Type or Size of Library Kirsten Clark &amp;

The Calabi-Yau Landscape: Beyond the Lampposts Mehmet Demirtas Cornell University String Pheno

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine,

Time and Length Scales in Glassy Systems Giulio Biroli

Efficiency Issues Neighborhoods and Landscapes Marco Chiarandini Department of Mathematics &amp;

IOT LANDSCAPE March 2017 DANNY DE COCK HTTP://GODOT.BE/SLIDES LIMITED SCOPE Security

New Horizons for Strong QCD Theory in the 12 GeV era at Jefferson Lab Strong QCD from

Towards Numerical Assistants Trust, Measurement, Community, and Generality for the Numerical

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang, Tianyi Zhou, Jeff

Project Management: Tips, Tools & Tricks for any Type or Size of Library Kirsten Clark &

Efficiency Issues Neighborhoods and Landscapes Marco Chiarandini Department of Mathematics &