Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, - PowerPoint PPT Presentation

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis Division of Applied Mathematics, Brown University Scientific Machine Learning, ICERM January 28, 2019 Lu (Brown) ReLU NN Collapse Scientific ML 2019 1 / 20

Overview Introduction 1 Examples 2 Theoretical analysis 3 Asymmetric initialization (Shin) 4 Lu (Brown) ReLU NN Collapse Scientific ML 2019 2 / 20

Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow ReLU := max( x, 0) Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow ReLU := max( x, 0) ◮ Width limit? For continuous functions [0 , 1] d in → R d out [Hanin & Sellke, 2017]: d in + 1 ≤ minimal width ≤ d in + d out Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow ReLU := max( x, 0) ◮ Width limit? For continuous functions [0 , 1] d in → R d out [Hanin & Sellke, 2017]: d in + 1 ≤ minimal width ≤ d in + d out ◮ Depth limit? Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side Deep ReLU nets? Dying ReLU network NN is a constant function after initialization Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side Deep ReLU nets? Dying ReLU network NN is a constant function after initialization Collapse NN converges to the “mean” state of the target function during training Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

1D Examples f ( x ) = | x | � � 1 � � | x | = ReLU ( x ) + ReLU ( − x ) = 1 1 ReLU ( x ) − 1 2-layer with width 2 Train a 10-layer ReLU NN with width 2 (MSE loss, whatever optimizer) Lu (Brown) ReLU NN Collapse Scientific ML 2019 6 / 20

1D Examples f ( x ) = | x | � � 1 � � | x | = ReLU ( x ) + ReLU ( − x ) = 1 1 ReLU ( x ) − 1 2-layer with width 2 Train a 10-layer ReLU NN with width 2 (MSE loss, whatever optimizer) Collapse to the mean value (A): ∼ 93% Collapse partially (B) A B 2 2 y = |x| y = |x| NN NN 1.5 1.5 1 1 y y 0.5 0.5 0 0 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 x x Lu (Brown) ReLU NN Collapse Scientific ML 2019 6 / 20

1D Examples f ( x ) = x sin(5 x ) A B C D 2 y = xsin(5x) y = xsin(5x) y = xsin(5x) y = xsin(5x) NN NN NN NN 1 y 0 -1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 x x x x f ( x ) = 1 { x> 0 } + 0 . 2 sin(5 x ) A B C D 1.5 y y y y NN NN NN NN 1 y 0.5 0 -0.5 -1 0 1 -1 0 1 -1 0 1 -1 0 1 x x x x Lu (Brown) ReLU NN Collapse Scientific ML 2019 7 / 20

2D Examples   1 1 � � � � | x 1 + x 2 | 1 1 − 1 − 1   f ( x ) = = ReLU (  x )   | x 1 − x 2 | − 1 1 1 1    − 1 1 A B y 1 = |x 1 +x 2 | y 1 = |x 1 +x 2 | NN NN 3 3 2 2 1 1 0 0 -1 -1 1 1 0 0 0 0 x 2 x 2 1 -1 1 -1 x 1 x 1 Lu (Brown) ReLU NN Collapse Scientific ML 2019 8 / 20

Loss Mean squared error (MSE) ⇒ mean Mean absolute error (MAE) ⇒ median A B C 2 2 2 y = |x| y = xsin(5x) y = 1 {x>0} +0.2sin(5x) MSE MSE MSE MAE MAE MAE 1 1 1 y 0 0 0 -1 -1 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 x x x Lu (Brown) ReLU NN Collapse Scientific ML 2019 9 / 20

Setup Feed-forward ReLU neural network N L : R d in → R d out L layers In the layer ℓ ◮ N ℓ neurons ( N 0 = d in , N L = d out ) ◮ Weight W ℓ : N ℓ × N ℓ − 1 matrix ◮ Bias b ℓ ∈ R N ℓ Input: x ∈ R d in Neural activity in the layer ℓ : N ℓ ( x ) ∈ R N ℓ N ℓ ( x ) = W ℓ φ ( N ℓ − 1 ( x )) + b ℓ ∈ R N ℓ , 2 ≤ ℓ ≤ L for N 1 ( x ) = W 1 x + b 1 Lu (Brown) ReLU NN Collapse Scientific ML 2019 11 / 20

Setup Training data T = { x i , f ( x i ) } 1 ≤ i ≤ M ⊂ D ≡ B r (0) = { x ∈ R d in |� x � 2 ≤ r } Loss function M � ℓ ( N L ( x i ; θ ) , f ( x i )) , L ( θ, T ) = i =1 where θ = { W ℓ , b ℓ } 1 ≤ ℓ ≤ L Lu (Brown) ReLU NN Collapse Scientific ML 2019 12 / 20

N L will eventually Die in probability as L → ∞ Theorem 1 Let N L ( x ) be a ReLU NN with L layers, each having N 1 , · · · , N L neurons. Suppose 1 Weights are independently initialized from a symmetric distribution around 0, 2 Biases are either from a symmetric distribution or set to be zero. Then L − 1 P ( N L ( x ) dies ) ≤ 1 − � (1 − (1 / 2) N ℓ ) . ℓ =1 Furthermore, assuming N ℓ = N for all ℓ , L →∞ P ( N L ( x ) dies ) = 1 , N →∞ P ( N L ( x ) dies ) = 0 . lim lim Lu (Brown) ReLU NN Collapse Scientific ML 2019 13 / 20

Proof Lemma 1 Let N L ( x ) be a ReLU NN of L -layers. Suppose weights are independently from distributions satisfying P ( W ℓ j z = 0 ) = 0 for any nonzero z ∈ R N ℓ − 1 and any j -th row of W ℓ . Then P ( N ℓ ( x ) dies ) = P ( ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D ) . Lu (Brown) ReLU NN Collapse Scientific ML 2019 14 / 20

Proof Lemma 1 Let N L ( x ) be a ReLU NN of L -layers. Suppose weights are independently from distributions satisfying P ( W ℓ j z = 0 ) = 0 for any nonzero z ∈ R N ℓ − 1 and any j -th row of W ℓ . Then P ( N ℓ ( x ) dies ) = P ( ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D ) . For a given x , = 1 � � s φ ( N j − 1 ( x )) + b j s < 0 | ˜ W j A c P 2 , j − 1 , x where ˜ A c ℓ, x = {∀ 1 ≤ j < ℓ, φ ( N j ( x )) � = 0 } Lu (Brown) ReLU NN Collapse Scientific ML 2019 14 / 20

Dead Networks would Collapse Theorem 2 Suppose the ReLU NN dies. Then for any loss L , the network is optimized to a constant function by any gradient based method. Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20

Dead Networks would Collapse Theorem 2 Suppose the ReLU NN dies. Then for any loss L , the network is optimized to a constant function by any gradient based method. Proof Lemma 1 ⇒ ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D Gradients of L wrt the weights/biases in the 1 , . . . , l -th layers vanish Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20

Dead Networks would Collapse Theorem 2 Suppose the ReLU NN dies. Then for any loss L , the network is optimized to a constant function by any gradient based method. Proof Lemma 1 ⇒ ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D Gradients of L wrt the weights/biases in the 1 , . . . , l -th layers vanish Assuming training data are iid from P D , the optimized network is N L ( x ; θ ∗ ) = argmin E x ∼ P D [ ℓ ( c , f ( x )))] c ∈ R NL MSE/ L 2 ⇒ E [ f ( x )] MAE/ L 1 ⇒ median of f ( x ) Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, - PowerPoint PPT Presentation

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis Division of Applied Mathematics, Brown University Scientific Machine Learning, ICERM January 28, 2019 Lu (Brown) ReLU NN Collapse Scientific ML

On-line learning in neural networks with ReLU activation Michiel Straat September 19, 2018 1 /

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural

Deep Convolutional Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine

outline of this tutorial motivations 1 ACISS09 tutorial on deep belief nets deep

Management of Supraventricular Arrhythmias Narrow-complex Tachycardias Narrow-complex

Today CS 188: Artificial Intelligence Neural Nets (wrap-up) and Decision Trees Neural Nets --

Convolutional Neural Nets CS447 Natural Language Processing (J. Hockenmaier)

Deep Neural Nets and Features Sung-Eui Yoon ( ) Course URL:

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware Architecture Learning Where Do

Deep Learning Barun Patra Index Convolutional Networks Introduction to Neural Nets

Doojin Kim Searching for New Physics Leaving No Stone Unturned University of Utah, August 9 th

An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and

Infinite Mixture Prototypes for Few-Shot Learning Adaptively inferring model capacity for simple

Cold & Ultra-cold Neutron Source Studies Yunchang Shin Indiana University/IUCF Outline

On projective manifolds with semi-positive holomorphic sectional curvature Shin-ichi Matsumura (

Temporal Planning with Clock-Based SMT Encodings Jussi Rintanen Department of Computer Science

Building capacity, generating new surveillance information, and creating knowledge transfer and

Retailer Collage: What is Your Relationship to Your Favorite Retailer? Janna Parker, James

Sambuz

Useful Links

Newsletter

Mail Us

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, - PowerPoint PPT Presentation

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis Division of Applied Mathematics, Brown University Scientific Machine Learning, ICERM January 28, 2019 Lu (Brown) ReLU NN Collapse Scientific ML

On-line learning in neural networks with ReLU activation Michiel Straat September 19, 2018 1 /

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural

Deep Convolutional Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine

outline of this tutorial motivations 1 ACISS09 tutorial on deep belief nets deep

Management of Supraventricular Arrhythmias Narrow-complex Tachycardias Narrow-complex

Today CS 188: Artificial Intelligence Neural Nets (wrap-up) and Decision Trees Neural Nets --

Convolutional Neural Nets CS447 Natural Language Processing (J. Hockenmaier)

Deep Neural Nets and Features Sung-Eui Yoon ( ) Course URL:

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware Architecture Learning Where Do

Deep Learning Barun Patra Index Convolutional Networks Introduction to Neural Nets

Doojin Kim Searching for New Physics Leaving No Stone Unturned University of Utah, August 9 th

An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and

Infinite Mixture Prototypes for Few-Shot Learning Adaptively inferring model capacity for simple

Cold &amp; Ultra-cold Neutron Source Studies Yunchang Shin Indiana University/IUCF Outline

On projective manifolds with semi-positive holomorphic sectional curvature Shin-ichi Matsumura (

Temporal Planning with Clock-Based SMT Encodings Jussi Rintanen Department of Computer Science

Building capacity, generating new surveillance information, and creating knowledge transfer and

Retailer Collage: What is Your Relationship to Your Favorite Retailer? Janna Parker, James

Sambuz

Useful Links

Newsletter

Mail Us

Cold & Ultra-cold Neutron Source Studies Yunchang Shin Indiana University/IUCF Outline