Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis Division of Applied Mathematics, Brown University Scientific Machine Learning, ICERM January 28, 2019 Lu (Brown) ReLU NN Collapse Scientific ML 2019 1 / 20
Overview Introduction 1 Examples 2 Theoretical analysis 3 Asymmetric initialization (Shin) 4 Lu (Brown) ReLU NN Collapse Scientific ML 2019 2 / 20
Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20
Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20
Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20
Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow ReLU := max( x, 0) Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20
Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow ReLU := max( x, 0) ◮ Width limit? For continuous functions [0 , 1] d in → R d out [Hanin & Sellke, 2017]: d in + 1 ≤ minimal width ≤ d in + d out Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20
Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow ReLU := max( x, 0) ◮ Width limit? For continuous functions [0 , 1] d in → R d out [Hanin & Sellke, 2017]: d in + 1 ≤ minimal width ≤ d in + d out ◮ Depth limit? Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20
Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20
Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20
Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side Deep ReLU nets? Dying ReLU network NN is a constant function after initialization Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20
Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side Deep ReLU nets? Dying ReLU network NN is a constant function after initialization Collapse NN converges to the “mean” state of the target function during training Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20
Overview Introduction 1 Examples 2 Theoretical analysis 3 Asymmetric initialization (Shin) 4 Lu (Brown) ReLU NN Collapse Scientific ML 2019 5 / 20
1D Examples f ( x ) = | x | � � 1 � � | x | = ReLU ( x ) + ReLU ( − x ) = 1 1 ReLU ( x ) − 1 2-layer with width 2 Train a 10-layer ReLU NN with width 2 (MSE loss, whatever optimizer) Lu (Brown) ReLU NN Collapse Scientific ML 2019 6 / 20
1D Examples f ( x ) = | x | � � 1 � � | x | = ReLU ( x ) + ReLU ( − x ) = 1 1 ReLU ( x ) − 1 2-layer with width 2 Train a 10-layer ReLU NN with width 2 (MSE loss, whatever optimizer) Collapse to the mean value (A): ∼ 93% Collapse partially (B) A B 2 2 y = |x| y = |x| NN NN 1.5 1.5 1 1 y y 0.5 0.5 0 0 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 x x Lu (Brown) ReLU NN Collapse Scientific ML 2019 6 / 20
1D Examples f ( x ) = x sin(5 x ) A B C D 2 y = xsin(5x) y = xsin(5x) y = xsin(5x) y = xsin(5x) NN NN NN NN 1 y 0 -1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 x x x x f ( x ) = 1 { x> 0 } + 0 . 2 sin(5 x ) A B C D 1.5 y y y y NN NN NN NN 1 y 0.5 0 -0.5 -1 0 1 -1 0 1 -1 0 1 -1 0 1 x x x x Lu (Brown) ReLU NN Collapse Scientific ML 2019 7 / 20
2D Examples 1 1 � � � � | x 1 + x 2 | 1 1 − 1 − 1 f ( x ) = = ReLU ( x ) | x 1 − x 2 | − 1 1 1 1 − 1 1 A B y 1 = |x 1 +x 2 | y 1 = |x 1 +x 2 | NN NN 3 3 2 2 1 1 0 0 -1 -1 1 1 0 0 0 0 x 2 x 2 1 -1 1 -1 x 1 x 1 Lu (Brown) ReLU NN Collapse Scientific ML 2019 8 / 20
Loss Mean squared error (MSE) ⇒ mean Mean absolute error (MAE) ⇒ median A B C 2 2 2 y = |x| y = xsin(5x) y = 1 {x>0} +0.2sin(5x) MSE MSE MSE MAE MAE MAE 1 1 1 y 0 0 0 -1 -1 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 x x x Lu (Brown) ReLU NN Collapse Scientific ML 2019 9 / 20
Overview Introduction 1 Examples 2 Theoretical analysis 3 Asymmetric initialization (Shin) 4 Lu (Brown) ReLU NN Collapse Scientific ML 2019 10 / 20
Setup Feed-forward ReLU neural network N L : R d in → R d out L layers In the layer ℓ ◮ N ℓ neurons ( N 0 = d in , N L = d out ) ◮ Weight W ℓ : N ℓ × N ℓ − 1 matrix ◮ Bias b ℓ ∈ R N ℓ Input: x ∈ R d in Neural activity in the layer ℓ : N ℓ ( x ) ∈ R N ℓ N ℓ ( x ) = W ℓ φ ( N ℓ − 1 ( x )) + b ℓ ∈ R N ℓ , 2 ≤ ℓ ≤ L for N 1 ( x ) = W 1 x + b 1 Lu (Brown) ReLU NN Collapse Scientific ML 2019 11 / 20
Setup Training data T = { x i , f ( x i ) } 1 ≤ i ≤ M ⊂ D ≡ B r (0) = { x ∈ R d in |� x � 2 ≤ r } Loss function M � ℓ ( N L ( x i ; θ ) , f ( x i )) , L ( θ, T ) = i =1 where θ = { W ℓ , b ℓ } 1 ≤ ℓ ≤ L Lu (Brown) ReLU NN Collapse Scientific ML 2019 12 / 20
N L will eventually Die in probability as L → ∞ Theorem 1 Let N L ( x ) be a ReLU NN with L layers, each having N 1 , · · · , N L neurons. Suppose 1 Weights are independently initialized from a symmetric distribution around 0, 2 Biases are either from a symmetric distribution or set to be zero. Then L − 1 P ( N L ( x ) dies ) ≤ 1 − � (1 − (1 / 2) N ℓ ) . ℓ =1 Furthermore, assuming N ℓ = N for all ℓ , L →∞ P ( N L ( x ) dies ) = 1 , N →∞ P ( N L ( x ) dies ) = 0 . lim lim Lu (Brown) ReLU NN Collapse Scientific ML 2019 13 / 20
Proof Lemma 1 Let N L ( x ) be a ReLU NN of L -layers. Suppose weights are independently from distributions satisfying P ( W ℓ j z = 0 ) = 0 for any nonzero z ∈ R N ℓ − 1 and any j -th row of W ℓ . Then P ( N ℓ ( x ) dies ) = P ( ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D ) . Lu (Brown) ReLU NN Collapse Scientific ML 2019 14 / 20
Proof Lemma 1 Let N L ( x ) be a ReLU NN of L -layers. Suppose weights are independently from distributions satisfying P ( W ℓ j z = 0 ) = 0 for any nonzero z ∈ R N ℓ − 1 and any j -th row of W ℓ . Then P ( N ℓ ( x ) dies ) = P ( ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D ) . For a given x , = 1 � � s φ ( N j − 1 ( x )) + b j s < 0 | ˜ W j A c P 2 , j − 1 , x where ˜ A c ℓ, x = {∀ 1 ≤ j < ℓ, φ ( N j ( x )) � = 0 } Lu (Brown) ReLU NN Collapse Scientific ML 2019 14 / 20
Dead Networks would Collapse Theorem 2 Suppose the ReLU NN dies. Then for any loss L , the network is optimized to a constant function by any gradient based method. Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20
Dead Networks would Collapse Theorem 2 Suppose the ReLU NN dies. Then for any loss L , the network is optimized to a constant function by any gradient based method. Proof Lemma 1 ⇒ ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D Gradients of L wrt the weights/biases in the 1 , . . . , l -th layers vanish Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20
Dead Networks would Collapse Theorem 2 Suppose the ReLU NN dies. Then for any loss L , the network is optimized to a constant function by any gradient based method. Proof Lemma 1 ⇒ ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D Gradients of L wrt the weights/biases in the 1 , . . . , l -th layers vanish Assuming training data are iid from P D , the optimized network is N L ( x ; θ ∗ ) = argmin E x ∼ P D [ ℓ ( c , f ( x )))] c ∈ R NL MSE/ L 2 ⇒ E [ f ( x )] MAE/ L 1 ⇒ median of f ( x ) Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20
Recommend
More recommend