collapse of deep and narrow relu neural nets
play

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, - PowerPoint PPT Presentation

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis Division of Applied Mathematics, Brown University Scientific Machine Learning, ICERM January 28, 2019 Lu (Brown) ReLU NN Collapse Scientific ML


  1. Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis Division of Applied Mathematics, Brown University Scientific Machine Learning, ICERM January 28, 2019 Lu (Brown) ReLU NN Collapse Scientific ML 2019 1 / 20

  2. Overview Introduction 1 Examples 2 Theoretical analysis 3 Asymmetric initialization (Shin) 4 Lu (Brown) ReLU NN Collapse Scientific ML 2019 2 / 20

  3. Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

  4. Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

  5. Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

  6. Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow ReLU := max( x, 0) Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

  7. Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow ReLU := max( x, 0) ◮ Width limit? For continuous functions [0 , 1] d in → R d out [Hanin & Sellke, 2017]: d in + 1 ≤ minimal width ≤ d in + d out Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

  8. Introduction Shallow NNs (single hidden layer) ◮ universal approximation theorem Deep (& narrow) NNs ◮ Better than shallow NNs (of comparable size) size shallow ≈ ǫ d [Mhaskar & Poggio, 2016] size deep ◮ ⇒ Deep & narrow ReLU := max( x, 0) ◮ Width limit? For continuous functions [0 , 1] d in → R d out [Hanin & Sellke, 2017]: d in + 1 ≤ minimal width ≤ d in + d out ◮ Depth limit? Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

  9. Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

  10. Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

  11. Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side Deep ReLU nets? Dying ReLU network NN is a constant function after initialization Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

  12. Introduction Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side Deep ReLU nets? Dying ReLU network NN is a constant function after initialization Collapse NN converges to the “mean” state of the target function during training Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

  13. Overview Introduction 1 Examples 2 Theoretical analysis 3 Asymmetric initialization (Shin) 4 Lu (Brown) ReLU NN Collapse Scientific ML 2019 5 / 20

  14. 1D Examples f ( x ) = | x | � � 1 � � | x | = ReLU ( x ) + ReLU ( − x ) = 1 1 ReLU ( x ) − 1 2-layer with width 2 Train a 10-layer ReLU NN with width 2 (MSE loss, whatever optimizer) Lu (Brown) ReLU NN Collapse Scientific ML 2019 6 / 20

  15. 1D Examples f ( x ) = | x | � � 1 � � | x | = ReLU ( x ) + ReLU ( − x ) = 1 1 ReLU ( x ) − 1 2-layer with width 2 Train a 10-layer ReLU NN with width 2 (MSE loss, whatever optimizer) Collapse to the mean value (A): ∼ 93% Collapse partially (B) A B 2 2 y = |x| y = |x| NN NN 1.5 1.5 1 1 y y 0.5 0.5 0 0 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 x x Lu (Brown) ReLU NN Collapse Scientific ML 2019 6 / 20

  16. 1D Examples f ( x ) = x sin(5 x ) A B C D 2 y = xsin(5x) y = xsin(5x) y = xsin(5x) y = xsin(5x) NN NN NN NN 1 y 0 -1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 x x x x f ( x ) = 1 { x> 0 } + 0 . 2 sin(5 x ) A B C D 1.5 y y y y NN NN NN NN 1 y 0.5 0 -0.5 -1 0 1 -1 0 1 -1 0 1 -1 0 1 x x x x Lu (Brown) ReLU NN Collapse Scientific ML 2019 7 / 20

  17. 2D Examples   1 1 � � � � | x 1 + x 2 | 1 1 − 1 − 1   f ( x ) = = ReLU (  x )   | x 1 − x 2 | − 1 1 1 1    − 1 1 A B y 1 = |x 1 +x 2 | y 1 = |x 1 +x 2 | NN NN 3 3 2 2 1 1 0 0 -1 -1 1 1 0 0 0 0 x 2 x 2 1 -1 1 -1 x 1 x 1 Lu (Brown) ReLU NN Collapse Scientific ML 2019 8 / 20

  18. Loss Mean squared error (MSE) ⇒ mean Mean absolute error (MAE) ⇒ median A B C 2 2 2 y = |x| y = xsin(5x) y = 1 {x>0} +0.2sin(5x) MSE MSE MSE MAE MAE MAE 1 1 1 y 0 0 0 -1 -1 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 x x x Lu (Brown) ReLU NN Collapse Scientific ML 2019 9 / 20

  19. Overview Introduction 1 Examples 2 Theoretical analysis 3 Asymmetric initialization (Shin) 4 Lu (Brown) ReLU NN Collapse Scientific ML 2019 10 / 20

  20. Setup Feed-forward ReLU neural network N L : R d in → R d out L layers In the layer ℓ ◮ N ℓ neurons ( N 0 = d in , N L = d out ) ◮ Weight W ℓ : N ℓ × N ℓ − 1 matrix ◮ Bias b ℓ ∈ R N ℓ Input: x ∈ R d in Neural activity in the layer ℓ : N ℓ ( x ) ∈ R N ℓ N ℓ ( x ) = W ℓ φ ( N ℓ − 1 ( x )) + b ℓ ∈ R N ℓ , 2 ≤ ℓ ≤ L for N 1 ( x ) = W 1 x + b 1 Lu (Brown) ReLU NN Collapse Scientific ML 2019 11 / 20

  21. Setup Training data T = { x i , f ( x i ) } 1 ≤ i ≤ M ⊂ D ≡ B r (0) = { x ∈ R d in |� x � 2 ≤ r } Loss function M � ℓ ( N L ( x i ; θ ) , f ( x i )) , L ( θ, T ) = i =1 where θ = { W ℓ , b ℓ } 1 ≤ ℓ ≤ L Lu (Brown) ReLU NN Collapse Scientific ML 2019 12 / 20

  22. N L will eventually Die in probability as L → ∞ Theorem 1 Let N L ( x ) be a ReLU NN with L layers, each having N 1 , · · · , N L neurons. Suppose 1 Weights are independently initialized from a symmetric distribution around 0, 2 Biases are either from a symmetric distribution or set to be zero. Then L − 1 P ( N L ( x ) dies ) ≤ 1 − � (1 − (1 / 2) N ℓ ) . ℓ =1 Furthermore, assuming N ℓ = N for all ℓ , L →∞ P ( N L ( x ) dies ) = 1 , N →∞ P ( N L ( x ) dies ) = 0 . lim lim Lu (Brown) ReLU NN Collapse Scientific ML 2019 13 / 20

  23. Proof Lemma 1 Let N L ( x ) be a ReLU NN of L -layers. Suppose weights are independently from distributions satisfying P ( W ℓ j z = 0 ) = 0 for any nonzero z ∈ R N ℓ − 1 and any j -th row of W ℓ . Then P ( N ℓ ( x ) dies ) = P ( ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D ) . Lu (Brown) ReLU NN Collapse Scientific ML 2019 14 / 20

  24. Proof Lemma 1 Let N L ( x ) be a ReLU NN of L -layers. Suppose weights are independently from distributions satisfying P ( W ℓ j z = 0 ) = 0 for any nonzero z ∈ R N ℓ − 1 and any j -th row of W ℓ . Then P ( N ℓ ( x ) dies ) = P ( ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D ) . For a given x , = 1 � � s φ ( N j − 1 ( x )) + b j s < 0 | ˜ W j A c P 2 , j − 1 , x where ˜ A c ℓ, x = {∀ 1 ≤ j < ℓ, φ ( N j ( x )) � = 0 } Lu (Brown) ReLU NN Collapse Scientific ML 2019 14 / 20

  25. Dead Networks would Collapse Theorem 2 Suppose the ReLU NN dies. Then for any loss L , the network is optimized to a constant function by any gradient based method. Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20

  26. Dead Networks would Collapse Theorem 2 Suppose the ReLU NN dies. Then for any loss L , the network is optimized to a constant function by any gradient based method. Proof Lemma 1 ⇒ ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D Gradients of L wrt the weights/biases in the 1 , . . . , l -th layers vanish Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20

  27. Dead Networks would Collapse Theorem 2 Suppose the ReLU NN dies. Then for any loss L , the network is optimized to a constant function by any gradient based method. Proof Lemma 1 ⇒ ∃ ℓ ∈ { 1 , . . . , L − 1 } s.t. φ ( N ℓ ( x )) = 0 ∀ x ∈ D Gradients of L wrt the weights/biases in the 1 , . . . , l -th layers vanish Assuming training data are iid from P D , the optimized network is N L ( x ; θ ∗ ) = argmin E x ∼ P D [ ℓ ( c , f ( x )))] c ∈ R NL MSE/ L 2 ⇒ E [ f ( x )] MAE/ L 1 ⇒ median of f ( x ) Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20

Recommend


More recommend