towards demystifying overparameterization in deep learning
play

Towards Demystifying Overparameterization in Deep Learning Mahdi - PowerPoint PPT Presentation

Towards Demystifying Overparameterization in Deep Learning Mahdi Soltanolkotabi Department of Electrical and Computer Engineering April 4, 2019 Mathematics of Imaging Workshop # 3 Henri Poincare Institute April 4, 2019 Mathematics of Imaging


  1. Towards Demystifying Overparameterization in Deep Learning Mahdi Soltanolkotabi Department of Electrical and Computer Engineering April 4, 2019 Mathematics of Imaging Workshop # 3 Henri Poincare Institute April 4, 2019 Mathematics of Imaging Workshop # 3

  2. Collaborators: Samet Oymak and Mingchen Li

  3. Motivation (Theory)

  4. Many success stories Neural networks very effective at learning from data

  5. Lots of hype

  6. Some failures

  7. Need more principled understanding Deep learning-based AI increasingly used in human facing services Challenges: Optimization: Why can they fit? Generalization: Why can they predict? Achitecture: Which neural nets?

  8. This talk: Overparameterization without overfitting Mystery # of parameters >> # training data

  9. Surprising experiment I (stolen from B. Recht) p parameters, n = 50 , 000 training samples, d = 3072 feature size, and 10 classes

  10. Surprising experiment II-Overfitting to corruption Add corruption Corrupt a fraction of training labels by replacing with another random label No corruption on test labels

  11. Surprising experiment III-Robustness Repeat the same experiment but stop early

  12. Surprising experiment III-Robustness Repeat the same experiment but stop early

  13. Benefits of overparameterization for neural networks Benefit I: Tractable nonconvex optimization Benefit II: Robustness to corruption with early stopping

  14. Benefit I: Tractable nonconvex optimization

  15. One-hidden layer y i = v T φ ( W x i )

  16. Theory for smooth activations Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 4 n � � � 2 v T φ ( W x i ) − y i 2 min L ( W ) := W i =1 0 − 6 − 4 − 2 0 2 4 6

  17. Theory for smooth activations Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 4 n � � � 2 v T φ ( W x i ) − y i 2 min L ( W ) := W i =1 0 − 6 − 4 − 2 0 2 4 6 Set v at random or balanced (half + , half − ) Run gradient descent W τ +1 = W τ − µ τ ∇L ( W τ ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume Smooth activations | φ ′ ( z ) | ≤ B and | φ ′′ ( z ) | ≤ B √ Overparameterization kd � κ ( X ) n Initialization with i.i.d. N (0 , 1) entries Then, with high probability � � 2 τ L ( W 0 ) 1 − c d Zero training error: L ( W τ ) ≤ n √ � W − W 0 � F kd Iterates remain close to initialization: � √ n � W 0 � F

  18. Dependence on data? Diversity of input data is important...   x T 1   x T  2  X =  .  .   . x T n � d n � X � κ ( X ) := λ ( X ) Definition (Neural network covariance matrix and eigenvalue) Neural net covariance matrix � � Σ ( X ) :=1 J ( W 0 ) J T ( W 0 ) k E W 0 � � XX T � � φ ′ ( Xw ) φ ′ ( Xw ) T � � = E w ⊙ . Eigenvalue λ ( X ) := λ min ( Σ ( X ))

  19. Hermite expansion Lemma Let { µ r ( φ ′ ) } ∞ r =0 be the Hermite coefficients of φ ′ . Then, + ∞ � � XX T � � XX T � � XX T � � XX T � µ 2 r ( φ ′ ) µ 2 2 ( φ ′ ) Σ ( X ) = ⊙ . . . ⊙ � ⊙ � �� � � �� � r =0 r +1 ( E [ φ ′′ ( g )]) 2 arbitrary activation ⇔ quadratic activation Conclusion For generic data e.g. x i i.i.d. uniform on the unit sphere κ ( X ) scales like a constant

  20. Theory for ReLU activations Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 4 n � � � 2 v T φ ( W x i ) − y i 2 min L ( W ) := W i =1 0 − 6 − 4 − 2 0 2 4 6 Set v at random or balanced (half + , half − ) Run gradient descent W τ +1 = W τ − µ τ ∇L ( W τ ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume ReLU activation φ ( z ) = ReLU ( z ) = max(0 , z ) √ kd � κ 3 ( X ) n Overparameterization d × n Initialization with i.i.d. N (0 , 1) entries Then, with high probability � � 2 τ L ( W 0 ) 1 − c d Zero training error: L ( W τ ) ≤ n √ � W − W 0 � F kd Iterates remain close to initialization: � √ n � W 0 � F

  21. Theory for SGD Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 n � � � 2 v T φ ( W x i ) − y i min L ( W ) := W i =1 Set v at random or balanced (half + , half − ) Run gradient descent W τ +1 = W τ − µ τ ∇L ( W τ ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume Smooth activations | φ ′ ( z ) | ≤ B and | φ ′′ ( z ) | ≤ B √ Overparameterization kd � κ ( X ) n Initialization with i.i.d. N (0 , 1) entries Then, with high probability � � 2 τ L ( W 0 ) 1 − c d Zero training error: E [ L ( W τ )] ≤ n 2 √ � W − W 0 � F kd Iterates remain close to initialization: � √ n � W 0 � F

  22. Proof Sketch

  23. Prelude: over-parametrized linear least-squares θ ∈ R p L ( θ ) := 1 2 � Xθ − y � 2 X ∈ R n × p min with and n ≤ p. ℓ 2 Gradient descent starting from θ 0 has three properties: Global convergence Converges to a global optimum which is closest to θ 0 Total gradient path length is relatively short

  24. Over-parametrized nonlinear least-squares θ ∈ R p L ( θ ) := 1 2 � f ( θ ) − y � 2 min ℓ 2 , where     y 1 f ( x 1 ; θ )     y 2 f ( x 2 ; θ )      ∈ R n ,  ∈ R n , y := f ( θ ) := and n ≤ p.  .   .  . .   . . y n f ( x n ; θ ) Gradient descent: start from some initial parameter θ 0 and run θ τ +1 = θ τ − η τ ∇L ( θ τ ) , ∇L ( θ ) = J ( θ ) T ( f ( θ ) − y ) . Here, J ( θ ) ∈ R n × p is the Jacobian matrix with entries J ij = ∂f ( x i , θ ) . ∂ θ j

  25. Key lemma Lemma 4 � f ( θ 0 ) − y � ℓ 2 Following assumptions on B ( θ 0 , R ) with R := α Jacobian at initialization: σ min ( J ( θ 0 )) ≥ 2 α Bounded Jacobian spectrum: �J ( θ ) � ≤ β � � � � � � � � � J ( � �� Lipschitz Jacobian: θ ) − J ( θ ) � ≤ L θ − θ � F Small initial residual: � f ( θ 0 ) − y � ℓ 2 ≤ α 2 4 L Then using step size η ≤ 2 β � � τ Global geometric convergence: � f ( θ τ ) − y � 2 1 − ηα 2 � f ( θ 0 ) − y � 2 ℓ 2 ≤ 2 ℓ 2 α � θ ∗ − θ 0 � ℓ 2 iterates stay close to init.: � θ τ − θ 0 � ℓ 2 ≤ 4 α � f ( θ 0 ) − y � ℓ 2 ≤ 4 β Total gradient path bounded: � ∞ τ =0 � θ τ +1 − θ τ � ℓ 2 ≤ 4 α � f ( θ 0 ) − y � ℓ 2 Key Ideal Track dynamics of 1 − ηβ 2 � τ − 1 � � V τ := � r τ � ℓ 2 + 1 � θ t +1 − θ t � ℓ 2 . 2

  26. Proof sketch (SGD) Challenge: show that SGD remains in the local neighborhood Attempt I: Show � θ τ − θ 0 � ℓ 2 is a super martingale (see also [Tan and Vershynin 2017]) Attempt II: Show that � f ( θ τ ) − y � ℓ 2 + λ � θ τ − θ 0 � ℓ 2 is a super martingale Final attempt: Show that K � � � 1 � θ τ − v i � ℓ 2 + 3 η � J T ( θ τ ) ( f ( θ τ ) − y ) � K n ℓ 2 j =1 is a super-martingale. Here, v i is a very fine cover of B ( θ 0 , R )

  27. Over-parametrized nonlinear least-squares for neural nets W ∈ R k × d L ( W ) := 1 2 � f ( W ) − y � 2 min ℓ 2 , where     f ( W , x 1 ) y 1     f ( W , x 2 ) y 2      ∈ R n ,  ∈ R n , y := f ( W ) := and n ≤ kd.     . . . .   . . f ( W , x n ) y n Linearization via Jacobian � φ ′ � XW T � � J ( W ) = X ∗ diag ( v )

  28. Key Techniques Hadamard product � � � XX T � J ( W ) J T ( W ) = φ ′ ( XW T ) φ ′ ( W X T ) ⊙ Theorem (Schur 1913) For two PSD A , B ∈ R n × n � � λ min ( A ⊙ B ) ≥ min B ii λ min ( A ) i � � λ max ( A ⊙ B ) ≤ max λ max ( A ) B ii i Random matrix theory k � � φ ′ ( Xw ℓ ) φ ′ ( Xw ℓ ) T � � XX T � J ( W ) J T ( W ) = ⊙ ℓ =1

  29. Side corollary: Nonconvex matrix recovery Features: A 1 , A 2 , . . . , A n ∈ R d × d . Labels: y 1 , y 2 , . . . , y n Solve Nonconvex matrix factorization n � � � 2 1 y i − � A i , UU T � min 2 U ∈ R d × r i =1 Theorem (Oymak and Soltanolkotabi 2018) Assume i.i.d. Gaussian A i any label y i Initialization at well conditioned matrix U 0 Then, gradient descent iterations U τ converge with a geometric rate to a close global optima as soon as n ≤ dr . Burer-Monteiro and many others r ≥ √ n For Gaussian A i we allow r ≥ n d when n ≈ dr 0 Burer-Monteiro: r � √ dr 0 Ours: r � r 0

  30. Previous work Unrealistic quadratic: [Soltanolkotabi, Javanmard, Lee 2018] and [Venturi, Bandeira, Bruna,...] Smooth activations: [Du, Lee, Li, Wang, Zhai 2018] kd � n 2 k � n 4 . versus ReLU activation: [Du et. al. 2018] k � n 4 k � n 6 . versus d 3 Separation: [Li and Liang 2018] [ Allen-Zhu, Li, Song 2018] k � n 12 k � n 25 ???? . versus δ 4 Begin to move beyond “lazy training” [Chizat & Bach, 2018]; Faster convergence rate Deep: [Du, Lee, Li, Wang, Zhai 2018] and [ Allen-Zhu, Li, Song 2018] Mean field analysis for infinitely wide: [Mei et al., 2018]; [Chizat & Bach, 2018]; [Sirignano & Spiliopoulos, 2018]; [Rotskoff & Vanden-Eijnden, 2018]; [Wei et al., 2018].

Recommend


More recommend