Towards Demystifying Overparameterization in Deep Learning Mahdi Soltanolkotabi Department of Electrical and Computer Engineering April 4, 2019 Mathematics of Imaging Workshop # 3 Henri Poincare Institute April 4, 2019 Mathematics of Imaging Workshop # 3
Collaborators: Samet Oymak and Mingchen Li
Motivation (Theory)
Many success stories Neural networks very effective at learning from data
Lots of hype
Some failures
Need more principled understanding Deep learning-based AI increasingly used in human facing services Challenges: Optimization: Why can they fit? Generalization: Why can they predict? Achitecture: Which neural nets?
This talk: Overparameterization without overfitting Mystery # of parameters >> # training data
Surprising experiment I (stolen from B. Recht) p parameters, n = 50 , 000 training samples, d = 3072 feature size, and 10 classes
Surprising experiment II-Overfitting to corruption Add corruption Corrupt a fraction of training labels by replacing with another random label No corruption on test labels
Surprising experiment III-Robustness Repeat the same experiment but stop early
Surprising experiment III-Robustness Repeat the same experiment but stop early
Benefits of overparameterization for neural networks Benefit I: Tractable nonconvex optimization Benefit II: Robustness to corruption with early stopping
Benefit I: Tractable nonconvex optimization
One-hidden layer y i = v T φ ( W x i )
Theory for smooth activations Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 4 n � � � 2 v T φ ( W x i ) − y i 2 min L ( W ) := W i =1 0 − 6 − 4 − 2 0 2 4 6
Theory for smooth activations Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 4 n � � � 2 v T φ ( W x i ) − y i 2 min L ( W ) := W i =1 0 − 6 − 4 − 2 0 2 4 6 Set v at random or balanced (half + , half − ) Run gradient descent W τ +1 = W τ − µ τ ∇L ( W τ ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume Smooth activations | φ ′ ( z ) | ≤ B and | φ ′′ ( z ) | ≤ B √ Overparameterization kd � κ ( X ) n Initialization with i.i.d. N (0 , 1) entries Then, with high probability � � 2 τ L ( W 0 ) 1 − c d Zero training error: L ( W τ ) ≤ n √ � W − W 0 � F kd Iterates remain close to initialization: � √ n � W 0 � F
Dependence on data? Diversity of input data is important... x T 1 x T 2 X = . . . x T n � d n � X � κ ( X ) := λ ( X ) Definition (Neural network covariance matrix and eigenvalue) Neural net covariance matrix � � Σ ( X ) :=1 J ( W 0 ) J T ( W 0 ) k E W 0 � � XX T � � φ ′ ( Xw ) φ ′ ( Xw ) T � � = E w ⊙ . Eigenvalue λ ( X ) := λ min ( Σ ( X ))
Hermite expansion Lemma Let { µ r ( φ ′ ) } ∞ r =0 be the Hermite coefficients of φ ′ . Then, + ∞ � � XX T � � XX T � � XX T � � XX T � µ 2 r ( φ ′ ) µ 2 2 ( φ ′ ) Σ ( X ) = ⊙ . . . ⊙ � ⊙ � �� � � �� � r =0 r +1 ( E [ φ ′′ ( g )]) 2 arbitrary activation ⇔ quadratic activation Conclusion For generic data e.g. x i i.i.d. uniform on the unit sphere κ ( X ) scales like a constant
Theory for ReLU activations Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 4 n � � � 2 v T φ ( W x i ) − y i 2 min L ( W ) := W i =1 0 − 6 − 4 − 2 0 2 4 6 Set v at random or balanced (half + , half − ) Run gradient descent W τ +1 = W τ − µ τ ∇L ( W τ ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume ReLU activation φ ( z ) = ReLU ( z ) = max(0 , z ) √ kd � κ 3 ( X ) n Overparameterization d × n Initialization with i.i.d. N (0 , 1) entries Then, with high probability � � 2 τ L ( W 0 ) 1 − c d Zero training error: L ( W τ ) ≤ n √ � W − W 0 � F kd Iterates remain close to initialization: � √ n � W 0 � F
Theory for SGD Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 n � � � 2 v T φ ( W x i ) − y i min L ( W ) := W i =1 Set v at random or balanced (half + , half − ) Run gradient descent W τ +1 = W τ − µ τ ∇L ( W τ ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume Smooth activations | φ ′ ( z ) | ≤ B and | φ ′′ ( z ) | ≤ B √ Overparameterization kd � κ ( X ) n Initialization with i.i.d. N (0 , 1) entries Then, with high probability � � 2 τ L ( W 0 ) 1 − c d Zero training error: E [ L ( W τ )] ≤ n 2 √ � W − W 0 � F kd Iterates remain close to initialization: � √ n � W 0 � F
Proof Sketch
Prelude: over-parametrized linear least-squares θ ∈ R p L ( θ ) := 1 2 � Xθ − y � 2 X ∈ R n × p min with and n ≤ p. ℓ 2 Gradient descent starting from θ 0 has three properties: Global convergence Converges to a global optimum which is closest to θ 0 Total gradient path length is relatively short
Over-parametrized nonlinear least-squares θ ∈ R p L ( θ ) := 1 2 � f ( θ ) − y � 2 min ℓ 2 , where y 1 f ( x 1 ; θ ) y 2 f ( x 2 ; θ ) ∈ R n , ∈ R n , y := f ( θ ) := and n ≤ p. . . . . . . y n f ( x n ; θ ) Gradient descent: start from some initial parameter θ 0 and run θ τ +1 = θ τ − η τ ∇L ( θ τ ) , ∇L ( θ ) = J ( θ ) T ( f ( θ ) − y ) . Here, J ( θ ) ∈ R n × p is the Jacobian matrix with entries J ij = ∂f ( x i , θ ) . ∂ θ j
Key lemma Lemma 4 � f ( θ 0 ) − y � ℓ 2 Following assumptions on B ( θ 0 , R ) with R := α Jacobian at initialization: σ min ( J ( θ 0 )) ≥ 2 α Bounded Jacobian spectrum: �J ( θ ) � ≤ β � � � � � � � � � J ( � �� Lipschitz Jacobian: θ ) − J ( θ ) � ≤ L θ − θ � F Small initial residual: � f ( θ 0 ) − y � ℓ 2 ≤ α 2 4 L Then using step size η ≤ 2 β � � τ Global geometric convergence: � f ( θ τ ) − y � 2 1 − ηα 2 � f ( θ 0 ) − y � 2 ℓ 2 ≤ 2 ℓ 2 α � θ ∗ − θ 0 � ℓ 2 iterates stay close to init.: � θ τ − θ 0 � ℓ 2 ≤ 4 α � f ( θ 0 ) − y � ℓ 2 ≤ 4 β Total gradient path bounded: � ∞ τ =0 � θ τ +1 − θ τ � ℓ 2 ≤ 4 α � f ( θ 0 ) − y � ℓ 2 Key Ideal Track dynamics of 1 − ηβ 2 � τ − 1 � � V τ := � r τ � ℓ 2 + 1 � θ t +1 − θ t � ℓ 2 . 2
Proof sketch (SGD) Challenge: show that SGD remains in the local neighborhood Attempt I: Show � θ τ − θ 0 � ℓ 2 is a super martingale (see also [Tan and Vershynin 2017]) Attempt II: Show that � f ( θ τ ) − y � ℓ 2 + λ � θ τ − θ 0 � ℓ 2 is a super martingale Final attempt: Show that K � � � 1 � θ τ − v i � ℓ 2 + 3 η � J T ( θ τ ) ( f ( θ τ ) − y ) � K n ℓ 2 j =1 is a super-martingale. Here, v i is a very fine cover of B ( θ 0 , R )
Over-parametrized nonlinear least-squares for neural nets W ∈ R k × d L ( W ) := 1 2 � f ( W ) − y � 2 min ℓ 2 , where f ( W , x 1 ) y 1 f ( W , x 2 ) y 2 ∈ R n , ∈ R n , y := f ( W ) := and n ≤ kd. . . . . . . f ( W , x n ) y n Linearization via Jacobian � φ ′ � XW T � � J ( W ) = X ∗ diag ( v )
Key Techniques Hadamard product � � � XX T � J ( W ) J T ( W ) = φ ′ ( XW T ) φ ′ ( W X T ) ⊙ Theorem (Schur 1913) For two PSD A , B ∈ R n × n � � λ min ( A ⊙ B ) ≥ min B ii λ min ( A ) i � � λ max ( A ⊙ B ) ≤ max λ max ( A ) B ii i Random matrix theory k � � φ ′ ( Xw ℓ ) φ ′ ( Xw ℓ ) T � � XX T � J ( W ) J T ( W ) = ⊙ ℓ =1
Side corollary: Nonconvex matrix recovery Features: A 1 , A 2 , . . . , A n ∈ R d × d . Labels: y 1 , y 2 , . . . , y n Solve Nonconvex matrix factorization n � � � 2 1 y i − � A i , UU T � min 2 U ∈ R d × r i =1 Theorem (Oymak and Soltanolkotabi 2018) Assume i.i.d. Gaussian A i any label y i Initialization at well conditioned matrix U 0 Then, gradient descent iterations U τ converge with a geometric rate to a close global optima as soon as n ≤ dr . Burer-Monteiro and many others r ≥ √ n For Gaussian A i we allow r ≥ n d when n ≈ dr 0 Burer-Monteiro: r � √ dr 0 Ours: r � r 0
Previous work Unrealistic quadratic: [Soltanolkotabi, Javanmard, Lee 2018] and [Venturi, Bandeira, Bruna,...] Smooth activations: [Du, Lee, Li, Wang, Zhai 2018] kd � n 2 k � n 4 . versus ReLU activation: [Du et. al. 2018] k � n 4 k � n 6 . versus d 3 Separation: [Li and Liang 2018] [ Allen-Zhu, Li, Song 2018] k � n 12 k � n 25 ???? . versus δ 4 Begin to move beyond “lazy training” [Chizat & Bach, 2018]; Faster convergence rate Deep: [Du, Lee, Li, Wang, Zhai 2018] and [ Allen-Zhu, Li, Song 2018] Mean field analysis for infinitely wide: [Mei et al., 2018]; [Chizat & Bach, 2018]; [Sirignano & Spiliopoulos, 2018]; [Rotskoff & Vanden-Eijnden, 2018]; [Wei et al., 2018].
Recommend
More recommend