Recovery Guarantees for One-hidden-layer Neural Networks Kai Zhong ∗ Joint work with Zhao Song ∗ , Prateek Jain † , Peter L. Bartlett ‡ , Inderjit S. Dhillon ∗ ∗ UT-Austin, † MSR India, ‡ UC Berkeley Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 1 / 21
Learning Neural Networks is Hard The objective functions of neural networks are highly non-convex. Gradient-descent-based methods only achieve local optima. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 2 / 21
Learning Neural Networks is Hard Good News When the size of the network is very large, no need to worry about bad local minima. Every local minimum is a global minimum or close to a global minimum. [Choromanska et al. ’15, Nguyen & Hein ’17, etc.] Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 3 / 21
Learning Neural Networks is Hard Good News When the size of the network is very large, no need to worry about bad local minima. Every local minimum is a global minimum or close to a global minimum. [Choromanska et al. ’15, Nguyen & Hein ’17, etc.] Bad News Typically over-parameterize May lead to overfitting!! Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 3 / 21
Learning Neural Networks is Hard Good News When the size of the network is very large, no need to worry about bad local minima. Every local minimum is a global minimum or close to a global minimum. [Choromanska et al. ’15, Nguyen & Hein ’17, etc.] Bad News Typically over-parameterize May lead to overfitting!! Can we learn a neural net without over-parameterization? Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 3 / 21
Recover A Neural Network Assume the data follows a specified neural network model. Try to recover this model. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 4 / 21
Model: One-hidden-layer Neural Network Assume n samples S = { ( x j , y j ) } j =1 , 2 , ··· ,n ⊂ R d × R are sampled i.i.d. from distribution � k v ∗ i · φ ( w ∗⊤ D : x ∼ N (0 , I ) , y = x ) , i i =1 where φ ( z ) is the activation function, k is the number of hidden nodes, { w ∗ i , v ∗ i } i =1 , 2 , ··· ,k are underlying ground truth parameters. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 5 / 21
General Issues and Our Contribution Can we recover the model? How many samples are required? (Sample Complexity) And how much time? (Computational Complexity) Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21
General Issues and Our Contribution Can we recover the model? Yes, by gradient descent following tensor method initialization How many samples are required? (Sample Complexity) And how much time? (Computational Complexity) Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21
General Issues and Our Contribution Can we recover the model? Yes, by gradient descent following tensor method initialization How many samples are required? (Sample Complexity) | S | > d · log(1 /ǫ ) · poly( k, λ ), where ǫ is the precision and λ is a condition number of W ∗ . And how much time? (Computational Complexity) Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21
General Issues and Our Contribution Can we recover the model? Yes, by gradient descent following tensor method initialization How many samples are required? (Sample Complexity) | S | > d · log(1 /ǫ ) · poly( k, λ ), where ǫ is the precision and λ is a condition number of W ∗ . And how much time? (Computational Complexity) | S | · d · poly( k, λ ) Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21
General Issues and Our Contribution Can we recover the model? Yes, by gradient descent following tensor method initialization How many samples are required? (Sample Complexity) | S | > d · log(1 /ǫ ) · poly( k, λ ), where ǫ is the precision and λ is a condition number of W ∗ . And how much time? (Computational Complexity) | S | · d · poly( k, λ ) The first recovery guarantee with both sample complexity and computational complexity linear in the input dimension and logarithmic in the precision. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21
Objective Function Given v ∗ i and a sample set S , consider L2 loss � k � 2 � � 1 � v ∗ i φ ( w ⊤ f S ( W ) = i x ) − y . 2 | S | i =1 ( x ,y ) ∈ S Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 7 / 21
Objective Function Given v ∗ i and a sample set S , consider L2 loss � k � 2 � � 1 � v ∗ i φ ( w ⊤ f S ( W ) = i x ) − y . 2 | S | i =1 ( x ,y ) ∈ S We show it is locally strongly convex near the ground truth! Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 7 / 21
Approach Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 8 / 21
Local Strong Convexity (LSC) ∇ 2 f ( W ) is positive definite (p.d.) for W ∈ A ⇒ f ( W ) is LSC in area A Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21
Local Strong Convexity (LSC) ∇ 2 f ( W ) is positive definite (p.d.) for W ∈ A ⇒ f ( W ) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth, 2 � � � ∇ 2 f D ( W ∗ ) φ ′ ( w ∗⊤ j x ) x ⊤ a j λ min = min j � a j � 2 =1 E � j where f D is the expected risk. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21
Local Strong Convexity (LSC) ∇ 2 f ( W ) is positive definite (p.d.) for W ∈ A ⇒ f ( W ) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth, 2 � � � ∇ 2 f D ( W ∗ ) φ ′ ( w ∗⊤ j x ) x ⊤ a j λ min = min j � a j � 2 =1 E � j where f D is the expected risk. � � ∇ 2 f D ( W ∗ ) λ min ≥ 0 always holds. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21
Local Strong Convexity (LSC) ∇ 2 f ( W ) is positive definite (p.d.) for W ∈ A ⇒ f ( W ) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth, 2 � � � ∇ 2 f D ( W ∗ ) φ ′ ( w ∗⊤ j x ) x ⊤ a j λ min = min j � a j � 2 =1 E � j where f D is the expected risk. � � ∇ 2 f D ( W ∗ ) λ min ≥ 0 always holds. � � ∇ 2 f D ( W ∗ ) Does λ min > 0 always hold? Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21
Local Strong Convexity (LSC) ∇ 2 f ( W ) is positive definite (p.d.) for W ∈ A ⇒ f ( W ) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth, 2 � � � ∇ 2 f D ( W ∗ ) φ ′ ( w ∗⊤ j x ) x ⊤ a j λ min = min j � a j � 2 =1 E � j where f D is the expected risk. � � ∇ 2 f D ( W ∗ ) λ min ≥ 0 always holds. � � ∇ 2 f D ( W ∗ ) Does λ min > 0 always hold? No Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21
Two Examples when LSC doesn’t Hold i = 1 and W ∗ = I ( k = d ). Set v ∗ 1 When φ ( z ) = z , ( x ⊤ � � � ∇ 2 f D ( W ∗ ) a j ) 2 = 0 λ min = min j � a j � 2 =1 E � j The minimum is achieved when � j a j = 0 Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 10 / 21
Two Examples when LSC doesn’t Hold i = 1 and W ∗ = I ( k = d ). Set v ∗ 2 When φ ( z ) = z 2 , � � � ( � xx ⊤ , A � ) 2 � ∇ 2 f D ( W ∗ ) λ min = 4 min = 0 j � a j � 2 =1 E � where A = [ a 1 , a 2 , · · · , a d ] ∈ R d × d . The minimum is achieved when A = − A ⊤ . Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 11 / 21
When LSC Holds 1 φ ( z ) satisfies three properties. P1 Non-negative and homogeneously bounded derivative 0 ≤ φ ′ ( z ) ≤ L 1 | z | p for some constants L 1 > 0 and p ≥ 0. max( z, 0) tanh( z ) max( z, 0 . 1 z ) Figure: activations satisfying P1 z 2 e z max( − z, 0) Figure: activations not satisfying P1 Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 12 / 21
When LSC Holds 1 φ ( z ) satisfies three properties. P2 “Non-linearity” 1 For any σ > 0, we have ρ ( σ ) > 0 , where ρ ( σ ) := min { α 2 , 0 − α 2 1 , 0 − α 2 1 , 1 , α 2 , 2 − α 2 1 , 1 − α 2 1 , 2 , α 1 , 0 α 1 , 2 − α 2 1 , 1 } and α i,j := E z ∼N (0 , 1) [( φ ′ ( σz )) i z j ]. ReLU leaky squared erf tanh linear quad- ReLU ReLU ratic ρ (0 . 1) 1.9E-4 1.8E-4 ρ (1) 0.091 0.089 0.27 σ 5.2E-2 4.9E-2 0 0 ρ (10) 2.5E-5 5.1E-5 1 Best name we can find... still need more understanding for ρ ( σ ) Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 13 / 21
When LSC Holds 1 φ ( z ) satisfies three properties. P3 φ ′′ ( z ) satisfies one of the following two properties, (a) Smoothness | φ ′′ ( z ) | ≤ L 2 for all z for some constant L 2 , or (b) Piece-wise linearity φ ′′ ( z ) = 0 except for e ( e is a finite constant) points. max( z, 0) 2 max( z, 0) max( z, 0 . 1 z ) tanh( z ) φ ( z ) = 0 if z < 0; z 4 + 4 z o.w. e z Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 14 / 21
Recommend
More recommend