generalization bounds of stochastic gradient descent for
play

Generalization Bounds of Stochastic Gradient Descent for Wide and - PowerPoint PPT Presentation

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao and Quanquan Gu Computer Science Department 1 / 14 Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks


  1. Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao and Quanquan Gu Computer Science Department 1 / 14

  2. Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks (Zhang et al. 2017; Bartlett et al. 2017; Neyshabur et al. 2018; Arora et al. 2019) 2 / 14

  3. Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks (Zhang et al. 2017; Bartlett et al. 2017; Neyshabur et al. 2018; Arora et al. 2019) ◮ Why can extremely wide neural networks generalize? ◮ What data can be learned by deep and wide neural networks? 3 / 14

  4. Learning Over-parameterized DNNs ◮ Fully connected neural network with width m : f W ( x ) = √ m · W L σ ( W L − 1 · · · σ ( W 1 x ) · · · )) . ◮ σ ( · ) is the ReLU activation function: σ ( t ) = max(0 , t ) . ◮ L ( x i ,y i ) ( W ) = ℓ [ y i · f W ( x i )] , ℓ ( z ) = log(1 + exp( − z )) . 4 / 14

  5. Learning Over-parameterized DNNs ◮ Fully connected neural network with width m : f W ( x ) = √ m · W L σ ( W L − 1 · · · σ ( W 1 x ) · · · )) . ◮ σ ( · ) is the ReLU activation function: σ ( t ) = max(0 , t ) . ◮ L ( x i ,y i ) ( W ) = ℓ [ y i · f W ( x i )] , ℓ ( z ) = log(1 + exp( − z )) . Algorithm SGD for DNNs starting at Gaussian initialization W (0) ∼ N (0 , 2 /m ) , l ∈ [ L − 1] , W (0) ∼ N (0 , 1 /m ) l L for i = 1 , 2 , . . . , n do Draw ( x i , y i ) from D . Update W ( i ) = W ( i − 1) − η · ∇ W L ( x i ,y i ) ( W ( i − 1) ) . end for Output: Randomly choose � W uniformly from { W (0) , . . . , W ( n − 1) } . 5 / 14

  6. Generalization Bounds for DNNs Theorem � � For any R > 0 , if m ≥ � , then with high probability, SGD returns � Ω poly( R, L, n ) W that satisfies � � � � � � n � � 4 LR log(1 /δ ) ( � L 0 − 1 W ) ≤ inf ℓ [ y i · f ( x i )] + O √ n + , E D n n f ∈F ( W (0) ,R ) i =1 where � � F ( W (0) , R ) = f W (0) ( · ) + �∇ W f W (0) ( · ) , W � : � W l � F ≤ R · m − 1 / 2 , l ∈ [ L ] . 6 / 14

  7. Generalization Bounds for DNNs Theorem � � For any R > 0 , if m ≥ � , then with high probability, SGD returns � Ω poly( R, L, n ) W that satisfies � � � � � � n � � 4 LR log(1 /δ ) ( � L 0 − 1 W ) ≤ inf ℓ [ y i · f ( x i )] + O √ n + , E D n n f ∈F ( W (0) ,R ) i =1 where � � F ( W (0) , R ) = f W (0) ( · ) + �∇ W f W (0) ( · ) , W � : � W l � F ≤ R · m − 1 / 2 , l ∈ [ L ] . Neural Tangent Random Feature (NTRF) model 7 / 14

  8. Generalization Bounds for DNNs Corollary � � Let y = ( y 1 , . . . , y n ) ⊤ and λ 0 = λ min ( Θ ( L ) ) . If m ≥ � poly( L, n, λ − 1 Ω 0 ) , then with high probability, SGD returns � W that satisfies � � �� � � � � y ⊤ ( Θ ( L ) ) − 1 � � y log(1 /δ ) ( � ≤ � L 0 − 1 E W ) O L · inf + O . D n n y i y i ≥ 1 � where Θ ( L ) is the neural tangent kernel (Jacot et al. 2018) Gram matrix. Θ ( L ) i,j := lim m →∞ m − 1 �∇ W f W (0) ( x i ) , ∇ W f W (0) ( x j ) � . 8 / 14

  9. Generalization Bounds for DNNs Corollary � � Let y = ( y 1 , . . . , y n ) ⊤ and λ 0 = λ min ( Θ ( L ) ) . If m ≥ � poly( L, n, λ − 1 Ω 0 ) , then with high probability, SGD returns � W that satisfies � � �� � � � � y ⊤ ( Θ ( L ) ) − 1 � � y log(1 /δ ) ( � ≤ � L 0 − 1 E W ) O L · inf + O . D n n y i y i ≥ 1 � where Θ ( L ) is the neural tangent kernel (Jacot et al. 2018) Gram matrix. Θ ( L ) i,j := lim m →∞ m − 1 �∇ W f W (0) ( x i ) , ∇ W f W (0) ( x j ) � . The “classifiability” of the underlying data distribution D can also be � y ⊤ ( Θ ( L ) ) − 1 � measured by the quantity inf � � y . y i y i ≥ 1 9 / 14

  10. Overview of the Proof Key observations ◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization f W ′ ( x i ) ≈ f W ( x i ) + �∇ f W ( x i ) , W ′ − W � . ◮ L ( x i ,y i ) ( W ) is Lipschitz continuous and almost convex �∇ W l L ( x i ,y i ) ( W ) � F ≤ O ( √ m ) , l ∈ [ L ] , L ( x i ,y i ) ( W ′ ) � L ( x i ,y i ) ( W ) + �∇ W L ( x i ,y i ) ( W ) , W ′ − W � . 10 / 14

  11. Overview of the Proof Key observations ◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization f W ′ ( x i ) ≈ f W ( x i ) + �∇ f W ( x i ) , W ′ − W � . ◮ L ( x i ,y i ) ( W ) is Lipschitz continuous and almost convex �∇ W l L ( x i ,y i ) ( W ) � F ≤ O ( √ m ) , l ∈ [ L ] , L ( x i ,y i ) ( W ′ ) � L ( x i ,y i ) ( W ) + �∇ W L ( x i ,y i ) ( W ) , W ′ − W � . Optimization for Lipschitz and (almost) convex functions + Online-to-batch conversion 11 / 14

  12. Overview of the Proof Key observations ◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization f W ′ ( x i ) ≈ f W ( x i ) + �∇ f W ( x i ) , W ′ − W � . ◮ L ( x i ,y i ) ( W ) is Lipschitz continuous and almost convex �∇ W l L ( x i ,y i ) ( W ) � F ≤ O ( √ m ) , l ∈ [ L ] , L ( x i ,y i ) ( W ′ ) � L ( x i ,y i ) ( W ) + �∇ W L ( x i ,y i ) ( W ) , W ′ − W � . Applicable to general loss functions: ℓ ( · ) is convex/Lipschitz/smooth ⇒ L ( x i ,y i ) ( W ) is (almost) convex/Lipschitz/smooth 12 / 14

  13. Summary ◮ Generalization bounds for wide DNNs that do not increase in network width. ◮ A random feature model (NTRF model) that naturally connects over-parameterized DNNs with NTK. � ◮ A quantification of the “classifiability” of data: inf � � y ⊤ ( Θ ( L ) ) − 1 � y . y i y i ≥ 1 ◮ A clean and simple proof framework for neural networks in the “NTK regime” that is applicable to various problem settings. 13 / 14

  14. Summary ◮ Generalization bounds for wide DNNs that do not increase in network width. ◮ A random feature model (NTRF model) that naturally connects over-parameterized DNNs with NTK. � ◮ A quantification of the “classifiability” of data: inf � � y ⊤ ( Θ ( L ) ) − 1 � y . y i y i ≥ 1 ◮ A clean and simple proof framework for neural networks in the “NTK regime” that is applicable to various problem settings. Thank you! Poster #141 14 / 14

Recommend


More recommend