Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao and Quanquan Gu Computer Science Department 1 / 14
Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks (Zhang et al. 2017; Bartlett et al. 2017; Neyshabur et al. 2018; Arora et al. 2019) 2 / 14
Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks (Zhang et al. 2017; Bartlett et al. 2017; Neyshabur et al. 2018; Arora et al. 2019) ◮ Why can extremely wide neural networks generalize? ◮ What data can be learned by deep and wide neural networks? 3 / 14
Learning Over-parameterized DNNs ◮ Fully connected neural network with width m : f W ( x ) = √ m · W L σ ( W L − 1 · · · σ ( W 1 x ) · · · )) . ◮ σ ( · ) is the ReLU activation function: σ ( t ) = max(0 , t ) . ◮ L ( x i ,y i ) ( W ) = ℓ [ y i · f W ( x i )] , ℓ ( z ) = log(1 + exp( − z )) . 4 / 14
Learning Over-parameterized DNNs ◮ Fully connected neural network with width m : f W ( x ) = √ m · W L σ ( W L − 1 · · · σ ( W 1 x ) · · · )) . ◮ σ ( · ) is the ReLU activation function: σ ( t ) = max(0 , t ) . ◮ L ( x i ,y i ) ( W ) = ℓ [ y i · f W ( x i )] , ℓ ( z ) = log(1 + exp( − z )) . Algorithm SGD for DNNs starting at Gaussian initialization W (0) ∼ N (0 , 2 /m ) , l ∈ [ L − 1] , W (0) ∼ N (0 , 1 /m ) l L for i = 1 , 2 , . . . , n do Draw ( x i , y i ) from D . Update W ( i ) = W ( i − 1) − η · ∇ W L ( x i ,y i ) ( W ( i − 1) ) . end for Output: Randomly choose � W uniformly from { W (0) , . . . , W ( n − 1) } . 5 / 14
Generalization Bounds for DNNs Theorem � � For any R > 0 , if m ≥ � , then with high probability, SGD returns � Ω poly( R, L, n ) W that satisfies � � � � � � n � � 4 LR log(1 /δ ) ( � L 0 − 1 W ) ≤ inf ℓ [ y i · f ( x i )] + O √ n + , E D n n f ∈F ( W (0) ,R ) i =1 where � � F ( W (0) , R ) = f W (0) ( · ) + �∇ W f W (0) ( · ) , W � : � W l � F ≤ R · m − 1 / 2 , l ∈ [ L ] . 6 / 14
Generalization Bounds for DNNs Theorem � � For any R > 0 , if m ≥ � , then with high probability, SGD returns � Ω poly( R, L, n ) W that satisfies � � � � � � n � � 4 LR log(1 /δ ) ( � L 0 − 1 W ) ≤ inf ℓ [ y i · f ( x i )] + O √ n + , E D n n f ∈F ( W (0) ,R ) i =1 where � � F ( W (0) , R ) = f W (0) ( · ) + �∇ W f W (0) ( · ) , W � : � W l � F ≤ R · m − 1 / 2 , l ∈ [ L ] . Neural Tangent Random Feature (NTRF) model 7 / 14
Generalization Bounds for DNNs Corollary � � Let y = ( y 1 , . . . , y n ) ⊤ and λ 0 = λ min ( Θ ( L ) ) . If m ≥ � poly( L, n, λ − 1 Ω 0 ) , then with high probability, SGD returns � W that satisfies � � �� � � � � y ⊤ ( Θ ( L ) ) − 1 � � y log(1 /δ ) ( � ≤ � L 0 − 1 E W ) O L · inf + O . D n n y i y i ≥ 1 � where Θ ( L ) is the neural tangent kernel (Jacot et al. 2018) Gram matrix. Θ ( L ) i,j := lim m →∞ m − 1 �∇ W f W (0) ( x i ) , ∇ W f W (0) ( x j ) � . 8 / 14
Generalization Bounds for DNNs Corollary � � Let y = ( y 1 , . . . , y n ) ⊤ and λ 0 = λ min ( Θ ( L ) ) . If m ≥ � poly( L, n, λ − 1 Ω 0 ) , then with high probability, SGD returns � W that satisfies � � �� � � � � y ⊤ ( Θ ( L ) ) − 1 � � y log(1 /δ ) ( � ≤ � L 0 − 1 E W ) O L · inf + O . D n n y i y i ≥ 1 � where Θ ( L ) is the neural tangent kernel (Jacot et al. 2018) Gram matrix. Θ ( L ) i,j := lim m →∞ m − 1 �∇ W f W (0) ( x i ) , ∇ W f W (0) ( x j ) � . The “classifiability” of the underlying data distribution D can also be � y ⊤ ( Θ ( L ) ) − 1 � measured by the quantity inf � � y . y i y i ≥ 1 9 / 14
Overview of the Proof Key observations ◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization f W ′ ( x i ) ≈ f W ( x i ) + �∇ f W ( x i ) , W ′ − W � . ◮ L ( x i ,y i ) ( W ) is Lipschitz continuous and almost convex �∇ W l L ( x i ,y i ) ( W ) � F ≤ O ( √ m ) , l ∈ [ L ] , L ( x i ,y i ) ( W ′ ) � L ( x i ,y i ) ( W ) + �∇ W L ( x i ,y i ) ( W ) , W ′ − W � . 10 / 14
Overview of the Proof Key observations ◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization f W ′ ( x i ) ≈ f W ( x i ) + �∇ f W ( x i ) , W ′ − W � . ◮ L ( x i ,y i ) ( W ) is Lipschitz continuous and almost convex �∇ W l L ( x i ,y i ) ( W ) � F ≤ O ( √ m ) , l ∈ [ L ] , L ( x i ,y i ) ( W ′ ) � L ( x i ,y i ) ( W ) + �∇ W L ( x i ,y i ) ( W ) , W ′ − W � . Optimization for Lipschitz and (almost) convex functions + Online-to-batch conversion 11 / 14
Overview of the Proof Key observations ◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization f W ′ ( x i ) ≈ f W ( x i ) + �∇ f W ( x i ) , W ′ − W � . ◮ L ( x i ,y i ) ( W ) is Lipschitz continuous and almost convex �∇ W l L ( x i ,y i ) ( W ) � F ≤ O ( √ m ) , l ∈ [ L ] , L ( x i ,y i ) ( W ′ ) � L ( x i ,y i ) ( W ) + �∇ W L ( x i ,y i ) ( W ) , W ′ − W � . Applicable to general loss functions: ℓ ( · ) is convex/Lipschitz/smooth ⇒ L ( x i ,y i ) ( W ) is (almost) convex/Lipschitz/smooth 12 / 14
Summary ◮ Generalization bounds for wide DNNs that do not increase in network width. ◮ A random feature model (NTRF model) that naturally connects over-parameterized DNNs with NTK. � ◮ A quantification of the “classifiability” of data: inf � � y ⊤ ( Θ ( L ) ) − 1 � y . y i y i ≥ 1 ◮ A clean and simple proof framework for neural networks in the “NTK regime” that is applicable to various problem settings. 13 / 14
Summary ◮ Generalization bounds for wide DNNs that do not increase in network width. ◮ A random feature model (NTRF model) that naturally connects over-parameterized DNNs with NTK. � ◮ A quantification of the “classifiability” of data: inf � � y ⊤ ( Θ ( L ) ) − 1 � y . y i y i ≥ 1 ◮ A clean and simple proof framework for neural networks in the “NTK regime” that is applicable to various problem settings. Thank you! Poster #141 14 / 14
Recommend
More recommend