Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint work with Sanjeev Arora, Yi Zhang Princeton University August 19, 2020 @ IJTCS Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 1 / 30
Introduction Table of Contents Introduction 1 Intuition and Warm-up example 2 Identifying Algorithmic Equivariance 3 Lower Bound for Equivariant Algorithms 4 Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 2 / 30
Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30
Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30
Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30
Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Question: Can we justify this rigorously by showing a sample complexity separation? Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30
Introduction Introduction CNN often performs better than FC Nets , especially on vision tasks. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Question: Can we justify this rigorously by showing a sample complexity separation? Since ultra-wide FC nets can simulate any CNN, the hurdle here is how to show that (S)GD + FC net doesn’t learn those CNN with good generalization. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 4 / 30
Introduction Introduction CNN often performs better than FC Nets , especially on vision tasks. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Question: Can we justify this rigorously by showing a sample complexity separation? Since ultra-wide FC nets can simulate any CNN, the hurdle here is how to show that (S)GD + FC net doesn’t learn those CNN with good generalization. This Work A single distribution + a single target function which can be learnt by CNN with constant samples, but SGD on FC nets of any depth and width require Ω( d 2 ) samples. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 4 / 30
Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30
Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30
Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30
Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30
Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Two examples, Kernel Regression and ERM (Empirical Risk Minimization): Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30
Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Two examples, Kernel Regression and ERM (Empirical Risk Minimization): � � REG K ( { x i , y i } n K ( x , X n ) · K ( X n , X n ) † y ≥ 0 i =1 )( x ) := 1 . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30
Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Two examples, Kernel Regression and ERM (Empirical Risk Minimization): � � REG K ( { x i , y i } n K ( x , X n ) · K ( X n , X n ) † y ≥ 0 i =1 )( x ) := 1 . � n ERM H ( { x i , y i } n i =1 1 [ h ( x i ) � = y i ]. 1 i =1 ) = argmin h ∈H 1 Strictly speaking, ERM H is not a well-defined algorithm. In this talk, we consider the worst performance of all the empirical minimizers in H . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30
Introduction Setting err P ( h ) = P ( X , Y ) ∼ P [ h ( X ) � = Y ]. Sample Complexity: single joint distribution P The ( ε, δ )- sample complexity , denoted N ( A , P , ε, δ ), is the smallest number n such that w.p. 1 − δ over the randomness of { x i , y i } n i =1 , err P ( A ( { x i , y i } n i =1 )) ≤ ε . We also define the ε -expected sample complexity, N ∗ ( A , P , ε ), as the smallest number n such that � � err P ( A ( { x i , y i } n i =1 )) ≤ ε . E ( x i , y i ) ∼ P Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 6 / 30
Introduction Setting err P ( h ) = P ( X , Y ) ∼ P [ h ( X ) � = Y ]. Sample Complexity: single joint distribution P The ( ε, δ )- sample complexity , denoted N ( A , P , ε, δ ), is the smallest number n such that w.p. 1 − δ over the randomness of { x i , y i } n i =1 , err P ( A ( { x i , y i } n i =1 )) ≤ ε . We also define the ε -expected sample complexity, N ∗ ( A , P , ε ), as the smallest number n such that � � err P ( A ( { x i , y i } n i =1 )) ≤ ε . E ( x i , y i ) ∼ P Sample Complexity: a family of distributions, P N ∗ ( A , P , ε ) = max P ∈P N ∗ ( A , P , ε ) N ( A , P , ε, δ ) = max P ∈P N ( A , P , ε, δ ) ; Fact: N ∗ ( A , P , ε + δ ) ≤ N ( A , P , ε, δ ) ≤ N ∗ ( A , P , εδ ) , ∀ ε, δ ∈ [0 , 1]. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 6 / 30
Introduction Parametric Models A parametric model M : W → Y X is a functional mapping from weight W to a hypothesis M ( W ) : X → Y . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 7 / 30
Recommend
More recommend