Why Are Convlotuional Nets More Sample-Efficient than - PowerPoint PPT Presentation

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint work with Sanjeev Arora, Yi Zhang Princeton University August 19, 2020 @ IJTCS Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 1 / 30

Introduction Table of Contents Introduction 1 Intuition and Warm-up example 2 Identifying Algorithmic Equivariance 3 Lower Bound for Equivariant Algorithms 4 Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 2 / 30

Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Question: Can we justify this rigorously by showing a sample complexity separation? Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

Introduction Introduction CNN often performs better than FC Nets , especially on vision tasks. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Question: Can we justify this rigorously by showing a sample complexity separation? Since ultra-wide FC nets can simulate any CNN, the hurdle here is how to show that (S)GD + FC net doesn’t learn those CNN with good generalization. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 4 / 30

Introduction Introduction CNN often performs better than FC Nets , especially on vision tasks. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Question: Can we justify this rigorously by showing a sample complexity separation? Since ultra-wide FC nets can simulate any CNN, the hurdle here is how to show that (S)GD + FC net doesn’t learn those CNN with good generalization. This Work A single distribution + a single target function which can be learnt by CNN with constant samples, but SGD on FC nets of any depth and width require Ω( d 2 ) samples. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 4 / 30

Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Two examples, Kernel Regression and ERM (Empirical Risk Minimization): Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Two examples, Kernel Regression and ERM (Empirical Risk Minimization): � � REG K ( { x i , y i } n K ( x , X n ) · K ( X n , X n ) † y ≥ 0 i =1 )( x ) := 1 . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Two examples, Kernel Regression and ERM (Empirical Risk Minimization): � � REG K ( { x i , y i } n K ( x , X n ) · K ( X n , X n ) † y ≥ 0 i =1 )( x ) := 1 . � n ERM H ( { x i , y i } n i =1 1 [ h ( x i ) � = y i ]. 1 i =1 ) = argmin h ∈H 1 Strictly speaking, ERM H is not a well-defined algorithm. In this talk, we consider the worst performance of all the empirical minimizers in H . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

Introduction Setting err P ( h ) = P ( X , Y ) ∼ P [ h ( X ) � = Y ]. Sample Complexity: single joint distribution P The ( ε, δ )- sample complexity , denoted N ( A , P , ε, δ ), is the smallest number n such that w.p. 1 − δ over the randomness of { x i , y i } n i =1 , err P ( A ( { x i , y i } n i =1 )) ≤ ε . We also define the ε -expected sample complexity, N ∗ ( A , P , ε ), as the smallest number n such that � � err P ( A ( { x i , y i } n i =1 )) ≤ ε . E ( x i , y i ) ∼ P Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 6 / 30

Introduction Setting err P ( h ) = P ( X , Y ) ∼ P [ h ( X ) � = Y ]. Sample Complexity: single joint distribution P The ( ε, δ )- sample complexity , denoted N ( A , P , ε, δ ), is the smallest number n such that w.p. 1 − δ over the randomness of { x i , y i } n i =1 , err P ( A ( { x i , y i } n i =1 )) ≤ ε . We also define the ε -expected sample complexity, N ∗ ( A , P , ε ), as the smallest number n such that � � err P ( A ( { x i , y i } n i =1 )) ≤ ε . E ( x i , y i ) ∼ P Sample Complexity: a family of distributions, P N ∗ ( A , P , ε ) = max P ∈P N ∗ ( A , P , ε ) N ( A , P , ε, δ ) = max P ∈P N ( A , P , ε, δ ) ; Fact: N ∗ ( A , P , ε + δ ) ≤ N ( A , P , ε, δ ) ≤ N ∗ ( A , P , εδ ) , ∀ ε, δ ∈ [0 , 1]. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 6 / 30

Introduction Parametric Models A parametric model M : W → Y X is a functional mapping from weight W to a hypothesis M ( W ) : X → Y . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 7 / 30

Why Are Convlotuional Nets More Sample-Efficient than - PowerPoint PPT Presentation

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint work with Sanjeev Arora, Yi Zhang Princeton University August 19, 2020 @ IJTCS Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

From DB-nets to Coloured Petri Nets with Priorities Marco Montali and Andrey Rivkin KRDB Research

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Iridex Group Plastic Protective sleeves Rabitz mesh Soil stabilization grid W Fencing nets

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

The power of Nets! Tychonoff Theorem Product spaces When sequences are not enough The Theorem

Mini Bookfairs in Schools/Universities More than 50 publishers More than 50 publishers More than

Why Transformers Work. More info blablabla More info blablabla More info blablabla More

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Boolean functions in quantum computation Ashley Montanaro School of Mathematics, University of

Beyond Short Snippets: Deep Networks for Video Classification Joe Yue-Hei Ng, Matthew Hausknecht,

Neural Architecture Search with Bayesian Optimisation and Optimal Transport #0 ip, 64, (28891)

Convolutional Neural Networks Hwann-Tzong Chen Naitonal Tsing Hua University 3 Januray 2017 1 /

Impact of Magnet Performance on the Physics Program of MICE Chris Rogers, AST eC, Rutherford

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

Direct Link Networks: Multiaccess Protocols (2.7) CS/ECpE 5516: Computer Networks Originally by

Why Are Convlotuional Nets More Sample-Efficient than - PowerPoint PPT Presentation

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint work with Sanjeev Arora, Yi Zhang Princeton University August 19, 2020 @ IJTCS Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

From DB-nets to Coloured Petri Nets with Priorities Marco Montali and Andrey Rivkin KRDB Research

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Iridex Group Plastic Protective sleeves Rabitz mesh Soil stabilization grid W Fencing nets

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

The power of Nets! Tychonoff Theorem Product spaces When sequences are not enough The Theorem

Mini Bookfairs in Schools/Universities More than 50 publishers More than 50 publishers More than

Why Transformers Work. *More info blablabla *More info blablabla *More info blablabla *More

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Boolean functions in quantum computation Ashley Montanaro School of Mathematics, University of

Beyond Short Snippets: Deep Networks for Video Classification Joe Yue-Hei Ng, Matthew Hausknecht,

Neural Architecture Search with Bayesian Optimisation and Optimal Transport #0 ip, 64, (28891)

Convolutional Neural Networks Hwann-Tzong Chen Naitonal Tsing Hua University 3 Januray 2017 1 /

Impact of Magnet Performance on the Physics Program of MICE Chris Rogers, AST eC, Rutherford

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

Direct Link Networks: Multiaccess Protocols (2.7) CS/ECpE 5516: Computer Networks Originally by

Why Transformers Work. More info blablabla More info blablabla More info blablabla More