Nearly-tight VC-dimension bounds for piecewise linear neural - PowerPoint PPT Presentation

Nearly-tight VC-dimension bounds for piecewise linear neural networks Nicholas J. A. Harvey, Christopher Liaw , Abbas Mehrabian University of British Columbia COLT ’17 July 10, 2017

Neural networks 𝜏(𝑦) = max ¡ {𝑦, 0} (ReLU) 𝜏(𝑥 * 𝑦 + 𝑐) 𝑦 " 𝜏 Identity 𝑦 # 𝜏 𝜏 𝑦 $ 𝜏 𝜏 𝑦 % 𝜏 𝜏 𝑦 & input hidden hidden output layer 1 layer 2 layer layer

VC-dimension Defn : If ¡𝐺 is a family of functions then 𝑊𝐷𝑒𝑗𝑛 𝐺 ≥ 𝑙 iff ∃ 𝑌 = {𝑦 " , … , 𝑦 A } s.t. 𝐺 achieves all 2 A signings, i.e. {(𝑡𝑗𝑕𝑜(𝑔(𝑦 " )), … , 𝑡𝑗𝑕𝑜(𝑔(𝑦 A ))): 𝑔 ∈ 𝐺} = {0,1} A e.g. Hyperplanes in 𝑆 K have VC-dimension 𝑒 + 1 . Impossible to shatter Can shatter any 4 points

VC-dimension Defn : If ¡𝐺 is a family of functions then 𝑊𝐷𝑒𝑗𝑛 𝐺 ≥ 𝑙 iff ∃ 𝑌 = {𝑦 " , … , 𝑦 A } s.t. 𝐺 achieves all 2 A signings, i.e. {(𝑡𝑗𝑕𝑜(𝑔(𝑦 " )), … , 𝑡𝑗𝑕𝑜(𝑔(𝑦 A ))): 𝑔 ∈ 𝐺} = {0,1} A Thm [Fund. thm. of learning] : 𝐺 is learnable iff 𝑊𝐷𝑒𝑗𝑛 𝐺 < ∞ . Moreover, sample complexity is Θ(𝑊𝐷𝑒𝑗𝑛 𝐺 ) .

W ¡-‑ # ¡parameters/edges VC-dimension of NNs L ¡-‑ # ¡layers Known lower bounds: Known upper bounds: • 𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀 # • Ω 𝑋𝑀 [BMM ‘98] [BMM ‘98] • 𝑃 𝑋 # • Ω 𝑋 log 𝑋 [M ‘94] [GJ ‘95]

W ¡-‑ # ¡parameters/edges VC-dimension of NNs L ¡-‑ # ¡layers Known lower bounds: Known upper bounds: • 𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀 # • Ω 𝑋𝑀 [BMM ‘98] [BMM ‘98] • 𝑃 𝑋 # • Ω 𝑋 log 𝑋 [M ‘94] [GJ ‘95] Main Thm [HLM ‘17] : For a ReLU NN w/ W params, L layers Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋) Means there exists NN with this VCdim

W ¡-‑ # ¡parameters/edges VC-dimension of NNs L ¡-‑ # ¡layers Known lower bounds: Known upper bounds: • 𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀 # • Ω 𝑋𝑀 [BMM ‘98] [BMM ‘98] • 𝑃 𝑋 # • Ω 𝑋 log 𝑋 [M ‘94] [GJ ‘95] Main Thm [HLM ‘17] : For a ReLU NN w/ W params, L layers Main Thm [HLM ‘17] : For a ReLU NN w/ W params, L layers Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋) Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋) Means there exists Independently proved NN with this VCdim by Bartlett ‘17

W ¡-‑ # ¡parameters/edges VC-dimension of NNs L ¡-‑ # ¡layers Known lower bounds: Known upper bounds: • 𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀 # • Ω 𝑋𝑀 [BMM ‘98] [BMM ‘98] • 𝑃 𝑋 # • Ω 𝑋 log 𝑋 [M ‘94] [GJ ‘95] Main Thm [HLM ‘17] : For a ReLU NN w/ W params, L layers Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋) Means there exists Independently proved NN with this VCdim by Bartlett ‘17 Recently, lots of work on “power of depth” for expressiveness of NNs [T ‘16, ES ‘16, Y ’16, LS ‘16, SS’16, CSS’ 16, LGMRA ‘17, D ‘17]

Lower bound (refinement of [BMM ’98]) • Shattered set: 𝑇 = 𝑓 ^ ^∈[`] ×{𝑓 c } c∈[d] • Encode 𝑔 w/ weights 𝑏 ^ = 0. 𝑏 ^," … 𝑏 ^,d where 𝑏 ^,c = 𝑔(𝑓 ^ , 𝑓 c )

NN block extracts Lower bound bits from 𝑏 ^ 𝑏 ^," 0 (refinement of [BMM ’98]) 𝑏 ^,# 0 𝑏 ^ • Shattered set: 𝑇 = 𝑓 ^ ^∈[`] ×{𝑓 c } c∈[d] 𝑏 ^ 𝑏 ^,c 1 • Encode 𝑔 w/ weights 𝑏 ^ = 0. 𝑏 ^," … 𝑏 ^,d where 𝑏 ^,c = 𝑔(𝑓 ^ , 𝑓 c ) 𝑓 ^ 𝑏 ^,d 0 • Given 𝑓 ^ , easy to extract 𝑏 ^ 0 𝑓 c 0 Select bit j from 𝑏 ^ 1 0 Rest of NN

NN block extracts Lower bound bits from 𝑏 ^ 𝑏 ^," 0 (refinement of [BMM ’98]) 𝑏 ^,# 0 𝑏 ^ • Shattered set: 𝑇 = 𝑓 ^ ^∈[`] ×{𝑓 c } c∈[d] 𝑏 ^ 𝑏 ^,c 1 • Encode 𝑔 w/ weights 𝑏 ^ = 0. 𝑏 ^," … 𝑏 ^,d where 𝑏 ^,c = 𝑔(𝑓 ^ , 𝑓 c ) 𝑓 ^ 𝑏 ^,d 0 • Given 𝑓 ^ , easy to extract 𝑏 ^ 0 • Design bit extractor to extract 𝑏 ^,c 𝑓 c 0 • [BMM ’98] do this 1 bit per layer ⇒ Ω(𝑋𝑀) Select bit j from 𝑏 ^ • More efficient: log(𝑋/𝑀) bits per layer 1 ⇒ Ω(𝑋𝑀 ¡log ¡ (𝑋/𝑀)) 0 Rest of NN

NN block extracts Lower bound bits from 𝑏 ^ 𝑏 ^," 0 (refinement of [BMM ’98]) 𝑏 ^,# 0 𝑏 ^ • Shattered set: 𝑇 = 𝑓 ^ ^∈[`] ×{𝑓 c } c∈[d] 𝑏 ^ 𝑏 ^,c 1 • Encode 𝑔 w/ weights 𝑏 ^ = 0. 𝑏 ^," … 𝑏 ^,d where 𝑏 ^,c = 𝑔(𝑓 ^ , 𝑓 c ) 𝑓 ^ 𝑏 ^,d 0 • Given 𝑓 ^ , easy to extract 𝑏 ^ 0 • Design bit extractor to extract 𝑏 ^,c 𝑓 c 0 • [BMM ’98] do this 1 bit per layer ⇒ Ω(𝑋𝑀) Select bit j from 𝑏 ^ • More efficient: log(𝑋/𝑀) bits per layer 1 ⇒ Ω(𝑋𝑀 ¡log ¡ (𝑋/𝑀)) Thm [HLM ‘17] : Suppose a ReLU NN w/ 𝑋 ¡ params, 𝑀 ¡ layers 0 extracts 𝑛 th bit of input. Then 𝑛 ≤ 𝑃(𝑀 ¡log ¡ (𝑋/𝑀)) . Rest of NN

Upper bound (refinement of [BMM ’98] for ReLU) • Fix a shattered set 𝑌 = {𝑦 " , … , 𝑦 d } ^ 𝑦 " • Partition parameter space s.t. input to 1 st 𝜏 hidden layer has constant sign 𝜏 ^ 𝑦 # • can replace 𝜏 with 0 (if < 0) or identity (if > 0)! 𝜏 • Number of partition is small, i.e. ≤ (𝐷𝑛) g 𝜏 ^ 𝑦 $ 𝜏 • Repeat procedure for each layer to get 𝜏 ^ 𝑦 % partition of size ≤ (𝐷𝑀𝑛) h(gi) 𝜏 • In each piece, output is polynomial of deg. 𝑀 ^ 𝑦 & so total # of signings ≤ 𝐷𝑀𝑛 h gi • Since 𝑌 is shattered, need 2 d ≤ 𝐷𝑀𝑛 h gi which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

Upper bound (refinement of [BMM ’98] for ReLU) • Fix a shattered set 𝑌 = {𝑦 " , … , 𝑦 d } ^ 𝑦 " • Partition parameter space s.t. input to 1 st 𝜏 hidden layer has constant sign ^ 𝑦 # • can replace 𝜏 with 0 (if < 0) or identity (if > 0)! 𝜏 • Number of partition is small, i.e. ≤ (𝐷𝑛) g ^ 𝑦 $ 0 𝜏 • Repeat procedure for each layer to get ^ 𝑦 % partition of size ≤ (𝐷𝑀𝑛) h(gi) 𝜏 • In each piece, output is polynomial of deg. 𝑀 ^ 𝑦 & so total # of signings ≤ 𝐷𝑀𝑛 h gi • Since 𝑌 is shattered, need 2 d ≤ 𝐷𝑀𝑛 h gi which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

Upper bound (refinement of [BMM ’98] for ReLU) • Fix a shattered set 𝑌 = {𝑦 " , … , 𝑦 d } ^ 𝑦 " • Partition parameter space s.t. input to 1 st 𝜏 hidden layer has constant sign ^ 𝑦 # • can replace 𝜏 with 0 (if < 0) or identity (if > 0)! 𝜏 • Size of partition is small, i.e. ≤ (𝐷𝑛) g [Warren ‘68] ^ 𝑦 $ 0 𝜏 • Repeat procedure for each layer to get ^ 𝑦 % partition of size ≤ (𝐷𝑀𝑛) h(gi) 𝜏 • In each piece, output is polynomial of deg. 𝑀 ^ 𝑦 & so total # of signings ≤ 𝐷𝑀𝑛 h gi • Since 𝑌 is shattered, need 2 d ≤ 𝐷𝑀𝑛 h gi which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋) * ¡ 𝐷 ¡ > ¡1 is some constant

Upper bound (refinement of [BMM ’98] for ReLU) • Fix a shattered set 𝑌 = {𝑦 " , … , 𝑦 d } ^ 𝑦 " • Partition parameter space s.t. input to 1 st hidden layer has constant sign ^ 𝑦 # • can replace 𝜏 with 0 (if < 0) or identity (if > 0)! • Size of partition is small, i.e. ≤ (𝐷𝑛) g [Warren ‘68] ^ 𝑦 $ 0 • Repeat procedure for each layer to get ^ 𝑦 % partition of size ≤ (𝐷𝑀𝑛) h(gi) • In each piece, output is polynomial of deg. 𝑀 ^ 𝑦 & so total # of signings ≤ 𝐷𝑀𝑛 h gi • Since 𝑌 is shattered, need 2 d ≤ 𝐷𝑀𝑛 h gi which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋) * ¡ 𝐷 ¡ > ¡1 is some constant

Nearly-tight VC-dimension bounds for piecewise linear neural - PowerPoint PPT Presentation

Nearly-tight VC-dimension bounds for piecewise linear neural networks Nicholas J. A. Harvey, Christopher Liaw , Abbas Mehrabian University of British Columbia COLT 17 July 10, 2017 Neural networks () = max {, 0} (ReLU)

Chapter 2 Tight-frames An Introduction 1 Outline 1. Tight-frame 1. Tight-frame 2. Matrix

Nearly Tight Bounds for Robust Proper Learning of Halfspaces with a Margin Ilias Diakonikolas

Reeb Graphs and Piecewise Linear Functions Koen Klaren Eindhoven University of Technology

Piecewise Bounds for Estimating Bernoulli- Logistic Latent Gaussian Models Mohammad Emtiyaz Khan

Computing Tight Bounds for Insurance Payments with Nonlinear Risk Man Hong WONG 1 Shuzhong ZHANG 2

Tight Bounds for Learning a Mixture of Two Gaussians Moritz Hardt Eric Price Google Research

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

Piecewise Isometries and Piecewise Contractions in Electronic Engineering Jonathan Deane

Piecewise w -Noetherian domains and their applications Gyu Whan Chang - Incheon National

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

VC-dimension and Erd os-P osa property Nicolas Bousquet LIRMM, University Montpellier II

Tight Bounds for Cost-Sharing in Weighted Congestion Games Martin Gairing University of

Cadence tools Brandon Rumberg 1 Tools to cover Creating piecewise linear (PWL) files

Tight Gas in the Netherlands A Study Proposal EBN Exploration Day 23 May 2016 1 1 Why a

S2S ASR Advanced issues Tight coupling Tight coupling ASR should output N ASR should

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Slides adapted from

On Connected Sublevel Sets in Deep Learning Quynh Nguyen Department of Mathematics and Computer

Primary Care Mental Health Service Carmarthenshire Judith Evan-Jones Team Manager Liza Evans PCMH

Discussion Metrics of change, uncertainties Definition of sustainability, tipping points

Finite volume method for linear and non linear elliptic problems with discontinuities Franck

Reagan OR Airway Guide WORK FLOW 01 Notify chief and attending 02 Book Case 03 Ask for the

KamLAND (Anti-Neutrino Status) The 10th International Conference on Topics in Astroparticle and

Lattice QCD Precision Science for Muon g-2 and EW Physics Kohtaroh Miura (GSI Helmholtz-Institut