Mad Max: Affine Spline Insights into Deep Learning Richard Baraniuk

expectations time

greek questions for the babylonians • Why is deep learning so effective ? • Can we derive deep learning systems from first principles ? • When and why does deep learning fail ? • How can deep learning systems be improved and extended in a principled fashion? • Where is the foundational framework for theory? See also Mallat, Soatto, Arora, Poggio, Tishby, [growing community] …

splines and deep learning R. Balestriero & B “A Spline Theory of Deep Networks,” ICML 2018 “Mad Max: Affine Spline Insights into Deep Learning,” arxiv.org/abs/1805.06576, 2018 “From Hard to Soft: Understanding Deep Network Nonlinearities…,” ICLR 2019 “A Max-Affine Spline Perspective of RNNs,” ICLR 2019 (w/ J. Wang)

prediction problem • Unknown function/operator mapping data to labels f y = f ( x ) label data (signal, image, video, …) • Goal: Learn an approximation to using training data f { ( x i , y i ) } n b y = f Θ ( x ) = i =1

deep nets approximate • Deep nets solve a function approx problem (black box) b y y = f Θ ( x ) = b

deep nets approximate • Deep nets solve a function approx problem hierarchically ReLU convo ReLU max-pool convo b y layer 1 layer 2 layer 3 ⇣ ⌘ f ( L ) θ ( L ) � · · · � f (3) θ (3) � f (2) θ (2) � f (1) b y = f Θ ( x ) = ( x ) θ (1)

deep nets and splines • Deep nets solve a function approx problem hierarchically using a very special family of splines ReLU convo ReLU max-pool convo b y layer 1 layer 2 layer 3 ⇣ ⌘ f ( L ) θ ( L ) � · · · � f (3) θ (3) � f (2) θ (2) � f (1) b y = f Θ ( x ) = ( x ) θ (1)

deep nets and splines

spline approximation • A spline function approximation consists of – a partition Ω of the independent variable (input space) – a (simple) local mapping on each region of the partition (our focus: piecewise-affine mappings) x Ω

spline approximation • A spline function approximation consists of – a partition Ω of the independent variable (input space) – a (simple) local mapping on each region of the partition • Powerful splines – free, unconstrained partition Ω (ex: “free-knot” splines ) – jointly optimize both the partition and local mappings (highly nonlinear, computationally intractable) • Easy splines – fixed partition (ex: uniform grid, dyadic grid) – need only optimize the local mappings

max-affine spline (MAS) [Magnani & Boyd, 2009; Hannah & Dunson, 2013] • Consider piecewise-affine approximation of a convex function over R regions a T r x + b r , r = 1 , . . . , R – Affine functions: r =1 ,...,R a T z ( x ) = max r x + b r – Convex approximation: ( a 4 , b 4 ) ( a 1 , b 1 ) R = 4 ( a 3 , b 3 ) ( a 2 , b 2 ) x

max-affine spline (MAS) [Magnani & Boyd, 2009; Hannah & Dunson, 2013] ( a r , b r ) , r = 1 , . . . , R • Key: Any set of affine parameters implicitly determines a spline partition a T r x + b r , r = 1 , . . . , R – Affine functions: r =1 ,...,R a T z ( x ) = max r x + b r – Convex approximation: ( a 4 , b 4 ) ( a 1 , b 1 ) R = 4 ( a 3 , b 3 ) ( a 2 , b 2 ) x

scale + bias | ReLU is a MAS z ( x ) = max(0 , ax + b ) • Scale x by a + bias b | ReLU: ( a 1 , b 1 ) = (0 , 0) , ( a 2 , b 2 ) = ( a, b ) – Affine functions: r =1 , 2 a T z ( x ) = max r x + b r – Convex approximation: R = 2 ( a 2 , b 2 ) ( a 1 , b 1 ) x

max-affine spline operator (MASO) a r ∈ R D , b r ∈ R x ∈ R D • MAS for has affine parameters • A MASO is simply a concatenation of K MASs MAS with k i parameters … [ a ] k,i,r , b r x ∈ R D z ∈ R K

modern deep nets • Focus: The lion-share of today’s deep net architectures (convnets, resnets, skip-connection nets, inception nets, recurrent nets, …) employ piecewise linear (affine) layers (fully connected, conv; (leaky) ReLU, abs value; max/mean/channel-pooling) ReLU convo ReLU max-pool convo b y layer 1 layer 2 layer 3 ⇣ ⌘ f ( L ) θ ( L ) � · · · � f (3) θ (3) � f (2) θ (2) � f (1) b y = f Θ ( x ) = ( x ) θ (1)

theorems • Each deep net layer is a MASO – convex wrt each output dimension, piecewise-affine operator b y

theorems • Each deep net layer is a MASO – convex , piecewise-affine operator b y WLOG ignore output softmax • A deep net is a composition of MASOs – non-convex piecewise-affine spline operator

theorems • A deep net is a composition of MASOs – non-convex piecewise-affine spline operator b y • A deep net is a convex MASO iff the convolution/fully connected weights in all but the first layer are nonnegative and the intermediate nonlinearities are nondecreasing

MASO spline partition • The parameters of each deep net layer (MASO) induce a partition of its input space with convex regions – vector quantization (info theory) – k -means (statistics) – Voronoi tiling (geometry)

MASO spline partition • The L layer-partitions of an L -layer deep net combine to form the global input signal space partition – affine spline operator – non-convex regions • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D

MASO spline partition • The L layer-partitions of an L -layer deep net combine to form the global input signal space partition – affine spline operator – non-convex regions x [2] • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D x [1]

MASO spline partition • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D • VQ partition of layer 1 depicted in the input space – convex regions

MASO spline partition • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D • Given the partition region Q ( x ) x containing the layer x input/output mapping is affine z ( x ) = A Q ( x ) x + b Q ( x )

MASO spline partition • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D • VQ partition of layer 2 depicted in the input space – non-convex regions due to visualization in the input space

MASO spline partition • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D x • Given the partition region Q ( x ) containing the layer x input/output mapping is affine z ( x ) = A Q ( x ) x + b Q ( x )

MASO spline partition • Toy example: “Deep” net layer – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D • VQ partition of layers 1 & 2 depicted in the input space – non-convex regions

learning layers 1 & 2 learning epochs (time)

local affine mapping – CNN WLOG ignore output softmax

local affine mapping – CNN Fixed, different A Q ( x ) , b Q ( x ) in each partition region

matched filters

deep nets are matched filterbanks z ( L ) ( x ) = A Q ( x ) x + b Q ( x ) z ( L ) ( x ) • Row c of is a vectorized A Q ( x ) signal/image corresponding to class c c • Entry c of deep net output = inner product between row c and signal • For classification, select largest output; matched filter!

deep nets are matched filterbanks

data memorization

orthogonal deep nets

partition-based signal distance

additional directions • Study the geometry of deep nets and signals via VQ partition • Affine input/output formula enables explicit calculation of the Lipschitz constant of a deep net for the analysis of stability, adversarial examples, … • Theory covers many recurrent neural networks (RNNs)

additional directions • Theory extends to non-piecewise-affine operators (ex: sigmoid ) by replacing the “ hard VQ ” of a MASO with a “ soft VQ ” – soft-VQ can generate new nonlinearities (ex: swish)

Mad Max: Affine Spline Insights into Deep Learning Richard Baraniuk - PowerPoint PPT Presentation

Mad Max: Affine Spline Insights into Deep Learning Richard Baraniuk expectations time greek questions for the babylonians Why is deep learning so effective ? Can we derive deep learning systems from first principles ? When and why

mad skills: new analysis practices for big data 2. dude, you got mad skills.

Keeping the Mad River Valley Vibrant (A Working Document) 11/15/18 2 The Mad

Max India Limited Max India Limited I Investor Presentation t P t ti June, 2014

Positive definite max-QP Recall max-cut max vE(1-v) s.t. 0 v 1 max

RAVEing Mad! Towards a million stars RAVE Collaboration Meeting Padova, 11th June 2009 Fred

Parcel Tax Election In The Beginning 3 rd Bond Authorization MAD Maintenance

Mean Absolute Deviation Mean Absolute Deviation O Definition: Mean Absolute Deviation (MAD) is

Ramsey regularity, MAD families, and their relatives David Schrittesser (KGRC) Joint work with

MAD families, splitting families and large continuum Vera Fischer Kurt G odel Research Center

MAD families, splitting families and large continuum Vera Fischer Kurt G odel Research Center

Mad families, definability, and ideals (Part 2) David Schrittesser (KGRC) joint with Asger

Algebra 1 SOL Review 1 - Finding MAD and Variance 1) Find the mean of the data 2) Create the

Jean-Max Roger Sancerre Menetou-Salon Pouilly-Fum Jean-Max Roger s.a.s. 11 place du Carrou

JBoss Tools & Developer Studio Max Rydahl Andersen http://in.relation.to/Bloggers/Max

Heapsort Build-Max-Heap Next we build a full heap from an unsorted sequence Build-Max-Heap(A)

SPECTRAL THEORY OF REDUCIBLE NONNEGATIVE MATRICES IN MAX ALGEBRA Hans Schneider Chemnitz

PRE-ASYMPTOTIC MEASURE OF FAT TAILEDNESS Nassim Nicholas Taleb Tandon School of Engineering,

1 Response Matrix Analytical Implementation of TwissResponse Joschua Dilly VIA at CERN 2

Cryptanalysis of WIDEA Conclusion Hash collisions Key recovery Truncated differential FSE 2013

GANs for Limited Labeled Data MIX+GAN Ian Goodfellow, Sta ff Research Scientist, Google Brain

FOUNDATIONS OF SEMANTIC WEB TECHNOLOGIES Ontology Editing Sebastian Rudolph Dresden, July 2

MIRON MIRON MIPv6 Route Optimization for NEMO MIPv6 Route Optimization for NEMO Carlos J.

Decoding Smart Cities Using a Rule-Based Programming Language Joaqun Arias 1 , 2 Manuel Carro 1

M. Teresa Higuera-Toledano Universidad Complutense de Madrid Ciudad Universitaria, Madrid 28040,

Mad Max: Affine Spline Insights into Deep Learning Richard Baraniuk - PowerPoint PPT Presentation

Mad Max: Affine Spline Insights into Deep Learning Richard Baraniuk expectations time greek questions for the babylonians Why is deep learning so effective ? Can we derive deep learning systems from first principles ? When and why

mad skills: new analysis practices for big data 2. dude, you got mad skills.

Keeping the Mad River Valley Vibrant (A Working Document) 11/15/18 2 The Mad

Max India Limited Max India Limited I Investor Presentation t P t ti June, 2014

Positive definite max-QP Recall max-cut max vE(1-v) s.t. 0 v 1 max

RAVEing Mad! Towards a million stars RAVE Collaboration Meeting Padova, 11th June 2009 Fred

Parcel Tax Election In The Beginning 3 rd Bond Authorization MAD Maintenance

Mean Absolute Deviation Mean Absolute Deviation O Definition: Mean Absolute Deviation (MAD) is

Ramsey regularity, MAD families, and their relatives David Schrittesser (KGRC) Joint work with

MAD families, splitting families and large continuum Vera Fischer Kurt G odel Research Center

MAD families, splitting families and large continuum Vera Fischer Kurt G odel Research Center

Mad families, definability, and ideals (Part 2) David Schrittesser (KGRC) joint with Asger

Algebra 1 SOL Review 1 - Finding MAD and Variance 1) Find the mean of the data 2) Create the

Jean-Max Roger Sancerre Menetou-Salon Pouilly-Fum Jean-Max Roger s.a.s. 11 place du Carrou

JBoss Tools &amp; Developer Studio Max Rydahl Andersen http://in.relation.to/Bloggers/Max

Heapsort Build-Max-Heap Next we build a full heap from an unsorted sequence Build-Max-Heap(A)

SPECTRAL THEORY OF REDUCIBLE NONNEGATIVE MATRICES IN MAX ALGEBRA Hans Schneider Chemnitz

PRE-ASYMPTOTIC MEASURE OF FAT TAILEDNESS Nassim Nicholas Taleb Tandon School of Engineering,

1 Response Matrix Analytical Implementation of TwissResponse Joschua Dilly VIA at CERN 2

Cryptanalysis of WIDEA Conclusion Hash collisions Key recovery Truncated differential FSE 2013

GANs for Limited Labeled Data MIX+GAN Ian Goodfellow, Sta ff Research Scientist, Google Brain

FOUNDATIONS OF SEMANTIC WEB TECHNOLOGIES Ontology Editing Sebastian Rudolph Dresden, July 2

MIRON MIRON MIPv6 Route Optimization for NEMO MIPv6 Route Optimization for NEMO Carlos J.

Decoding Smart Cities Using a Rule-Based Programming Language Joaqun Arias 1 , 2 Manuel Carro 1

M. Teresa Higuera-Toledano Universidad Complutense de Madrid Ciudad Universitaria, Madrid 28040,

JBoss Tools & Developer Studio Max Rydahl Andersen http://in.relation.to/Bloggers/Max