Mehlers Formula, Branching Processes, and Compositional Kernels of - PowerPoint PPT Presentation

Mehler’s Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020

Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18) . How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18) . How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity? ◮ Is there hope to design activation functions such that we can ”compress” multiple layers? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

Multi-Layer Perceptron with Random Weights (Neal ’96; Rahimi, Recht ’08; Daniely et al ’07) Input : x (0) := x ∈ R d � � Hidden Layers : x ( ℓ +1) := σ ∈ R d ℓ +1 , for 0 ≤ ℓ < L W ( ℓ ) x ( ℓ ) / � x ( ℓ ) � Random Weights : W ( ℓ ) ∈ R d ℓ +1 × d ℓ , W ( ℓ ) ∼ MN ( 0 , I d ℓ +1 ⊗ I d ℓ ) . Regime : d 1 , . . . , d L → ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 3

Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Dual Kernel: [ σ ( w T x i / � x i � ) σ ( w T x j / � x j � )] K ( x i , x j ) := E w ∼N (0 , I d ) ∞ � α 2 k ρ k = ij =: G ( ρ ij ); ρ ij := � x i / � x i � , x j / � x j �� . k =0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Dual Kernel: [ σ ( w T x i / � x i � ) σ ( w T x j / � x j � )] K ( x i , x j ) := E w ∼N (0 , I d ) ∞ � α 2 k ρ k = ij =: G ( ρ ij ); ρ ij := � x i / � x i � , x j / � x j �� . k =0 Compositional Kernel: K ( L ) ( x i , x j ) = G ◦ G ◦ · · · ◦ G ( ρ ij ) =: G ( L ) ( ρ ij ) . � �� composite L times H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

Branching Process and Compositional Kernels Distribution: Y , with P ( Y = k ) = α 2 k and PGF G . Galton-Watson Process: Z ( L ) , with off-spring Y and PGFs G ( L ) H. Tran-Bach Compositional Kernels of Deep Neural Networks 5

Rescaled Limit: Phase Transition Theorem (Liang, Tran-Bach ’20) Define � � a 2 a 2 µ := k k , µ ⋆ := k k log k . k ≥ 0 k > 2 Then, for all t > 0 , we have 1. If µ ≤ 1 , � if a 1 � = 1 1 , L →∞ K ( L ) ( e − t ) = lim e − t , if a 1 = 1 2. If µ > 1 , � ξ + (1 − ξ ) E [ e − tW ] , if µ ⋆ < ∞ L →∞ K ( L ) ( e − t /µ L ) = lim 0 , if µ ⋆ = ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 6

Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) ReLU 1.0 L=1 L=3 0.8 L=5 L=7 0.6 0.4 0.2 0.0 0.2 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) L = 1 L = 3 L = 5 L = 7 1.0 1.00 1.0 0.8 0.8 0.75 1 /µ 1 0.5 0.6 0.6 0.50 0.4 0.4 0.25 0.0 0.2 1 /µ 3 0.2 0.00 0.0 0.0 1 /µ 5 0.5 0.25 1 /µ 7 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) ReLU 1.0 L=1 L=3 0.8 L=5 L=7 0.6 0.4 0.2 0.0 0.2 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Rescaled Limit: K ( L ) ( e − t /µ L ) ReLU 1.0 L=1 L=3 0.9 L=5 L=7 0.8 0.7 0.6 0.5 0.4 0.0 0.2 0.4 0.6 0.8 1.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

small correlation large correlation Memorization Capacity ◮ ”small correlation” sup ij | ρ ij | ≈ 0 iid ∼ Unif ( S d − 1 ) and log( n ) / d → 0 x 1 , . . . , x n ◮ ”large correlation” sup ij | ρ ij | ≈ 1 x 1 , . . . , x n maximal packing of S d − 1 and log( n ) / d → ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 8

Memorization Capacity ◮ ”small correlation” sup ij | ρ ij | ≈ 0 iid ∼ Unif ( S d − 1 ) and log( n ) / d → 0 x 1 , . . . , x n ◮ ”large correlation” sup ij | ρ ij | ≈ 1 x 1 , . . . , x n maximal packing of S d − 1 and log( n ) / d → ∞ small correlation large correlation H. Tran-Bach Compositional Kernels of Deep Neural Networks 8

Memorization Capacity Theorem Theorem (Liang & Tran-Bach ’20) L � log( n κ − 1 ) + log log n d (small correlation) log a − 2 1 L � exp(2 log n d )log( n κ − 1 ) (large correlation) µ − 1 to memorize the data in the sense that 1 − κ ≤ λ i ≤ 1 + κ , where λ i are the eigenvalues of K := { K ( x i , x j ) } ij . H. Tran-Bach Compositional Kernels of Deep Neural Networks 9

New Random Features Algorithm H. Tran-Bach Compositional Kernels of Deep Neural Networks 10

New Random Features Algorithm Kernels Activation Sampling shift-invariant (Rahimi, Recht ’08) cos , sin ≈ inner-product (Liang, Tran-Bach ’20) ≈ Gaussian H. Tran-Bach Compositional Kernels of Deep Neural Networks 10

MNIST: L=1 MNIST: L=2 MNIST: L=3 1.0 0.98 0.95 0.9 0.96 0.94 0.90 0.8 0.92 0.7 0.85 0.90 ReLU ReLU 0.6 ReLU 0.88 0.80 GeLU GeLU GeLU 0.5 0.86 Sigmoid Sigmoid Sigmoid 0.75 Swish Swish Swish 0.84 0.4 0 10 20 30 0 10 20 30 0 10 20 30 CIFAR10: L=1 CIFAR10: L=2 CIFAR10: L=3 0.350 0.325 0.30 0.325 0.300 0.300 0.275 0.25 0.275 0.250 0.20 0.250 0.225 ReLU ReLU ReLU GeLU GeLU GeLU 0.225 0.200 0.15 Sigmoid Sigmoid Sigmoid 0.200 Swish Swish Swish 0.175 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Experiment: MNIST & CIFAR10 Activation ReLU GeLU Sigmoid Swish µ 0 . 95 1 . 08 0 . 15 1 . 07 a 2 0 . 50 0 . 59 0 . 15 0 . 80 1 H. Tran-Bach Compositional Kernels of Deep Neural Networks 11

Experiment: MNIST & CIFAR10 Activation ReLU GeLU Sigmoid Swish µ 0 . 95 1 . 08 0 . 15 1 . 07 a 2 0 . 50 0 . 59 0 . 15 0 . 80 1 MNIST: L=1 MNIST: L=2 MNIST: L=3 1.0 0.98 0.95 0.9 0.96 0.94 0.90 0.8 0.92 0.7 0.85 0.90 ReLU ReLU 0.6 ReLU 0.88 0.80 GeLU GeLU GeLU 0.5 0.86 Sigmoid Sigmoid Sigmoid 0.75 Swish Swish Swish 0.84 0.4 0 10 20 30 0 10 20 30 0 10 20 30 CIFAR10: L=1 CIFAR10: L=2 CIFAR10: L=3 0.350 0.325 0.30 0.325 0.300 0.300 0.275 0.25 0.275 0.250 0.20 0.250 0.225 ReLU ReLU ReLU GeLU GeLU GeLU 0.225 0.200 0.15 Sigmoid Sigmoid Sigmoid 0.200 Swish Swish Swish 0.175 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 H. Tran-Bach Compositional Kernels of Deep Neural Networks 11

Conclusions 1. Additional Results: H. Tran-Bach Compositional Kernels of Deep Neural Networks 12

Mehlers Formula, Branching Processes, and Compositional Kernels of - PowerPoint PPT Presentation

Mehlers Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020 Motivations & Questions DNN and Kernels (Rahimi, Recht 08; Belkin et al 18; Jacot et al 19) .

Noble Names, Branching Processes, and Fixation Probabilities Joachim Hermisson Mathematics &

Branching type processes with Stationary Ergodic Immigration Eitan Altman MAESTRO group,

Formula Student Overview for 2014-2015 Carleton Formula Student What is Formula Student?

Branching for PDEs Xavier Warin CEMRACS July Xavier Warin Branching for PDEs

71 Overview for 2010-2011 Carleton Formula SAE and Formula-Hybrid yb d u a o a d u a S o

Formula 1 What is Formula 1 ? What is Formula 1 ? Highest class of single seater auto racing

Target Formula Re-evaluation Target Formula Background Target formula is used to distribute

SAS + -Planung Vorlesung Handlungsplanung Tilman Mehler 1 Uberblick Wir haben u.A.

Diffusion on fractals: Branching Processes and Random Fractals Ben Hambly Mathematical Insitute

Condensation in reinforced branching processes Anna Senkevich as2945@bath.ac.uk Supervised by

Model Checking Stochastic Branching Processes Taolue Chen Klaus Dr ager Stefan Kiefer

Finding a Formula For f 1 ( x ) Given a formula for f ( x ), sometimes we would like to find a

A Review of the Tennessee A Review of the Tennessee Funding Formula Funding Formula Tennessee

Ultimate Quadrilateral Outline Review formula for Sum of exterior angles 360 formula for Sum

Branching and Iterative Compression Ariel Kulik Seminar on Algorithms, Technion, Winter 18/19

CHAPTER XIV PROGRAM CONTROL, JUMPING, AND BRANCHING READ BRANCHING FREE-DOC ON COURSE WEBPAGE

Randomness in Computing L ECTURE 5 Last time Bernoulli and binomial RVs Jensens

Disordered Systems and Random Graphs 1 Amin Coja-Oghlan Goethe University based on joint work

Industrial and economic problems E.g. : Logistics, telecommunications, IT, etc.

Boleslaw Szymanski based on slides by Albert-Lszl Barabsi and Roberta Sinatr a

Greedy maximal independent sets via local limits Peleg Michaeli Tel Aviv University The 31st

Mouvement brownien branchant avec s election Soutenance de th` ese de Pascal M AILLARD

Subcritical random hypergraphs, high-order components, and hypertrees Wenjie Fang, Institute of

Shedding new light on random chromosome segregation Qi Zheng Department of Epidemiology and

Sambuz

Useful Links

Newsletter

Mail Us

Mehlers Formula, Branching Processes, and Compositional Kernels of - PowerPoint PPT Presentation

Mehlers Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020 Motivations & Questions DNN and Kernels (Rahimi, Recht 08; Belkin et al 18; Jacot et al 19) .

Noble Names, Branching Processes, and Fixation Probabilities Joachim Hermisson Mathematics &amp;

Branching type processes with Stationary Ergodic Immigration Eitan Altman MAESTRO group,

Formula Student Overview for 2014-2015 Carleton Formula Student What is Formula Student?

Branching for PDEs Xavier Warin CEMRACS July Xavier Warin Branching for PDEs

71 Overview for 2010-2011 Carleton Formula SAE and Formula-Hybrid yb d u a o a d u a S o

Formula 1 What is Formula 1 ? What is Formula 1 ? Highest class of single seater auto racing

Target Formula Re-evaluation Target Formula Background Target formula is used to distribute

SAS + -Planung Vorlesung Handlungsplanung Tilman Mehler 1 Uberblick Wir haben u.A.

Diffusion on fractals: Branching Processes and Random Fractals Ben Hambly Mathematical Insitute

Condensation in reinforced branching processes Anna Senkevich as2945@bath.ac.uk Supervised by

Model Checking Stochastic Branching Processes Taolue Chen Klaus Dr ager Stefan Kiefer

Finding a Formula For f 1 ( x ) Given a formula for f ( x ), sometimes we would like to find a

A Review of the Tennessee A Review of the Tennessee Funding Formula Funding Formula Tennessee

Ultimate Quadrilateral Outline Review formula for Sum of exterior angles 360 formula for Sum

Branching and Iterative Compression Ariel Kulik Seminar on Algorithms, Technion, Winter 18/19

CHAPTER XIV PROGRAM CONTROL, JUMPING, AND BRANCHING READ BRANCHING FREE-DOC ON COURSE WEBPAGE

Randomness in Computing L ECTURE 5 Last time Bernoulli and binomial RVs Jensens

Disordered Systems and Random Graphs 1 Amin Coja-Oghlan Goethe University based on joint work

Industrial and economic problems E.g. : Logistics, telecommunications, IT, etc.

Boleslaw Szymanski based on slides by Albert-Lszl Barabsi and Roberta Sinatr a

Greedy maximal independent sets via local limits Peleg Michaeli Tel Aviv University The 31st

Mouvement brownien branchant avec s election Soutenance de th` ese de Pascal M AILLARD

Subcritical random hypergraphs, high-order components, and hypertrees Wenjie Fang, Institute of

Shedding new light on random chromosome segregation Qi Zheng Department of Epidemiology and

Sambuz

Useful Links

Newsletter

Mail Us

Noble Names, Branching Processes, and Fixation Probabilities Joachim Hermisson Mathematics &