mehler s formula branching processes and compositional
play

Mehlers Formula, Branching Processes, and Compositional Kernels of - PowerPoint PPT Presentation

Mehlers Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020 Motivations & Questions DNN and Kernels (Rahimi, Recht 08; Belkin et al 18; Jacot et al 19) .


  1. Mehler’s Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020

  2. Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

  3. Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18) . How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

  4. Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18) . How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity? ◮ Is there hope to design activation functions such that we can ”compress” multiple layers? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

  5. Multi-Layer Perceptron with Random Weights (Neal ’96; Rahimi, Recht ’08; Daniely et al ’07) Input : x (0) := x ∈ R d � � Hidden Layers : x ( ℓ +1) := σ ∈ R d ℓ +1 , for 0 ≤ ℓ < L W ( ℓ ) x ( ℓ ) / � x ( ℓ ) � Random Weights : W ( ℓ ) ∈ R d ℓ +1 × d ℓ , W ( ℓ ) ∼ MN ( 0 , I d ℓ +1 ⊗ I d ℓ ) . Regime : d 1 , . . . , d L → ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 3

  6. Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

  7. Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Dual Kernel: [ σ ( w T x i / � x i � ) σ ( w T x j / � x j � )] K ( x i , x j ) := E w ∼N (0 , I d ) ∞ � α 2 k ρ k = ij =: G ( ρ ij ); ρ ij := � x i / � x i � , x j / � x j �� . k =0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

  8. Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Dual Kernel: [ σ ( w T x i / � x i � ) σ ( w T x j / � x j � )] K ( x i , x j ) := E w ∼N (0 , I d ) ∞ � α 2 k ρ k = ij =: G ( ρ ij ); ρ ij := � x i / � x i � , x j / � x j �� . k =0 Compositional Kernel: K ( L ) ( x i , x j ) = G ◦ G ◦ · · · ◦ G ( ρ ij ) =: G ( L ) ( ρ ij ) . � �� � composite L times H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

  9. Branching Process and Compositional Kernels Distribution: Y , with P ( Y = k ) = α 2 k and PGF G . Galton-Watson Process: Z ( L ) , with off-spring Y and PGFs G ( L ) H. Tran-Bach Compositional Kernels of Deep Neural Networks 5

  10. Rescaled Limit: Phase Transition Theorem (Liang, Tran-Bach ’20) Define � � a 2 a 2 µ := k k , µ ⋆ := k k log k . k ≥ 0 k > 2 Then, for all t > 0 , we have 1. If µ ≤ 1 , � if a 1 � = 1 1 , L →∞ K ( L ) ( e − t ) = lim e − t , if a 1 = 1 2. If µ > 1 , � ξ + (1 − ξ ) E [ e − tW ] , if µ ⋆ < ∞ L →∞ K ( L ) ( e − t /µ L ) = lim 0 , if µ ⋆ = ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 6

  11. Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) ReLU 1.0 L=1 L=3 0.8 L=5 L=7 0.6 0.4 0.2 0.0 0.2 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

  12. Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) L = 1 L = 3 L = 5 L = 7 1.0 1.00 1.0 0.8 0.8 0.75 1 /µ 1 0.5 0.6 0.6 0.50 0.4 0.4 0.25 0.0 0.2 1 /µ 3 0.2 0.00 0.0 0.0 1 /µ 5 0.5 0.25 1 /µ 7 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

  13. Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) ReLU 1.0 L=1 L=3 0.8 L=5 L=7 0.6 0.4 0.2 0.0 0.2 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Rescaled Limit: K ( L ) ( e − t /µ L ) ReLU 1.0 L=1 L=3 0.9 L=5 L=7 0.8 0.7 0.6 0.5 0.4 0.0 0.2 0.4 0.6 0.8 1.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

  14. small correlation large correlation Memorization Capacity ◮ ”small correlation” sup ij | ρ ij | ≈ 0 iid ∼ Unif ( S d − 1 ) and log( n ) / d → 0 x 1 , . . . , x n ◮ ”large correlation” sup ij | ρ ij | ≈ 1 x 1 , . . . , x n maximal packing of S d − 1 and log( n ) / d → ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 8

  15. Memorization Capacity ◮ ”small correlation” sup ij | ρ ij | ≈ 0 iid ∼ Unif ( S d − 1 ) and log( n ) / d → 0 x 1 , . . . , x n ◮ ”large correlation” sup ij | ρ ij | ≈ 1 x 1 , . . . , x n maximal packing of S d − 1 and log( n ) / d → ∞ small correlation large correlation H. Tran-Bach Compositional Kernels of Deep Neural Networks 8

  16. Memorization Capacity Theorem Theorem (Liang & Tran-Bach ’20) L � log( n κ − 1 ) + log log n d (small correlation) log a − 2 1 L � exp(2 log n d )log( n κ − 1 ) (large correlation) µ − 1 to memorize the data in the sense that 1 − κ ≤ λ i ≤ 1 + κ , where λ i are the eigenvalues of K := { K ( x i , x j ) } ij . H. Tran-Bach Compositional Kernels of Deep Neural Networks 9

  17. Memorization Capacity Theorem Theorem (Liang & Tran-Bach ’20) L � log( n κ − 1 ) + log log n d (small correlation) log a − 2 1 L � exp(2 log n d )log( n κ − 1 ) (large correlation) µ − 1 to memorize the data in the sense that 1 − κ ≤ λ i ≤ 1 + κ , where λ i are the eigenvalues of K := { K ( x i , x j ) } ij . H. Tran-Bach Compositional Kernels of Deep Neural Networks 9

  18. Memorization Capacity Theorem Theorem (Liang & Tran-Bach ’20) L � log( n κ − 1 ) + log log n d (small correlation) log a − 2 1 L � exp(2 log n d )log( n κ − 1 ) (large correlation) µ − 1 to memorize the data in the sense that 1 − κ ≤ λ i ≤ 1 + κ , where λ i are the eigenvalues of K := { K ( x i , x j ) } ij . H. Tran-Bach Compositional Kernels of Deep Neural Networks 9

  19. Memorization Capacity Theorem Theorem (Liang & Tran-Bach ’20) L � log( n κ − 1 ) + log log n d (small correlation) log a − 2 1 L � exp(2 log n d )log( n κ − 1 ) (large correlation) µ − 1 to memorize the data in the sense that 1 − κ ≤ λ i ≤ 1 + κ , where λ i are the eigenvalues of K := { K ( x i , x j ) } ij . H. Tran-Bach Compositional Kernels of Deep Neural Networks 9

  20. New Random Features Algorithm H. Tran-Bach Compositional Kernels of Deep Neural Networks 10

  21. New Random Features Algorithm Kernels Activation Sampling shift-invariant (Rahimi, Recht ’08) cos , sin ≈ inner-product (Liang, Tran-Bach ’20) ≈ Gaussian H. Tran-Bach Compositional Kernels of Deep Neural Networks 10

  22. MNIST: L=1 MNIST: L=2 MNIST: L=3 1.0 0.98 0.95 0.9 0.96 0.94 0.90 0.8 0.92 0.7 0.85 0.90 ReLU ReLU 0.6 ReLU 0.88 0.80 GeLU GeLU GeLU 0.5 0.86 Sigmoid Sigmoid Sigmoid 0.75 Swish Swish Swish 0.84 0.4 0 10 20 30 0 10 20 30 0 10 20 30 CIFAR10: L=1 CIFAR10: L=2 CIFAR10: L=3 0.350 0.325 0.30 0.325 0.300 0.300 0.275 0.25 0.275 0.250 0.20 0.250 0.225 ReLU ReLU ReLU GeLU GeLU GeLU 0.225 0.200 0.15 Sigmoid Sigmoid Sigmoid 0.200 Swish Swish Swish 0.175 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Experiment: MNIST & CIFAR10 Activation ReLU GeLU Sigmoid Swish µ 0 . 95 1 . 08 0 . 15 1 . 07 a 2 0 . 50 0 . 59 0 . 15 0 . 80 1 H. Tran-Bach Compositional Kernels of Deep Neural Networks 11

  23. Experiment: MNIST & CIFAR10 Activation ReLU GeLU Sigmoid Swish µ 0 . 95 1 . 08 0 . 15 1 . 07 a 2 0 . 50 0 . 59 0 . 15 0 . 80 1 MNIST: L=1 MNIST: L=2 MNIST: L=3 1.0 0.98 0.95 0.9 0.96 0.94 0.90 0.8 0.92 0.7 0.85 0.90 ReLU ReLU 0.6 ReLU 0.88 0.80 GeLU GeLU GeLU 0.5 0.86 Sigmoid Sigmoid Sigmoid 0.75 Swish Swish Swish 0.84 0.4 0 10 20 30 0 10 20 30 0 10 20 30 CIFAR10: L=1 CIFAR10: L=2 CIFAR10: L=3 0.350 0.325 0.30 0.325 0.300 0.300 0.275 0.25 0.275 0.250 0.20 0.250 0.225 ReLU ReLU ReLU GeLU GeLU GeLU 0.225 0.200 0.15 Sigmoid Sigmoid Sigmoid 0.200 Swish Swish Swish 0.175 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 H. Tran-Bach Compositional Kernels of Deep Neural Networks 11

  24. Conclusions 1. Additional Results: H. Tran-Bach Compositional Kernels of Deep Neural Networks 12

Recommend


More recommend