random matrix theory proves that deep learning
play

Random Matrix Theory Proves that Deep Learning Representations of - PowerPoint PPT Presentation

Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures ICML 2020 MEA. Seddik 12 , C.Louart 13 , M. Tamaazousti 1 , R. Couillet 23 1 CEA List, France 2 CentraleSuplec, L2S, France 3 GIPSA Lab


  1. Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures ICML 2020 MEA. Seddik 12 ∗ , C.Louart 13 , M. Tamaazousti 1 , R. Couillet 23 1 CEA List, France 2 CentraleSupélec, L2S, France 3 GIPSA Lab Grenoble-Alpes University, France ∗ http://melaseddik.github.io/ June 8, 2020 1 / 17

  2. / 2/17 Abstract Context: ◮ Study of large Gram matrices of concentrated data. Motivation: ◮ Gram matrices are at the core of various ML algorithms. ◮ RMT predicts their performances under Gaussian assumptions on the data. ◮ BUT Real data are unlikely close to Gaussian vectors. Results: ◮ GAN data ( ≈ Real data) fall within the class of Concentrated vectors. ◮ Universality result: Only first and second order statistics of Concentrated data matter to describe the behavior of Gram matrices. 2 / 17

  3. Concentrated Vectors/ 3/17 Notion of Concentrated Vectors Definition (Concentrated Vectors) Given a normed space ( E , � · � E ) and q ∈ R , a random vector Z ∈ E is q -exponentially concentrated if for any 1- Lipschitz 1 function F : E → R , there exists C , c > 0 such that ∀ t > 0 , P {|F ( Z ) − E F ( Z ) | ≥ t } ≤ Ce − ( t / c ) q denoted − − − − − → Z ∈ E q ( c ) If c independent of dim( E ), we denote Z ∈ E q (1) Concentrated vectors enjoy: (P1 ) If X ∼ N ( 0 , I p ) then X ∈ E 2 (1) “Gaussian vectors are concentrated vectors” (P2 ) If X ∈ E q (1) and G is a λ G - Lipschitz map, then G ( X ) ∈ E q ( λ G ) “Concentrated vectors are stable through Lipschitz maps” 1 Reminder: F : E → F is λ F -Lipschitz if ∀ ( x , y ) ∈ E 2 : �F ( x ) − F ( y ) � F ≤ λ F � x − y � E . 3 / 17

  4. GAN Data: An Example of Concentrated Vectors/ 4/17 Why Concentrated Vectors? Figure: Images artificially generated using the BigGAN model [Brock et al , ICLR’19]. Real Data ≈ GAN Data = F L ◦ F L − 1 ◦ · · · ◦ F 1 ( Gaussian ) � �� � G where the F i ’s correspond to Fully Connected layers, Convolutional layers, Sub-sampling, Pooling and activation functions, residual connections or Batch Normalisation. ⇒ The F i ’s are essentially Lipschitz operations. 4 / 17

  5. GAN Data: An Example of Concentrated Vectors/ 5/17 Why Concentrated Vectors? ◮ Fully Connected Layers and Convolutional Layers are affine operations: F i ( x ) = W i x + b i , � W i u � p and �F i � lip = sup u � = 0 , for any p -norm. � u � p ◮ Pooling Layers and Activation Functions: Are 1-Lipschitz operations with respect to any p -norm (e.g., ReLU and Max-pooling). ◮ Residual Connections: F i ( x ) = x + F ( ℓ ) ◦ · · · ◦ F (1) ( x ) i i where the F ( j ) ’s are Lipschitz operations, thus F i is a Lipschitz operation with i Lipschitz constant bounded by 1 + � ℓ j =1 �F ( j ) � lip . i ◮ . . . By: (P1 ) If X ∼ N ( 0 , I p ) then X ∈ E 2 (1) (P2 ) If X ∈ E q (1) and G is a λ G - Lipschitz map, then G ( X ) ∈ E q ( λ G ) ⇒ GAN data are concentrated vectors by design. Remark: Still we need to control λ G . 5 / 17

  6. GAN Data: An Example of Concentrated Vectors/ 6/17 Control of λ G with Spectral Normalization Let σ ∗ > 0 and G be a neural network composed of N affine layers, each one of input dimension d i − 1 and output dimension d i for i ∈ [ N ], with 1-Lipschitz activation functions. Consider the following dynamics with learning rate η : W ← W − η E , with E i , j ∼ N (0 , 1) W ← W − max(0 , σ 1 ( W ) − σ ∗ ) u 1 ( W ) v 1 ( W ) ⊺ . The Lipschitz constant of G is bounded at convergence with high probability as: N � � � � σ 2 ∗ + η 2 d i d i − 1 λ G ≤ ε + . i =1 6 Largest singular value σ 1 Without SN 5 With SN Theoretical bound σ ∗ = 4 4 σ ∗ = 3 3 σ ∗ = 2 2 1 0 200 400 600 800 1 , 000 Iterations Figure: Parameters N = 1, d 0 = d 1 = 100 and η = 1 / d 0 . 6 / 17

  7. GAN Data: An Example of Concentrated Vectors/ 7/17 Model & Assumptions (A1) Data matrix (distributed in k classes C 1 , C 2 , . . . , C k ):      ∈ R p × n X =  x 1 , . . . , x n 1 , x n 1 +1 , . . . , x n 2 , . . . , x n − n k +1 , . . . , x n � �� � � �� � � �� � ∈E q 1 (1) ∈E q 2 (1) ∈E qk (1) Model statistics: µ ℓ = E x i ∈C ℓ [ x i ] , C ℓ = E x i ∈C ℓ [ x i x ⊺ i ] (A2) Growth rate assumptions: As p → ∞ , 1. p / n → c ∈ (0 , ∞ ). 2. The number of classers k is bounded. 3. For any ℓ ∈ [ k ], � µ ℓ � = O ( √ p ). Gram matrix and its resolvent: G = 1 Q ( z ) = ( G + z I n ) − 1 p X ⊺ X , � m L ( z ) = 1 UU ⊺ = − 1 n tr ( Q ( − z )) , Q ( − z ) dz 2 π i γ 7 / 17

  8. Behavior of the Gram Matrix for Concentrated Vectors/ 8/17 Main Result Theorem Under Assumptions (A1) and (A2) , we have Q ( z ) ∈ E q ( p − 1 2 ) . Furthermore, �� � log p Q ( z ) = 1 z Λ ( z ) + 1 � Q ( z ) � � E [ Q ( z )] − ˜ � = O where ˜ p z J Ω( z ) J ⊺ p � � k 1 n ℓ ℓ ˜ R ( z ) µ ℓ } k and Ω( z ) = diag { µ ⊺ with Λ ( z ) = diag 1+ δ ℓ ( z ) ℓ =1 ℓ =1 � � − 1 k � 1 C ℓ ˜ R ( z ) = 1 + δ ℓ ( z ) + z I p k ℓ =1 with δ ( z ) = [ δ 1 ( z ) , . . . , δ k ( z )] is the unique fixed point of the system of equations � � � − 1 � k 1 � C j δ ℓ ( z ) = tr C ℓ 1 + δ j ( z ) + z I p for each ℓ ∈ [ k ] . k j =1 8 / 17

  9. Behavior of the Gram Matrix for Concentrated Vectors/ 9/17 Main Result Theorem Under Assumptions (A1) and (A2) , we have Q ( z ) ∈ E q ( p − 1 2 ) . Furthermore, �� � � Q ( z ) � log p Q ( z ) = 1 z Λ ( z ) + 1 � E [ Q ( z )] − ˜ � = O where ˜ p z J Ω( z ) J ⊺ p � � k 1 n ℓ and Ω( z ) = diag { µ ℓ ⊺ ˜ R ( z ) µ ℓ } k with Λ ( z ) = diag 1+ δ ℓ ( z ) ℓ =1 ℓ =1 � � − 1 k � 1 C ℓ ˜ R ( z ) = 1 + δ ℓ ( z ) + z I p k ℓ =1 with δ ( z ) = [ δ 1 ( z ) , . . . , δ k ( z )] is the unique fixed point of the system of equations � � � − 1 � k � 1 C j for each ℓ ∈ [ k ] . δ ℓ ( z ) = tr C ℓ 1 + δ j ( z ) + z I p k j =1 Key Observation: Only first and second order statistics matter! 9 / 17

  10. Application to CNN Representations of GAN Images/ 10/17 Application to CNN Representations of GAN Images Generator Discriminator Real / Fake Lipschitz operation Representation Network Concentrated Vectors Lipschitz operation ◮ CNN representations correspond to the penultimate layer. ◮ Popular architectures considered in practice are: Resnet, VGG, Densenet . 10 / 17

  11. Application to CNN Representations of GAN Images/ 11/17 Application to CNN Representations of GAN Images GAN Images Real Images Figure: k = 3 classes, n = 3000 images. 11 / 17

  12. Application to CNN Representations of GAN Images/ 12/17 Application to CNN Representations of GAN Images GAN Images Real Images 12 / 17

  13. Application to CNN Representations of GAN Images/ 13/17 Application to CNN Representations of GAN Images GAN Images Real Images 13 / 17

  14. Application to CNN Representations of GAN Images/ 14/17 Application to CNN Representations of GAN Images GAN Images Real Images 14 / 17

  15. Application to CNN Representations of GAN Images/ 15/17 Performance of a linear SVM classifier GAN Images 15 / 17

  16. Application to CNN Representations of GAN Images/ 16/17 Performance of a linear SVM classifier Real Images 16 / 17

  17. Application to CNN Representations of GAN Images/ 17/17 Take away messages ◮ Concentrated Vectors seem appropriate for realistic data modelling. ◮ Universality of linear classifiers regardless of the data distribution. ◮ RMT can anticipate the performances of standard classifiers for DL representations of GAN images. ◮ Universality supports the Gaussianity assumption on the data representations as considered in the literature, e.g., the FID metric � � 1 d 2 (( µ , C ) , ( µ w , C w )) = � µ − µ w � 2 + tr C + C w − 2( CC w ) . 2 17 / 17

Recommend


More recommend