random matrix advances in machine learning
play

Random Matrix Advances in Machine Learning (Imaging and Machine - PowerPoint PPT Presentation

Random Matrix Advances in Machine Learning (Imaging and Machine Learning) Mathematics Workshop #3 Institut Henri Poincar e Romain COUILLET CentraleSup elec, L2S, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab,


  1. Basics of Random Matrix Theory/Spiked Models 10/41 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 1 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� � p − 4 10 / 41

  2. Basics of Random Matrix Theory/Spiked Models 10/41 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 2 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� � p − 4 10 / 41

  3. Basics of Random Matrix Theory/Spiked Models 11/41 Spiked Models Theorem (Eigenvalues [Baik,Silverstein’06] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ , where, for K fixed, Ω = diag ( ω 1 , . . . , ω K ) ∈ R K × K , with ω 1 ≥ . . . ≥ ω K > 0 . 11 / 41

  4. Basics of Random Matrix Theory/Spiked Models 11/41 Spiked Models Theorem (Eigenvalues [Baik,Silverstein’06] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ , where, for K fixed, Ω = diag ( ω 1 , . . . , ω K ) ∈ R K × K , with ω 1 ≥ . . . ≥ ω K > 0 . Then, as p, n → ∞ , p/n → c ∈ (0 , ∞ ) , denoting λ m = λ m ( 1 n Y p Y ∗ p ) ( λ m > λ m +1 ), � > (1 + √ c ) 2 , ω m > √ c 1 + ω m + c 1+ ω m a . s . λ m − → (1 + √ c ) 2 ω m , ω m ∈ (0 , √ c ] . 11 / 41

  5. Basics of Random Matrix Theory/Spiked Models 12/41 Spiked Models Theorem (Eigenvectors [Paul’07] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ = � K i =1 ω i u i u ∗ i , ω 1 > . . . > ω M > 0 . 12 / 41

  6. Basics of Random Matrix Theory/Spiked Models 12/41 Spiked Models Theorem (Eigenvectors [Paul’07] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ = � K i =1 ω i u i u ∗ i , ω 1 > . . . > ω M > 0 . Then, as p, n → ∞ , p/n → c ∈ (0 , ∞ ) , for a, b ∈ C p deterministic and ˆ u i eigenvector of λ i ( 1 n Y p Y ∗ p ) , i b − 1 − cω − 2 a . s . a ∗ ˆ u ∗ i a ∗ u i u ∗ i b · 1 ω i > √ c u i ˆ − → 0 1 + cω − 1 i In particular, → 1 − cω − 2 i u i | 2 a . s . i u ∗ | ˆ − · 1 ω i > √ c . 1 + cω − 1 i 12 / 41

  7. Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

  8. Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 p = 200 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

  9. Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 p = 200 p = 400 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

  10. Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 p = 100 p = 200 0 . 2 p = 400 1 − c/ω 2 1 1+ c/ω 1 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

  11. Basics of Random Matrix Theory/Spiked Models 14/41 Other Spiked Models Similar results for multiple matrix models: 1 1 ◮ Y p = 1 2 X p X ∗ n ( I + P ) p ( I + P ) 2 ◮ Y p = 1 n X p X ∗ p + P ◮ Y p = 1 n X ∗ p ( I + P ) X ◮ Y p = 1 n ( X p + P ) ∗ ( X p + P ) ◮ etc. 14 / 41

  12. Application to Machine Learning/ 15/41 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning 15 / 41

  13. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : 16 / 41

  14. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled 16 / 41

  15. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze 16 / 41

  16. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . 16 / 41

  17. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : 16 / 41

  18. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! 16 / 41

  19. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! ◮ Key observation 1 : In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down! 16 / 41

  20. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! ◮ Key observation 1 : In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down! ◮ Key observation 2 : In these “non-trivial” settings, RMT explains a lot of things and can improve algorithms! 16 / 41

  21. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! ◮ Key observation 1 : In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down! ◮ Key observation 2 : In these “non-trivial” settings, RMT explains a lot of things and can improve algorithms! ◮ Key observation 3 : Universality goes a long way...: RMT findings are compliant with real data observations! 16 / 41

  22. Takeaway Message 1 “RMT Explains Why Machine Learning Intuitions Collapse in Large Dimensions”

  23. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : 18 / 41

  24. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 18 / 41

  25. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , 18 / 41

  26. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) 18 / 41

  27. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering 18 / 41

  28. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of K = { κ ( x i , x j ) } n i,j =1 18 / 41

  29. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of � 1 p � x i − x j � 2 � K = { κ ( x i , x j ) } n i,j =1 , κ ( x i , x j ) = f . 18 / 41

  30. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of � 1 p � x i − x j � 2 � K = { κ ( x i , x j ) } n i,j =1 , κ ( x i , x j ) = f . ◮ Why? Finite-dimensional intuition 18 / 41

  31. Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). 19 / 41

  32. Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� � � � 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � max − → 0 , n C a . � � p 1 ≤ i � = j ≤ n i =1 19 / 41

  33. Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� � � � 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � max − → 0 , n C a . � � p 1 ≤ i � = j ≤ n i =1 ◮ this suggests K ≃ f ( τ )1 n 1 T n ! 19 / 41

  34. Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� � � � 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � max − → 0 , n C a . � � p 1 ≤ i � = j ≤ n i =1 ◮ this suggests K ≃ f ( τ )1 n 1 T n ! ◮ more importantly, in non-trivial settings, data are neither close, nor far! 19 / 41

  35. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : 20 / 41

  36. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse 20 / 41

  37. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... 20 / 41

  38. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! 20 / 41

  39. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , 20 / 41

  40. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) 20 / 41

  41. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . 20 / 41

  42. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . ➫ This is a spiked model! We can study it fully! 20 / 41

  43. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . ➫ This is a spiked model! We can study it fully! RMT can explain tools ML engineers use everyday. 20 / 41

  44. Application to Machine Learning/ 21/41 Theoretical Findings versus MNIST 0 . 2 Eigenvalues of K 0 . 15 0 . 1 5 · 10 − 2 0 0 10 20 30 40 50 Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784 , n = 192 . 21 / 41

  45. Application to Machine Learning/ 21/41 Theoretical Findings versus MNIST 0 . 2 Eigenvalues of K Eigenvalues of ˆ K as if Gaussian model 0 . 15 0 . 1 5 · 10 − 2 0 0 10 20 30 40 50 Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784 , n = 192 . 21 / 41

  46. Application to Machine Learning/ 22/41 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of K for MNIST data ( red ) and theoretical findings ( blue ). 22 / 41

  47. Application to Machine Learning/ 22/41 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of K for MNIST data ( red ) and theoretical findings ( blue ). 22 / 41

  48. Application to Machine Learning/ 23/41 Theoretical Findings versus MNIST Eigenvector 2 /Eigenvector 1 Eigenvector 3 /Eigenvector 2 0 . 1 0 . 2 0 . 1 0 0 − 0 . 1 − 0 . 1 − . 08 − . 07 − . 06 − 0 . 1 0 0 . 1 Figure: 2 D representation of eigenvectors of K , for the MNIST dataset. Theoretical means and 1 - and 2 -standard deviations in blue . Class 1 in red , Class 2 in black , Class 3 in green . 23 / 41

  49. Application to Machine Learning/ 23/41 Theoretical Findings versus MNIST Eigenvector 2 /Eigenvector 1 Eigenvector 3 /Eigenvector 2 0 . 1 0 . 2 0 . 1 0 0 − 0 . 1 − 0 . 1 − . 08 − . 07 − . 06 − 0 . 1 0 0 . 1 Figure: 2 D representation of eigenvectors of K , for the MNIST dataset. Theoretical means and 1 - and 2 -standard deviations in blue . Class 1 in red , Class 2 in black , Class 3 in green . 23 / 41

  50. Takeaway Message 2 “RMT Reassesses and Improves Data Processing”

  51. Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: 25 / 41

  52. Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) 25 / 41

  53. Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. 25 / 41

  54. Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f ( t ) = exp( − 1 2 t ) versus f ( t ) = ( t − τ ) 2 (think about the surprising kernel shape!) 25 / 41

  55. Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f ( t ) = exp( − 1 2 t ) versus f ( t ) = ( t − τ ) 2 (think about the surprising kernel shape!) 25 / 41

  56. Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! 26 / 41

  57. Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). 26 / 41

  58. Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , F [ l ] F = argmin F ∈ R n × k K ij F ia − F jb ia = δ { x i ∈C a } . a =1 26 / 41

  59. Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , F [ l ] F = argmin F ∈ R n × k K ij F ia − F jb ia = δ { x i ∈C a } . a =1 ◮ Explicit solution : � � − 1 F [ u ] = I n [ u ] − D − 1 D − 1 [ u ] K [ ul ] F [ l ] [ u ] K [ uu ] where D = diag( K 1 n ) (degree matrix) and [ ul ] , [ uu ] , . . . blocks of l abeled/ u nlabeled data. 26 / 41

  60. Application to Machine Learning/ 27/41 The finite-dimensional intuition: What we expect 27 / 41

  61. Application to Machine Learning/ 27/41 The finite-dimensional intuition: What we expect 27 / 41

  62. Application to Machine Learning/ 27/41 The finite-dimensional intuition: What we expect 27 / 41

  63. Application to Machine Learning/ 28/41 The reality: What we see! Setting. p = 400 , n = 1000 , x i ∼ N ( ± µ, I p ) . Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) . Display. Scores F ik (left) and F ◦ ik = F ik − 1 2 ( F i 1 + F i 2 ) (right). 28 / 41

  64. Application to Machine Learning/ 28/41 The reality: What we see! Setting. p = 400 , n = 1000 , x i ∼ N ( ± µ, I p ) . Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) . Display. Scores F ik (left) and F ◦ ik = F ik − 1 2 ( F i 1 + F i 2 ) (right). ➫ Score are almost all identical... and do not follow the labelled data! 28 / 41

  65. Application to Machine Learning/ 29/41 MNIST Data Example [ F ( u )] · , 1 (Zeros) 1 . 2 1 F ( u ) · ,a 0 . 8 0 50 100 150 Index Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 41

  66. Application to Machine Learning/ 29/41 MNIST Data Example [ F ( u )] · , 1 (Zeros) [ F ( u )] · , 2 (Ones) 1 . 2 F ( u ) · ,a 1 0 . 8 0 50 100 150 Index Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 41

  67. Application to Machine Learning/ 29/41 MNIST Data Example [ F ( u )] · , 1 (Zeros) [ F ( u )] · , 2 (Ones) [ F ( u )] · , 3 (Twos) 1 . 2 F ( u ) · ,a 1 0 . 8 0 50 100 150 Index Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 41

  68. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” 30 / 41

  69. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work 30 / 41

  70. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! 30 / 41

  71. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... 30 / 41

  72. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! 30 / 41

  73. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨ olkopf, Zien, “ Semi-Supervised Learning ”, Chapter 4, 2009. Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. 30 / 41

  74. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨ olkopf, Zien, “ Semi-Supervised Learning ”, Chapter 4, 2009. Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. What RMT can do about it 30 / 41

  75. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨ olkopf, Zien, “ Semi-Supervised Learning ”, Chapter 4, 2009. Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. What RMT can do about it ◮ Asymptotic performance analysis: clear understanding of what we see! 30 / 41

Recommend


More recommend