Can Random Matrices Change the Future of Machine Learning? Malik - PowerPoint PPT Presentation

Basics of Random Matrix Theory/Spiked Models 11/46 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 1 / 4 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� p − 4 11 / 46

Basics of Random Matrix Theory/Spiked Models 11/46 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 1 / 2 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� p − 4 11 / 46

Basics of Random Matrix Theory/Spiked Models 11/46 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 1 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� p − 4 11 / 46

Basics of Random Matrix Theory/Spiked Models 11/46 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 2 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� p − 4 11 / 46

Basics of Random Matrix Theory/Spiked Models 12/46 Spiked Models Theorem (Eigenvalues [Baik,Silverstein’06] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ , where, for K fixed, Ω = diag ( ω 1 , . . . , ω K ) ∈ R K × K , with ω 1 ≥ . . . ≥ ω K > 0 . 12 / 46

Basics of Random Matrix Theory/Spiked Models 12/46 Spiked Models Theorem (Eigenvalues [Baik,Silverstein’06] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ , where, for K fixed, Ω = diag ( ω 1 , . . . , ω K ) ∈ R K × K , with ω 1 ≥ . . . ≥ ω K > 0 . Then, as p, n → ∞ , p/n → c ∈ (0 , ∞ ) , denoting λ m = λ m ( 1 n Y p Y ∗ p ) ( λ m > λ m +1 ), � > (1 + √ c ) 2 , ω m > √ c 1 + ω m + c 1+ ω m a . s . λ m − → (1 + √ c ) 2 ω m , ω m ∈ (0 , √ c ] . 12 / 46

Basics of Random Matrix Theory/Spiked Models 13/46 Spiked Models Theorem (Eigenvectors [Paul’07] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ = � K i =1 ω i u i u ∗ i , ω 1 > . . . > ω M > 0 . 13 / 46

Basics of Random Matrix Theory/Spiked Models 13/46 Spiked Models Theorem (Eigenvectors [Paul’07] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ = � K i =1 ω i u i u ∗ i , ω 1 > . . . > ω M > 0 . Then, as p, n → ∞ , p/n → c ∈ (0 , ∞ ) , for a, b ∈ C p deterministic and ˆ u i eigenvector of λ i ( 1 n Y p Y ∗ p ) , i b − 1 − cω − 2 a . s . a ∗ ˆ u ∗ i a ∗ u i u ∗ i b · 1 ω i > √ c u i ˆ − → 0 1 + cω − 1 i In particular, → 1 − cω − 2 i u i | 2 a . s . i u ∗ | ˆ − · 1 ω i > √ c . 1 + cω − 1 i 13 / 46

Basics of Random Matrix Theory/Spiked Models 14/46 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 14 / 46

Basics of Random Matrix Theory/Spiked Models 14/46 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 p = 200 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 14 / 46

Basics of Random Matrix Theory/Spiked Models 14/46 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 p = 200 p = 400 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 14 / 46

Basics of Random Matrix Theory/Spiked Models 14/46 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 p = 100 p = 200 0 . 2 p = 400 1 − c/ω 2 1 1+ c/ω 1 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 14 / 46

Basics of Random Matrix Theory/Spiked Models 15/46 Other Spiked Models Similar results for multiple matrix models: 1 1 ◮ Y p = 1 2 X p X ∗ n ( I + P ) p ( I + P ) 2 ◮ Y p = 1 n X p X ∗ p + P ◮ Y p = 1 n X ∗ p ( I + P ) X ◮ Y p = 1 n ( X p + P ) ∗ ( X p + P ) ◮ etc. 15 / 46

Application to Machine Learning/ 16/46 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning 16 / 46

Takeaway Message 1 “RMT Explains Why Machine Learning Intuitions Collapse in Large Dimensions”

Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : 18 / 46

Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 18 / 46

Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , 18 / 46

Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , Classical method: spectral clustering 18 / 46

Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of K = { κ ( x i , x j ) } n i,j =1 18 / 46

Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of � 1 p � x i − x j � 2 � K = { κ ( x i , x j ) } n i,j =1 , κ ( x i , x j ) = f . 18 / 46

Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of � 1 p � x i − x j � 2 � K = { κ ( x i , x j ) } n i,j =1 , κ ( x i , x j ) = f . ◮ Why? Finite-dimensional intuition 18 / 46

Application to Machine Learning/ 19/46 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). 19 / 46

Application to Machine Learning/ 19/46 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � − → 0 , max n C a . � � p 1 ≤ i � = j ≤ n i =1 19 / 46

Application to Machine Learning/ 19/46 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � − → 0 , max n C a . � � p 1 ≤ i � = j ≤ n i =1 ◮ this suggests K ≃ f ( τ )1 n 1 T n ! 19 / 46

Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse 20 / 46

Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... 20 / 46

Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , � � � a . s . � K − ˆ K ≃ f ( τ )1 n 1 T ˆ − → 0 , K n � �� O �·� ( n ) 20 / 46

Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , � � + 1 � a . s . pZZ T + JAJ T + ∗ � K − ˆ K ≃ f ( τ )1 n 1 T ˆ − → 0 , K n � �� O �·� ( n ) 20 / 46

Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , � � + 1 � a . s . pZZ T + JAJ T + ∗ � K − ˆ K ≃ f ( τ )1 n 1 T ˆ − → 0 , K n � �� O �·� ( n ) with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) 20 / 46

Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , � � + 1 � a . s . pZZ T + JAJ T + ∗ � K − ˆ K ≃ f ( τ )1 n 1 T ˆ − → 0 , K n � �� O �·� ( n ) with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . 20 / 46

Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , � � + 1 � a . s . pZZ T + JAJ T + ∗ � K − ˆ K ≃ f ( τ )1 n 1 T ˆ − → 0 , K n � �� O �·� ( n ) with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . ➫ This is a spiked model! We can study it fully! 20 / 46

Application to Machine Learning/ 21/46 Theoretical Findings versus MNIST 0 . 2 Eigenvalues of K 0 . 15 0 . 1 5 · 10 − 2 0 0 10 20 30 40 50 Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784 , n = 192 . 21 / 46

Application to Machine Learning/ 21/46 Theoretical Findings versus MNIST 0 . 2 Eigenvalues of K Eigenvalues of ˆ K as if Gaussian model 0 . 15 0 . 1 5 · 10 − 2 0 0 10 20 30 40 50 Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784 , n = 192 . 21 / 46

Application to Machine Learning/ 22/46 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of K for MNIST data ( red ) and theoretical findings ( blue ). 22 / 46

Application to Machine Learning/ 23/46 Theoretical Findings versus MNIST Eigenvector 2 /Eigenvector 1 Eigenvector 3 /Eigenvector 2 0 . 1 0 . 2 0 . 1 0 0 − 0 . 1 − 0 . 1 − . 08 − . 07 − . 06 − 0 . 1 0 0 . 1 Figure: 2 D representation of eigenvectors of K , for the MNIST dataset. Theoretical means and 1 - and 2 -standard deviations in blue . Class 1 in red , Class 2 in black , Class 3 in green . 23 / 46

Takeaway Message 2 “RMT Reassesses and Improves Data Processing”

Application to Machine Learning/ 25/46 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: 25 / 46

Application to Machine Learning/ 25/46 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. 25 / 46

Application to Machine Learning/ 25/46 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f ( t ) = exp( − 1 2 t ) versus f ( t ) = ( t − τ ) 2 (think about the surprising kernel shape!) 25 / 46

Application to Machine Learning/ 26/46 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! 26 / 46

Application to Machine Learning/ 26/46 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). 26 / 46

Application to Machine Learning/ 26/46 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , � F [ l ] F = argmin F ∈ R n × k K ij F ia − F ja ia = δ { x i ∈C a } . a =1 i,j 26 / 46

Application to Machine Learning/ 26/46 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , � F [ l ] F ia D α ii − F ja D α F = argmin F ∈ R n × k K ij ia = δ { x i ∈C a } . jj a =1 i,j 26 / 46

Application to Machine Learning/ 26/46 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , � F [ l ] F ia D α ii − F ja D α F = argmin F ∈ R n × k K ij ia = δ { x i ∈C a } . jj a =1 i,j ◮ Explicit solution : � � − 1 F [ u ] = I n [ u ] − D − 1 − α K [ uu ] D α D − 1 − α K [ ul ] D α [ l ] F [ l ] [ u ] [ u ] [ u ] where D = diag( K 1 n ) (degree matrix) and [ ul ] , [ uu ] , . . . blocks of l abeled/ u nlabeled data. 26 / 46

Application to Machine Learning/ 27/46 The finite-dimensional case: What we expect [ F ] · , 1 (scores for C 1 ) 1 0 Labelled Unlabelled Labelled Unlabelled C 1 C 2 0 30 100 130 200 Figure: Outcome F of Laplacian algorithms ( α = − 1 ) for N ( ± µ, I p ) with p = 1 . 27 / 46

Application to Machine Learning/ 27/46 The finite-dimensional case: What we expect [ F ] · , 1 (scores for C 1 ) [ F ] · , 2 (scores for C 2 ) 1 0 Labelled Unlabelled Labelled Unlabelled C 1 C 2 0 30 100 130 200 Figure: Outcome F of Laplacian algorithms ( α = − 1 ) for N ( ± µ, I p ) with p = 1 . 27 / 46

Application to Machine Learning/ 28/46 The reality: What we see! [ F ] · , 1 (scores for C 1 ) 1 0 Labelled Unlabelled Labelled Unlabelled C 1 C 2 0 30 100 130 200 Figure: Outcome F of Laplacian algorithms ( α = − 1 ) for N ( ± µ, I p ) with p = 80 . 28 / 46

Application to Machine Learning/ 28/46 The reality: What we see! [ F ] · , 1 (scores for C 1 ) [ F ] · , 2 (scores for C 2 ) 1 0 Labelled Unlabelled Labelled Unlabelled C 1 C 2 0 30 100 130 200 Figure: Outcome F of Laplacian algorithms ( α = − 1 ) for N ( ± µ, I p ) with p = 80 . 28 / 46

Application to Machine Learning/ 29/46 The reality: What we see! (on MNIST) [ F ( u )] · , 1 (Zeros) 1 F ( u ) · ,a 0 . 5 0 0 20 40 60 80 100 120 140 160 180 Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 46

Application to Machine Learning/ 29/46 The reality: What we see! (on MNIST) [ F ( u )] · , 1 (Zeros) [ F ( u )] · , 2 (Ones) 1 F ( u ) · ,a 0 . 5 0 0 20 40 60 80 100 120 140 160 180 Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 46

Application to Machine Learning/ 29/46 The reality: What we see! (on MNIST) [ F ( u )] · , 1 (Zeros) [ F ( u )] · , 2 (Ones) [ F ( u )] · , 3 (Twos) 1 F ( u ) · ,a 0 . 5 0 0 20 40 60 80 100 120 140 160 180 Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 46

Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” 30 / 46

Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work 30 / 46

Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! 30 / 46

Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization (e.g., α = − 1 , F i · ← F i · /n [ l ] ,i ), it works again... 30 / 46

Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization (e.g., α = − 1 , F i · ← F i · /n [ l ] ,i ), it works again... ◮ BUT it does not use efficiently unlabelled data! 30 / 46

Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization (e.g., α = − 1 , F i · ← F i · /n [ l ] ,i ), it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨ olkopf, Zien, “ Semi-Supervised Learning ”, Chapter 4, 2009. Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. 30 / 46

Application to Machine Learning/ 31/46 Asymptotic Performance Analysis Theorem ( [Mai,C’18] Asymptotic Performance of SSL) For x i ∈ C b unlabelled, score vector F i, · ∈ R k satisfies: F i, · − G b → 0 , G b ∼ N ( m b , Σ b ) with m b ∈ R k , Σ b ∈ R k × k function of ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) , µ 1 , . . . , µ k , C 1 , . . . , C k ◮ only n l . 31 / 46

Application to Machine Learning/ 31/46 Asymptotic Performance Analysis Theorem ( [Mai,C’18] Asymptotic Performance of SSL) For x i ∈ C b unlabelled, score vector F i, · ∈ R k satisfies: F i, · − G b → 0 , G b ∼ N ( m b , Σ b ) with m b ∈ R k , Σ b ∈ R k × k function of ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) , µ 1 , . . . , µ k , C 1 , . . . , C k ◮ only n l . 0 . 82 0 . 8 Accuracy 0 . 78 Laplacian regularization 0 . 76 2 4 6 8 10 n [ u ] /p Figure: Accuracy as a function of n [ u ] /p with n [ l ] /p = 2 , c 1 = c 2 , p = 100 , − µ 1 = µ 2 = [1; 0 p − 1 ] , { C } i,j = . 1 | i − j | . Graph constructed with K ij = e −� xi − xj � 2 /p . 31 / 46

Can Random Matrices Change the Future of Machine Learning? Malik - PowerPoint PPT Presentation

Can Random Matrices Change the Future of Machine Learning? Malik TIOMOKO and Romain COUILLET CentraleSup elec, L2S, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, University GrenobleAlpes, France. March 8, 2020

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

Random Forests COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Random

Random Matrix Advances in Machine Learning (Imaging and Machine Learning) Mathematics Workshop #3

Sparse Matrices Beyond Solvers - Graphs, Biology, and Machine Learning (v2) Aydn Bulu

Language change as a random walk in vector space Gerhard Jger Tbingen University, Department

Spectral rigidity for addition of random matrices at the regular edge Zhigang Bao HKUST Random

Matrices and vectors Machine Learning Andrew Ng Matrix:

Isotropic local laws for random matrices Antti Knowles University of Geneva With Y. He and R.

Gaussian Free Field in (self-adjoint) random matrices and random surfaces Alexei Borodin Corners

Probability Overview Events discrete random variables, continuous random variables,

Introduction to Machine Learning Random Forest: Introduction compstat-lmu.github.io/lecture_i2ml

Scaling laws for products of random matrices Lois d echelle pour les produits de matrices

Covariance Matrices and Covariance Operators in Machine Learning and Pattern Recognition A

On the interface between Hermitian and normal random matrices Yacin Ameur. Centre for

Introduction to Machine Learning Random Forest: Benchmarking Trees, Forests, and Bagging K-NN

On spectral properties of large dilute Wigner random matrices O. Khorunzhiy University of

Local law of addition of random matrices Kevin Schnelli 1 IST Austria Joint work with Zhigang Bao

PROBABILITY INEQUALITIES FOR SUMS OF RANDOM MATRICES JOEL A. TROPP 1. Overview Let X 1 ,..., X n

Heavy-tailed random matrices and the Poisson Weighted In fi nite Tree Charles Bordenave CNRS

kernel CCA, kernel Kmeans Spectral Clustering 1 MACHINE LEARNING 2012 Change in timetable:

Random matrices: Distribution of the least singular value (via Property Testing) Van H. Vu

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Lvy-Khintchine random matrices Paul Jung University of Alabama Birmingham September 21, 2014

The strong asymptotic freeness of large random and deterministic matrices Camille Male