Basics of Random Matrix Theory/Spiked Models 11/46 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 1 / 4 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� � p − 4 11 / 46
Basics of Random Matrix Theory/Spiked Models 11/46 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 1 / 2 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� � p − 4 11 / 46
Basics of Random Matrix Theory/Spiked Models 11/46 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 1 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� � p − 4 11 / 46
Basics of Random Matrix Theory/Spiked Models 11/46 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 2 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� � p − 4 11 / 46
Basics of Random Matrix Theory/Spiked Models 12/46 Spiked Models Theorem (Eigenvalues [Baik,Silverstein’06] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ , where, for K fixed, Ω = diag ( ω 1 , . . . , ω K ) ∈ R K × K , with ω 1 ≥ . . . ≥ ω K > 0 . 12 / 46
Basics of Random Matrix Theory/Spiked Models 12/46 Spiked Models Theorem (Eigenvalues [Baik,Silverstein’06] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ , where, for K fixed, Ω = diag ( ω 1 , . . . , ω K ) ∈ R K × K , with ω 1 ≥ . . . ≥ ω K > 0 . Then, as p, n → ∞ , p/n → c ∈ (0 , ∞ ) , denoting λ m = λ m ( 1 n Y p Y ∗ p ) ( λ m > λ m +1 ), � > (1 + √ c ) 2 , ω m > √ c 1 + ω m + c 1+ ω m a . s . λ m − → (1 + √ c ) 2 ω m , ω m ∈ (0 , √ c ] . 12 / 46
Basics of Random Matrix Theory/Spiked Models 13/46 Spiked Models Theorem (Eigenvectors [Paul’07] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ = � K i =1 ω i u i u ∗ i , ω 1 > . . . > ω M > 0 . 13 / 46
Basics of Random Matrix Theory/Spiked Models 13/46 Spiked Models Theorem (Eigenvectors [Paul’07] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ = � K i =1 ω i u i u ∗ i , ω 1 > . . . > ω M > 0 . Then, as p, n → ∞ , p/n → c ∈ (0 , ∞ ) , for a, b ∈ C p deterministic and ˆ u i eigenvector of λ i ( 1 n Y p Y ∗ p ) , i b − 1 − cω − 2 a . s . a ∗ ˆ u ∗ i a ∗ u i u ∗ i b · 1 ω i > √ c u i ˆ − → 0 1 + cω − 1 i In particular, → 1 − cω − 2 i u i | 2 a . s . i u ∗ | ˆ − · 1 ω i > √ c . 1 + cω − 1 i 13 / 46
Basics of Random Matrix Theory/Spiked Models 14/46 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 14 / 46
Basics of Random Matrix Theory/Spiked Models 14/46 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 p = 200 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 14 / 46
Basics of Random Matrix Theory/Spiked Models 14/46 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 p = 200 p = 400 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 14 / 46
Basics of Random Matrix Theory/Spiked Models 14/46 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 p = 100 p = 200 0 . 2 p = 400 1 − c/ω 2 1 1+ c/ω 1 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 14 / 46
Basics of Random Matrix Theory/Spiked Models 15/46 Other Spiked Models Similar results for multiple matrix models: 1 1 ◮ Y p = 1 2 X p X ∗ n ( I + P ) p ( I + P ) 2 ◮ Y p = 1 n X p X ∗ p + P ◮ Y p = 1 n X ∗ p ( I + P ) X ◮ Y p = 1 n ( X p + P ) ∗ ( X p + P ) ◮ etc. 15 / 46
Application to Machine Learning/ 16/46 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning 16 / 46
Takeaway Message 1 “RMT Explains Why Machine Learning Intuitions Collapse in Large Dimensions”
Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : 18 / 46
Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 18 / 46
Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , 18 / 46
Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , Classical method: spectral clustering 18 / 46
Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of K = { κ ( x i , x j ) } n i,j =1 18 / 46
Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of � 1 p � x i − x j � 2 � K = { κ ( x i , x j ) } n i,j =1 , κ ( x i , x j ) = f . 18 / 46
Application to Machine Learning/ 18/46 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of � 1 p � x i − x j � 2 � K = { κ ( x i , x j ) } n i,j =1 , κ ( x i , x j ) = f . ◮ Why? Finite-dimensional intuition 18 / 46
Application to Machine Learning/ 19/46 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). 19 / 46
Application to Machine Learning/ 19/46 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). 19 / 46
Application to Machine Learning/ 19/46 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). 19 / 46
Application to Machine Learning/ 19/46 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� � � � 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � − → 0 , max n C a . � � p 1 ≤ i � = j ≤ n i =1 19 / 46
Application to Machine Learning/ 19/46 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� � � � 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � − → 0 , max n C a . � � p 1 ≤ i � = j ≤ n i =1 ◮ this suggests K ≃ f ( τ )1 n 1 T n ! 19 / 46
Application to Machine Learning/ 19/46 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� � � � 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � − → 0 , max n C a . � � p 1 ≤ i � = j ≤ n i =1 ◮ this suggests K ≃ f ( τ )1 n 1 T n ! 19 / 46
Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse 20 / 46
Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... 20 / 46
Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , � � � a . s . � K − ˆ K ≃ f ( τ )1 n 1 T ˆ − → 0 , K n � �� � O �·� ( n ) 20 / 46
Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , � � + 1 � a . s . pZZ T + JAJ T + ∗ � K − ˆ K ≃ f ( τ )1 n 1 T ˆ − → 0 , K n � �� � O �·� ( n ) 20 / 46
Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , � � + 1 � a . s . pZZ T + JAJ T + ∗ � K − ˆ K ≃ f ( τ )1 n 1 T ˆ − → 0 , K n � �� � O �·� ( n ) with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) 20 / 46
Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , � � + 1 � a . s . pZZ T + JAJ T + ∗ � K − ˆ K ≃ f ( τ )1 n 1 T ˆ − → 0 , K n � �� � O �·� ( n ) with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . 20 / 46
Application to Machine Learning/ 20/46 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , � � + 1 � a . s . pZZ T + JAJ T + ∗ � K − ˆ K ≃ f ( τ )1 n 1 T ˆ − → 0 , K n � �� � O �·� ( n ) with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . ➫ This is a spiked model! We can study it fully! 20 / 46
Application to Machine Learning/ 21/46 Theoretical Findings versus MNIST 0 . 2 Eigenvalues of K 0 . 15 0 . 1 5 · 10 − 2 0 0 10 20 30 40 50 Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784 , n = 192 . 21 / 46
Application to Machine Learning/ 21/46 Theoretical Findings versus MNIST 0 . 2 Eigenvalues of K Eigenvalues of ˆ K as if Gaussian model 0 . 15 0 . 1 5 · 10 − 2 0 0 10 20 30 40 50 Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784 , n = 192 . 21 / 46
Application to Machine Learning/ 22/46 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of K for MNIST data ( red ) and theoretical findings ( blue ). 22 / 46
Application to Machine Learning/ 22/46 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of K for MNIST data ( red ) and theoretical findings ( blue ). 22 / 46
Application to Machine Learning/ 23/46 Theoretical Findings versus MNIST Eigenvector 2 /Eigenvector 1 Eigenvector 3 /Eigenvector 2 0 . 1 0 . 2 0 . 1 0 0 − 0 . 1 − 0 . 1 − . 08 − . 07 − . 06 − 0 . 1 0 0 . 1 Figure: 2 D representation of eigenvectors of K , for the MNIST dataset. Theoretical means and 1 - and 2 -standard deviations in blue . Class 1 in red , Class 2 in black , Class 3 in green . 23 / 46
Application to Machine Learning/ 23/46 Theoretical Findings versus MNIST Eigenvector 2 /Eigenvector 1 Eigenvector 3 /Eigenvector 2 0 . 1 0 . 2 0 . 1 0 0 − 0 . 1 − 0 . 1 − . 08 − . 07 − . 06 − 0 . 1 0 0 . 1 Figure: 2 D representation of eigenvectors of K , for the MNIST dataset. Theoretical means and 1 - and 2 -standard deviations in blue . Class 1 in red , Class 2 in black , Class 3 in green . 23 / 46
Takeaway Message 2 “RMT Reassesses and Improves Data Processing”
Application to Machine Learning/ 25/46 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: 25 / 46
Application to Machine Learning/ 25/46 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. 25 / 46
Application to Machine Learning/ 25/46 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f ( t ) = exp( − 1 2 t ) versus f ( t ) = ( t − τ ) 2 (think about the surprising kernel shape!) 25 / 46
Application to Machine Learning/ 25/46 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f ( t ) = exp( − 1 2 t ) versus f ( t ) = ( t − τ ) 2 (think about the surprising kernel shape!) 25 / 46
Application to Machine Learning/ 26/46 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! 26 / 46
Application to Machine Learning/ 26/46 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). 26 / 46
Application to Machine Learning/ 26/46 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , � F [ l ] F = argmin F ∈ R n × k K ij F ia − F ja ia = δ { x i ∈C a } . a =1 i,j 26 / 46
Application to Machine Learning/ 26/46 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , � F [ l ] F ia D α ii − F ja D α F = argmin F ∈ R n × k K ij ia = δ { x i ∈C a } . jj a =1 i,j 26 / 46
Application to Machine Learning/ 26/46 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , � F [ l ] F ia D α ii − F ja D α F = argmin F ∈ R n × k K ij ia = δ { x i ∈C a } . jj a =1 i,j ◮ Explicit solution : � � − 1 F [ u ] = I n [ u ] − D − 1 − α K [ uu ] D α D − 1 − α K [ ul ] D α [ l ] F [ l ] [ u ] [ u ] [ u ] where D = diag( K 1 n ) (degree matrix) and [ ul ] , [ uu ] , . . . blocks of l abeled/ u nlabeled data. 26 / 46
Application to Machine Learning/ 27/46 The finite-dimensional case: What we expect [ F ] · , 1 (scores for C 1 ) 1 0 Labelled Unlabelled Labelled Unlabelled C 1 C 2 0 30 100 130 200 Figure: Outcome F of Laplacian algorithms ( α = − 1 ) for N ( ± µ, I p ) with p = 1 . 27 / 46
Application to Machine Learning/ 27/46 The finite-dimensional case: What we expect [ F ] · , 1 (scores for C 1 ) [ F ] · , 2 (scores for C 2 ) 1 0 Labelled Unlabelled Labelled Unlabelled C 1 C 2 0 30 100 130 200 Figure: Outcome F of Laplacian algorithms ( α = − 1 ) for N ( ± µ, I p ) with p = 1 . 27 / 46
Application to Machine Learning/ 28/46 The reality: What we see! [ F ] · , 1 (scores for C 1 ) 1 0 Labelled Unlabelled Labelled Unlabelled C 1 C 2 0 30 100 130 200 Figure: Outcome F of Laplacian algorithms ( α = − 1 ) for N ( ± µ, I p ) with p = 80 . 28 / 46
Application to Machine Learning/ 28/46 The reality: What we see! [ F ] · , 1 (scores for C 1 ) [ F ] · , 2 (scores for C 2 ) 1 0 Labelled Unlabelled Labelled Unlabelled C 1 C 2 0 30 100 130 200 Figure: Outcome F of Laplacian algorithms ( α = − 1 ) for N ( ± µ, I p ) with p = 80 . 28 / 46
Application to Machine Learning/ 29/46 The reality: What we see! (on MNIST) [ F ( u )] · , 1 (Zeros) 1 F ( u ) · ,a 0 . 5 0 0 20 40 60 80 100 120 140 160 180 Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 46
Application to Machine Learning/ 29/46 The reality: What we see! (on MNIST) [ F ( u )] · , 1 (Zeros) [ F ( u )] · , 2 (Ones) 1 F ( u ) · ,a 0 . 5 0 0 20 40 60 80 100 120 140 160 180 Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 46
Application to Machine Learning/ 29/46 The reality: What we see! (on MNIST) [ F ( u )] · , 1 (Zeros) [ F ( u )] · , 2 (Ones) [ F ( u )] · , 3 (Twos) 1 F ( u ) · ,a 0 . 5 0 0 20 40 60 80 100 120 140 160 180 Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 46
Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” 30 / 46
Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work 30 / 46
Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! 30 / 46
Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization (e.g., α = − 1 , F i · ← F i · /n [ l ] ,i ), it works again... 30 / 46
Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization (e.g., α = − 1 , F i · ← F i · /n [ l ] ,i ), it works again... ◮ BUT it does not use efficiently unlabelled data! 30 / 46
Application to Machine Learning/ 30/46 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization (e.g., α = − 1 , F i · ← F i · /n [ l ] ,i ), it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨ olkopf, Zien, “ Semi-Supervised Learning ”, Chapter 4, 2009. Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. 30 / 46
Application to Machine Learning/ 31/46 Asymptotic Performance Analysis Theorem ( [Mai,C’18] Asymptotic Performance of SSL) For x i ∈ C b unlabelled, score vector F i, · ∈ R k satisfies: F i, · − G b → 0 , G b ∼ N ( m b , Σ b ) with m b ∈ R k , Σ b ∈ R k × k function of ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) , µ 1 , . . . , µ k , C 1 , . . . , C k ◮ only n l . 31 / 46
Application to Machine Learning/ 31/46 Asymptotic Performance Analysis Theorem ( [Mai,C’18] Asymptotic Performance of SSL) For x i ∈ C b unlabelled, score vector F i, · ∈ R k satisfies: F i, · − G b → 0 , G b ∼ N ( m b , Σ b ) with m b ∈ R k , Σ b ∈ R k × k function of ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) , µ 1 , . . . , µ k , C 1 , . . . , C k ◮ only n l . 0 . 82 0 . 8 Accuracy 0 . 78 Laplacian regularization 0 . 76 2 4 6 8 10 n [ u ] /p Figure: Accuracy as a function of n [ u ] /p with n [ l ] /p = 2 , c 1 = c 2 , p = 100 , − µ 1 = µ 2 = [1; 0 p − 1 ] , { C } i,j = . 1 | i − j | . Graph constructed with K ij = e −� xi − xj � 2 /p . 31 / 46
Recommend
More recommend