machine learning on the symmetric group
play

Machine learning on the symmetric group Jean-Philippe Vert ML ML - PowerPoint PPT Presentation

Machine learning on the symmetric group Jean-Philippe Vert ML ML ML ML What if inputs are permutations? Permutation: a bijection : [ 1 , N ] [ 1 , N ] ( i ) = rank of item i Composition ( 1 2 )( i ) = 1 ( 2 ( i )) S N


  1. Machine learning on the symmetric group Jean-Philippe Vert

  2. ML

  3. ML

  4. ML

  5. ML

  6. What if inputs are permutations? Permutation: a bijection σ : [ 1 , N ] → [ 1 , N ] σ ( i ) = rank of item i Composition ( σ 1 σ 2 )( i ) = σ 1 ( σ 2 ( i )) S N the symmetric group | S N | = N !

  7. Examples Ranking data Ranks extracted from data (histogram equalization, quantile normalization...)

  8. Examples Batch effects, calibration of experimental measures

  9. Learning from permutations Assume your data are permutations and you want to learn f : S N → R A solutions: embed S N to a Euclidean (or Hilbert) space Φ : S N → R p and learn a linear function: f β ( σ ) = β ⊤ Φ( σ ) The corresponding kernel is K ( σ 1 , σ 2 ) = Φ( σ 1 ) ⊤ Φ( σ 2 )

  10. How to define the embedding Φ : S N → R p ? Should encode interesting features Should lead to efficient algorithms Should be invariant to renaming of the items, i.e., the kernel should be right-invariant ∀ σ 1 , σ 2 , π ∈ S N , K ( σ 1 π, σ 2 π ) = K ( σ 1 , σ 2 )

  11. Some attempts SUQUAN Kendall (Jiao and Vert, 2015, 2017, 2018; Le Morvan and Vert, 2017)

  12. SUQUAN embedding (Le Morvan and Vert, 2017) Let Φ( σ ) = Π σ the permutation representation (Serres, 1977): � 1 if σ ( j ) = i , [Π σ ] ij = 0 otherwise. Right invariant: � � < Φ( σ ) , Φ( σ ′ ) > = Tr Π σ Π ⊤ Π σ Π − 1 � � = Tr = Tr (Π σ Π σ ′− 1 ) = Tr (Π σσ ′− 1 ) σ ′ σ ′

  13. Link with quantile normalization (QN) Take σ ( x ) = rank ( x ) with x ∈ R N Fix a target quantile f ∈ R n "Keep the order of x , change the values to f " [Ψ f ( x )] i = f σ ( x )( i ) ⇔ Ψ f ( x ) = Π σ ( x ) f

  14. How to choose a "good" target distribution?

  15. Supervised QN (SUQUAN) Standard QN: Fix f arbitrarily 1 QN all samples to get Ψ f ( x 1 ) , . . . , Ψ f ( x N ) 2 Learn a model on normalized data, e.g.: 3 � N � 1 � � � w ⊤ Ψ f ( x i ) + b min ℓ i + λ Ω( w ) N w , b i = 1 SUQUAN: jointly learn f and the model: N � � 1 � � � w ⊤ Ψ f ( x i ) + b + λ Ω( w ) + γ Ω 2 ( f ) min ℓ i N w , b , f i = 1

  16. SUQAN as rank-1 matrix regression over Φ( σ ) Linear SUQUAN therefore solves � N � 1 � � � w ⊤ Ψ f ( x i ) + b min + λ Ω( w ) + γ Ω 2 ( f ) ℓ i N w , b , f i = 1 � N � 1 � � � w ⊤ Π ⊤ = min σ ( x i ) f + b + λ Ω( w ) + γ Ω 2 ( f ) ℓ N w , b , f i = 1 � N � 1 � < Π σ ( x i ) , fw ⊤ > Frobenius + b � � = min + λ Ω( w ) + γ Ω 2 ( f ) ℓ N w , b , f i = 1 A particular linear model to estimate a rank-1 matrix M = fw ⊤ Each sample σ ∈ S N is represented by the matrix Π σ ∈ R n × n Non-convex Alternative optimization of f and w is easy

  17. Experiments: CIFAR-10 Image classification into 10 classes (45 binary problems) N = 5 , 000 per class, p = 1 , 024 pixels Linear logistic regression on raw pixels AUC on test set − SUQUAN BND 0.90 ● ● ● ● ● ● 0.85 0.85 ● ● ● ● ● ● ● ● ● ● ● ● 0.80 ● ● ● ● ● ● AUC 0.75 ● ● ● 0.75 ● ● ● 0.70 ● ● ● ● ● ● ● ● ● 0.65 ● ● 0.65 ● ● 0.60 ● ● cauchy exponential uniform gaussian median SUQUAN SVD SUQUAN BND SUQUAN SPAV 0.65 0.75 0.85 AUC on test set − median

  18. Experiments: CIFAR-10 Example: horse vs. plane Different methods learn different quantile functions original median SVD SUQUAN BND 0 400 800 0 400 800 0 400 800 Index Index Index

  19. Limits of the SUQUAN embedding Linear model on Φ( σ ) = Π σ ∈ R N × N Captures first-order information of the form " i-th feature ranked at the j-th position " What about higher-order information such as " feature i larger than feature j "?

  20. The Kendall embedding (Jiao and Vert, 2015, 2017) � 1 if σ ( i ) < σ ( j ) , Φ i , j ( σ ) = 0 otherwise.

  21. Geometry of the embedding For any two permutations σ, σ ′ ∈ S N : Inner product Φ( σ ) ⊤ Φ( σ ′ ) = � 1 σ ( i ) <σ ( j ) 1 σ ′ ( i ) <σ ′ ( j ) = n c ( σ, σ ′ ) 1 ≤ i � = j ≤ n n c = number of concordant pairs Distance � Φ( σ ) − Φ( σ ′ ) � 2 = ( 1 σ ( i ) <σ ( j ) − 1 σ ′ ( i ) <σ ′ ( j ) ) 2 = 2 n d ( σ, σ ′ ) � 1 ≤ i , j ≤ n n d = number of discordant pairs

  22. Kendall and Mallows kernels The Kendall kernel is K τ ( σ, σ ′ ) = n c ( σ, σ ′ ) The Mallows kernel is K λ M ( σ, σ ′ ) = e − λ n d ( σ,σ ′ ) ∀ λ ≥ 0 Theorem (Jiao and Vert, 2015, 2017) The Kendall and Mallows kernels are positive definite right-invariant kernels and can be evaluated in O ( N log N ) time Kernel trick useful with few samples in large dimensions

  23. Remark Kondor and Barbarosa (2010) proposed the diffusion kernel on the Cayley graph of the symmetric group generated by adjacent transpositions. Computationally intensive ( O ( N 2 N )) Mallows kernel is written as M ( σ, σ ′ ) = e − λ n d ( σ,σ ′ ) , K λ where n d ( σ, σ ′ ) is the shortest path distance on the Cayley graph. Cayley graph of S 4 It can be computed in O ( N log N )

  24. Applications and Vert, 2017). Average performance on 10 microarray classification problems (Jiao acc 0.4 0.6 0.8 1.0 SVMkdtALL SVMlinearTOP SVMlinearALL SVMkdtTOP SVMpolyALL KFDkdtALL kTSP SVMpolyTOP KFDlinearALL KFDpolyALL ● ● TSP SVMrbfALL ● KFDrbfALL APMV

  25. Extension: weighted Kendall kernel? Can we weight differently pairs based on their ranks? This would ensure a right-invariant kernel, i.e., the overall geometry does not change if we relabel the items ∀ σ 1 , σ 2 , π ∈ S N , K ( σ 1 π, σ 2 π ) = K ( σ 1 , σ 2 )

  26. Related work Given a weight function w : [ 1 , n ] 2 → R , many weighted versions of the Kendall’s τ have been proposed: � w ( σ ( i ) , σ ( j )) 1 σ ( i ) <σ ( j ) 1 σ ′ ( i ) <σ ′ ( j ) Shieh (1998) 1 ≤ i � = j ≤ n w ( σ ( i ) , σ ( j )) p σ ( i ) − p σ ′ ( i ) p σ ( j ) − p σ ′ ( j ) � σ ( j ) − σ ′ ( j ) 1 σ ( i ) <σ ( j ) 1 σ ′ ( i ) <σ ′ ( j ) σ ( i ) − σ ′ ( i ) 1 ≤ i � = j ≤ n Kumar and Vassilvitskii (2010) � w ( i , j ) 1 σ ( i ) <σ ( j ) 1 σ ′ ( i ) <σ ′ ( j ) Vigna (2015) 1 ≤ i � = j ≤ n However, they are either not symmetric (1st and 2nd), or not right-invariant (3rd)

  27. A right-invariant weighted Kendall kernel (Jiao and Vert, 2018) Theorem For any matrix U ∈ R n × n , � K U ( σ, σ ′ ) = U σ ( i ) ,σ ( j ) U σ ′ ( i ) ,σ ′ ( j ) 1 σ ( i ) <σ ( j ) 1 σ ′ ( i ) <σ ′ ( j ) , 1 ≤ i � = j ≤ n is a right-invariant p.d. kernel on S N .

  28. Examples U a , b corresponds to the weight of (items ranked at) positions a and b in a permutation. Interesting choices include: Top-k. For some k ∈ [ 1 , n ] , � 1 if a ≤ k and b ≤ k , U a , b = 0 otherwise. Additive. For some u ∈ R n , take U ij = u i + u j Multiplicative. For some u ∈ R n , take U ij = u i u j Theorem (Kernel trick) . The weighted Kendall kernel can be computed in O ( n ln ( n )) for the top-k, additive or multiplicative weights.

  29. Learning the weights (1/2) K U can be written as K U ( σ, σ ′ ) = Φ U ( σ ) ⊤ Φ U ( σ ′ ) with � � Φ U ( σ ) = U σ ( i ) ,σ ( j ) 1 σ ( i ) <σ ( j ) 1 ≤ i � = j ≤ n Interesting fact: For any upper triangular matrix U ∈ R n × n , Φ U ( σ ) = Π ⊤ σ U Π σ with (Π σ ) ij = 1 i = σ ( j ) Hence a linear model on Φ U can be rewritten as f β, U ( σ ) = � β, Φ U ( σ ) � Frobenius ( n × n ) � � β, Π ⊤ = σ U Π σ Frobenius ( n × n ) � Π σ ⊗ Π σ , vec ( U ) ⊗ ( vec ( β )) ⊤ � = Frobenius ( n 2 × n 2 )

  30. Learning the weights (2/2) � Π σ ⊗ Π σ , vec ( U ) ⊗ ( vec ( β )) ⊤ � f β, U ( σ ) = Frobenius ( n 2 × n 2 ) This is symmetric in U and β Instead of fixing the weights U and optimizing β , we can jointly optimize β and U to learn the weights U Same as SUQAN, with Π σ ⊗ Π σ instead of Π σ

  31. Experiments Eurobarometer data (Christensen, 2010) >12k individuals rank 6 sources of information Binary classification problem: predict age from ranking (>40y vs <40y) 0.7 accuracy 0.6 0.5 standard (or top−6) top−5 top−4 top−3 top−2 average add weight (hb) mult weight (hb) add weight (log) mult weight (log) learned weight (svd) learned weight (opt) type of weighted kernel

  32. Towards higher-order representations � Π σ ⊗ Π σ , vec ( U ) ⊗ ( vec ( β )) ⊤ � f β, U ( σ ) = Frobenius ( n 2 × n 2 ) A particular rank-1 linear model for the embedding Σ σ = Π σ ⊗ Π σ ∈ ( { 0 , 1 } ) n 2 × n 2 Σ is the direct sum of the second-order and first-order permutation representations: Σ ∼ = τ ( n − 2 , 1 , 1 ) ⊕ τ ( n − 1 , 1 ) This generalizes SUQUAN which considers the first-order representation Π σ only: � Π σ , w ⊗ β ⊤ � h β, w ( σ ) = Frobenius ( n × n ) Generalization possible to higher-order information by using higher-order linear representations of the symmetric group, which are the good basis for right-invariant kernels (Bochner theorem)...

Recommend


More recommend