Clustering Rankings in the Fourier Domain Stéphan Clémençon and Romaric Gaudel and Jérémie Jakubowicz LTCI, Telecom Paristech (TSI) UMR Institut Telecom/CNRS No. 5141 ECML PKDD, September 2011
Distributions on rankings Many applications consider ranked data / distributions on rankings (Uniform distribution with respect to constraints) ◮ Top-k lists ⋆ Rank of the k most preferred objects 3 > 2 > 5 > . . . ◮ Preference data ⋆ Preferences on k (randomly) picked objects . . . > 3 > . . . > 2 > . . . > 5 > . . . “sushi” dataset ◮ Bucket order ⋆ Preferences on groups of objects 3 , 2 > 5 , 1 , 7 > 4 , 6 , 8 S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 2 / 20
Representation for distributions on Rankings Probability table ◮ n ! (factorial n ) coefficients Fourier representation [Diaconis, 1989; Kondor & Barbosa, 2010] ◮ n ! coefficients ◮ Few relevant coefficients in practice Parametric models ◮ Mallows [Mallows, 1957] ◮ Plackett-Luce [Luce, 1959; Plackett, 1975] S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 3 / 20
Representation for distributions on Rankings Probability table ◮ n ! (factorial n ) coefficients Fourier representation [Diaconis, 1989; Kondor & Barbosa, 2010] ◮ n ! coefficients ◮ Few relevant coefficients in practice Parametric models ◮ Mallows [Mallows, 1957] ◮ Plackett-Luce [Luce, 1959; Plackett, 1975] S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 3 / 20
Contributions Clustering of rankings through sparse Fourier representation Position ◮ Clustering of distributions on rankings ⋆ Gather ranking distributions with similar shapes Proposed approach ◮ Work in the Fourier representation ⋆ Sparse representation of 1 distribution = ⇒ ⋆ Sparse difference between representations of 2 distributions S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 4 / 20
Outline Sparsity in the Fourier Representation 1 Sparse Clustering of Rankings 2 Numerical Experiments 3 S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 5 / 20
Fourier representation For real line function Functions are decomposed on the sinusoidal basis f ( x ) = 1 . 1 + 2 . 1 cos ( x ) + 3 . 2 cos ( 2 x ) + 1 . 5 cos ( 3 x ) + 0 . 2 cos ( 4 x ) + 0 . 01 cos ( 5 x ) + . . . = + + + + + The information is contained in few (low frequency) coefficients = ⇒ Reduced storage/transfer/computation costs S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 6 / 20
Fourier representation For real line function Functions are decomposed on the sinusoidal basis f ( x ) = 1 . 1 + 2 . 1 cos ( x ) + 3 . 2 cos ( 2 x ) + 1 . 5 cos ( 3 x ) + 0 . 2 cos ( 4 x ) + 0 . 01 cos ( 5 x ) + . . . = + + + + + The information is contained in few (low frequency) coefficients = ⇒ Reduced storage/transfer/computation costs S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 6 / 20
Fourier representation For functions on S n [Diaconis, 1989] There is no simple basis (corresponding to eigen-spaces of dimension 1) = ⇒ Fourier coefficients are matrices indexed by the set R n of all integer partitions of n F f = , , , , , , . . . �� � � � k ξ = ( n 1 , . . . , n k ) ∈ N ∗ k : n 1 ≥ · · · ≥ n k , R n = n i = n , 1 ≤ k ≤ n i = 1 “Low-frequency” coefficients are related to low order summaries ( P [ σ ( i , j ) = ( k , ℓ )] ) S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 7 / 20
Example: Mallows( S 5 ) Exponential distribution on rankings, γ = 0 . 1 0.06 0.012 [ 3 2 4 1 5 ] [ 3 5 4 2 1 ] [ 1 2 4 5 3 ] 0.02 0.008 −0.02 0.004 [ 3 2 4 1 5 ] [ 3 5 4 2 1 ] 0.000 −0.06 [ 1 2 4 5 3 ] 0 20 40 60 80 100 120 0 20 40 60 80 100 120 “Temporal” coefficents Fourier coefficients Remark: ◮ A few relevant parameters when using the Fourier representation S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 8 / 20
Uncertainty principle Balancing Sparsity Theorem (inspired from [Donoho & Stark, 1989]) Let f ∈ C [ S n ] of Fourier transform F f . Denote by supp ( f ) = { σ ∈ S n : f ( σ ) � = 0 } and by supp ( F f ) = { ξ ∈ R n : F f ( ξ ) � = 0 } the support of f and that of its Fourier transform respectively. Then, we have: � d 2 # supp ( f ) · ξ ≥ n ! . ξ ∈ supp ( F f ) 1 γ = 10 γ = 1 0.8 γ = 0.1 distortion 0.6 Direct consequence 0.4 ◮ Both representations cannot 0.2 0 be simultaneously sparse 0 20 40 60 80 100 120 # used coefficients Distortion with Mallows( S 5 ) S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 9 / 20
Outline Sparsity in the Fourier Representation 1 Sparse Clustering of Rankings 2 Numerical Experiments 3 S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 10 / 20
Clustering of rankings Aim ◮ Gather distributions on rankings with similar shape Objective function ◮ Minimize (on all partitions C ) � � L || f i − f j || 2 · I { ( f i , f j ) ∈ C 2 � M ( C ) = l } l = 1 1 ≤ i , j ≤ N � � L � 1 ||F f i ( ξ ) − F f j ( ξ ) || 2 = d ξ HS ( d ξ ) n ! ξ ∈R n l = 1 1 ≤ i , j ≤ N : ( f i , f j ) ∈C 2 l with d ξ × d ξ the dimension of the matrix indexed by ξ S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 11 / 20
Managing sparsity Aim ◮ Gather distributions on rankings with similar shape ◮ Use few Fourier coefficients New objective function [Witten & Tibshirani, 2010] ◮ Minimize (on all partitions C , and all weight vectors ω ) � � � L ω ξ d ξ � ||F f i ( ξ ) − F f j ( ξ ) || 2 M ω ( C ) = HS ( d ξ ) n ! l = 1 1 ≤ i , j ≤ N : ( f i , f j ) ∈C 2 ξ ∈R n l with ω = ( ω ξ ) ξ ∈ R n ∈ R # R n , || ω || 2 l 2 ≤ 1 and || ω || l 1 ≤ λ + Remark: ◮ Fixing ω = ( 1 / √ # R n , . . . , 1 / √ # R n ) leads to the initial optimization problem (without ω ) S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 12 / 20
Algorithm Initialize ω = ( 1 / √ # R n , . . . , 1 / √ # R n ) Until convergence, iterate steps 1 and 2 Fixing the weight vector ω , minimize � M ω ( C ) after the partition C 1 Fixing the partition C , minimize � M ω ( C ) after ω . 2 Remarks ◮ Step 1 is performed by a standard clustering algorithm ◮ Step 2 accepts a closed form [Witten & Tibshirani, 2010] S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 13 / 20
Outline Sparsity in the Fourier Representation 1 Sparse Clustering of Rankings 2 Numerical Experiments 3 S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 14 / 20
Experiments Aim ◮ Recover clustering information ◮ Use few coefficients Datasets ◮ Mallows (synthetic) ⋆ Exponential distribution on rankings ◮ Top- k lists (synthetic) ⋆ Uniform distribution on rankings ◮ E-commerce Dataset ⋆ List of purchased products (ordered by date) S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) Clustering Rankings in the Fourier Domain ECML PKDD, September 2011 15 / 20
S. Clémençon & R. Gaudel & J. Jakubowicz (LTCI) γ = 1 Mallows( S 7 ) Remarks: 0.00 0.10 0.20 ◮ The Fourier representation uses few coefficients (compared to n ! = 5 , 040) ◮ The Fourier representation recovers the clustering information “Temporal” representation (3 coefficients selected) [ 6 5 1 7 4 2 3 ] [ 6 5 1 7 4 3 2 ] [ 6 5 1 4 7 2 3 ] [ 6 7 3 4 5 1 2 ] [ 4 6 3 7 5 1 2 ] [ 4 7 3 6 5 1 2 ] [ 4 7 3 6 2 1 5 ] [ 3 7 4 6 5 1 2 ] Clustering Rankings in the Fourier Domain [ 6 5 1 2 4 7 3 ] [ 1 5 6 7 4 2 3 ] 0.02 0.08 0.14 (54 coefficients selected) [ 6 7 3 4 5 1 2 ] Fourier representation [ 3 7 4 6 5 1 2 ] [ 4 6 3 7 5 1 2 ] [ 4 7 3 6 5 1 2 ] ECML PKDD, September 2011 [ 4 7 3 6 2 1 5 ] [ 1 5 6 7 4 2 3 ] [ 6 5 1 7 4 3 2 ] [ 6 5 1 2 4 7 3 ] [ 6 5 1 7 4 2 3 ] [ 6 5 1 4 7 2 3 ] 16 / 20
Recommend
More recommend