PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin
Motivation Example • Clustering cannot be analyzed without specifying what it will be used for! be used for!
Example • Cluster then pack • Cluster then pack • Clustering by shape is preferable • Evaluate the amount of time saved
How to define a clustering problem? • Common pitfall: the goal is defined in terms of the solution – Graph cut – Spectral clustering – Spectral clustering – Information-theoretic approaches • Which one to choose??? How to compare? • Our goal: suggest problem formulation which is independent of the way of solution
Outline • Two problems behind co-clustering – Discriminative prediction – Density estimation • PAC-Bayesian analysis of discriminative • PAC-Bayesian analysis of discriminative prediction with co-clustering • PAC-Bayesian analysis of graph clustering
Discriminative Prediction with Co-clustering • Example: collaborative filtering X 2 (movies) • Goal: find discriminative viewers) Y prediction rule q ( Y | X 1 , X 2 ) Y X 1 (vie Y
Discriminative Prediction with Co-clustering • Example: collaborative filtering X 2 (movies) • Goal: find discriminative viewers) Y prediction rule q ( Y | X 1 , X 2 ) Y X 1 (vie • Evaluation: • Evaluation: Y = ( ) ( , ' ) L q E E l Y Y ( , , ) ( '| , ) p X X Y q Y X X 1 2 1 2 Expectation w.r.t. the Expectation Given true distribution w.r.t. the loss p ( X 1 , X 2 , Y ) classifier l ( Y , Y’ ) q ( Y | X 1 , X 2 )
Co-occurrence Data Analysis • Example: words-documents co- X 2 (words) occurrence data cuments) • Goal: find an estimator q ( X 1 , X 2 ) for the joint distribution p ( X , X ) for the joint distribution p ( X 1 , X 2 ) X 1 (docu
Co-occurrence Data Analysis • Example: words-documents co- X 2 (words) occurrence data cuments) • Goal: find an estimator q ( X 1 , X 2 ) for the joint distribution p ( X , X ) for the joint distribution p ( X 1 , X 2 ) X 1 (docu • Evaluation: = − ( ) ln ( , ) L q E q X X p ( X , X ) 1 2 1 2 The true distribution p ( X 1 , X 2 )
Outline • Two problems behind co Two problems behind co Two problems behind co-clustering clustering clustering – Discriminative prediction Discriminative prediction Discriminative prediction – Density estimation Density estimation Density estimation • PAC-Bayesian analysis of discriminative • PAC-Bayesian analysis of discriminative prediction with co-clustering • PAC PAC PAC-Bayesian analysis of graph clustering Bayesian analysis of graph clustering Bayesian analysis of graph clustering
Discriminative prediction based on co-clustering ∑ = Model: ( | , ) ( | , ) ( | ) ( | ) q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 C 1 , C 2 Y C 1 C 2 X 1 X 2 Denote: Q = { q ( C 1 | X 1 ), q ( C 2 | X 2 ), q ( Y | C 1 , C 2 )}
{ } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N ˆ ˆ ( ) ( ) L Q L Q ˆ ˆ ˆ = + − ( ( ) || ( )) ( ) ln ( 1 ( )) ln kl L Q L Q L Q L Q ( ) ( ) L Q L Q • A looser, but simpler form of the bound: ∑ ∑ ˆ + + 2 ( ) ( ; ) 2 ( ; ) L Q X I X C K X I X C K i i i i i i ≤ ˆ + + i i ( ) ( ) L Q L Q N N
{ } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N ∑ ∏ = + + − δ ln ln ln( 4 ) / 2 ln K C X C Y N i i i i i Logarithmic Number of PAC-Bayesian in | X i | partition cells bound part
{ } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N High Complexity Low Complexity I ( X i ; C i ) = ln| X i | I ( X i ; C i ) = 0
{ } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N Optimization tradeoff: Empirical loss vs. “Effective” partition Lower Higher complexity Complexity Complexity
Practice • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N • Replace with a trade-off: ∑ = β ˆ + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i
Application • MovieLens dataset – 100,000 ratings on 5-star scale – 80,000 train ratings, 20,000 test ratings – 943 viewers x 1682 movies – 943 viewers x 1682 movies – State-of-the-art Mean Absolute Error (0.72) – The optimal performance is achieved even with 300x300 cluster space
13x6 Clusters ∑ ˆ = β + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i Bound Test MAE ute Error Test MAE Mean Absolute β β
50x50 Clusters ∑ ˆ = β + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i Bound Test MAE ute Error Test MAE Mean Absolute β β
283x283 Clusters ∑ ˆ = β + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i Test MAE ute Error Bound Bound Mean Absolute Test MAE β β
Weighted Graph Clustering • The weigts of the edges w ij are generated by unknown distribution p ( w ij | x i , x j ) • Given a sample of size N of edge weights • Given a sample of size N of edge weights • Build a model q ( w | x 1 , x 2 ) such that E p ( x1 , x2 , w ) E q ( w ’| x 1, x 2) l ( w , w ’) is minimized
Other problems • Pairwise clustering = clustering of a weighted graph – Edge weights = pairwise relations • Clustering of unweigted graph – Present edges = weight 1 – Absent edges = weight 0
Weighted Graph Clustering • The weights of the links are generated according to: q ( w ij | X i , X j ) = Σ C a , C b q ( w ij | C a , C b ) q ( C a | X i ) q ( C b | X j ) • This is co-clustering with shared q ( C | X ) – Same bounds and (almost same) algorithms apply
Application • Optimize the trade-off = β + ˆ ( ) ( ) ( ; ) F Q N L Q X I X C • Kings dataset – Edge weights = exponentiated negative – Edge weights = exponentiated negative distance between DNS servers – |X| = 1740 – Number of edges = 1,512,930
Graph Clustering Application 0.05 • 2 ^ (Q) L I(X;C) Bound Inform 0.04 1.5 rmation (nats) ss Loss 0.03 1 0.02 0.5 0.01 • 0 0 5 10 15 |C|
Relation with Matrix Factorization • Co-clustering: – g ( X 1 , X 2 ) = Σ C 1, C 2 q ( C 1 | X 1 ) g ( C 1 , C 2 ) q ( C 2 | X 2 ) – M ≈ Q 1 T GQ 2 • Graph clustering: – g ( X 1 , X 2 ) = Σ C 1, C 2 q ( C 1 | X 1 ) g ( C 1 , C 2 ) q ( C 2 | X 2 ) – M ≈ Q T GQ
Summary of main contributions • Formulation of co-clustering and graph clustering (unsupervised learning) as prediction problems • PAC-Bayesian analysis of co-clustering and graph clustering – Regularization terms • Encouraging empirical results
Future Directions • Practice: – More applications • Theory: – Continuous domains – Continuous domains – Multidimensional matrices References Co-clustering: Seldin & Tishby JMLR 2010 submitted, avail.online Graph clustering: Seldin Social Analytics 2010
Recommend
More recommend