PAC-Bayesian Analysis of Co-clustering, Graph Clustering and - PowerPoint PPT Presentation

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Motivation Example • Clustering cannot be analyzed without specifying what it will be used for! be used for!

Example • Cluster then pack • Cluster then pack • Clustering by shape is preferable • Evaluate the amount of time saved

How to define a clustering problem? • Common pitfall: the goal is defined in terms of the solution – Graph cut – Spectral clustering – Spectral clustering – Information-theoretic approaches • Which one to choose??? How to compare? • Our goal: suggest problem formulation which is independent of the way of solution

Outline • Two problems behind co-clustering – Discriminative prediction – Density estimation • PAC-Bayesian analysis of discriminative • PAC-Bayesian analysis of discriminative prediction with co-clustering • PAC-Bayesian analysis of graph clustering

Discriminative Prediction with Co-clustering • Example: collaborative filtering X 2 (movies) • Goal: find discriminative viewers) Y prediction rule q ( Y | X 1 , X 2 ) Y X 1 (vie Y

Discriminative Prediction with Co-clustering • Example: collaborative filtering X 2 (movies) • Goal: find discriminative viewers) Y prediction rule q ( Y | X 1 , X 2 ) Y X 1 (vie • Evaluation: • Evaluation: Y = ( ) ( , ' ) L q E E l Y Y ( , , ) ( '| , ) p X X Y q Y X X 1 2 1 2 Expectation w.r.t. the Expectation Given true distribution w.r.t. the loss p ( X 1 , X 2 , Y ) classifier l ( Y , Y’ ) q ( Y | X 1 , X 2 )

Co-occurrence Data Analysis • Example: words-documents co- X 2 (words) occurrence data cuments) • Goal: find an estimator q ( X 1 , X 2 ) for the joint distribution p ( X , X ) for the joint distribution p ( X 1 , X 2 ) X 1 (docu

Co-occurrence Data Analysis • Example: words-documents co- X 2 (words) occurrence data cuments) • Goal: find an estimator q ( X 1 , X 2 ) for the joint distribution p ( X , X ) for the joint distribution p ( X 1 , X 2 ) X 1 (docu • Evaluation: = − ( ) ln ( , ) L q E q X X p ( X , X ) 1 2 1 2 The true distribution p ( X 1 , X 2 )

Outline • Two problems behind co Two problems behind co Two problems behind co-clustering clustering clustering – Discriminative prediction Discriminative prediction Discriminative prediction – Density estimation Density estimation Density estimation • PAC-Bayesian analysis of discriminative • PAC-Bayesian analysis of discriminative prediction with co-clustering • PAC PAC PAC-Bayesian analysis of graph clustering Bayesian analysis of graph clustering Bayesian analysis of graph clustering

Discriminative prediction based on co-clustering ∑ = Model: ( | , ) ( | , ) ( | ) ( | ) q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 C 1 , C 2 Y C 1 C 2 X 1 X 2 Denote: Q = { q ( C 1 | X 1 ), q ( C 2 | X 2 ), q ( Y | C 1 , C 2 )}

{ } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N ˆ ˆ ( ) ( ) L Q L Q ˆ ˆ ˆ = + − ( ( ) || ( )) ( ) ln ( 1 ( )) ln kl L Q L Q L Q L Q ( ) ( ) L Q L Q • A looser, but simpler form of the bound:     ∑ ∑ ˆ  +   +  2 ( ) ( ; ) 2 ( ; ) L Q X I X C K X I X C K i i i i i i     ≤ ˆ + + i i ( ) ( ) L Q L Q N N

{ } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N   ∑ ∏ = + + − δ   ln ln ln( 4 ) / 2 ln K C X C Y N   i i i i i Logarithmic Number of PAC-Bayesian in | X i | partition cells bound part

{ } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N High Complexity Low Complexity I ( X i ; C i ) = ln| X i | I ( X i ; C i ) = 0

{ } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N Optimization tradeoff: Empirical loss vs. “Effective” partition Lower Higher complexity Complexity Complexity

Practice • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N • Replace with a trade-off: ∑ = β ˆ + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i

Application • MovieLens dataset – 100,000 ratings on 5-star scale – 80,000 train ratings, 20,000 test ratings – 943 viewers x 1682 movies – 943 viewers x 1682 movies – State-of-the-art Mean Absolute Error (0.72) – The optimal performance is achieved even with 300x300 cluster space

13x6 Clusters ∑ ˆ = β + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i Bound Test MAE ute Error Test MAE Mean Absolute β β

50x50 Clusters ∑ ˆ = β + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i Bound Test MAE ute Error Test MAE Mean Absolute β β

283x283 Clusters ∑ ˆ = β + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i Test MAE ute Error Bound Bound Mean Absolute Test MAE β β

Weighted Graph Clustering • The weigts of the edges w ij are generated by unknown distribution p ( w ij | x i , x j ) • Given a sample of size N of edge weights • Given a sample of size N of edge weights • Build a model q ( w | x 1 , x 2 ) such that E p ( x1 , x2 , w ) E q ( w ’| x 1, x 2) l ( w , w ’) is minimized

Other problems • Pairwise clustering = clustering of a weighted graph – Edge weights = pairwise relations • Clustering of unweigted graph – Present edges = weight 1 – Absent edges = weight 0

Weighted Graph Clustering • The weights of the links are generated according to: q ( w ij | X i , X j ) = Σ C a , C b q ( w ij | C a , C b ) q ( C a | X i ) q ( C b | X j ) • This is co-clustering with shared q ( C | X ) – Same bounds and (almost same) algorithms apply

Application • Optimize the trade-off = β + ˆ ( ) ( ) ( ; ) F Q N L Q X I X C • Kings dataset – Edge weights = exponentiated negative – Edge weights = exponentiated negative distance between DNS servers – |X| = 1740 – Number of edges = 1,512,930

Graph Clustering Application 0.05 • 2 ^ (Q) L I(X;C) Bound Inform 0.04 1.5 rmation (nats) ss Loss 0.03 1 0.02 0.5 0.01 • 0 0 5 10 15 |C|

Relation with Matrix Factorization • Co-clustering: – g ( X 1 , X 2 ) = Σ C 1, C 2 q ( C 1 | X 1 ) g ( C 1 , C 2 ) q ( C 2 | X 2 ) – M ≈ Q 1 T GQ 2 • Graph clustering: – g ( X 1 , X 2 ) = Σ C 1, C 2 q ( C 1 | X 1 ) g ( C 1 , C 2 ) q ( C 2 | X 2 ) – M ≈ Q T GQ

Summary of main contributions • Formulation of co-clustering and graph clustering (unsupervised learning) as prediction problems • PAC-Bayesian analysis of co-clustering and graph clustering – Regularization terms • Encouraging empirical results

Future Directions • Practice: – More applications • Theory: – Continuous domains – Continuous domains – Multidimensional matrices References Co-clustering: Seldin & Tishby JMLR 2010 submitted, avail.online Graph clustering: Seldin Social Analytics 2010

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and - PowerPoint PPT Presentation

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin Motivation Example Clustering cannot be analyzed without specifying what it will be used for! be used for! Example Cluster then pack

Guiding Financial Controls and Practices for PACs and PAC Treasurers PAC Treasurers Workshop

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

NAPSLO PAC Contributions How contributing to the NAPSLO PAC will benefit you, your company and the

WELCOME June 2011 PAC Presentation Opening Remarks Introductions June 2011 PAC

AAOS Orthopaedic PAC The Orthopaedic PAC is the only national political action committee

LArIAT Fermilab PAC Meeting November 11, 2016 Jen Raaf PAC Charge Fermilab PAC Meeting, J.

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Bayesian Methods for Graph Clustering P. Latouche, E. Birmel e Laboratoire Statistique et

HERITAGE SQUARE CONSIDERATIONS Public Process Project Advisory Committee Meetings: PAC Meeting

Interferometric Sensor (MAGIS-100) PAC Meeting Jason Hogan on behalf of the MAGIS

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Old rules Rule of Universal Specification (US) If a formula S results from a formula R by

Introduction to Logic in Computer Science: Autumn 2007 Ulle Endriss Institute for Logic,

FMCAD 2011 (Austin, Texas) Jonathan Kotker , Dorsa Sadigh, Sanjit Seshia University of California,

05Predicate Logic CS 5209: Foundation in Logic and AI Martin Henz and Aquinas Hobor February

S ENTENCES IN FOL Cube(a) xCube(x) a is a cube For any x, x is a cube True in a world if a

Determining lower bounds for packing densities of non- layered patterns using weighted templates

Application of entropy compression in pattern avoidance Pascal Ochem, Alexandre Pinlou LIRMM,

Chapter 9: Memory Questions? What is main memory? CSCI [4|6]730 How does multiple