Clustering and Ranking in Heterogeneous Information Networks via - PowerPoint PPT Presentation

Clustering and Ranking in Heterogeneous Information Networks via Gamma-Poisson Model Junxiang Chen Wei Dai Yizhou Sun Jennifer Dy Northeastern University May 1, 2015

Information Network ◮ Information networks are oftentimes used to represent objects and their interactions. ◮ Objects are represented by vertices. ◮ Relationships are represented by edges. ◮ Homogeneous information network has been well studied. ◮ It assumes there contains only one type of vertices and one type of edges. ◮ A friendship network is an example. Sophia William Emma Jacob Mason

Heterogeneous Network ◮ In the real world, multiple-typed objects are usually related with each other. ◮ It can be represented by a heterogeneous information network . ◮ It involves vertices of multiple types and edges of multiple types. ◮ For example, DBLP is a computer science bibliographic database. Sophia William Mason co-author Author “database” publish use “data” VLDB “mining” appear Word Venue SDM

Related Work ◮ Clustering and ranking are prominent techniques to analyze information networks. ◮ They are usually regarded as orthogonal techniques.

Related Work ◮ Clustering and ranking are prominent techniques to analyze information networks. ◮ They are usually regarded as orthogonal techniques. Clustering Methods Homogenous Networks Spectral clustering [Shi and Malik, 2000] Affinity propagation [Frey and Dueck, 2007] Stochastic blockmodel [Snijders and Nowicki, 1997] Multi-type spectral clustering [Long et al., 2006] Clustering Methods Heterogeneous Networks

Related Work ◮ Clustering and ranking are prominent techniques to analyze information networks. ◮ They are usually regarded as orthogonal techniques. Clustering and Ranking Methods Homogenous Networks HITS Spectral clustering [Kleinberg, 1999] Affinity propogation PageRank [Page et al., 1998] Stochastic blockmodel Multi-type spectral PopRank clustering [Nie et al., 2005] Clustering Ranking Methods Methods Heterogeneous Networks

Related Work (cont.) ◮ Combining clustering and ranking together usually achieves better results. ◮ Sun et al. [2009a] proposes the RankClus model for bi-typed networks. ◮ Sun et al. [2009b] introduces NetClus model for star-network schema. Clustering and Ranking Methods Homogenous Networks Spectral clustering HITS Affinity propogation PageRank Stochastic blockmodel Multi-type spectral PopRank clustering RankClus Clustering Ranking Methods NetClus Methods Networks with Specified Schema Heterogeneous Networks

Contributions ◮ We develop a Gamma-Poisson generative model, called GPNRankClus (Gamma-Poisson Network Model for Ranking and Clustering) Clustering and Ranking Methods Homogenous Networks Spectral clustering HITS Affinity propogation PageRank Stochastic blockmodel Multi-type spectral GPNRankClus PopRank clustering RankClus Clustering Ranking Methods NetClus Methods Networks with Specified Schema Heterogeneous Networks

Ranking Scores ◮ We want to simultaneously achieve ranking and Ranking results clustering. ◮ We assign each vertex v ( T m ) ranking score r ( T m ) n nk for each cluster that represents the importance of K clusters the vertex in this cluster, s.t. Clustering results v ( Tm ) ⇔ k = argmax l ( r ( Tm ) ∈ C k ) (1) n nl rank k ( v ( Tm ) ) < rank k ( v ( Tm ) ) ⇔ r ( Tm ) > r ( Tm ) (2) i j ik jk N Objects of type T m

Ranking Scores ◮ We want to simultaneously achieve ranking and clustering. ◮ We assign each vertex v ( T m ) ranking score r ( T m ) n nk for each cluster that represents the importance of the vertex in this cluster, s.t. θ r v ( Tm ) ⇔ k = argmax l ( r ( Tm ) ∈ C k ) n nl rank k ( v ( Tm ) ) < rank k ( v ( Tm ) ) ⇔ r ( Tm ) > r ( Tm ) i j ik jk ◮ Since r ( T m ) Ranking is a positive real number r ( T a ) r ( T b ) nk Scores ik jk r ( T m ) ∼ Gamma ( α r , β r ) . N × K N × K nk

Intensity of Edge Type ◮ In heterogeneous networks, the intensity for different edge type differs. ◮ Some edge types tend to θ r generate more connections. ◮ We model the intensity of each edge type using a positive real number. Ranking r ( T a ) r ( T b ) λ ( T a ,T b ) ∼ Gamma ( α λ , β λ ) Scores ik jk N × K N × K λ ( T a ,T b ) W ( T a ,T b ) θ λ T e ij M 2 Intensity of Edge Type

Number of Edges ◮ There exist multiple edges between two vertices. θ r ◮ Connections between vertices are treated as counts of repeated events. W ( T a ,T b ) ( r ( T a ) · r ( T b ) ∼ Pois ( λ ( T a ,T b ) )) ij i j Intensity of dot product of # of Edges edge type ranking scores Ranking r ( T a ) r ( T b ) jk Scores ik N × K N × K λ ( T a ,T b ) W ( T a ,T b ) θ λ T e ij M 2 N 2 Intensity of Number of Edge Type Edges

Why Dot Product? W ( T a ,T b ) ∼ Pois ( λ ( T a ,T b ) ( r ( T a ) · r ( T b ) )) ij i j Intensity of dot product of # of Edges edge type ranking scores ◮ The dot product can be expressed as r ( T a ) · r ( T b ) = cos θ × || r ( T a ) || × || r ( T b ) || i j i j ◮ In order to have a large W ( T a ,T b ) we need ij ◮ Large λ ( T a ,T b ) ◮ Large cos θ ◮ Large || r ( T a ) || and || r ( T b ) || i j

Summary of the Model ◮ For each vertex n and each cluster k , Draw r ( T m ) ∼ Gamma ( α r , β r ) nk For each non-zero edge type ( T a , T b ) , ◮ Draw λ ( T a ,T b ) ∼ Gamma ( α λ , β λ ) For each pair of different vertices ( v ( T a ) , v ( T b ) ◮ ) i j Draw W ( T a ,T b ) ∼ Pois ( λ ( T a ,T b ) ( r ( T a ) · r ( T b ) )) ij i j θ r Ranking r ( T b ) r ( T a ) Scores ik jk N × K N × K λ ( T a ,T b ) W ( T a ,T b ) θ λ T e ij M 2 N 2 Intensity of Number of Edge Type Edges

Inference Ranking results ◮ It is computationally intractable to directly evaluate the posterior distributions. K clusters Clustering results ◮ We use mean-field variational inference to approximate these distributions. ◮ Ranking and clustering results are given by comparing the expected values of the ranking scores N Objects of type T m ∈ C k , where k = argmax l ( E [ r ( T m ) v ( T m ) ]) . n nl ) = argsort i ( E [ r ( T m ) rank k ( v ( T m ) ]) . n ik ◮ We introduce seeds . ◮ Existing models use seeds to guide the clustering process. ◮ We select 1 representative object for each cluster. ◮ We assign a special prior distribution for these seeds.

Synthetic Data ◮ We generate synthetic data ◮ 400 data points ◮ 4 different types ◮ 2 clusters ◮ We add noise of different levels. Low noise level High noise level Mediate noise level

Real Data ◮ We test the performance of model on two real heterogeneous network datasets: ◮ DBLP dataset ◮ YELP dataset ◮ We compare GPNRankClus with state-of-the-art algorithms ◮ NetClus , A clustering and ranking method for heterogeneous networks that follow a star-network schema. ◮ GNetMine , a transductive classification method in heterogeneous networks. ◮ RankClass , a ranking-based classification method in heterogeneous networks.

DBLP Dataset Classification Accuracy on Authors GPNRankClus NetClus GNetMine RankClass Accuracy 92 . 28% 76 . 11% ‡ 80 . 67% 91 . 12% ◮ The dataset includes conferences from Database Classification Accuracy on Conferences (DB), Data Mining (DM), GPNRankClus NetClus GNetMine RankClass Accuracy 100% 85% ‡ 100% 100% Machine Learning (ML), ‡ We test NetClus on the star-schema version of the Information Retrieval (IR). DBLP dataset. co-author Top-5 Words in Each Cluster DB DM ML IR 1 data data learning web 2 database mining knowledge retrieval 3 databases learning system information Author 4 query clustering reasoning search 5 system classification model text publish use Top-5 Conferences in Each Cluster DB DM ML IR 1 VLDB KDD IJCAI SIGIR appear Word Venue 2 ICDE PAKDD AAAI WWW 3 SIGMOD ICDM ICML CIKM 4 PODS PKDD CVPR ECIR 5 EDBT SDM ECML AAAI

YELP Dataset User ◮ We examine a subset of the YELP given by dataset for 3 different clustering tasks: given to Business Review ◮ 4 Level-1 categories ◮ 6 Restaurant categories contains ◮ 6 Shopping categories Word Classification accuracy on businesses GPNRankClus NetClus GNetMine RankClass Level 1 56 . 25% 17 . 78% 47 . 16% 37 . 19% Restaurant 66 . 81% 15 . 31% 49 . 36% 57 . 11% Shopping 64 . 62% 13 . 28% 64 . 45% 32 . 58% Normalized Mutual Information (NMI) on businesses GPNRankClus NetClus GNetMine RankClass Level 1 0 . 5590 0 . 0168 0 . 1387 0 . 1579 Restaurant 0 . 6606 0 . 0187 0 . 2346 0 . 3044 Shopping 0 . 4721 0 . 0313 0 . 3617 0 . 2335

Conclusions ◮ We introduce a new concept of ranking score that conveys both ranking and clustering information. ◮ Based on this concept, we propose a generative model, called GPNRankClus . ◮ We model the ranking score of each vertex in each cluster as a gamma distribution. ◮ We model the number of edges as a Poisson distribution. ◮ We test our model on DBLP and YELP data. ◮ GPNRankClus outperforms state-of-the-art baselines.

Clustering and Ranking in Heterogeneous Information Networks via - PowerPoint PPT Presentation

Clustering and Ranking in Heterogeneous Information Networks via Gamma-Poisson Model Junxiang Chen Wei Dai Yizhou Sun Jennifer Dy Northeastern University May 1, 2015 Information Network Information networks are oftentimes used to

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Heterogeneous Subgraph Features for Information Networks Andreas Spitz , Diego Costa, Kai Chen,

metapath2vec Scalable Representation Learning for Heterogeneous Networks Yuxiao Dong Nitesh V.

Charon-Suite Module Framework Modular Algorithms with Serializable C++ Objects Jens-Malte

Distributing Secrets Securely ? Presented by Simo Sorce Red Hat, Inc. Flock 2015 Historically

Information Information systems/infrastructure systems/infrastructure complexity complexity

The network and the OS David Clark MIT CSAIL October,

Background p Network A ubiquitous data structure to model the relationships between entities p

Experiences with Distributed Heterogeneous Clouds over Community Networks Mennan Selimi , Felix