Clustering and Ranking in Heterogeneous Information Networks via Gamma-Poisson Model Junxiang Chen Wei Dai Yizhou Sun Jennifer Dy Northeastern University May 1, 2015
Information Network ◮ Information networks are oftentimes used to represent objects and their interactions. ◮ Objects are represented by vertices. ◮ Relationships are represented by edges. ◮ Homogeneous information network has been well studied. ◮ It assumes there contains only one type of vertices and one type of edges. ◮ A friendship network is an example. Sophia William Emma Jacob Mason
Heterogeneous Network ◮ In the real world, multiple-typed objects are usually related with each other. ◮ It can be represented by a heterogeneous information network . ◮ It involves vertices of multiple types and edges of multiple types. ◮ For example, DBLP is a computer science bibliographic database. Sophia William Mason co-author Author “database” publish use “data” VLDB “mining” appear Word Venue SDM
Related Work ◮ Clustering and ranking are prominent techniques to analyze information networks. ◮ They are usually regarded as orthogonal techniques.
Related Work ◮ Clustering and ranking are prominent techniques to analyze information networks. ◮ They are usually regarded as orthogonal techniques. Clustering Methods Homogenous Networks Spectral clustering [Shi and Malik, 2000] Affinity propagation [Frey and Dueck, 2007] Stochastic blockmodel [Snijders and Nowicki, 1997] Multi-type spectral clustering [Long et al., 2006] Clustering Methods Heterogeneous Networks
Related Work ◮ Clustering and ranking are prominent techniques to analyze information networks. ◮ They are usually regarded as orthogonal techniques. Clustering and Ranking Methods Homogenous Networks HITS Spectral clustering [Kleinberg, 1999] Affinity propogation PageRank [Page et al., 1998] Stochastic blockmodel Multi-type spectral PopRank clustering [Nie et al., 2005] Clustering Ranking Methods Methods Heterogeneous Networks
Related Work (cont.) ◮ Combining clustering and ranking together usually achieves better results. ◮ Sun et al. [2009a] proposes the RankClus model for bi-typed networks. ◮ Sun et al. [2009b] introduces NetClus model for star-network schema. Clustering and Ranking Methods Homogenous Networks Spectral clustering HITS Affinity propogation PageRank Stochastic blockmodel Multi-type spectral PopRank clustering RankClus Clustering Ranking Methods NetClus Methods Networks with Specified Schema Heterogeneous Networks
Contributions ◮ We develop a Gamma-Poisson generative model, called GPNRankClus (Gamma-Poisson Network Model for Ranking and Clustering) Clustering and Ranking Methods Homogenous Networks Spectral clustering HITS Affinity propogation PageRank Stochastic blockmodel Multi-type spectral GPNRankClus PopRank clustering RankClus Clustering Ranking Methods NetClus Methods Networks with Specified Schema Heterogeneous Networks
Ranking Scores ◮ We want to simultaneously achieve ranking and Ranking results clustering. ◮ We assign each vertex v ( T m ) ranking score r ( T m ) n nk for each cluster that represents the importance of K clusters the vertex in this cluster, s.t. Clustering results v ( Tm ) ⇔ k = argmax l ( r ( Tm ) ∈ C k ) (1) n nl rank k ( v ( Tm ) ) < rank k ( v ( Tm ) ) ⇔ r ( Tm ) > r ( Tm ) (2) i j ik jk N Objects of type T m
Ranking Scores ◮ We want to simultaneously achieve ranking and clustering. ◮ We assign each vertex v ( T m ) ranking score r ( T m ) n nk for each cluster that represents the importance of the vertex in this cluster, s.t. θ r v ( Tm ) ⇔ k = argmax l ( r ( Tm ) ∈ C k ) n nl rank k ( v ( Tm ) ) < rank k ( v ( Tm ) ) ⇔ r ( Tm ) > r ( Tm ) i j ik jk ◮ Since r ( T m ) Ranking is a positive real number r ( T a ) r ( T b ) nk Scores ik jk r ( T m ) ∼ Gamma ( α r , β r ) . N × K N × K nk
Intensity of Edge Type ◮ In heterogeneous networks, the intensity for different edge type differs. ◮ Some edge types tend to θ r generate more connections. ◮ We model the intensity of each edge type using a positive real number. Ranking r ( T a ) r ( T b ) λ ( T a ,T b ) ∼ Gamma ( α λ , β λ ) Scores ik jk N × K N × K λ ( T a ,T b ) W ( T a ,T b ) θ λ T e ij M 2 Intensity of Edge Type
Number of Edges ◮ There exist multiple edges between two vertices. θ r ◮ Connections between vertices are treated as counts of repeated events. W ( T a ,T b ) ( r ( T a ) · r ( T b ) ∼ Pois ( λ ( T a ,T b ) )) ij i j Intensity of dot product of # of Edges edge type ranking scores Ranking r ( T a ) r ( T b ) jk Scores ik N × K N × K λ ( T a ,T b ) W ( T a ,T b ) θ λ T e ij M 2 N 2 Intensity of Number of Edge Type Edges
Why Dot Product? W ( T a ,T b ) ∼ Pois ( λ ( T a ,T b ) ( r ( T a ) · r ( T b ) )) ij i j Intensity of dot product of # of Edges edge type ranking scores ◮ The dot product can be expressed as r ( T a ) · r ( T b ) = cos θ × || r ( T a ) || × || r ( T b ) || i j i j ◮ In order to have a large W ( T a ,T b ) we need ij ◮ Large λ ( T a ,T b ) ◮ Large cos θ ◮ Large || r ( T a ) || and || r ( T b ) || i j
Summary of the Model ◮ For each vertex n and each cluster k , Draw r ( T m ) ∼ Gamma ( α r , β r ) nk For each non-zero edge type ( T a , T b ) , ◮ Draw λ ( T a ,T b ) ∼ Gamma ( α λ , β λ ) For each pair of different vertices ( v ( T a ) , v ( T b ) ◮ ) i j Draw W ( T a ,T b ) ∼ Pois ( λ ( T a ,T b ) ( r ( T a ) · r ( T b ) )) ij i j θ r Ranking r ( T b ) r ( T a ) Scores ik jk N × K N × K λ ( T a ,T b ) W ( T a ,T b ) θ λ T e ij M 2 N 2 Intensity of Number of Edge Type Edges
Inference Ranking results ◮ It is computationally intractable to directly evaluate the posterior distributions. K clusters Clustering results ◮ We use mean-field variational inference to approximate these distributions. ◮ Ranking and clustering results are given by comparing the expected values of the ranking scores N Objects of type T m ∈ C k , where k = argmax l ( E [ r ( T m ) v ( T m ) ]) . n nl ) = argsort i ( E [ r ( T m ) rank k ( v ( T m ) ]) . n ik ◮ We introduce seeds . ◮ Existing models use seeds to guide the clustering process. ◮ We select 1 representative object for each cluster. ◮ We assign a special prior distribution for these seeds.
Synthetic Data ◮ We generate synthetic data ◮ 400 data points ◮ 4 different types ◮ 2 clusters ◮ We add noise of different levels. Low noise level High noise level Mediate noise level
Real Data ◮ We test the performance of model on two real heterogeneous network datasets: ◮ DBLP dataset ◮ YELP dataset ◮ We compare GPNRankClus with state-of-the-art algorithms ◮ NetClus , A clustering and ranking method for heterogeneous networks that follow a star-network schema. ◮ GNetMine , a transductive classification method in heterogeneous networks. ◮ RankClass , a ranking-based classification method in heterogeneous networks.
DBLP Dataset Classification Accuracy on Authors GPNRankClus NetClus GNetMine RankClass Accuracy 92 . 28% 76 . 11% ‡ 80 . 67% 91 . 12% ◮ The dataset includes conferences from Database Classification Accuracy on Conferences (DB), Data Mining (DM), GPNRankClus NetClus GNetMine RankClass Accuracy 100% 85% ‡ 100% 100% Machine Learning (ML), ‡ We test NetClus on the star-schema version of the Information Retrieval (IR). DBLP dataset. co-author Top-5 Words in Each Cluster DB DM ML IR 1 data data learning web 2 database mining knowledge retrieval 3 databases learning system information Author 4 query clustering reasoning search 5 system classification model text publish use Top-5 Conferences in Each Cluster DB DM ML IR 1 VLDB KDD IJCAI SIGIR appear Word Venue 2 ICDE PAKDD AAAI WWW 3 SIGMOD ICDM ICML CIKM 4 PODS PKDD CVPR ECIR 5 EDBT SDM ECML AAAI
YELP Dataset User ◮ We examine a subset of the YELP given by dataset for 3 different clustering tasks: given to Business Review ◮ 4 Level-1 categories ◮ 6 Restaurant categories contains ◮ 6 Shopping categories Word Classification accuracy on businesses GPNRankClus NetClus GNetMine RankClass Level 1 56 . 25% 17 . 78% 47 . 16% 37 . 19% Restaurant 66 . 81% 15 . 31% 49 . 36% 57 . 11% Shopping 64 . 62% 13 . 28% 64 . 45% 32 . 58% Normalized Mutual Information (NMI) on businesses GPNRankClus NetClus GNetMine RankClass Level 1 0 . 5590 0 . 0168 0 . 1387 0 . 1579 Restaurant 0 . 6606 0 . 0187 0 . 2346 0 . 3044 Shopping 0 . 4721 0 . 0313 0 . 3617 0 . 2335
Conclusions ◮ We introduce a new concept of ranking score that conveys both ranking and clustering information. ◮ Based on this concept, we propose a generative model, called GPNRankClus . ◮ We model the ranking score of each vertex in each cluster as a gamma distribution. ◮ We model the number of edges as a Poisson distribution. ◮ We test our model on DBLP and YELP data. ◮ GPNRankClus outperforms state-of-the-art baselines.
Recommend
More recommend