clustering and ranking in heterogeneous information
play

Clustering and Ranking in Heterogeneous Information Networks via - PowerPoint PPT Presentation

Clustering and Ranking in Heterogeneous Information Networks via Gamma-Poisson Model Junxiang Chen Wei Dai Yizhou Sun Jennifer Dy Northeastern University May 1, 2015 Information Network Information networks are oftentimes used to


  1. Clustering and Ranking in Heterogeneous Information Networks via Gamma-Poisson Model Junxiang Chen Wei Dai Yizhou Sun Jennifer Dy Northeastern University May 1, 2015

  2. Information Network ◮ Information networks are oftentimes used to represent objects and their interactions. ◮ Objects are represented by vertices. ◮ Relationships are represented by edges. ◮ Homogeneous information network has been well studied. ◮ It assumes there contains only one type of vertices and one type of edges. ◮ A friendship network is an example. Sophia William Emma Jacob Mason

  3. Heterogeneous Network ◮ In the real world, multiple-typed objects are usually related with each other. ◮ It can be represented by a heterogeneous information network . ◮ It involves vertices of multiple types and edges of multiple types. ◮ For example, DBLP is a computer science bibliographic database. Sophia William Mason co-author Author “database” publish use “data” VLDB “mining” appear Word Venue SDM

  4. Related Work ◮ Clustering and ranking are prominent techniques to analyze information networks. ◮ They are usually regarded as orthogonal techniques.

  5. Related Work ◮ Clustering and ranking are prominent techniques to analyze information networks. ◮ They are usually regarded as orthogonal techniques. Clustering Methods Homogenous Networks Spectral clustering [Shi and Malik, 2000] Affinity propagation [Frey and Dueck, 2007] Stochastic blockmodel [Snijders and Nowicki, 1997] Multi-type spectral clustering [Long et al., 2006] Clustering Methods Heterogeneous Networks

  6. Related Work ◮ Clustering and ranking are prominent techniques to analyze information networks. ◮ They are usually regarded as orthogonal techniques. Clustering and Ranking Methods Homogenous Networks HITS Spectral clustering [Kleinberg, 1999] Affinity propogation PageRank [Page et al., 1998] Stochastic blockmodel Multi-type spectral PopRank clustering [Nie et al., 2005] Clustering Ranking Methods Methods Heterogeneous Networks

  7. Related Work (cont.) ◮ Combining clustering and ranking together usually achieves better results. ◮ Sun et al. [2009a] proposes the RankClus model for bi-typed networks. ◮ Sun et al. [2009b] introduces NetClus model for star-network schema. Clustering and Ranking Methods Homogenous Networks Spectral clustering HITS Affinity propogation PageRank Stochastic blockmodel Multi-type spectral PopRank clustering RankClus Clustering Ranking Methods NetClus Methods Networks with Specified Schema Heterogeneous Networks

  8. Contributions ◮ We develop a Gamma-Poisson generative model, called GPNRankClus (Gamma-Poisson Network Model for Ranking and Clustering) Clustering and Ranking Methods Homogenous Networks Spectral clustering HITS Affinity propogation PageRank Stochastic blockmodel Multi-type spectral GPNRankClus PopRank clustering RankClus Clustering Ranking Methods NetClus Methods Networks with Specified Schema Heterogeneous Networks

  9. Ranking Scores ◮ We want to simultaneously achieve ranking and Ranking results clustering. ◮ We assign each vertex v ( T m ) ranking score r ( T m ) n nk for each cluster that represents the importance of K clusters the vertex in this cluster, s.t. Clustering results v ( Tm ) ⇔ k = argmax l ( r ( Tm ) ∈ C k ) (1) n nl rank k ( v ( Tm ) ) < rank k ( v ( Tm ) ) ⇔ r ( Tm ) > r ( Tm ) (2) i j ik jk N Objects of type T m

  10. Ranking Scores ◮ We want to simultaneously achieve ranking and clustering. ◮ We assign each vertex v ( T m ) ranking score r ( T m ) n nk for each cluster that represents the importance of the vertex in this cluster, s.t. θ r v ( Tm ) ⇔ k = argmax l ( r ( Tm ) ∈ C k ) n nl rank k ( v ( Tm ) ) < rank k ( v ( Tm ) ) ⇔ r ( Tm ) > r ( Tm ) i j ik jk ◮ Since r ( T m ) Ranking is a positive real number r ( T a ) r ( T b ) nk Scores ik jk r ( T m ) ∼ Gamma ( α r , β r ) . N × K N × K nk

  11. Intensity of Edge Type ◮ In heterogeneous networks, the intensity for different edge type differs. ◮ Some edge types tend to θ r generate more connections. ◮ We model the intensity of each edge type using a positive real number. Ranking r ( T a ) r ( T b ) λ ( T a ,T b ) ∼ Gamma ( α λ , β λ ) Scores ik jk N × K N × K λ ( T a ,T b ) W ( T a ,T b ) θ λ T e ij M 2 Intensity of Edge Type

  12. Number of Edges ◮ There exist multiple edges between two vertices. θ r ◮ Connections between vertices are treated as counts of repeated events. W ( T a ,T b ) ( r ( T a ) · r ( T b ) ∼ Pois ( λ ( T a ,T b ) )) ij i j Intensity of dot product of # of Edges edge type ranking scores Ranking r ( T a ) r ( T b ) jk Scores ik N × K N × K λ ( T a ,T b ) W ( T a ,T b ) θ λ T e ij M 2 N 2 Intensity of Number of Edge Type Edges

  13. Why Dot Product? W ( T a ,T b ) ∼ Pois ( λ ( T a ,T b ) ( r ( T a ) · r ( T b ) )) ij i j Intensity of dot product of # of Edges edge type ranking scores ◮ The dot product can be expressed as r ( T a ) · r ( T b ) = cos θ × || r ( T a ) || × || r ( T b ) || i j i j ◮ In order to have a large W ( T a ,T b ) we need ij ◮ Large λ ( T a ,T b ) ◮ Large cos θ ◮ Large || r ( T a ) || and || r ( T b ) || i j

  14. Summary of the Model ◮ For each vertex n and each cluster k , Draw r ( T m ) ∼ Gamma ( α r , β r ) nk For each non-zero edge type ( T a , T b ) , ◮ Draw λ ( T a ,T b ) ∼ Gamma ( α λ , β λ ) For each pair of different vertices ( v ( T a ) , v ( T b ) ◮ ) i j Draw W ( T a ,T b ) ∼ Pois ( λ ( T a ,T b ) ( r ( T a ) · r ( T b ) )) ij i j θ r Ranking r ( T b ) r ( T a ) Scores ik jk N × K N × K λ ( T a ,T b ) W ( T a ,T b ) θ λ T e ij M 2 N 2 Intensity of Number of Edge Type Edges

  15. Inference Ranking results ◮ It is computationally intractable to directly evaluate the posterior distributions. K clusters Clustering results ◮ We use mean-field variational inference to approximate these distributions. ◮ Ranking and clustering results are given by comparing the expected values of the ranking scores N Objects of type T m ∈ C k , where k = argmax l ( E [ r ( T m ) v ( T m ) ]) . n nl ) = argsort i ( E [ r ( T m ) rank k ( v ( T m ) ]) . n ik ◮ We introduce seeds . ◮ Existing models use seeds to guide the clustering process. ◮ We select 1 representative object for each cluster. ◮ We assign a special prior distribution for these seeds.

  16. Synthetic Data ◮ We generate synthetic data ◮ 400 data points ◮ 4 different types ◮ 2 clusters ◮ We add noise of different levels. Low noise level High noise level Mediate noise level

  17. Real Data ◮ We test the performance of model on two real heterogeneous network datasets: ◮ DBLP dataset ◮ YELP dataset ◮ We compare GPNRankClus with state-of-the-art algorithms ◮ NetClus , A clustering and ranking method for heterogeneous networks that follow a star-network schema. ◮ GNetMine , a transductive classification method in heterogeneous networks. ◮ RankClass , a ranking-based classification method in heterogeneous networks.

  18. DBLP Dataset Classification Accuracy on Authors GPNRankClus NetClus GNetMine RankClass Accuracy 92 . 28% 76 . 11% ‡ 80 . 67% 91 . 12% ◮ The dataset includes conferences from Database Classification Accuracy on Conferences (DB), Data Mining (DM), GPNRankClus NetClus GNetMine RankClass Accuracy 100% 85% ‡ 100% 100% Machine Learning (ML), ‡ We test NetClus on the star-schema version of the Information Retrieval (IR). DBLP dataset. co-author Top-5 Words in Each Cluster DB DM ML IR 1 data data learning web 2 database mining knowledge retrieval 3 databases learning system information Author 4 query clustering reasoning search 5 system classification model text publish use Top-5 Conferences in Each Cluster DB DM ML IR 1 VLDB KDD IJCAI SIGIR appear Word Venue 2 ICDE PAKDD AAAI WWW 3 SIGMOD ICDM ICML CIKM 4 PODS PKDD CVPR ECIR 5 EDBT SDM ECML AAAI

  19. YELP Dataset User ◮ We examine a subset of the YELP given by dataset for 3 different clustering tasks: given to Business Review ◮ 4 Level-1 categories ◮ 6 Restaurant categories contains ◮ 6 Shopping categories Word Classification accuracy on businesses GPNRankClus NetClus GNetMine RankClass Level 1 56 . 25% 17 . 78% 47 . 16% 37 . 19% Restaurant 66 . 81% 15 . 31% 49 . 36% 57 . 11% Shopping 64 . 62% 13 . 28% 64 . 45% 32 . 58% Normalized Mutual Information (NMI) on businesses GPNRankClus NetClus GNetMine RankClass Level 1 0 . 5590 0 . 0168 0 . 1387 0 . 1579 Restaurant 0 . 6606 0 . 0187 0 . 2346 0 . 3044 Shopping 0 . 4721 0 . 0313 0 . 3617 0 . 2335

  20. Conclusions ◮ We introduce a new concept of ranking score that conveys both ranking and clustering information. ◮ Based on this concept, we propose a generative model, called GPNRankClus . ◮ We model the ranking score of each vertex in each cluster as a gamma distribution. ◮ We model the number of edges as a Poisson distribution. ◮ We test our model on DBLP and YELP data. ◮ GPNRankClus outperforms state-of-the-art baselines.

Recommend


More recommend