GIN: A Clustering Model for Capturing Dual Heterogeneity in Networked Data Jialu Liu Chi Wang Jing Gao Quanquan Gu Charu Aggarwal Lance Kaplan Jiawei Han May 1, 2015 amss
Outline 1 Heterogeneity in Networked Data GIN–the Proposed Network Clustering Algorithm 2 Modeling Subnetworks Unified Model Experiments 3 amss
Networked Data Many real-world data can be represented as a network (or graph), which is composed of nodes interconnected with each other via meaningful links. amss
Node Heterogeneity In real networks, there will likely be multiple types of nodes. amss
Link Heterogeneity Meanwhile, links can be categorized into different types. 10 12 3 24 5 6 Binary/Unweighted Links Weighted Links Besides link weights, links can be directed or undirected. amss
Dual Heterogeneity In this work, we work on heterogeneous networks that contain interconnected multi-typed nodes and links. Specifically, links are undirected but are allowed to be either binary or weighted . Author Paper Paper Venue Author Venue A 1 P 1 P 1 V 1 A 1 V 1 Author P 2 A 2 P 2 V 2 A 2 V 2 Paper Venue P 3 P 3 A 3 A 3 P 4 P 4 (a) (b) (c) (d) Figure: Dashed line – binary links, Solid line – weighted links. amss
Task and Novelty Network Clustering: We aim to find a clustering solution given a general heterogeneous network, in which each cluster consists of multiple types of nodes and links. Novelty compared with previous works: We are considering heterogeneity in both nodes and links; The algorithm does not have requirement on the network schema; The algorithm shows that sampling unobserved links (negative sampling) improves performance. amss
Outline 1 Heterogeneity in Networked Data GIN–the Proposed Network Clustering Algorithm 2 Modeling Subnetworks Unified Model Experiments 3 amss
Subnetworks A subnetwork in heterogeneous network is either a homogeneous network or a bipartite network. A network with the number of object types T = 1 is called homogeneous network . It is called bipartite network when T = 2 and links only exist between two object types. amss
Symbols We use G to denote a heterogeneous network and G ( uv ) to represent its subnetwork (can be homogeneous or bipartite network depending on whether object type u equals v ). G ( uv ) can be either unweighted or weighted. That is to say, link e ( uv ) between nodes x ( u ) and x ( v ) with weight W ( uv ) ij i j ij can be binary or take any non-negative values. amss
Subnetworks with Binary Links Suppose the probability of a link between nodes x ( u ) and x ( v ) is i j P ( e ( uv ) = 1 ) . ij Specifically, we factorize P ( e ( uv ) = 1 ) into P K k = 1 θ ( u ) ik θ ( v ) where ij jk { θ ( u ) ik } K k = 1 is a vector with length K indicating the cluster membership of node x ( u ) . i This factorization implies that two nodes get connected more easily if they share the same cluster distribution. nodes get connected ution. θ ( u ) θ 0.1 0 0 0.6 0 0.1 0.2 i 0.44 get connected u ) θ ( v ) 0 0.1 0 0.7 0 0.2 0 j amss
The underlying generative process for link e ( uv ) is as follows: ij X e ( uv ) θ ( u ) ik θ ( v ) ⇠ Bernoulli ( jk ) . ij k For the whole set of binary links E ( uv ) , the following likelihood can be derived to estimate parameters: ⌘ W ( uv ) ⌘ 1 � W ( uv ) ⇣ ⇣ Y P ( e ( uv ) P ( e ( uv ) ij ij = 1 ) = 0 ) (1) ij ij i < j | {z } Unobserved Links amss
Subnetworks with Weighted Links Similar to the Bernoulli setting in the previous subsection, we first model the existence of a link between a given pair of nodes. In addition to the cluster membership vector θ ( u ) , we incorporate a i scale parameter σ ( u ) for each node x ( u ) in consideration of the i i weighted setting. Then we can come up with the following generative process for weighted links: X (a) e ( uv ) θ ( u ) ik θ ( v ) ⇠ Bernoulli ( jk ) ij k (2) X (b) If e ( uv ) ω ( uv ) ⇠ Poisson ( σ ( u ) σ ( v ) θ ( u ) ik θ ( v ) = 1 , jk ) ij ij i j k where discrete random variable ω ( uv ) is the weight of the link. ij amss
⇣ ⌘ Y X θ ( u ) ik θ ( v ) 1 � ⇥ jk k W ( uv ) = 0 ij | {z } Unobserved Links � W ( uv ) (3) ⌘� σ ( u ) σ ( v ) P k θ ( u ) ik θ ( v ) ij ⇣ X Y θ ( u ) ik θ ( v ) i j jk jk W ( uv ) ! W ( uv ) k ij > 0 ij ⇥ e � σ ( u ) σ ( v ) k θ ( u ) ik θ ( v ) P i j jk . amss
Outline 1 Heterogeneity in Networked Data GIN–the Proposed Network Clustering Algorithm 2 Modeling Subnetworks Unified Model Experiments 3 amss
Objective Function We first define two sets of subnetworks belonging to the same heterogeneous network G : B and W . They represent subnetworks having binary and weighted links respectively, satisfying that B [ W = G and B \ W = ∅ . ⌘ W ( uv ) ⌘ 1 − W ( uv ) ⇣ X ⇣ Y Y X θ ( u ) ik θ ( v ) ij θ ( u ) ik θ ( v ) ij jk ) 1 � jk G ( uv ) ∈ B i < j k k ⇣ ⌘ Y Y X θ ( u ) ik θ ( v ) ⇥ 1 � jk G ( uv ) ∈ W W ( uv ) k = 0 ij (4) � W ( uv ) ⌘� σ ( u ) σ ( v ) P k θ ( u ) ik θ ( v ) ij ⇣ X Y θ ( u ) ik θ ( v ) i j jk ⇥ jk W ( uv ) ! W ( uv ) k ij > 0 ij ⇥ e − σ ( u ) σ ( v ) k θ ( u ) ik θ ( v ) P i j jk . amss
Complete Log-likelihood To directly optimize the previsou expression is difficult. We apply EM algorithm by using φ ( uv ) ijk 1 k 2 to denote the posterior probability of an unobserved link generated from different cluster assignments of two end nodes, i.e., k 1 6 = k 2 . Meanwhile, we use ψ ( uv ) to denote the ijk posterior probability of a link resulted from the same cluster assignments of two end nodes. ψ ( uv ) log θ ( u ) ik θ ( v ) X X X L ( Θ , Σ ) = ijk jk k G ( uv ) 2 B W ( uv ) = 1 ij ⇣ ⌘ X W ( uv ) ψ ( uv ) log θ ( u ) ik θ ( v ) X X + + 1 ij ijk jk G ( uv ) 2 W W ( uv ) k > 0 ij (5) φ ( uv ) ijk 1 k 2 log θ ( u ) ik 1 θ ( v ) X X X + jk 2 G ( uv ) 2 G W ( uv ) k 1 6 = k 2 = 0 ij W ( uv ) log σ ( u ) σ ( v ) X X + . ij i j G ( uv ) 2 W W ( uv ) > 0 ij amss
Update Functions Expectation Step: θ ( u ) ik 1 θ ( v ) φ ( uv ) jk 2 ijk 1 k 2 = P l 1 6 = l 2 θ ( u ) il 1 θ ( v ) . jl 2 θ ( u ) ik θ ( v ) ψ ( uv ) jk = ijk P l θ ( u ) il θ ( v ) . jl Maximization Step: ⇣ ⌘ X X X X θ ( u ) ψ ( uv ) W ( uv ) ψ ( uv ) / + + 1 ik ijk ij ijk G ( uv ) 2 B W ( uv ) G ( uv ) 2 W W ( uv ) = 1 > 0 ij ij X X X φ ( uv ) + ijkl . G ( uv ) 2 G W ( uv ) l 6 = k = 0 ij amss
Efficiency Issue θ ( u ) ik 1 θ ( v ) φ ( uv ) jk 2 O ( k 2 ) ijk 1 k 2 = P l 1 6 = l 2 θ ( u ) il 1 θ ( v ) jl 2 P l 6 = k θ ( u ) ik θ ( v ) θ ( u ) � θ ( u ) ik θ ( v ) X φ ( uv ) jl ik jk ) = = O ( k ) ijkl P l 1 6 = l 2 θ ( u ) il 1 θ ( v ) 1 � P l θ ( u ) il θ ( v ) l 6 = k jl 2 jl ⇣ ⌘ X X X X θ ( u ) ψ ( uv ) W ( uv ) ψ ( uv ) / + + 1 ik ijk ij ijk G ( uv ) 2 B W ( uv ) G ( uv ) 2 W W ( uv ) = 1 > 0 ij ij h X i X X φ ( uv ) + ijkl . G ( uv ) 2 G W ( uv ) l 6 = k = 0 ij amss (6)
Sampling Unobserved Links For the unobserved links, the spatial/time complexity increases significantly if we need to go over all of them. To alleviate such burden we sampled a potential neighbourhood for each node. This also downweights the third term of θ ( u ) ik θ ( u ) ψ ( uv ) ⇣ W ( uv ) ⌘ ψ ( uv ) X X X X / + + 1 ik ijk ij ijk G ( uv ) 2 B W ( uv ) G ( uv ) 2 W W ( uv ) = 1 > 0 ij ij (7) φ ( uv ) X X X + # ijkl l 6 = k G ( uv ) 2 G W ( uv ) = 0 ij We keep all the non-zero links and sample η M unobserved links to make its size proportional to the total number of links M (we choose η = 0 . 1 in the experiments). amss
Outline 1 Heterogeneity in Networked Data GIN–the Proposed Network Clustering Algorithm 2 Modeling Subnetworks Unified Model Experiments 3 amss
Datasets Four real world data sets were used. The DBLP data set is a collection of CS publications. We use a subset that belong to four research areas. The 4Groups data set contains co-author and author-term relationships where researchers are selected from four data mining and machine learning research groups. The Flickr data set is a network containing three types of objects: image, user and tag. Links exist between image-user and image-tag. The NSF data set describes NSF Research Awards Abstracts from 1990 to 2003. We use documents associated with terms and investigators that belong to the largest 10 programs. amss
The important statistics of four datasets are summarized in the following table. Data set DBLP 4Groups Flickr NSF #Nodes 70,536 1,618 4,076 30,995 #Links 332,388 5,568 14,396 1,883,682 Sparsity 6.7e-5 2.1e-3 8.7e-4 2.0e-3 #Clusters 4 4 8 10 #Objects 4 2 3 3 #Subnet. 3 2 2 2 Link Cat. Binary Weighted Binary Fused Term Venue Image Doc. Paper Author Term User Tag Inv. Term Author Figure: Network schemas of all data sets in which circles of labelled object types are amss in grey. Dashed (resp., solid) lines refer to binary (resp., weighted) links.
Recommend
More recommend