Chapter 8-2: Communit unity D Det etection Jilles Vreeken IRDM ‘15/16 3 Dec 2015
IRDM Chapter 8, overview The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Graph Clustering 4. You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16 VIII-2: 2 IRDM ‘15/16
IRDM Chapter 8, today The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Community Detection 4. You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16 VIII-2: 3 IRDM ‘15/16
Chapter 7.4: Communi unity Det Detec ection Aggarwal Ch. 19.3, 17.5 Zaki & Meira Ch. 16 VIII-2: 4 IRDM ‘15/16
Chapter 7.4.1: Det Detec ecting Sm Smal all Communi unities es - VIII-2: 5 IRDM ‘15/16
Trawling Searching for small communities in the Web graph What is the signature of a community in a Web graph? intuition: Many people all talking about the same things Use this to define “topics”: What the same people on the left talk about on the right … … Dense 2-layer graph VIII-2: 6 IRDM ‘15/16
Searching for small communities A more well-defined problem: enumerate complete bipartite subgraphs 𝐿 𝛽 , 𝛾 where 𝐿 𝛽 , 𝛾 has 𝑡 nodes on the “left” and every such node in 𝑡 links to the same set of 𝑢 nodes on the “right” | 𝑌 | = 𝛽 = 3 | 𝑍 | = 𝛾 = 4 𝑌 𝑍 𝐿 3 , 4 Fully connected VIII-2: 7 IRDM ‘15/16
Frequent itemset mining Recall market basket analysis. market: universe 𝑉 of 𝑜 items baskets: 𝑛 transctions, subsets of 𝑉 : 𝑢 1 , 𝑢 2 , … , 𝑢 𝑛 ⊆ 𝑉 where each 𝑢 𝑗 is a set of items one person bought support: frequency threshold 𝜏 Goal: find all subsets 𝑌 ⊆ 𝑉 s.t. 𝑌 ⊆ 𝑢 𝑗 of at least 𝜏 sets 𝑢 𝑗 ∈ 𝑬 What’s the connection between itemsets and complete bipartite graphs? VIII-2: 8 IRDM ‘15/16
From itemsets to bipartite 𝐿 𝛽 , 𝛾 Frequent itemsets = complete bipartite graphs! a How? b 𝑢 𝑗 = { 𝑏 , 𝑐 , 𝑑 , 𝑒 } i view each node 𝑗 as a c set 𝑢 𝑗 of the nodes 𝑗 points to d 𝐿 𝛽 , 𝛾 = a set 𝑍 of size 𝛾 a that occurs in 𝛽 sets 𝑢 𝑗 j b i c looking for 𝐿 𝛽 , 𝛾 → k set frequency threshold to 𝛽 d 𝑌 and look at layer 𝛾 , find 𝑍 all frequent sets of size 𝑢 𝛽 … minimum support ( | 𝑌 | = 𝛽 ) 𝛾 … itemset size ( | 𝑍 | = 𝛾 ) VIII-2: 9 IRDM ‘15/16
From itemsets to bipartite 𝐿 𝛽 , 𝛾 1) View each node 𝑗 as a 3) Say we find a frequent itemset 𝑌 = { 𝑏 , 𝑐 , 𝑑 } of supp 𝛽 set 𝑢 𝑗 of nodes 𝑗 points to This means, there are 𝛽 nodes a that link to all of { 𝑏 , 𝑐 , 𝑑 } : b i c a a d b a b 𝑢 𝑗 = { 𝑏 , 𝑐 , 𝑑 , 𝑒 } x z y c b c 2) Find frequent itemsets: c 𝛽 … minimum support 𝑍 𝛾 … itemset size a 4) We found 𝐿 𝛽 , 𝛾 ! x b y 𝐿 𝛽 , 𝛾 = a set 𝑍 of size 𝛾 c z that occurs in 𝛽 sets 𝑢 𝑗 (Kumar et al ‘99) 𝑌 VIII-2: 10 IRDM ‘15/16
Example b Suppor ort thresh shol old 𝛽 = 𝜏 = 2 a { 𝑐 , 𝑒 } : support 3 { 𝑓 , 𝑔 }: support 2 c i.e. we found 2 bipartite subgraphs: d e f b a c Itemsets: 𝑏 = { 𝑐 , 𝑑 , 𝑒 } d 𝑐 = { 𝑒 } c 𝑑 = { 𝑐 , 𝑒 , 𝑓 , 𝑔 } d e 𝑒 = { 𝑓 , 𝑔 } f 𝑓 = { 𝑐 , 𝑒 } e 𝑔 = {} VIII-2: 11 IRDM ‘15/16
Chapter 7.4.2: Communi unity Det Detec ection by y Gra Graph C h Clust lustering Aggarwal Ch. 17.5, 19.3 Zaki & Meira Ch. 16 VIII-2: 12 IRDM ‘15/16
Where do graphs come from? We can have data in graph form e.g. the clusters of our social networks Or, we map existing data to a graph data points become vertices add an edge if two data points are similar edge weights can also tell about similarity VIII-2: 13 IRDM ‘15/16
Similarity and adjacency matrices A sim imila ilarit ity matrix is an 𝑜 -by- 𝑜 non-negative, symmetric matrix the opposite of the distance matrix Recall that a weighted adjacency matrix is an 𝑜 -by- 𝑜 non-negative, symmetric matrix for weighted, undirected graphs So, we can think every s sim imila ilarit ity m matrix ix as an adjacency matrix of some weighted, undirected graph this graph will be complete (a clique) Further, we can use any s sim imila ilarit ity m measure between two points as an edge weight VIII-2: 14 IRDM ‘15/16
Getting non-complete graphs Using complete graphs can be a waste of resources for clustering, we don’t really care about very dissimilar pairs We can remove edges between dissimilar vertices zero weight Or, we adjust the weights to diminish dissimilar points the Gaus ussian n kernel l is popular for this 2 𝑥 𝑗𝑗 = exp − 𝑦 𝑗 − 𝑦 𝑗 2 𝜏 2 VIII-2: 15 IRDM ‘15/16
Getting non-complete graphs (2) How to decide when vertices are too dissimilar? In 𝝑 -ne neighb hbour ur grap aphs s we add an edge between two vertices that are within distance 𝜗 to each other usually the resulting graph is considered unweighted as all weights would be roughly similar In 𝑙 -nearest st neighb hbour ur graphs we connect two vertices if one is within the 𝑙 nearest neighbours of the other in mutual 𝑙 -nearest ne neighb hbour ur gra raph ph we only connect two vertices if they’re both in each other’s 𝑙 nearest neighbours VIII-2: 16 IRDM ‘15/16
Which similarity graph? With 𝜗 -graphs choosing the parameter is hard no single cor orrect an answer if different clusters have different internal similarities 𝑙 -nearest neighbours can connect points with different similarities but far-away high density regions become unconnected The mutual 𝑙 -nearest neighbours is somewhat in between good for detecting clusters with different densities General recommendation: start with 𝑙 -NN others if data supports that VIII-2: 17 IRDM ‘15/16
Example graph (Zaki & Meira, Fig 16.1) VIII-2: 18 IRDM ‘15/16
Graph partitioning 5 1 Undirected graph 2 6 4 3 Bi-partitioning task: divide vertices into two disjoint groups A B 5 1 2 6 4 3 Questions: how can we define a “good partition of”? how can we efficiently identify such a partition? VIII-2: 19 IRDM ‘15/16
Graph partitioning What makes a good partition? maximize the number of within-group connections minimize the number of between-group connections 5 1 2 6 4 3 A B VIII-2: 20 IRDM ‘15/16
Clustering as Graph Cuts A cut cut of a connected graph 𝐻 = ( 𝑊 , 𝐹 ) divides the set of vertices into two partitions 𝑇 and 𝑊 ∖ 𝑇 and removes the edges between them cut can be expressed by giving the set 𝑇 or by giving the cut set, i.e. edges with exactly one end in 𝑇 , 𝑤 , 𝑣 ∩ 𝑇 = 1} 𝐺 = { 𝑤 , 𝑣 ∈ 𝐹 ∶ A graph cut groups the vertices of a graph into two clusters subsequent cuts in the components give us a hierarchical clustering A 𝒍 -way y cut cut cuts the graph into 𝑙 disjoint set of vertices 𝐷 1 , 𝐷 2 , … , 𝐷 𝑙 and removes the edges between them VIII-2: 21 IRDM ‘15/16
What is a good cut? Not every cut will cut it In mi mini nimum mum c cut ut the goal is to find any set of vertices such that cutting them from the rest of the graph requires removing the least number of edges least sum of weights for weighted graphs the extension to multiway cuts is straightforward The minimum cut can be found in polynomial time the max-flow min-cut theorem But minimum cut isn’t very good for clustering purposes VIII-2: 22 IRDM ‘15/16
Cuts that cut it The minimum cut usually removes only one vertex not very appealing clustering we want to penalize the cut for imbalanced cluster sizes In ratio io cu cut, the goal is to minimize the ratio of the weight of the edges in the cut set and the size of the clusters 𝐷 𝑗 Let 𝑋 𝐵 , 𝐶 = ∑ 𝑥 𝑗𝑗 𝑗∈𝐵 , 𝑗∈𝐶 wij is the weight of edge (i, j) 𝑙 RatioCut = � 𝑋 𝐷 𝑗 , 𝑊 ∖ 𝐷 𝑗 𝐷 𝑗 𝑗=1 VIII-2: 23 IRDM ‘15/16
Cuts that cut it The vol olume ume of a set of vertices 𝐵 is the weight of all edges connected to 𝐵 𝑤𝑤𝑤 𝐵 = 𝑋 𝐵 , 𝑊 = � 𝑥 𝑗𝑗 𝑗∈𝐵 , 𝑗∈𝑊 In normalize lized cu cut we measure the size of 𝐷 𝑗 by 𝑤𝑤𝑤 ( 𝐷 𝑗 ) instead of | 𝐷 𝑗 | 𝑙 NormalisedCut = � 𝑋 𝐷 𝑗 , 𝑊 ∖ 𝐷 𝑗 𝑤𝑤𝑤 𝐷 𝑗 𝑗=1 VIII-2: 24 IRDM ‘15/16
Cuts that cut it The vol olume ume of a set of vertices 𝐵 is the weight of all edges connected to 𝐵 𝑤𝑤𝑤 𝐵 = 𝑋 𝐵 , 𝑊 = � 𝑥 𝑗𝑗 Finding the optimal 𝑗∈𝐵 , 𝑗∈𝑊 RatioCut or NormalisedCut In normalize lized cu cut we measure the size of 𝐷 𝑗 by 𝑤𝑤𝑤 ( 𝐷 𝑗 ) is NP -hard instead of 𝐷 𝑗 𝑙 NormalisedCut = � 𝑋 𝐷 𝑗 , 𝑊 ∖ 𝐷 𝑗 𝑤𝑤𝑤 𝐷 𝑗 𝑗=1 VIII-2: 25 IRDM ‘15/16
Recommend
More recommend