As Strong as the Weakest Link: Mining Diverse Cliques in Weighted Graphs Petko Bogdanov (UC Santa Barbara), with Ben Baumer (Smith College), Prithwish Basu (Raytheon BBN) , Amotz Bar-Noy (CUNY) and Ambuj K. Singh (UC Santa Barbara) ECML/PKDD, Prague, 2013
Example: Collaboration in sports Significance of a pair’s success when on a team 2
Influential groups 3
Multiple teams 4
Cliques in gene networks Complexes - interacting Gene Interaction Networks functional units* 5 * Leemor Joshua-Tor, Structure and Function of Nucleic Acid Regulatory Complexeshttp://www.hhmi.org/research/structure-and- function-nucleic-acid-regulatory-complexes
Cliques in other domains ● Sets of duplicates and near-duplicates in similarity networks ○ images ○ video ○ other complex objects with similarity function ● Co-evolving time series ○ stocks of companies related in a supply chain ○ brain regions co-associated 6 in performing a task* * Hagmann P, Cammoun L, Gigandet X, Meuli R, Honey CJ, Wedeen VJ, Sporns O (2008) Mapping the structural core of human cerebral cortex. PLoS Biology Vol. 6, No. 7, e159
Challenges ● Enumeration of cliques ○ MAX CLIQUE is NP-hard ● Ensuring diversity in the result set ○ Managing overlap “adds” complexity ● Size and density of real-world networks ● How to find the best diverse cliques efficiently while maintaining good quality of the solution 7
Outline ● Motivation and examples ● Problem statement and properties ● Proposed solutions ● Experiments ● Conclusion 8
Basic notions ● A graph G(V,E,w) represents a network of entities V with edges E among them ● w defines weights on edges ○ higher weight means stronger association ● A clique is a complete subgraph, i.e. all edges among the selected entities exist 9
Clique strength ● Strong ties of all pairwise edges ● A clique is as strong as its weakest link ● “Flat” teams in which all connections are important ● Bigger cliques featuring all strong edges are better 10
Diversity Score Diversity ● Linear combination of score and diversity via α ● Higher number of distinct nodes in solution means higher diversity 11
Example: Top-2 cliques Too much overlap 12
Example: Top-2 cliques Slightly lower score but less overlap 13
Complexity ● m-Diverse k- Structures (mDk S ) is NP-hard ○ reduction from SET COVER ● Even if we are interested in sets of arbitrary structure, maximizing diversity is NP-hard Included in solution 14
Approximation Diminishing ● Good news return ○ monotonic Candidate to ○ submodular add to solution ○ Allows a (1-1/e)-APX ● Challenges ○ Requires greedily finding the next best clique ○ MAX CLIQUE NP-hard to approximate to a constant ● Questions ○ Can we develop a solution with APX guarantees that is fast? Limitations? ○ Can we develop a very fast solution of good quality? 15
Outline ● Motivation and examples ● Problem statement and properties ● Proposed solutions ● Experiments ● Conclusion 16
Intuition ● How to obtain good cliques while reducing the cost of enumeration? ● Exploit the distribution of edge weights in a real network. ● Consider good edges first. ● Include good cliques in solution before considering all edges based on bounding the contribution of partial cliques 17
Upper bound for an incomplete clique contribution Current lowest C The rest of weight will be the the nodes will lowest in the not overlap whole clique Optimistic completion 18
DiCliQ - threshold and prune 19
DiCliQ - threshold and prune 1. Enumerate cliques in a thresholded graph 2. Upper bound 3. If there is a candidate with a better score contribution than the best UB, add it to the solution 20
DiCliQ - threshold and prune 1. Enumerate cliques in a thresholded graph 2. Upper bound 3. If there is a candidate with a better score contribution than the best UB, add it to the solution 4. Lower threshold and repeat 21
DiCliQ - threshold and prune ● Implements a GREEDY and hence has a (1- 1/e)-approximation factor ● Exhaustive enumeration of all cliques might incur high cost in very large/dense instances ● How to scale up the discovery of diverse cliques without compromising the quality much? 22
BUDiC - Bottom-up greedy heuristic ● Greedy expansion UB? Already in around a node based the solution on the UB contribution A ● Incorporates diversity C UB? 23
BUDiC - Bottom-up greedy heuristic ● Greedy expansion Already in around a node based the solution on the UB contribution A ● Incorporates diversity C Grow away from included nodes 24 based on UB
BUDiC - Bottom-up greedy heuristic ● Greedy expansion Already in around a node based the solution on the UB contribution A ● Incorporates diversity ● Repeat for all nodes ● Scales much better: O C (m*k*|E|) ● No APX guarantee ● Good quality on real Grow away from datasets included nodes 25 based on UB
Outline ● Motivation and examples ● Problem statement and properties ● Proposed solutions ● Experiments ● Conclusion 26
Data 27
Scalability Apx. guarantee Scalable, High Quality ● Compare running time to a Baseline (No thresholding) and relative quality to iMDV* ● α = 0.5, m = 10, k = 5 * S. Bandyopadhyay and M. Bhattacharyya. Mining the largest dense vertexlet in 28 a weighted scale-free graph. Fundam. Inform., 96(1-2):1–25, 2009
Scalability on YeastNet α=0.5, k=5 α=0.5, m=5 27
Quality 28
Discovering gene complexes 29
Conclusion ● General results for diverse clique mining ○ application to discovery of effective groups in collaboration ○ complexes in gene networks ○ similarity/correlation graphs ● Two scalable algorithms, one with constant factor approximation ● More than 3 orders of magnitude running time improvement while preserving good quality 30
Thank You Q&A The research was supported by the Army Research Laboratory under cooperative agreement W911NF-09-2-0053 (NS-CTA).
Effect of diversity parameter α 32
Groups in the other datasets ● The Harry Potter cast in the movies data set ● NBA: Nowitzki-Chandler-Stevenson of the defending champion Dallas Mavericks (addition of Chandler positive) ● MLB: Ramirez-Blake-Kuo of the LA Dodgers (13/14 with an otherwise unremarkable lineup reached the playoffs in 2008) 33
Related work ● Quasi-cliques ○ frequency of clique occurrence (not score) ○ non-unique labels ● Weighted cliques ○ Bandyopadhyay et al. 2009: no APX guarantees, single clique, extended version does not have as good quality ● Other subgraph types ○ Steiner trees ○ Clique percolation (CFinder) ○ Edge weights are constraints and not part of score ● Diversity of nodes labels within a clique 34
Recommend
More recommend