as strong as the weakest link mining diverse cliques in
play

As Strong as the Weakest Link: Mining Diverse Cliques in Weighted - PowerPoint PPT Presentation

As Strong as the Weakest Link: Mining Diverse Cliques in Weighted Graphs Petko Bogdanov (UC Santa Barbara), with Ben Baumer (Smith College), Prithwish Basu (Raytheon BBN) , Amotz Bar-Noy (CUNY) and Ambuj K. Singh (UC Santa Barbara) ECML/PKDD,


  1. As Strong as the Weakest Link: Mining Diverse Cliques in Weighted Graphs Petko Bogdanov (UC Santa Barbara), with Ben Baumer (Smith College), Prithwish Basu (Raytheon BBN) , Amotz Bar-Noy (CUNY) and Ambuj K. Singh (UC Santa Barbara) ECML/PKDD, Prague, 2013

  2. Example: Collaboration in sports Significance of a pair’s success when on a team 2

  3. Influential groups 3

  4. Multiple teams 4

  5. Cliques in gene networks Complexes - interacting Gene Interaction Networks functional units* 5 * Leemor Joshua-Tor, Structure and Function of Nucleic Acid Regulatory Complexeshttp://www.hhmi.org/research/structure-and- function-nucleic-acid-regulatory-complexes

  6. Cliques in other domains ● Sets of duplicates and near-duplicates in similarity networks ○ images ○ video ○ other complex objects with similarity function ● Co-evolving time series ○ stocks of companies related in a supply chain ○ brain regions co-associated 6 in performing a task* * Hagmann P, Cammoun L, Gigandet X, Meuli R, Honey CJ, Wedeen VJ, Sporns O (2008) Mapping the structural core of human cerebral cortex. PLoS Biology Vol. 6, No. 7, e159

  7. Challenges ● Enumeration of cliques ○ MAX CLIQUE is NP-hard ● Ensuring diversity in the result set ○ Managing overlap “adds” complexity ● Size and density of real-world networks ● How to find the best diverse cliques efficiently while maintaining good quality of the solution 7

  8. Outline ● Motivation and examples ● Problem statement and properties ● Proposed solutions ● Experiments ● Conclusion 8

  9. Basic notions ● A graph G(V,E,w) represents a network of entities V with edges E among them ● w defines weights on edges ○ higher weight means stronger association ● A clique is a complete subgraph, i.e. all edges among the selected entities exist 9

  10. Clique strength ● Strong ties of all pairwise edges ● A clique is as strong as its weakest link ● “Flat” teams in which all connections are important ● Bigger cliques featuring all strong edges are better 10

  11. Diversity Score Diversity ● Linear combination of score and diversity via α ● Higher number of distinct nodes in solution means higher diversity 11

  12. Example: Top-2 cliques Too much overlap 12

  13. Example: Top-2 cliques Slightly lower score but less overlap 13

  14. Complexity ● m-Diverse k- Structures (mDk S ) is NP-hard ○ reduction from SET COVER ● Even if we are interested in sets of arbitrary structure, maximizing diversity is NP-hard Included in solution 14

  15. Approximation Diminishing ● Good news return ○ monotonic Candidate to ○ submodular add to solution ○ Allows a (1-1/e)-APX ● Challenges ○ Requires greedily finding the next best clique ○ MAX CLIQUE NP-hard to approximate to a constant ● Questions ○ Can we develop a solution with APX guarantees that is fast? Limitations? ○ Can we develop a very fast solution of good quality? 15

  16. Outline ● Motivation and examples ● Problem statement and properties ● Proposed solutions ● Experiments ● Conclusion 16

  17. Intuition ● How to obtain good cliques while reducing the cost of enumeration? ● Exploit the distribution of edge weights in a real network. ● Consider good edges first. ● Include good cliques in solution before considering all edges based on bounding the contribution of partial cliques 17

  18. Upper bound for an incomplete clique contribution Current lowest C The rest of weight will be the the nodes will lowest in the not overlap whole clique Optimistic completion 18

  19. DiCliQ - threshold and prune 19

  20. DiCliQ - threshold and prune 1. Enumerate cliques in a thresholded graph 2. Upper bound 3. If there is a candidate with a better score contribution than the best UB, add it to the solution 20

  21. DiCliQ - threshold and prune 1. Enumerate cliques in a thresholded graph 2. Upper bound 3. If there is a candidate with a better score contribution than the best UB, add it to the solution 4. Lower threshold and repeat 21

  22. DiCliQ - threshold and prune ● Implements a GREEDY and hence has a (1- 1/e)-approximation factor ● Exhaustive enumeration of all cliques might incur high cost in very large/dense instances ● How to scale up the discovery of diverse cliques without compromising the quality much? 22

  23. BUDiC - Bottom-up greedy heuristic ● Greedy expansion UB? Already in around a node based the solution on the UB contribution A ● Incorporates diversity C UB? 23

  24. BUDiC - Bottom-up greedy heuristic ● Greedy expansion Already in around a node based the solution on the UB contribution A ● Incorporates diversity C Grow away from included nodes 24 based on UB

  25. BUDiC - Bottom-up greedy heuristic ● Greedy expansion Already in around a node based the solution on the UB contribution A ● Incorporates diversity ● Repeat for all nodes ● Scales much better: O C (m*k*|E|) ● No APX guarantee ● Good quality on real Grow away from datasets included nodes 25 based on UB

  26. Outline ● Motivation and examples ● Problem statement and properties ● Proposed solutions ● Experiments ● Conclusion 26

  27. Data 27

  28. Scalability Apx. guarantee Scalable, High Quality ● Compare running time to a Baseline (No thresholding) and relative quality to iMDV* ● α = 0.5, m = 10, k = 5 * S. Bandyopadhyay and M. Bhattacharyya. Mining the largest dense vertexlet in 28 a weighted scale-free graph. Fundam. Inform., 96(1-2):1–25, 2009

  29. Scalability on YeastNet α=0.5, k=5 α=0.5, m=5 27

  30. Quality 28

  31. Discovering gene complexes 29

  32. Conclusion ● General results for diverse clique mining ○ application to discovery of effective groups in collaboration ○ complexes in gene networks ○ similarity/correlation graphs ● Two scalable algorithms, one with constant factor approximation ● More than 3 orders of magnitude running time improvement while preserving good quality 30

  33. Thank You Q&A The research was supported by the Army Research Laboratory under cooperative agreement W911NF-09-2-0053 (NS-CTA).

  34. Effect of diversity parameter α 32

  35. Groups in the other datasets ● The Harry Potter cast in the movies data set ● NBA: Nowitzki-Chandler-Stevenson of the defending champion Dallas Mavericks (addition of Chandler positive) ● MLB: Ramirez-Blake-Kuo of the LA Dodgers (13/14 with an otherwise unremarkable lineup reached the playoffs in 2008) 33

  36. Related work ● Quasi-cliques ○ frequency of clique occurrence (not score) ○ non-unique labels ● Weighted cliques ○ Bandyopadhyay et al. 2009: no APX guarantees, single clique, extended version does not have as good quality ● Other subgraph types ○ Steiner trees ○ Clique percolation (CFinder) ○ Edge weights are constraints and not part of score ● Diversity of nodes labels within a clique 34

Recommend


More recommend