algorithms for large sparse network alignment
play

Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David - PowerPoint PPT Presentation

Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David Gleich, Margot Gerritsen, Amin Saberi, Ying Wang @ Stanford University and Jeong Han Kim @ Yonsei University Our motivation Library of Congress subject headings 3 Wikipedia


  1. Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David Gleich, Margot Gerritsen, Amin Saberi, Ying Wang @ Stanford University and Jeong Han Kim @ Yonsei University

  2. Our motivation

  3. Library of Congress subject headings 3

  4. Wikipedia categories 4

  5. Wikipedia categories 5

  6. Wikipedia categories 6

  7. Wikipedia vs Library of Congress Library of Congress Created by many, Developed by few, non-experts, in a experts, in a distributed way in a centralized way in few years over a century. Are they similar ? 7

  8. Wikipedia vs Library of Congress • How similar these two data-sets are ? • Can we use one data-set to enrich the other ? • How to spend tax- payer’s money more wisely to maintain the Library of Congress ? Project funded by the Library of Congress. 8

  9. Are these graphs similar 9

  10. Network alignment for Comparing data-sets • Match cross species vertices (proteins) and edges (protein interactions)  Detect functionally similar proteins. Berger et al’08, PNAS. Fly Yeast from Drexel University, School of Biomedical Engineering Website 10

  11. Network alignment for Comparing data-sets • Find the largest common sub-graph on similar vertices. (Singh-Xu- Berger’07,’08). • Recently (Klau’09). Berger et al’08, PNAS. Fly Yeast from Drexel University, School of Biomedical Engineering Website 11

  12. Network alignment for Comparing data-sets • Database schema matching – (Melnik-Garcia Molina- Rahm’02). • Computer vision: Match a query image to an existing image. – (Conte- Foggia’04) • Ontology matching: Match query image to existing image. – (Svab’07). • Website : Match similar parts of the web-graph. – Toyota’s USA websites vs Toyota France. • Social networks: Teenagers have both fake and real identities. • This talk: Comparing Wikipedia vs Library of Congress. 12

  13. This talk • Defining the problem mathematically • Quick survey of existing approaches • A message-passing algorithm. • Experiments – Real data – Synthetic data • Rigorous results. 13

  14. Approach: Align the two databases 297,266 nodes 205,948 nodes 248,232 links 422,503 links 5,233,829 potential matches Goal: Find an alignment that matches similar titles and maximizes the total number of overlaps . 14

  15. Quadratic program formulation Formulate the problem as a quadratic program (QP). Total similarity Total overlap maximize Subject to: Linear constraints Maximizing the similarity alone is easy, but the overlap is NP-hard to maximize. There is a reduction from the MAX-CUT problem. NP-hard to obtain better than 87.8% of the optimum overlap, unless the unique games conjecture is false (Goeman’s - Williamson’95). 15

  16. Quadratic program formulation maximize Subject to: Related NP-hard problems: 1) Maximum common sub-graph. 2) Graph isomorphism. 3) Maximum clique. 16

  17. Quadratic program formulation maximize Subject to: Relaxing the integer constraint  Still hard (non-concave max.) Heuristic 1) Find a local maxima using SNOPT  Round to an integer solution. 17

  18. Naïve linear program (LP) formulation maximize Subject to: For sparse graphs can be solved relatively efficiently. 18

  19. Improved LP by Klau’09 maximize Lagrange multiplier Subject to: + some other combinatorial constraints Both LPs and QP also produce an upper-bound for the optimum. 19

  20. IsoRank (Berger et al’07, 08) maximize Subject to: The new weights, r ii ’ ts can be found using an eigen-value calculation (similar to PageRank). 20

  21. Our approach: Belief Propagation (BP) Cavity method in Statistical Decoding of LDPC Artificial Intelligence Physics codes R. Gallager’63 J. Pearl’88 M. Mezard and G. Parisi’86 Successful applications in: Bayesian Inference, Computer vision, Coding theory, Optimization, Constraint satisfaction, Systems biology, etc. 21

  22. Our approach: Belief Propagation (BP) Variable nodes Function nodes Independently, BP was used by Bradde-Braunstein-Mahmoudi- Tira-Weigt- Zecchina’09 for similar problems. 22

  23. Our approach: Belief Propagation (BP) Variable nodes Function nodes 23

  24. Belief Propagation for  =0 1) Iterate the following: For update the following messages on each link of the network. How much i likes to mach to i ’ 2) The estimated solution at the end of iteration choose a matching i.e. pick the link with maximum incoming message. 24

  25. Belief Propagation for  >0 Variable nodes Function nodes How much i likes to mach to i ’ How much ii’ likes to have overlap with jj ’ 25

  26. Algorithm works for  =0 (B-Shah- Sharma’05) Each node’s decision is correct for (B-Borgs-Chayes- Zecchina’07): Same algorithm works for any graph when LP relaxation of the problem is integral. - Generalizes to b-matchings. (independently by Sanghavi-Malioutov- Wilskey’07). - Works for asynchronous updates as well. (B-Borgs-Chayes- Zecchina’08): “Belief Propagation” solves the LP relaxation. - Can use Belief Propagation messages to find the LP solutions in all cases. 26

  27. How about the  >0 ? 27

  28. Experiment on Synthetic data Most of the real-world networks including Wikipedia and LCSH have power-law distribution (The node degree distribution satisfies ) Add all correct edges Noise 1) Add with probability p. Noise 2) Add with probability q . 28

  29. Experiment on Synthetic data q = 0 120.00% BP , IsoRank  few seconds Correct matches 100.00% SNOPT  few hours 80.00% BP 60.00% IsoRank 40.00% SNOPT 20.00% 0.00% Add all correct edges 0.00 0.20 0.40 0.60 p Noise 1) Add with probability p. q = 0.2 100.00% Correct matches 80.00% 60.00% BP Noise 2) Add with 40.00% IsoRank probability q . 20.00% SNOPT 0.00% 0.00 0.05 0.10 0.15 0.20 0.25 p 29

  30. Power-law graph experiments LP Upper-bound LP BP IsoRank 30

  31. Grid graphs experiments LP Upper-bound LP BP IsoRank 31

  32. Bioinformatics data: Fly vs Yeast BP IsoRank LP 32

  33. Bioinformatics data: Human vs Mouse BP IsoRank LP 33

  34. Ontology data: Wiki vs LCSH BP IsoRank LP 34

  35. Statistical significance maximize Subject to: Create many uniform random samples of LCSH and Wiki with the same node degrees. The objective value drops by 99%. Statistical evidence that the two data-sets are very comparable. 35

  36. Some matched titles 36

  37. Enriching the data-sets • The approach suggests few thousands of potential links to be tested with human experts in the Library of Congress. BP matches Ancient People Civilization, Ancient Cultural history History, Ancient 37

  38. Conclusions • Only BP, IsoRank and LP can handle large graphs. • BP and LP find near optimum solution on sparse data • LP produces an upper bound, and slightly better results. But slightly slower. • For denser graphs BP outperforms LP. 38

  39. Thank You! 39

Recommend


More recommend