Algorithms for Large, Sparse Network Alignment
Mohsen Bayati, David Gleich, Margot Gerritsen, Amin Saberi, Ying Wang @ Stanford University and Jeong Han Kim @ Yonsei University
Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David - - PowerPoint PPT Presentation
Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David Gleich, Margot Gerritsen, Amin Saberi, Ying Wang @ Stanford University and Jeong Han Kim @ Yonsei University Our motivation Library of Congress subject headings 3 Wikipedia
Mohsen Bayati, David Gleich, Margot Gerritsen, Amin Saberi, Ying Wang @ Stanford University and Jeong Han Kim @ Yonsei University
3
4
5
6
Library of Congress
Created by many, non-experts, in a distributed way in a few years Developed by few, experts, in a centralized way in
7
Are they similar ?
8
Project funded by the Library of Congress.
9
10
(protein interactions) Detect functionally similar proteins.
from Drexel University, School of Biomedical Engineering Website Berger et al’08, PNAS.
Fly Yeast
11
(Singh-Xu-Berger’07,’08).
from Drexel University, School of Biomedical Engineering Website Berger et al’08, PNAS.
Fly Yeast
12
– (Melnik-Garcia Molina-Rahm’02).
– (Conte-Foggia’04)
– (Svab’07).
– Toyota’s USA websites vs Toyota France.
13
– Real data – Synthetic data
5,233,829 potential matches Goal: Find an alignment that matches similar titles and maximizes the total number of overlaps.
14
297,266 nodes 248,232 links 205,948 nodes 422,503 links
Formulate the problem as a quadratic program (QP).
maximize Subject to:
Total overlap Total similarity
Maximizing the similarity alone is easy, but the overlap is NP-hard to maximize. There is a reduction from the MAX-CUT problem.
15 Linear constraints
NP-hard to obtain better than 87.8% of the optimum overlap, unless the unique games conjecture is false (Goeman’s-Williamson’95).
maximize Subject to:
Related NP-hard problems: 1) Maximum common sub-graph. 2) Graph isomorphism. 3) Maximum clique.
16
maximize Subject to:
Relaxing the integer constraint Still hard (non-concave max.)
17
Heuristic 1) Find a local maxima using SNOPT Round to an integer solution.
maximize Subject to:
18
For sparse graphs can be solved relatively efficiently.
maximize Subject to:
19 Lagrange multiplier
Both LPs and QP also produce an upper-bound for the optimum.
+ some other combinatorial constraints
maximize Subject to:
20
The new weights, rii’ ts can be found using an eigen-value calculation (similar to PageRank).
21
Artificial Intelligence
Decoding of LDPC codes R. Gallager’63 Cavity method in Statistical Physics
Successful applications in: Bayesian Inference, Computer vision, Coding theory, Optimization, Constraint satisfaction, Systems biology, etc.
22 Variable nodes Function nodes
Independently, BP was used by Bradde-Braunstein-Mahmoudi- Tira-Weigt-Zecchina’09 for similar problems.
23 Variable nodes Function nodes
1) Iterate the following:
For update the following messages on each link of the network.
2) The estimated solution at the end of iteration choose a matching
i.e. pick the link with maximum incoming message.
24 How much i likes to mach to i’
25
How much i likes to mach to i’ How much ii’ likes to have overlap with jj’ Variable nodes Function nodes
(B-Shah-Sharma’05) Each node’s decision is correct for (B-Borgs-Chayes-Zecchina’07): Same algorithm works for any graph when LP relaxation of the problem is integral.
Wilskey’07).
(B-Borgs-Chayes-Zecchina’08): “Belief Propagation” solves the LP relaxation.
26
27
Most of the real-world networks including Wikipedia and LCSH have power-law distribution (The node degree distribution satisfies )
28 Add all correct edges Noise 1) Add with probability p. Noise 2) Add with probability q.
29
0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00% 0.00 0.20 0.40 0.60 BP IsoRank SNOPT 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 0.00 0.05 0.10 0.15 0.20 0.25 BP IsoRank SNOPT
Correct matches
p p
Correct matches
q = 0 q = 0.2 Add all correct edges Noise 1) Add with probability p. Noise 2) Add with probability q.
BP, IsoRank few seconds SNOPT few hours
30
LP Upper-bound LP BP IsoRank
31
LP Upper-bound LP BP IsoRank
32
LP BP IsoRank
33
LP BP IsoRank
34
LP BP IsoRank
Create many uniform random samples of LCSH and Wiki with the same node degrees. The objective value drops by 99%.
35
Statistical evidence that the two data-sets are very comparable. maximize Subject to:
36
History, Ancient Cultural history Civilization, Ancient Ancient People
with human experts in the Library of Congress.
37
BP matches
38
39