Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David - - PowerPoint PPT Presentation

algorithms for large sparse network alignment
SMART_READER_LITE
LIVE PREVIEW

Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David - - PowerPoint PPT Presentation

Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David Gleich, Margot Gerritsen, Amin Saberi, Ying Wang @ Stanford University and Jeong Han Kim @ Yonsei University Our motivation Library of Congress subject headings 3 Wikipedia


slide-1
SLIDE 1

Algorithms for Large, Sparse Network Alignment

Mohsen Bayati, David Gleich, Margot Gerritsen, Amin Saberi, Ying Wang @ Stanford University and Jeong Han Kim @ Yonsei University

slide-2
SLIDE 2

Our motivation

slide-3
SLIDE 3

3

Library of Congress subject headings

slide-4
SLIDE 4

Wikipedia categories

4

slide-5
SLIDE 5

Wikipedia categories

5

slide-6
SLIDE 6

6

Wikipedia categories

slide-7
SLIDE 7

Wikipedia vs Library of Congress

Library of Congress

Created by many, non-experts, in a distributed way in a few years Developed by few, experts, in a centralized way in

  • ver a century.

7

Are they similar ?

slide-8
SLIDE 8

Wikipedia vs Library of Congress

8

  • How similar these two data-sets are ?
  • Can we use one data-set to enrich the other ?
  • How to spend tax-payer’s money more wisely to

maintain the Library of Congress ?

Project funded by the Library of Congress.

slide-9
SLIDE 9

9

Are these graphs similar

slide-10
SLIDE 10

10

  • Match cross species vertices (proteins) and edges

(protein interactions)  Detect functionally similar proteins.

from Drexel University, School of Biomedical Engineering Website Berger et al’08, PNAS.

Fly Yeast

Network alignment for Comparing data-sets

slide-11
SLIDE 11

Network alignment for Comparing data-sets

11

  • Find the largest common sub-graph on similar vertices.

(Singh-Xu-Berger’07,’08).

  • Recently (Klau’09).

from Drexel University, School of Biomedical Engineering Website Berger et al’08, PNAS.

Fly Yeast

slide-12
SLIDE 12

12

  • Database schema matching

– (Melnik-Garcia Molina-Rahm’02).

  • Computer vision: Match a query image to an existing image.

– (Conte-Foggia’04)

  • Ontology matching: Match query image to existing image.

– (Svab’07).

  • Website : Match similar parts of the web-graph.

– Toyota’s USA websites vs Toyota France.

  • Social networks: Teenagers have both fake and real identities.
  • This talk: Comparing Wikipedia vs Library of Congress.

Network alignment for Comparing data-sets

slide-13
SLIDE 13

13

  • Defining the problem mathematically
  • Quick survey of existing approaches
  • A message-passing algorithm.
  • Experiments

– Real data – Synthetic data

  • Rigorous results.

This talk

slide-14
SLIDE 14

Approach: Align the two databases

5,233,829 potential matches Goal: Find an alignment that matches similar titles and maximizes the total number of overlaps.

14

297,266 nodes 248,232 links 205,948 nodes 422,503 links

slide-15
SLIDE 15

Quadratic program formulation

Formulate the problem as a quadratic program (QP).

maximize Subject to:

Total overlap Total similarity

Maximizing the similarity alone is easy, but the overlap is NP-hard to maximize. There is a reduction from the MAX-CUT problem.

15 Linear constraints

NP-hard to obtain better than 87.8% of the optimum overlap, unless the unique games conjecture is false (Goeman’s-Williamson’95).

slide-16
SLIDE 16

Quadratic program formulation

maximize Subject to:

Related NP-hard problems: 1) Maximum common sub-graph. 2) Graph isomorphism. 3) Maximum clique.

16

slide-17
SLIDE 17

Quadratic program formulation

maximize Subject to:

Relaxing the integer constraint Still hard (non-concave max.)

17

Heuristic 1) Find a local maxima using SNOPT  Round to an integer solution.

slide-18
SLIDE 18

Naïve linear program (LP) formulation

maximize Subject to:

18

For sparse graphs can be solved relatively efficiently.

slide-19
SLIDE 19

Improved LP by Klau’09

maximize Subject to:

19 Lagrange multiplier

Both LPs and QP also produce an upper-bound for the optimum.

+ some other combinatorial constraints

slide-20
SLIDE 20

IsoRank (Berger et al’07, 08)

maximize Subject to:

20

The new weights, rii’ ts can be found using an eigen-value calculation (similar to PageRank).

slide-21
SLIDE 21

21

Artificial Intelligence

  • J. Pearl’88

Decoding of LDPC codes R. Gallager’63 Cavity method in Statistical Physics

  • M. Mezard and G. Parisi’86

Successful applications in: Bayesian Inference, Computer vision, Coding theory, Optimization, Constraint satisfaction, Systems biology, etc.

Our approach: Belief Propagation (BP)

slide-22
SLIDE 22

22 Variable nodes Function nodes

Our approach: Belief Propagation (BP)

Independently, BP was used by Bradde-Braunstein-Mahmoudi- Tira-Weigt-Zecchina’09 for similar problems.

slide-23
SLIDE 23

23 Variable nodes Function nodes

Our approach: Belief Propagation (BP)

slide-24
SLIDE 24

Belief Propagation for =0

1) Iterate the following:

For update the following messages on each link of the network.

2) The estimated solution at the end of iteration choose a matching

i.e. pick the link with maximum incoming message.

24 How much i likes to mach to i’

slide-25
SLIDE 25

25

Belief Propagation for >0

How much i likes to mach to i’ How much ii’ likes to have overlap with jj’ Variable nodes Function nodes

slide-26
SLIDE 26

Algorithm works for =0

(B-Shah-Sharma’05) Each node’s decision is correct for (B-Borgs-Chayes-Zecchina’07): Same algorithm works for any graph when LP relaxation of the problem is integral.

  • Generalizes to b-matchings. (independently by Sanghavi-Malioutov-

Wilskey’07).

  • Works for asynchronous updates as well.

(B-Borgs-Chayes-Zecchina’08): “Belief Propagation” solves the LP relaxation.

  • Can use Belief Propagation messages to find the LP solutions in all cases.

26

slide-27
SLIDE 27

How about the >0 ?

27

slide-28
SLIDE 28

Experiment on Synthetic data

Most of the real-world networks including Wikipedia and LCSH have power-law distribution (The node degree distribution satisfies )

28 Add all correct edges Noise 1) Add with probability p. Noise 2) Add with probability q.

slide-29
SLIDE 29

29

0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00% 0.00 0.20 0.40 0.60 BP IsoRank SNOPT 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 0.00 0.05 0.10 0.15 0.20 0.25 BP IsoRank SNOPT

Correct matches

p p

Correct matches

q = 0 q = 0.2 Add all correct edges Noise 1) Add with probability p. Noise 2) Add with probability q.

BP, IsoRank  few seconds SNOPT  few hours

Experiment on Synthetic data

slide-30
SLIDE 30

30

Power-law graph experiments

LP Upper-bound LP BP IsoRank

slide-31
SLIDE 31

31

Grid graphs experiments

LP Upper-bound LP BP IsoRank

slide-32
SLIDE 32

32

Bioinformatics data: Fly vs Yeast

LP BP IsoRank

slide-33
SLIDE 33

33

Bioinformatics data: Human vs Mouse

LP BP IsoRank

slide-34
SLIDE 34

34

Ontology data: Wiki vs LCSH

LP BP IsoRank

slide-35
SLIDE 35

Statistical significance

Create many uniform random samples of LCSH and Wiki with the same node degrees. The objective value drops by 99%.

35

Statistical evidence that the two data-sets are very comparable. maximize Subject to:

slide-36
SLIDE 36

36

Some matched titles

slide-37
SLIDE 37

History, Ancient Cultural history Civilization, Ancient Ancient People

Enriching the data-sets

  • The approach suggests few thousands of potential links to be tested

with human experts in the Library of Congress.

37

BP matches

slide-38
SLIDE 38

Conclusions

  • Only BP, IsoRank and LP can handle large graphs.
  • BP and LP find near optimum solution on sparse data
  • LP produces an upper bound, and slightly better results.

But slightly slower.

  • For denser graphs BP outperforms LP.

38

slide-39
SLIDE 39

Thank You!

39