Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David - PowerPoint PPT Presentation

Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David Gleich, Margot Gerritsen, Amin Saberi, Ying Wang @ Stanford University and Jeong Han Kim @ Yonsei University

Our motivation

Library of Congress subject headings 3

Wikipedia categories 4

Wikipedia vs Library of Congress Library of Congress Created by many, Developed by few, non-experts, in a experts, in a distributed way in a centralized way in few years over a century. Are they similar ? 7

Wikipedia vs Library of Congress • How similar these two data-sets are ? • Can we use one data-set to enrich the other ? • How to spend tax- payer’s money more wisely to maintain the Library of Congress ? Project funded by the Library of Congress. 8

Are these graphs similar 9

Network alignment for Comparing data-sets • Match cross species vertices (proteins) and edges (protein interactions)  Detect functionally similar proteins. Berger et al’08, PNAS. Fly Yeast from Drexel University, School of Biomedical Engineering Website 10

Network alignment for Comparing data-sets • Find the largest common sub-graph on similar vertices. (Singh-Xu- Berger’07,’08). • Recently (Klau’09). Berger et al’08, PNAS. Fly Yeast from Drexel University, School of Biomedical Engineering Website 11

Network alignment for Comparing data-sets • Database schema matching – (Melnik-Garcia Molina- Rahm’02). • Computer vision: Match a query image to an existing image. – (Conte- Foggia’04) • Ontology matching: Match query image to existing image. – (Svab’07). • Website : Match similar parts of the web-graph. – Toyota’s USA websites vs Toyota France. • Social networks: Teenagers have both fake and real identities. • This talk: Comparing Wikipedia vs Library of Congress. 12

This talk • Defining the problem mathematically • Quick survey of existing approaches • A message-passing algorithm. • Experiments – Real data – Synthetic data • Rigorous results. 13

Approach: Align the two databases 297,266 nodes 205,948 nodes 248,232 links 422,503 links 5,233,829 potential matches Goal: Find an alignment that matches similar titles and maximizes the total number of overlaps . 14

Quadratic program formulation Formulate the problem as a quadratic program (QP). Total similarity Total overlap maximize Subject to: Linear constraints Maximizing the similarity alone is easy, but the overlap is NP-hard to maximize. There is a reduction from the MAX-CUT problem. NP-hard to obtain better than 87.8% of the optimum overlap, unless the unique games conjecture is false (Goeman’s - Williamson’95). 15

Quadratic program formulation maximize Subject to: Related NP-hard problems: 1) Maximum common sub-graph. 2) Graph isomorphism. 3) Maximum clique. 16

Quadratic program formulation maximize Subject to: Relaxing the integer constraint  Still hard (non-concave max.) Heuristic 1) Find a local maxima using SNOPT  Round to an integer solution. 17

Naïve linear program (LP) formulation maximize Subject to: For sparse graphs can be solved relatively efficiently. 18

Improved LP by Klau’09 maximize Lagrange multiplier Subject to: + some other combinatorial constraints Both LPs and QP also produce an upper-bound for the optimum. 19

IsoRank (Berger et al’07, 08) maximize Subject to: The new weights, r ii ’ ts can be found using an eigen-value calculation (similar to PageRank). 20

Our approach: Belief Propagation (BP) Cavity method in Statistical Decoding of LDPC Artificial Intelligence Physics codes R. Gallager’63 J. Pearl’88 M. Mezard and G. Parisi’86 Successful applications in: Bayesian Inference, Computer vision, Coding theory, Optimization, Constraint satisfaction, Systems biology, etc. 21

Our approach: Belief Propagation (BP) Variable nodes Function nodes Independently, BP was used by Bradde-Braunstein-Mahmoudi- Tira-Weigt- Zecchina’09 for similar problems. 22

Our approach: Belief Propagation (BP) Variable nodes Function nodes 23

Belief Propagation for  =0 1) Iterate the following: For update the following messages on each link of the network. How much i likes to mach to i ’ 2) The estimated solution at the end of iteration choose a matching i.e. pick the link with maximum incoming message. 24

Belief Propagation for  >0 Variable nodes Function nodes How much i likes to mach to i ’ How much ii’ likes to have overlap with jj ’ 25

Algorithm works for  =0 (B-Shah- Sharma’05) Each node’s decision is correct for (B-Borgs-Chayes- Zecchina’07): Same algorithm works for any graph when LP relaxation of the problem is integral. - Generalizes to b-matchings. (independently by Sanghavi-Malioutov- Wilskey’07). - Works for asynchronous updates as well. (B-Borgs-Chayes- Zecchina’08): “Belief Propagation” solves the LP relaxation. - Can use Belief Propagation messages to find the LP solutions in all cases. 26

How about the  >0 ? 27

Experiment on Synthetic data Most of the real-world networks including Wikipedia and LCSH have power-law distribution (The node degree distribution satisfies ) Add all correct edges Noise 1) Add with probability p. Noise 2) Add with probability q . 28

Experiment on Synthetic data q = 0 120.00% BP , IsoRank  few seconds Correct matches 100.00% SNOPT  few hours 80.00% BP 60.00% IsoRank 40.00% SNOPT 20.00% 0.00% Add all correct edges 0.00 0.20 0.40 0.60 p Noise 1) Add with probability p. q = 0.2 100.00% Correct matches 80.00% 60.00% BP Noise 2) Add with 40.00% IsoRank probability q . 20.00% SNOPT 0.00% 0.00 0.05 0.10 0.15 0.20 0.25 p 29

Power-law graph experiments LP Upper-bound LP BP IsoRank 30

Grid graphs experiments LP Upper-bound LP BP IsoRank 31

Bioinformatics data: Fly vs Yeast BP IsoRank LP 32

Bioinformatics data: Human vs Mouse BP IsoRank LP 33

Ontology data: Wiki vs LCSH BP IsoRank LP 34

Statistical significance maximize Subject to: Create many uniform random samples of LCSH and Wiki with the same node degrees. The objective value drops by 99%. Statistical evidence that the two data-sets are very comparable. 35

Some matched titles 36

Enriching the data-sets • The approach suggests few thousands of potential links to be tested with human experts in the Library of Congress. BP matches Ancient People Civilization, Ancient Cultural history History, Ancient 37

Conclusions • Only BP, IsoRank and LP can handle large graphs. • BP and LP find near optimum solution on sparse data • LP produces an upper bound, and slightly better results. But slightly slower. • For denser graphs BP outperforms LP. 38

Thank You! 39

Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David - PowerPoint PPT Presentation

Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David Gleich, Margot Gerritsen, Amin Saberi, Ying Wang @ Stanford University and Jeong Han Kim @ Yonsei University Our motivation Library of Congress subject headings 3 Wikipedia

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

Image alignment Slides from Derek Hoiem, Svetlana Lazebnik Image source Alignment applications

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Triangle Counting in Large Sparse Graph Meng-Tsung Tsai r95065@cise.ntu.edu.tw Triangle Counting

CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

TOD Alignment Rezoning Public Meeting July 18, 2019 TOD Alignment Rezoning The TOD Alignment

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Discriminative word alignment by learning the Discriminative word alignment by learning the

A library to manipulate Z-polyhedron in image representation Guillaume Iooss, Sanjay Rajopadhye

Western Library Online hello! WE ARE YOUR LIBRARIANS! Ellen Range Linda VanSistine-Yost

Trustee Board Training Focus: Community Needs Assessment, Strategic Planning, & Advocacy

General Purpose Timing Library (GPTL) A tool for characterizing parallel and serial application

An Overview of Guava: Google Core Libraries for Java Kevin Bourrillion Java Core Libraries Team

Linear Algebra Libraries for High- Performance Computing: Scientific Computing with Multicore and

Starfish Efficient Concurrency Support for Computer Vision Applications Robert

Enterprise Community Partners Title Reports / Surveys This training is made possible by the

Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David - PowerPoint PPT Presentation

Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David Gleich, Margot Gerritsen, Amin Saberi, Ying Wang @ Stanford University and Jeong Han Kim @ Yonsei University Our motivation Library of Congress subject headings 3 Wikipedia

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

Image alignment Slides from Derek Hoiem, Svetlana Lazebnik Image source Alignment applications

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Triangle Counting in Large Sparse Graph Meng-Tsung Tsai r95065@cise.ntu.edu.tw Triangle Counting

CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

TOD Alignment Rezoning Public Meeting July 18, 2019 TOD Alignment Rezoning The TOD Alignment

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Discriminative word alignment by learning the Discriminative word alignment by learning the

A library to manipulate Z-polyhedron in image representation Guillaume Iooss, Sanjay Rajopadhye

Western Library Online hello! WE ARE YOUR LIBRARIANS! Ellen Range Linda VanSistine-Yost

Trustee Board Training Focus: Community Needs Assessment, Strategic Planning, &amp; Advocacy

General Purpose Timing Library (GPTL) A tool for characterizing parallel and serial application

An Overview of Guava: Google Core Libraries for Java Kevin Bourrillion Java Core Libraries Team

Linear Algebra Libraries for High- Performance Computing: Scientific Computing with Multicore and

Starfish Efficient Concurrency Support for Computer Vision Applications Robert

Enterprise Community Partners Title Reports / Surveys This training is made possible by the

Trustee Board Training Focus: Community Needs Assessment, Strategic Planning, & Advocacy