Network Alignment Using Isomorphic Graphlets Harrison Lee 21Nov2019
Problem Introduction: ● It is useful to be able to align similar networks to evaluate their similarity ○ If two networks are very similar, we could roughly overlay them over each other, similar topographies ● To compare larger networks, we need to define a cost matrix to build our alignment on. ○ How do we objectively and quickly define how similar two nodes are? ○ Degree? Neighbors’ degrees? How can we capture local network topographies?
More on Network Alignment Applications ● Network alignment has strong uses in bioinformatics ○ A good network alignment implies strong similarities in topology between the two ● For example, aligning protein-protein interaction (PPI) networks ○ Similarities between PPI Networks could help highlight analogous metabolic pathways. ○ Seeing these similarities can help us infer about or direct research for proteins and their functions based on knowledge on more well-studied, similar analogues. ■ Model organisms, such as yeast or rats, are much easier to study than others, such as humans. ■ Alternatively, this can also highlight the differences in proteins and metabolic pathways, suggesting a way to explain potential differences in function.
The Networks We Initially Test On ● In this project and existing papers, we compare protein-protein interaction networks, though these algorithms are not limited to them. ○ In this case, we use PPI networks such as for H. pylori (stomach ulcers, n=700 nodes, m=1425 edges), Saccharomyces (yeast, n=5113, m=22315) or human (n=9141, m=41456) ● For testing, we look at both largest component and regular high-confidence PPI networks that were noised for comparison ● For automated larger scale testing, we can also generate networks and compare them to a modified/noised version, with a defined similarity, by randomly re-wiring edges. ○ These generated networks can be replaced with other arbitrary types for different network types, and their corresponding topologies.
Current Alignment Methods Are Slow ● Calculating and finding graph isomorphisms is NP-Complete ● Networks such as PPI networks, in the 100s-1000s of node ranges, would be infeasibly slow to calculate without heuristics ○ This does not even consider additional possible applications in huge data sets such as social network data, which is in the billions of nodes range. ● Heuristics like graphlet orbits can reduce those run times without compromising accuracy relative to un-tenable run-time algorithms ● Here we look at graphlet orbits as an option to create a network aligning, and subsequently, similarity scoring, mechanism. ○ While we initially test with PPI networks, this works on patterns in graph topologies, and could be applied in any other types of network, being constrained primarily by network size.
Graphlets, Orbits ● To build the cost matrix, we can begin by looking at isomorphic graphlets ● We define these small sub-graphs as node orbits ○ We begin set-up by constructing all the possible shapes or edge configurations for small, less than 5 node, graphlets Small Graphlet Examples: 2-node (1) ______3-node (2)_______ _______________________4-node (5)________________ 1 1 5 3 4 1 2 4 1 7 9 9 2 6 3 1 1 1 4 4 4 3 1 8 1 2 9 6 9 1 1 4 1 1 2 5 5 7 7 4 0 0 Nodes should be 3 Independent 5 of rotation, 4 1 1 rearrangement 2 5 5
Larger Graphlet Examples ● These larger graphlets better capture local neighborhoods around nodes ○ With higher node graphlets, we can go deeper/further from each source node, increase the similarity to any other nodes that have the equivalent large orbits. ○ We will use the number of orbits each node has to construct our cost matrix. # Nodes | # Orbits ● Here are all the 5-node graphlets, as from an earlier work 2 1 ○ The number of orbits per nodes increase rapidly: there are 73 5-node graphlets 3 3 ○ 4 11 5 57 Malod-Dognin, Noël & Przulj, Natasa. (2015). L-GRAAL: Lagrangian graphlet-based network aligner. Bioinformatics. 10.1093/bioinformatics/btv130.
Treelets # Nodes | # Treelet ● In this project, we build on the existing work by using treelets, Orbits 2 1 instead of graphlets. Treelets are a subset of graphlets. 3 2 ○ Require both isomorphic orbits as well as tree structures 4 4 ○ Trees: All nodes are connected by only one path, and are acyclic. ● Similarly with graphlets, we begin by constructing all the possible small treelets 5 9 ○ 2-Node graphlets and treelets are the same, but on larger nodes the 6 20 difference is more noticeable. 7 48 ● The number of orbits increase less rapidly, allowing us to feasibly use larger treelets Valid Treelets: Invalid: These are graphlets 1 1 1 1 1 7 5 5 3 4 2 9 9 1 5 2 4 1 6 8 1 1 1 1 1 3 6 5 5 4 3 1 4 4 9 9 1 1 7 7 5 0 0 2
Edge Orbits ● We can also define orbits by edge isomorphisms ○ This can be done with both graphlets or treelets Ex: treelet edge orbits graphlet edge orbits # Nodes # Node # Edge # Node # Edge Graphlets Graphlets Treelets Treelets a b 3 3 2 2 1 4 11 10 4 3 a a c c 5 57 57 9 6 d 6 n/a n/a 20 16 7 n/a n/a 48 37
Collect all the orbit counts ● With any of those orbits, we can begin to construct our cost matrix by using the counts of each orbit that each node has. ○ Each node (row) has some number of each orbit (columns) ● Let’s look at one of the networks on the first slide, and focus on one node 4 2 9 3 x 0 1 2 ● Larger orbits are less 0 0 0 3 common and provide 1 0 better accuracy, but are 0 more difficult to calculate 0
Alignment: Cost Matrix Construction ● Now we have orbit counts, how do we turn that into a cost-matrix now? ○ point_A, (point_B), num_orbit1, num_orbit2, ... ● For node orbits, we use node graphlet degree vector (GDV) similarity ○ Let u and v refer to the two nodes we are comparing, k = the total number of orbits we consider ○ With D i and w i referring to GDV similarity/distance and weighting for orbit i ● The signature sum between the two GDVs is then defined as S(u, v)
Edge-Based Costs ● For edges: we also use edge graphlet degree centrality (GDC) as well ○ GDV signature similarity are still used: ● Edge graphlet degree centrality is also used, measuring the extended neighborhood’s complexity, favoring the denser neighborhoods. ○ With edge e, and e i referring to the orbit i count ● The final edge cost function (ECF) is: ○ alpha scales the weighting of GDV vs GDC ● These similarity values in the cost matrix can then be minimized to create our alignment.
Alignment: Aligning according to our Cost Matrix ● With this new cost matrix we now have maps similarities of the n A nodes of network A to the n B nodes of network B. (n a x n b matrix) ○ The only constraint on large networks is that the two should probably be reasonably well connected ● We can use this cost matrix to run a few possible alignment algorithms to map the nodes of each network ○ Greedy Example Cost Matrix ○ Hungarian a1 a2 a3 a4 a5 ○ Just need to minimize total cost In the matrix b1 b2 b3 b4 b5 c1 c2 c3 c4 c5 d1 d2 d3 d4 d5
Assignment Algorithm Examples ● There is further improvement possible in this direction, by using more effective choice algorithms Greedy: Hungarian: Subtract smallest uncovered values 7 9 8 6 8 7 9 8 6 8 1 3 2 0 2 7 9 8 6 8 4 3 7 6 9 4 3 7 6 9 1 0 4 3 6 4 3 7 6 9 5 8 3 1 4 5 8 3 1 4 3 7 2 0 3 5 8 3 1 4 60 43 32 10 80 60 43 32 10 80 50 33 22 0 70 60 43 32 10 80 Total Cost: 43 23
Improvement from using Hungarian Alignment ● Optimal Network Alignment with Graphlet Degree Vectors (Milenković, T., 2010) shows significant improvement when using Hungarian alignment, over the greedy original/initial GRAAL, which also uses graphlets. ● Edge Correctness refers to the our alignment quality measure. It refers to the percentage of edges in one graph that aligned to edges in the other. ● Here, the network is a high confidence yeast PPI network, that is aligned against the same network augmented with interactions from a lower-confidence Version at different noise levels.
Additional Alignment Quality Measures ● We also use node correctness and interaction correctness ○ These both require knowing the true alignment, and correct node or edge mappings.
Yeast-Human Alignment Analysis and Run-time Analysis ● In addition to comparing against what the results from a random mapping would be, the shared Gene Ontology (GO) terms between aligned protein pairs are also counted. # Common 1 2 3 4 GO Terms H-GRAAL 45.38% 14.54% 4.55% 1.3% GRAAL 45.10% 15.60% 5.06% 2.02% ● Run-time: Hungarian Algorithm runs in O(n 3 ) , where n is the number of nodes ○ ○ Collecting orbits and counts takes O(nk) at least, where k is the number of orbits Cost matrix generation over n 2 nodes for k orbits each, takes O(kn 2 ) ○
Recommend
More recommend