graphlet kernels
play

Graphlet Kernels Karsten Borgwardt and Nino Shervashidze joint work - PowerPoint PPT Presentation

Graphlet Kernels Karsten Borgwardt and Nino Shervashidze joint work with SVN Vishwanathan, Tobias Petri, and Kurt Mehlhorn Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max


  1. Graphlet Kernels Karsten Borgwardt and Nino Shervashidze joint work with SVN Vishwanathan, Tobias Petri, and Kurt Mehlhorn Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max Planck Institute for Developmental Biology, Tübingen appeared in AISTATS 2009 Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 1

  2. String kernels Recall the k -mer kernel on strings Basic idea: count the number of common contiguous sub- strings of length k This is equivalent to: count the number of occurrences of all k -mers in strings s 1 and s 2 separately, compute the inner product between these counts. s 1 s 2 ACCTTGTA TGTCCTG ACC CCT CTG CTT GTA GTC TCC TGT TTG ACC TGT CCT GTC s 1 f(s 1 ) = (...,1, ..., 1, ..., 0, ..., 1, ..., 1,..., 0, ..., 0, ...,1,...,1,...) CTT TCC f(s 2 ) = (...,0, ..., 1, ..., 1, ..., 0, ..., 0, ...,1, ..., 1, ...,1,...,0,...) s 2 TTG CCT TGT CTG K(s 1 ,s 2 ) = f(s 1 )f(s 2 )’ GTA Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 2

  3. Graph comparison Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 3

  4. Graph kernels Graph kernels have traditionally been based on different ideas Random walk kernel ( O ( n 3 ) ) Shortest path kernel ( O ( n 4 ) ) Subtree kernel (NP-hard) Cycle kernel (NP-hard) All possible subgraphs kernel (NP-hard) Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 4

  5. Graphlet kernel We call graphlets subgraphs of size { 3 , 4 , 5 } . Let G = { graphlet (1) , . . . , graphlet ( N k ) } be the set of size- k graphlets and G be a graph of size n . Define a vector f G of length N k such that f Gi = #( graphlet ( i ) ⊑ G ) . We call f G the k -spectrum of G . In this figure n = 5 , k = 3 , f G = (1 , 3 , 6 , 0) . Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 5

  6. Graphlet kernel Given two graphs G and G ′ of size n ≥ k , the graphlet kernel k g is defined as k g ( G, G ′ ) := f ⊤ G f G ′ . Problem: if G and G ′ have different sizes, this will greatly skew the counts f G Solution: normalize the counts to frequency vectors: 1 D G = # all graphlets in G f G and work with the normalized variant of k g k g ( G, G ′ ) = D ⊤ G D G ′ . Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 6

  7. Link to graph reconstruction Isomorphism of graphs → equality of their k -spectra. Equality of their k -spectra → isomorphism? Yes, when n = k + 1 and n ≤ 11 ... Graph reconstruction conjecture Let G v denote a subgraph of G , obtained by deleting node v and all the edges incident to it. Let G and G ′ be graphs of size greater than 2 and g : V → V ′ be an isomorphism function such that G v is isomorphic to G ′ g ( v ) for all v ∈ V . Then G is isomorphic to G ′ . G M Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 7

  8. Link to graph reconstruction Recursive definition of the graphlet kernel Given two graphs G and G ′ of size n ≥ k , let M and M ′ deno- te the set of size-n-1 subgraphs of G and G ′ respectively. G M The recursive graph kernel based on these subgraphs is de- fined as 1  S ∈ M ,S ′ ∈ M ′ k n − 1 ( S, S ′ ) if n > k, � ( n − k ) 2  k n ( G, G ′ ) = δ ( G ∼ = G ′ ) if n = k  where δ ( G ∼ = G ′ ) is 1 if G and G ′ are isomorphic, 0 otherwise. The graphlet kernel is defined as k g ( G, G ′ ) := k n ( G, G ′ ) . Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 8

  9. How to reduce runtime? The kernel is defined, but how to compute graphlet distributi- ons? Counting size- k graphlets by exhaustive enumeration takes O ( n k ) . This is too expensive. We propose 2 schemes to speed up the computation. We show that sampling a fixed number of graphlets suffices to bound the l 1 deviation of the empirical estimates of the graphlet distribution from the true distribution. for graphs of degree bounded by d , the exact number of all graphlets of size k can be determined in time O ( nd k − 1 ) . Large real world graphs are often sparse with d ≪ n . Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 9

  10. Sampling from graphs Given a multiset X := { X j } m j =1 of independent identically dis- tributed (iid) random variables X j ∼ D , the empirical estimate of D is defined as m D m ( i ) = 1 ˆ � δ ( X j = i ) , m j =1 where i ∈ A , and δ is an indicator function. Let D be a probability distribution on the finite set A = { 1 , . . . , a } . Let X := { X j } m j =1 , with X j ∼ D . For a given ǫ > 0 and δ > 0 , � � � 1 � �� 2 log 2 · a + log δ m = ǫ 2 � � || D − ˆ D m || 1 ≥ ǫ ≤ δ . samples suffice to ensure that P Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 10

  11. Sampling from graphs Example Consider size- 5 graphlets with ǫ = 0 . 05 , δ = 0 . 05 a = 34 , as there are 34 pairwise non-isomorphic gra- phlets of size 5 1 2 3 4 5 6 7 8 9 10 11 12 19 20 21 22 23 24 13 14 15 16 17 18 25 26 27 28 29 30 31 32 33 34 We obtain m = 21251 independent from the size of gra- phs we want to compare 21251 ≪ n 5 , ∀ n > 9 . Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 11

  12. Bounded degree graphs There is a large fraction of graphs on which complete coun- ting of graphlets can be performed efficiently: graphs of bounded degree d . We present 2 algorithms which exploit the low degree: one for enumerating all connected graphlets , one for counting all graphlets . Both have O ( nd k − 1 ) runtime complexity, but the first one is faster in practice Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 12

  13. Bounded degree graphs Count connected graphlets of size k , k ∈ { 3 , 4 , 5 } Notice that most connected graphlets contain size- k simple paths Provided this, the idea is simple: enumerate simple paths of k nodes ( O ( nd k − 1 ) ) for each path, look up adjacencies among these k nodes to decide which graphlet we obtain ( O (1) provided that we have a data structure allowing for this) each graphlet will be counted as many times, as the number of k -node paths it contains → divide counts by these numbers 1 2 Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 13

  14. Bounded degree graphs Problem: while for size- 3 graphlets all connected graphlets contain simple paths of k nodes, this is no more the case for size- 4 and 5 graphlets. I II III IV Solution: � d i � To count I, we look up the neighbor triplets of each 3 v i , and check if they induce the graphlet we are interes- ted in ( O ( nd 3 ) ) II, III and IV contain I. So we first enumerate all occur- rences of I, and then check the neighbors of each no- de in I to see if they induce the graphlets in question ( O ( nd 4 ) ) Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 14

  15. Bounded degree graphs Count all graphlets of size k , k ∈ { 3 , 4 , 5 } The basic idea: enumerate all connected graphlets obtain counts of disconnected graphlets by subtracting previously obtained quantities from precomputed quan- tities Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 15

  16. Bounded degree graphs Count all graphlets of size k , k ∈ { 3 , 4 , 5 } (continued) Example: 3 -node graphlets There are 4 types of 3 -node graphlets: denote them F i , i ∈ { 0 , 1 , 2 , 3 } , F i contains i edges First count graphlets containing at least one edge |F 1 |=|F 2 |=|F 3 |=0 For all edges do (0(nd)) Current edge |F 3 |=|F 3 |+#(red nodes) (0(d)) |F 2 |=|F 2 |+#(green nodes) |F 1 |=|F 1 |+(n-2-#(red and green nodes)) |F 3 |=|F 3 |/6, |F 2 |=|F 2 |/4, |F 1 |=|F 1 |/2 n ( ) |F 0 |= - (|F 1 |+|F 2 |+|F 3 |) 3 Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 16

  17. Experiments Statistics on datasets dataset size classes # nodes # edges d MUTAG 188 2 (125 vs. 63) 17.7 38.9 4 PTC 344 2 (192 vs. 152) 26.7 50.7 4 Enzyme 600 6 (100 each) 32.6 124.3 9 D & D 1178 2 (691 vs. 587) 284.4 1921.6 52 MUTAG, PTC - chemicals Enzyme, D & D - biological datasets We did not consider node labels Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 17

  18. Experiments Classification accuracy for k = 4 100 90 80 70 60 50 40 30 20 10 0 D & D Karsten Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 18

Recommend


More recommend