convolutional kernel networks for graph structured data
play

Convolutional Kernel Networks for Graph-Structured Data Dexiong Chen - PowerPoint PPT Presentation

Convolutional Kernel Networks for Graph-Structured Data Dexiong Chen 1 Laurent Jacob 2 Julien Mairal 1 1 Inria Grenoble 2 CNRS/LBBE Lyon ICML 2020 Dexiong Chen Graph Convolutional Kernel Networks 1 / 15 Graph-structured data are ubiquitous (a)


  1. Convolutional Kernel Networks for Graph-Structured Data Dexiong Chen 1 Laurent Jacob 2 Julien Mairal 1 1 Inria Grenoble 2 CNRS/LBBE Lyon ICML 2020 Dexiong Chen Graph Convolutional Kernel Networks 1 / 15

  2. Graph-structured data are ubiquitous (a) molecules (b) protein regulation (c) social networks (d) chemical pathways Dexiong Chen Graph Convolutional Kernel Networks 2 / 15

  3. Learning graph representations State-of-the-art models for representing graphs Deep learning for graphs : graph neural networks (GNNs) Graph kernels : Weisfeiler-Lehman (WL) graph kernels Hybrid models attempt to bridge both worlds: graph neural tangent kernels Dexiong Chen Graph Convolutional Kernel Networks 3 / 15

  4. Learning graph representations State-of-the-art models for representing graphs Deep learning for graphs : graph neural networks (GNNs) Graph kernels : Weisfeiler-Lehman (WL) graph kernels Hybrid models attempt to bridge both worlds: graph neural tangent kernels Our model : A new type of multilayer graph kernel: more expressive than WL kernels Learning easy-to-regularize and scalable unsupervised graph representations Learning supervised graph representations like GNNs Dexiong Chen Graph Convolutional Kernel Networks 3 / 15

  5. Graphs with node attributes G = ( V , E , a : V → R 3 ) u a ( u ) = [0 . 3 , 0 . 8 , 0 . 5] A graph is defined as a triplet ( V , E , a ) ; V and E correspond to the set of vertices and edges; a : V → R d is a function assigning attributes to each node. Dexiong Chen Graph Convolutional Kernel Networks 4 / 15

  6. Graph kernel mappings φ H X Map each graph G in X to a vector ϕ ( G ) in H , which lends itself to learning tasks. [Lei et al., 2017, Kriege et al., 2019] Dexiong Chen Graph Convolutional Kernel Networks 5 / 15

  7. Graph kernel mappings φ H X Map each graph G in X to a vector ϕ ( G ) in H , which lends itself to learning tasks. A large class of graph kernel mappings can be written in the form � ϕ ( G ) := ϕ base ( ℓ G ( u )) where ϕ base embeds some local patterns ℓ G ( u ) to H . u ∈V [Lei et al., 2017, Kriege et al., 2019] Dexiong Chen Graph Convolutional Kernel Networks 5 / 15

  8. Basic kernels: walk and path kernel mappings P k ( G , u ) := paths of length k from node u in G . The k -path mapping is � ϕ path ( u ) := δ a ( p ) ( · ) p ∈P k ( G , u ) a ( p ) : concatenated attributes in p ; δ : the Dirac function. ϕ path ( u ) can be interpreted as a histogram of paths occurrences. Dexiong Chen Graph Convolutional Kernel Networks 6 / 15

  9. Basic kernels: walk and path kernel mappings P k ( G , u ) := paths of length k from node u in G . The k -path mapping is � ϕ path ( u ) := δ a ( p ) ( · ) p ∈P k ( G , u ) a ( p ) : concatenated attributes in p ; δ : the Dirac function. ϕ path ( u ) can be interpreted as a histogram of paths occurrences. Path kernels are more expressive than walk kernels, but less preferred for computational reasons. Dexiong Chen Graph Convolutional Kernel Networks 6 / 15

  10. A relaxed path kernel � ϕ path ( u ) = δ a ( p ) ( · ) p ∈P k ( G , u ) Issues of the path kernel mapping: δ allows hard comparison between paths thus only works for discrete attributes. δ is not differentiable, which cannot be “optimized” with back-propagation. Dexiong Chen Graph Convolutional Kernel Networks 7 / 15

  11. A relaxed path kernel � ϕ path ( u ) = δ a ( p ) ( · ) p ∈P k ( G , u ) 2 � a ( p ) −·� 2 . � e − α = ⇒ p ∈P k ( G , u ) Issues of the path kernel mapping: δ allows hard comparison between paths thus only works for discrete attributes. δ is not differentiable, which cannot be “optimized” with back-propagation. Relax it with a “soft” and differentiable mapping interpreted as the sum of Gaussians centered at each path features from u . Dexiong Chen Graph Convolutional Kernel Networks 7 / 15

  12. One-layer GCKN: a closer look on the relaxed path kernel We define the one-layer GCKN as the relaxed path kernel mapping 2 � a ( p ) −·� 2 = e − α 1 � � Φ 1 ( a ( p )) ∈ H 1 . ϕ 1 ( u ) := p ∈P k ( G , u ) p ∈P k ( G , u ) This formula can be divided into 3 steps : path extraction: enumerating all P k ( G , u ) kernel mapping: evaluating Gaussian embedding Φ 1 of path features path aggregation: aggregating the path embeddings Dexiong Chen Graph Convolutional Kernel Networks 8 / 15

  13. One-layer GCKN: a closer look on the relaxed path kernel We define the one-layer GCKN as the relaxed path kernel mapping 2 � a ( p ) −·� 2 = e − α 1 � � Φ 1 ( a ( p )) ∈ H 1 . ϕ 1 ( u ) := p ∈P k ( G , u ) p ∈P k ( G , u ) This formula can be divided into 3 steps : path extraction: enumerating all P k ( G , u ) kernel mapping: evaluating Gaussian embedding Φ 1 of path features path aggregation: aggregating the path embeddings We obtain a new graph with the same topology but different features ϕ path ( V , E , a ) − − − → ( V , E , ϕ 1 ) Dexiong Chen Graph Convolutional Kernel Networks 8 / 15

  14. Construction of one-layer GCKN ( V , E , ϕ 1 : V → H 1 ) ϕ 1 ( u ) ∈ H 1 ϕ 1 ( u ) := Φ 1 ( a ( p 1 )) + Φ 1 ( a ( p 2 )) + Φ 1 ( a ( p 3 )) u path aggregation path aggregation kernel mapping u Φ 1 ( a ( p 3 )) Φ 1 ( a ( p 2 )) path extraction Φ 1 ( a ( p 1 )) H 1 a ( u ) ∈ R d kernel mapping u p 1 p 2 p 3 u u u ( V , E , a : V → R d ) Dexiong Chen Graph Convolutional Kernel Networks 9 / 15

  15. From one-layer to multilayer GCKN We can repeat applying ϕ path to the new graph ϕ path ϕ path ϕ path ϕ path ( V , E , a ) − − − → ( V , E , ϕ 1 ) → ( V , E , ϕ 2 ) → . . . → ( V , E , ϕ j ) . − − − − − − − − − ϕ j ( u ) represents the information about a neighborhood of u . Final graph representation at layer j , ϕ j ( G ) = � u ∈V ϕ j ( u ) . Dexiong Chen Graph Convolutional Kernel Networks 10 / 15

  16. From one-layer to multilayer GCKN We can repeat applying ϕ path to the new graph ϕ path ϕ path ϕ path ϕ path ( V , E , a ) − − − → ( V , E , ϕ 1 ) → ( V , E , ϕ 2 ) → . . . → ( V , E , ϕ j ) . − − − − − − − − − ϕ j ( u ) represents the information about a neighborhood of u . Final graph representation at layer j , ϕ j ( G ) = � u ∈V ϕ j ( u ) . Why is the multilayer model interesting ? applying ϕ path once can capture paths : GCKN-path; applying twice can capture subtrees : GCKN-subtree; so applying even more times may capture higher-order structures ? Long paths cannot be enumerated due to computational complexity, yet multilayer model can capture long-range substructures . Dexiong Chen Graph Convolutional Kernel Networks 10 / 15

  17. Scalable approximation of Gaussian kernel mapping � ϕ path ( u ) = Φ( a ( p )) p ∈P k ( G , u ) 2 � x −·� 2 ∈ H is infinite-dimensional and can be expensive to compute. Φ( x ) = e − α [Chen et al., 2019a,b] Dexiong Chen Graph Convolutional Kernel Networks 11 / 15

  18. Scalable approximation of Gaussian kernel mapping � ϕ path ( u ) = Φ( a ( p )) p ∈P k ( G , u ) 2 � x −·� 2 ∈ H is infinite-dimensional and can be expensive to compute. Φ( x ) = e − α Nyström provides a finite-dimensional approximation Ψ( x ) ∈ R q by orthogonally projecting Φ( x ) onto some finite-dimensional subspace: span (Φ( z 1 ) , . . . , Φ( z q )) parametrized by Z = { z 1 , . . . , z q } , where z j ∈ R dk can be interpreted as path features. [Chen et al., 2019a,b] Dexiong Chen Graph Convolutional Kernel Networks 11 / 15

  19. Scalable approximation of Gaussian kernel mapping � ϕ path ( u ) = Φ( a ( p )) p ∈P k ( G , u ) 2 � x −·� 2 ∈ H is infinite-dimensional and can be expensive to compute. Φ( x ) = e − α Nyström provides a finite-dimensional approximation Ψ( x ) ∈ R q by orthogonally projecting Φ( x ) onto some finite-dimensional subspace: span (Φ( z 1 ) , . . . , Φ( z q )) parametrized by Z = { z 1 , . . . , z q } , where z j ∈ R dk can be interpreted as path features. The parameters Z can be learned by (unsupervised) K-means on the set of path features; (supervised) end-to-end learning with back-propagation. [Chen et al., 2019a,b] Dexiong Chen Graph Convolutional Kernel Networks 11 / 15

  20. Experiments on graphs with discrete attributes MUTAG 12 10 Accuracy improvement with COLLAB PROTEINS respect to the WL subtree 0 kernel. GCKN-path already -10 outperforms the baselines. IMDB-M PTC Increasing number of layers brings larger improvement. Supervised learning does not improve performance, but WL subtree IMDB-B NCI1 GNTK leads to more compact GCN GIN representations. GCKN-path-unsup GCKN-subtree-unsup GCKN-subtree-sup [Shervashidze et al., 2011, Du et al., 2019, Xu et al., 2019, Kipf and Welling, 2017] Dexiong Chen Graph Convolutional Kernel Networks 12 / 15

  21. Experiments on graphs with continuous attributes ENZYMES 5 0 Accuracy improvement with respect to the WWL kernel. -5 COX2 PROTEINS Results similar to discrete case. Path features seem presumably predictive enough. WWL GNTK BZR GCKN-path-unsup GCKN-subtree-unsup GCKN-subtree-sup [Du et al., 2019, Togninalli et al., 2019] Dexiong Chen Graph Convolutional Kernel Networks 13 / 15

  22. Model interpretation for mutagenicity prediction Idea: find the minimal connected component that preserves the prediction. Original GCKN [Ying et al., 2019] Dexiong Chen Graph Convolutional Kernel Networks 14 / 15

Recommend


More recommend