graph representation learning where probability theory
play

Graph Representation Learning: Where Probability Theory, Data - PowerPoint PPT Presentation

Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet Bruno Ribeir eiro Assis istant nt Profess ssor Departm tment ent of Comp mputer er Scienc nce Purdue Unive Pu versit rsity Joint work


  1. Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet Bruno Ribeir eiro Assis istant nt Profess ssor Departm tment ent of Comp mputer er Scienc nce Purdue Unive Pu versit rsity Joint work with R. Murphy*, B. Srinivasan*, V. Rao GrAPL Workshop @ IPDPS May 20 th , 2019 Army Research Lab Sponsors: Network Science CTA

  2.  What is the most powerful + graph model / representation?  How can we make model learning $ tractable*? ◦ How can we make model learning $ scalable? + powerful → expressive * tractable → works on small graphs $ learning and inference Bruno Ribeiro

  3. 3 Bruno Ribeiro

  4. 𝐻 = (𝑊, 𝐹) Social Graphs Biological Graphs Molecules Ecological Graphs The Web Bruno Ribeiro 4

  5. Arbitrary node labels 2 𝑩 ⋅1 𝑩 ⋅2 2 3 3 0 1 0 0 0 0 1 0 𝑩 1⋅ 1 6 7 𝑩 2⋅ 1 0 1 0 0 0 0 0 7 8 6 7 6 7 0 1 0 1 0 0 0 1 6 7 vertices/nodes edges 0 0 1 0 1 1 0 1 6 7 A = 6 7 0 0 0 1 0 1 0 0 6 7 6 6 7 4 0 0 0 1 1 0 1 0 6 7 4 5 1 0 0 0 0 1 0 1 5 0 0 1 1 0 0 1 0 Undirected Graph G(V, E) 𝑄(𝑩) probability of sampling A (this graph) 5 Bruno Ribeiro

  6. Bruno Ribeiro 6

  7.  Consider a sequence of n random variables: countable 𝑌 1 , … , 𝑌 𝑜 with with joint probability distribution  Sequence example: “The quick brown fox jumped over the lazy dog” 𝑄(𝑌 1 = the, 𝑌 2 = quick, … , 𝑌 9 = dog)  The joint probability is just a function 𝑄: Ω 𝑜 → [0,1] (w/ normalization) ◦ P takes an ordered sequence and outputs a value between zero and one (w/ normalization) 7 Bruno Ribeiro

  8.  Consider a set of n random variables ( representing a multiset ): with how should we define their joint probability distribution ? Recall: Probability function 𝑄: Ω 𝑜 → 0,1 is order-dependent Definition: For multisets the probability function P is such that  is true for any permutation 𝜌 of (1,…,n) Useful references: (Diaconis, Synthese 1977). Finite forms of de Finetti’s theorem on exchangeability (Murphy et al., ICLR 2019) Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs 8 Bruno Ribeiro

  9.  Point clouds  Bag of words  Our friends  Neighbors of a node A Lidar maps extension: set-of-sets (Meng et al., KDD 2019) Bruno Ribeiro

  10.  Consider an array of 𝑜 2 random variables: … 𝑌 𝑗𝑘 ∈ Ω 𝑌 𝑜𝑜 and 𝑄: Ω 𝑜×𝑜 → [0,1] such that 𝑄 𝑌 11 , 𝑌 12 , 𝑌 21 , … , 𝑌 𝑜𝑜 = 𝑄 𝑌 𝜌(1)𝜌(1) , 𝑌 𝜌(1)𝜌(2) , 𝑌 𝜌(2)𝜌(1) , … , 𝑌 𝜌 𝑜 𝜌(𝑜) for any permutation 𝜌 of (1,…, n )  Then, P is a model of a graph with n vertices, where 𝑌 𝑗𝑘 ∈ Ω are edge attributes (e.g., weights) ◦ For each graph, P assigns a probability ◦ Trivial to add node attributes to the definition  If Ω = {0,1} then P is a probability distribution over adjacency matrices ◦ Most statistical graph models can be represented this way 10 Bruno Ribeiro

  11. Arbitrary node labels 2 𝑩 ⋅1 𝑩 ⋅2 2 3 3 0 1 0 0 0 0 1 0 𝑩 1⋅ 1 6 7 𝑩 2⋅ 1 0 1 0 0 0 0 0 7 8 6 7 6 7 0 1 0 1 0 0 0 1 6 7 vertices/nodes edges 6 0 0 1 0 1 1 0 1 7 A = 6 7 0 0 0 1 0 1 0 0 6 7 6 6 7 4 𝝆 0 0 0 1 1 0 1 0 6 7 4 5 1 0 0 0 0 1 0 1 5 0 0 1 1 0 0 1 0 Undirected Graph G(V, E) Graph model is invariant to permutations 𝝆 = (2,1,3,4,5,6,7,8) 𝑄 𝐵 = 𝑄 𝐵 𝜌𝜌 11 Bruno Ribeiro

  12. Bruno Ribeiro

  13.  Invariances have deep implications in nature ◦ Noether’s (first) theorem (1918): invariances ⇒ laws of conservation e.g.:  time and space translation invariance ⇒ energy conservation  The study of probabilistic invariances (symmetries) has a long history ◦ Laplace’s “rule of succession” dates to 1774 ( Kallenberg, 2005) ◦ Maxwell’s work in statistical mechanics (1875) ( Kallenberg, 2005) ◦ Permutation invariance for infinite sets :  de Finetti’s theorem (de Finetti, 1930)  Special case of the ergodic decomposition theorem, related to integral decompositions (see Orbanz and Roy (2015) for a good overview) ◦ Kallenberg (2005) & (2007): de-facto references on probabilistic invariances Bruno Ribeiro

  14. Aldous, D. J. Representations for partially exchangeable arrays of random variables. J. Multivar. Anal., 1981. 14 Bruno Ribeiro

  15.  Consider an infinite set of random variables: … 𝑌 𝑗𝑘 ∈ Ω such that is true for any permutation 𝜌 of the positive integers  Then, 𝑉 ∞ ∈[0,1] ς 𝑗𝑘 𝑄(𝑌 𝑗𝑘 |𝑉 𝑗 , 𝑉 𝑄 𝑌 11 , 𝑌 12 , … ∝ ׬ 𝑉 1 ∈[0,1] ⋯ ׬ 𝑘 ) is a mixture model of uniform distributions over 𝑉 𝑗 , 𝑉 𝑘 , … ∼ Uniform(0,1) (Aldous-Hoover representation is sufficient only for infinite graphs) 15 Bruno Ribeiro

  16. Bruno Ribeiro

  17. Relationship between deterministic functions and probability distributions  Noise outsourcing : ◦ Tool from measure theory ◦ Any conditional probability 𝑄(𝑍|𝑌) can be represent as 𝑍 = 𝑕 𝑌, 𝜗 , 𝜗 ∼ Uniform(0,1) where 𝑕 is a deterministic function ◦ The randomness is entirely outsourced to 𝜗  Representation 𝑡(𝑌) : ◦ 𝑡(𝑌) : deterministic function, makes 𝑍 independent of 𝑌 given 𝑡(𝑌) ◦ Then, ∃𝑕′ such that 𝑍, 𝑌 = (𝑕′ 𝑡 𝑌 , 𝜗 , 𝑌) , 𝜗 ∼ Uniform(0,1) We call 𝑡(𝑌) a representation of 𝑌 Representations are generalizations of “embeddings” * = is a.s. Bruno Ribeiro

  18. 18 Bruno Ribeiro

  19. Gaussian Linear Model: (each node i represented by a random vector) 2 𝑱) Node i vector U i . ~ 𝑂𝑝𝑠𝑛𝑏𝑚(𝟏, 𝜏 𝑉 Adjacency matrix: A ij ~ 𝑂𝑝𝑠𝑛𝑏𝑚(𝑽 𝑗⋅ 𝑈 𝑽 𝑘⋅ , 𝜏 2 ) Q : For a given 𝑩 , what is the most likely 𝑽 ? Answer : 𝑽 ⋆ = argmax 𝑽 𝑄(𝑩|𝑽) , a.k.a. maximum likelihood Equivalent optimization : Minimizing Negative Log-Likelihood: 2 + 𝜏 2 𝑽 ⋆ = argmin 𝑩 − 𝑽𝑽 𝑈 2 2 𝑽 2 2 𝜏 𝑉 𝑽 19 Bruno Ribeiro

  20. That will turn out to be the same Bruno Ribeiro

  21.  Embedding of adjacency matrix 𝑩 𝑩 ≈ 𝑽𝑽 𝑈 U . i = i- th column vector of 𝑽 U . 1 U . 2 A U . 2 U . 1 = + + … Bruno Ribeiro

  22.  Matrix factorization can be used to compute a low-rank representation of A  A reconstruction problem: Find 𝑩 ≈ 𝑽𝑽 T by optimizing Sum squared error L2 regularization 2 + 𝜇‖𝑽 ‖ 2 𝑩 − 𝑽𝑽 T 2 min 2 𝑽 Regularization strength where 𝑉 has k columns* *sometimes we will force orthogonal columns in U 22 Bruno Ribeiro

  23. 23 Bruno Ribeiro

  24. Bruno Ribeiro

  25. 25 Bruno Ribeiro

  26. 26 Bruno Ribeiro

  27. Recursive algorithm to determine if two  graphs are isomorphic ◦ Valid isomorphism test for most graphs (Babai and Kucera, 1979) ◦ Cai et al., 1992 shows examples that cannot be Shervashidze et al. 2011 distinguished by it Initialize: 𝒊 𝑤 is the attribute vector of vertex 𝑤 ∈ 𝐻 ◦ Belongs to class of color refinement (if no attribute, assign 1) algorithms that iteratively update vertex “colors” (hash values) until 𝑙 = 0 it has converged to unique function WL-fingerprints( 𝐻 ): assignments of hashes to vertices while vertex attributes change do : ◦ Final hash values encode the 𝑙 ← 𝑙 + 1 structural roles of vertices inside a for all vertices 𝑤 ∈ 𝐻 do graph 𝒊 𝑙,𝑤 ← hash 𝒊 𝑙−1,𝑤 , 𝒊 𝑙−1,𝑣 : ∀𝑣 ∈ Neighbors 𝑤 ◦ Often fails for graphs with a high degree of symmetry, e.g. chains, Return {𝒊 𝑙,𝑤 : ∀𝑤 ∈ 𝐻} complete graphs, tori and stars neighbors of node v Bruno Ribeiro 27

  28.  The hardest task for graph representation is: ◦ Give different tags to different graphs  Isomorphic graphs should have the same tag ◦ Task: Given adjacency matrix 𝑩 , predict tag  Goal : Find a representation 𝑡(𝑩) such that P tag 𝐁 = g(𝑡 𝑩 , 𝜗) ◦ Then, 𝑡(𝑩) must give:  same representation to isomorphic graphs  different representations to non-isomorphic graphs 28 Bruno Ribeiro

  29. Bruno Ribeiro

  30. Main idea: Graph Neural Networks: Use the WL algorithm to compute representations that are related to a task Initialize ℎ 0,𝑤 = node 𝑤 attribute function Ԧ 𝑔 ( 𝑩, 𝑿 1 , … , 𝑿 𝐿 , 𝒄 1 , … , 𝒄 𝐿 ): while 𝑙 < 𝐿 do : # K layers could be another permutation-invariant function 𝑙 ← 𝑙 + 1 (see Murphy et al. ICLR 2019) for all vertices 𝑤 ∈ 𝑊 do 𝒊 𝑙,𝑤 = 𝜏 𝑿 𝑙 𝒊 𝑙−1,𝑤 , 𝑩 𝒘⋅ 𝒊 + 𝒄 𝑙 return {𝒊 𝐿,𝑤 : ∀𝑤 ∈ 𝑊} Example supervised task: predict label 𝑧 𝑗 of graph 𝐻 𝑗 represented by 𝑩 𝑗 Optimization for loss 𝑀 : Let 𝜾 = (𝑿 1 , … , 𝑿 𝐿 , 𝒄 1 , … , 𝒄 𝐿 , 𝑿 agg , 𝒄 agg ) 𝜾 ⋆ = argmax 𝜾 𝑔 ( 𝑩 𝑗 , 𝑿 1 , … , 𝑿 𝐿 , 𝒄 1 , … , 𝒄 𝐿 )) + 𝒄 agg ) 𝑀(𝑧 𝑗 , 𝑿 agg Pooling( Ԧ ෍ 𝑗∈Data permutation-invariant function 30 (see Murphy et al. ICLR 2019) Bruno Ribeiro

Recommend


More recommend