Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet Bruno Ribeir eiro Assis istant nt Profess ssor Departm tment ent of Comp mputer er Scienc nce Purdue Unive Pu versit rsity Joint work with R. Murphy*, B. Srinivasan*, V. Rao GrAPL Workshop @ IPDPS May 20 th , 2019 Army Research Lab Sponsors: Network Science CTA
What is the most powerful + graph model / representation? How can we make model learning $ tractable*? ◦ How can we make model learning $ scalable? + powerful → expressive * tractable → works on small graphs $ learning and inference Bruno Ribeiro
3 Bruno Ribeiro
𝐻 = (𝑊, 𝐹) Social Graphs Biological Graphs Molecules Ecological Graphs The Web Bruno Ribeiro 4
Arbitrary node labels 2 𝑩 ⋅1 𝑩 ⋅2 2 3 3 0 1 0 0 0 0 1 0 𝑩 1⋅ 1 6 7 𝑩 2⋅ 1 0 1 0 0 0 0 0 7 8 6 7 6 7 0 1 0 1 0 0 0 1 6 7 vertices/nodes edges 0 0 1 0 1 1 0 1 6 7 A = 6 7 0 0 0 1 0 1 0 0 6 7 6 6 7 4 0 0 0 1 1 0 1 0 6 7 4 5 1 0 0 0 0 1 0 1 5 0 0 1 1 0 0 1 0 Undirected Graph G(V, E) 𝑄(𝑩) probability of sampling A (this graph) 5 Bruno Ribeiro
Bruno Ribeiro 6
Consider a sequence of n random variables: countable 𝑌 1 , … , 𝑌 𝑜 with with joint probability distribution Sequence example: “The quick brown fox jumped over the lazy dog” 𝑄(𝑌 1 = the, 𝑌 2 = quick, … , 𝑌 9 = dog) The joint probability is just a function 𝑄: Ω 𝑜 → [0,1] (w/ normalization) ◦ P takes an ordered sequence and outputs a value between zero and one (w/ normalization) 7 Bruno Ribeiro
Consider a set of n random variables ( representing a multiset ): with how should we define their joint probability distribution ? Recall: Probability function 𝑄: Ω 𝑜 → 0,1 is order-dependent Definition: For multisets the probability function P is such that is true for any permutation 𝜌 of (1,…,n) Useful references: (Diaconis, Synthese 1977). Finite forms of de Finetti’s theorem on exchangeability (Murphy et al., ICLR 2019) Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs 8 Bruno Ribeiro
Point clouds Bag of words Our friends Neighbors of a node A Lidar maps extension: set-of-sets (Meng et al., KDD 2019) Bruno Ribeiro
Consider an array of 𝑜 2 random variables: … 𝑌 𝑗𝑘 ∈ Ω 𝑌 𝑜𝑜 and 𝑄: Ω 𝑜×𝑜 → [0,1] such that 𝑄 𝑌 11 , 𝑌 12 , 𝑌 21 , … , 𝑌 𝑜𝑜 = 𝑄 𝑌 𝜌(1)𝜌(1) , 𝑌 𝜌(1)𝜌(2) , 𝑌 𝜌(2)𝜌(1) , … , 𝑌 𝜌 𝑜 𝜌(𝑜) for any permutation 𝜌 of (1,…, n ) Then, P is a model of a graph with n vertices, where 𝑌 𝑗𝑘 ∈ Ω are edge attributes (e.g., weights) ◦ For each graph, P assigns a probability ◦ Trivial to add node attributes to the definition If Ω = {0,1} then P is a probability distribution over adjacency matrices ◦ Most statistical graph models can be represented this way 10 Bruno Ribeiro
Arbitrary node labels 2 𝑩 ⋅1 𝑩 ⋅2 2 3 3 0 1 0 0 0 0 1 0 𝑩 1⋅ 1 6 7 𝑩 2⋅ 1 0 1 0 0 0 0 0 7 8 6 7 6 7 0 1 0 1 0 0 0 1 6 7 vertices/nodes edges 6 0 0 1 0 1 1 0 1 7 A = 6 7 0 0 0 1 0 1 0 0 6 7 6 6 7 4 𝝆 0 0 0 1 1 0 1 0 6 7 4 5 1 0 0 0 0 1 0 1 5 0 0 1 1 0 0 1 0 Undirected Graph G(V, E) Graph model is invariant to permutations 𝝆 = (2,1,3,4,5,6,7,8) 𝑄 𝐵 = 𝑄 𝐵 𝜌𝜌 11 Bruno Ribeiro
Bruno Ribeiro
Invariances have deep implications in nature ◦ Noether’s (first) theorem (1918): invariances ⇒ laws of conservation e.g.: time and space translation invariance ⇒ energy conservation The study of probabilistic invariances (symmetries) has a long history ◦ Laplace’s “rule of succession” dates to 1774 ( Kallenberg, 2005) ◦ Maxwell’s work in statistical mechanics (1875) ( Kallenberg, 2005) ◦ Permutation invariance for infinite sets : de Finetti’s theorem (de Finetti, 1930) Special case of the ergodic decomposition theorem, related to integral decompositions (see Orbanz and Roy (2015) for a good overview) ◦ Kallenberg (2005) & (2007): de-facto references on probabilistic invariances Bruno Ribeiro
Aldous, D. J. Representations for partially exchangeable arrays of random variables. J. Multivar. Anal., 1981. 14 Bruno Ribeiro
Consider an infinite set of random variables: … 𝑌 𝑗𝑘 ∈ Ω such that is true for any permutation 𝜌 of the positive integers Then, 𝑉 ∞ ∈[0,1] ς 𝑗𝑘 𝑄(𝑌 𝑗𝑘 |𝑉 𝑗 , 𝑉 𝑄 𝑌 11 , 𝑌 12 , … ∝ 𝑉 1 ∈[0,1] ⋯ 𝑘 ) is a mixture model of uniform distributions over 𝑉 𝑗 , 𝑉 𝑘 , … ∼ Uniform(0,1) (Aldous-Hoover representation is sufficient only for infinite graphs) 15 Bruno Ribeiro
Bruno Ribeiro
Relationship between deterministic functions and probability distributions Noise outsourcing : ◦ Tool from measure theory ◦ Any conditional probability 𝑄(𝑍|𝑌) can be represent as 𝑍 = 𝑌, 𝜗 , 𝜗 ∼ Uniform(0,1) where is a deterministic function ◦ The randomness is entirely outsourced to 𝜗 Representation 𝑡(𝑌) : ◦ 𝑡(𝑌) : deterministic function, makes 𝑍 independent of 𝑌 given 𝑡(𝑌) ◦ Then, ∃′ such that 𝑍, 𝑌 = (′ 𝑡 𝑌 , 𝜗 , 𝑌) , 𝜗 ∼ Uniform(0,1) We call 𝑡(𝑌) a representation of 𝑌 Representations are generalizations of “embeddings” * = is a.s. Bruno Ribeiro
18 Bruno Ribeiro
Gaussian Linear Model: (each node i represented by a random vector) 2 𝑱) Node i vector U i . ~ 𝑂𝑝𝑠𝑛𝑏𝑚(𝟏, 𝜏 𝑉 Adjacency matrix: A ij ~ 𝑂𝑝𝑠𝑛𝑏𝑚(𝑽 𝑗⋅ 𝑈 𝑽 𝑘⋅ , 𝜏 2 ) Q : For a given 𝑩 , what is the most likely 𝑽 ? Answer : 𝑽 ⋆ = argmax 𝑽 𝑄(𝑩|𝑽) , a.k.a. maximum likelihood Equivalent optimization : Minimizing Negative Log-Likelihood: 2 + 𝜏 2 𝑽 ⋆ = argmin 𝑩 − 𝑽𝑽 𝑈 2 2 𝑽 2 2 𝜏 𝑉 𝑽 19 Bruno Ribeiro
That will turn out to be the same Bruno Ribeiro
Embedding of adjacency matrix 𝑩 𝑩 ≈ 𝑽𝑽 𝑈 U . i = i- th column vector of 𝑽 U . 1 U . 2 A U . 2 U . 1 = + + … Bruno Ribeiro
Matrix factorization can be used to compute a low-rank representation of A A reconstruction problem: Find 𝑩 ≈ 𝑽𝑽 T by optimizing Sum squared error L2 regularization 2 + 𝜇‖𝑽 ‖ 2 𝑩 − 𝑽𝑽 T 2 min 2 𝑽 Regularization strength where 𝑉 has k columns* *sometimes we will force orthogonal columns in U 22 Bruno Ribeiro
23 Bruno Ribeiro
Bruno Ribeiro
25 Bruno Ribeiro
26 Bruno Ribeiro
Recursive algorithm to determine if two graphs are isomorphic ◦ Valid isomorphism test for most graphs (Babai and Kucera, 1979) ◦ Cai et al., 1992 shows examples that cannot be Shervashidze et al. 2011 distinguished by it Initialize: 𝒊 𝑤 is the attribute vector of vertex 𝑤 ∈ 𝐻 ◦ Belongs to class of color refinement (if no attribute, assign 1) algorithms that iteratively update vertex “colors” (hash values) until 𝑙 = 0 it has converged to unique function WL-fingerprints( 𝐻 ): assignments of hashes to vertices while vertex attributes change do : ◦ Final hash values encode the 𝑙 ← 𝑙 + 1 structural roles of vertices inside a for all vertices 𝑤 ∈ 𝐻 do graph 𝒊 𝑙,𝑤 ← hash 𝒊 𝑙−1,𝑤 , 𝒊 𝑙−1,𝑣 : ∀𝑣 ∈ Neighbors 𝑤 ◦ Often fails for graphs with a high degree of symmetry, e.g. chains, Return {𝒊 𝑙,𝑤 : ∀𝑤 ∈ 𝐻} complete graphs, tori and stars neighbors of node v Bruno Ribeiro 27
The hardest task for graph representation is: ◦ Give different tags to different graphs Isomorphic graphs should have the same tag ◦ Task: Given adjacency matrix 𝑩 , predict tag Goal : Find a representation 𝑡(𝑩) such that P tag 𝐁 = g(𝑡 𝑩 , 𝜗) ◦ Then, 𝑡(𝑩) must give: same representation to isomorphic graphs different representations to non-isomorphic graphs 28 Bruno Ribeiro
Bruno Ribeiro
Main idea: Graph Neural Networks: Use the WL algorithm to compute representations that are related to a task Initialize ℎ 0,𝑤 = node 𝑤 attribute function Ԧ 𝑔 ( 𝑩, 𝑿 1 , … , 𝑿 𝐿 , 𝒄 1 , … , 𝒄 𝐿 ): while 𝑙 < 𝐿 do : # K layers could be another permutation-invariant function 𝑙 ← 𝑙 + 1 (see Murphy et al. ICLR 2019) for all vertices 𝑤 ∈ 𝑊 do 𝒊 𝑙,𝑤 = 𝜏 𝑿 𝑙 𝒊 𝑙−1,𝑤 , 𝑩 𝒘⋅ 𝒊 + 𝒄 𝑙 return {𝒊 𝐿,𝑤 : ∀𝑤 ∈ 𝑊} Example supervised task: predict label 𝑧 𝑗 of graph 𝐻 𝑗 represented by 𝑩 𝑗 Optimization for loss 𝑀 : Let 𝜾 = (𝑿 1 , … , 𝑿 𝐿 , 𝒄 1 , … , 𝒄 𝐿 , 𝑿 agg , 𝒄 agg ) 𝜾 ⋆ = argmax 𝜾 𝑔 ( 𝑩 𝑗 , 𝑿 1 , … , 𝑿 𝐿 , 𝒄 1 , … , 𝒄 𝐿 )) + 𝒄 agg ) 𝑀(𝑧 𝑗 , 𝑿 agg Pooling( Ԧ 𝑗∈Data permutation-invariant function 30 (see Murphy et al. ICLR 2019) Bruno Ribeiro
Recommend
More recommend