? ? x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2
? ? ? ? Machine Learning ? Node classification 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3
Classifying the function of proteins in the interactome Image from: Ganapathiraju et al. 2016. Schizophrenia interactome with 504 novel protein–protein interactions. Nature . 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4
¡ (Supervised) Machine Learning Lifecycle requires feature engineering every single time! Raw Structured Learning Model Data Data Algorithm Automatically Feature Downstream Engineering learn the features task 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5
Goal: Efficient task-independent feature learning for machine learning in networks! vec node u 𝑔: 𝑣 → ℝ ! ℝ ! Feature representation, embedding 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6
Task: We map each node in a network to a • point in a low-dimensional space § Distributed representation for nodes – § Similarity of embedding between nodes indicates – their network similarity § Encode network information and generate node – representation 17 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7
2D embedding of nodes of the Zachary’s Karate Club network: • Zachary’s Karate Network: Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014. 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 8
¡ Modern deep learning toolbox is designed for simple sequences or grids § CNNs for fixed-size images/grids…. § RNNs or word2vec for text/sequences… 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 9
But networks are far more complex! ¡ Complex topographical structure (no spatial locality like grids) vs vs. Text Te Networks ks Im Imag ages es ¡ No fixed node ordering or reference point ¡ Often dynamic and have multimodal features. 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10
Assume we have a graph G : ¡ V is the vertex set ¡ A is the adjacency matrix (assume binary) ¡ No node features or extra information is used! 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12
¡ Goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the original network 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13
Goal: Go similarity( u, v ) ≈ z > v z u in the original network Similarity of the embedding Ne Need t to d define! 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 14
Define an encoder (i.e., a mapping from 1. nodes to embeddings) Define a node similarity function (i.e., a 2. measure of similarity in the original network) Optimize the parameters of the encoder 3. so that: similarity( u, v ) ≈ z > v z u in the original network Similarity of the embedding 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 15
¡ Encoder maps each node to a low- dimensional vector d -dimensional embedding enc ( v ) = z v node in the input graph ¡ Similarity function specifies how relationships in vector space map to relationships in the original network similarity( u, v ) ≈ z > v z u Similarity of u and v in dot product between node the original network embeddings 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 16
¡ Simplest encoding approach: encoder is just an embedding-lookup enc ( v ) = Zv Matrix, each column is 𝑒 -dim node Z ∈ R d × |V| embedding [w [what w we l learn!] !] Indicator vector, all zeroes v ∈ I |V| except a one in column indicating node 𝑤 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 17
¡ Simplest encoding approach: encoder is just an embedding-lookup embedding vector for a specific node embedding matrix Dimension/size Z = of embeddings one column per node 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 18
Simplest encoding approach: encoder is just an embedding-lookup Each node is assigned a unique embedding vector Many methods: node2vec, DeepWalk, LINE 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 19
Key choice of methods is how they define node similarity. E.g., should two nodes have similar embeddings if they…. ¡ are connected? ¡ share neighbors? ¡ have similar “structural roles”? ¡ …? 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20
Material based on: Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD. • Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD. •
Probability that 𝑣 z > and 𝑤 co-occur on u z v ≈ a random walk over the network 𝑨 ! … embedding of node 𝑣 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22
Estimate probability of visiting node 𝒘 on a 1. random walk starting from node 𝒗 using some random walk strategy 𝑺 Optimize embeddings to encode these 2. random walk statistics: 𝑨 ! Similarity (here: dot product ≈ cos(𝜄) ) encodes random walk “similarity” 𝑨 " 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 23
Expressivity: Flexible stochastic 1. definition of node similarity that incorporates both local and higher- order neighborhood information Efficiency: Do not need to consider all 2. node pairs when training; only need to consider pairs that co-occur on random walks 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 24
¡ Intuition: Find embedding of nodes to 𝑒 -dimensional space so that node similarity is preserved ¡ Idea: Learn node embedding such that nearby nodes are close together in the network ¡ Given a node 𝒗 , how do we define nearby nodes? § 𝑂 ! 𝑣 … neighbourhood of 𝑣 obtained by some strategy 𝑆 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 25
¡ Given 𝐻 = (𝑊, 𝐹) ¡ Our goal is to learn a mapping 𝑨: 𝑣 → ℝ ! ¡ Maximize log-likelihood objective: max 8 log P(𝑂 & (𝑣)| 𝑨 # ) " # ∈% § where 𝑂 ! (𝑣) is neighborhood of node 𝑣 ¡ Given node 𝑣 , we want to learn feature representations predictive of nodes in its neighborhood 𝑂 & (𝑣) 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 26
Run short fixed-length random walks 1. starting from each node on the graph using some strategy R For each node 𝑣 collect 𝑂 ' (𝑣) , the multiset * 2. of nodes visited on random walks starting from u Optimize embeddings according to: Given 3. node 𝑣 , predict its neighbors 𝑂 & (𝑣) max 8 log P(𝑂 & (𝑣)| 𝑨 # ) " # ∈% * 𝑂 ! (𝑣) can have repeat elements since nodes can be visited multiple times on random walks 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 27
max 8 log P(𝑂 & (𝑣)| 𝑨 # ) " # ∈% ¡ Assumption: Conditional likelihood factorizes over the set of neighbors: log P(𝑂 & (𝑣)|𝑨 # ) = 8 log P(z ( | 𝑨 # ) (∈) ! (#) ¡ Softmax parametrization: Why softmax? ,-.(/ " ⋅" # ) Pr z ( 𝑨 # ) = We want node 𝑤 to be most similar to node 𝑣 ∑ $∈& ,-.(/ ' ⋅" # ) (out of all nodes 𝑜 ). Intuition: ∑ " exp 𝑦 " ≈ max exp(𝑦 " ) " 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 28
Putting it all together: exp( z > ✓ ◆ u z v ) X X L = − log P n 2 V exp( z > u z n ) u 2 V v 2 N R ( u ) predicted probability of 𝑣 sum over nodes 𝑤 sum over all and 𝑤 co-occuring on seen on random nodes 𝑣 random walk walks starting from 𝑣 Optimizing random walk embeddings = Finding node embeddings 𝒜 that minimize L 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 30
But doing this naively is too expensive!! exp( z > ✓ ◆ u z v ) X X L = − log P n 2 V exp( z > u z n ) u 2 V v 2 N R ( u ) Nested sum over nodes gives O(|V| 2 ) complexity! 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 31
But doing this naively is too expensive!! exp( z > ✓ ◆ u z v ) X X L = − log P n 2 V exp( z > u z n ) u 2 V v 2 N R ( u ) The normalization term from the softmax is the culprit… can we approximate it? 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 32
Recommend
More recommend