? ? x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2
? ? ? ? Machine Learning ? Node classification 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3
Classifying the function of proteins in the interactome Image from: Ganapathiraju et al. 2016. Schizophrenia interactome with 504 novel proteinβprotein interactions. Nature . 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4
Β‘ (Supervised) Machine Learning Lifecycle requires feature engineering every single time! Raw Structured Learning Model Data Data Algorithm Automatically Feature Downstream Engineering learn the features task 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5
Goal: Efficient task-independent feature learning for machine learning in networks! vec node u π: π£ β β ! β ! Feature representation, embedding 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6
Task: We map each node in a network to a β’ point in a low-dimensional space Β§ Distributed representation for nodes β Β§ Similarity of embedding between nodes indicates β their network similarity Β§ Encode network information and generate node β representation 17 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7
2D embedding of nodes of the Zacharyβs Karate Club network: β’ Zacharyβs Karate Network: Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014. 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 8
Β‘ Modern deep learning toolbox is designed for simple sequences or grids Β§ CNNs for fixed-size images/gridsβ¦. Β§ RNNs or word2vec for text/sequencesβ¦ 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 9
But networks are far more complex! Β‘ Complex topographical structure (no spatial locality like grids) vs vs. Text Te Networks ks Im Imag ages es Β‘ No fixed node ordering or reference point Β‘ Often dynamic and have multimodal features. 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10
Assume we have a graph G : Β‘ V is the vertex set Β‘ A is the adjacency matrix (assume binary) Β‘ No node features or extra information is used! 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12
Β‘ Goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the original network 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13
Goal: Go similarity( u, v ) β z > v z u in the original network Similarity of the embedding Ne Need t to d define! 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 14
Define an encoder (i.e., a mapping from 1. nodes to embeddings) Define a node similarity function (i.e., a 2. measure of similarity in the original network) Optimize the parameters of the encoder 3. so that: similarity( u, v ) β z > v z u in the original network Similarity of the embedding 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 15
Β‘ Encoder maps each node to a low- dimensional vector d -dimensional embedding enc ( v ) = z v node in the input graph Β‘ Similarity function specifies how relationships in vector space map to relationships in the original network similarity( u, v ) β z > v z u Similarity of u and v in dot product between node the original network embeddings 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 16
Β‘ Simplest encoding approach: encoder is just an embedding-lookup enc ( v ) = Zv Matrix, each column is π -dim node Z β R d Γ |V| embedding [w [what w we l learn!] !] Indicator vector, all zeroes v β I |V| except a one in column indicating node π€ 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 17
Β‘ Simplest encoding approach: encoder is just an embedding-lookup embedding vector for a specific node embedding matrix Dimension/size Z = of embeddings one column per node 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 18
Simplest encoding approach: encoder is just an embedding-lookup Each node is assigned a unique embedding vector Many methods: node2vec, DeepWalk, LINE 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 19
Key choice of methods is how they define node similarity. E.g., should two nodes have similar embeddings if theyβ¦. Β‘ are connected? Β‘ share neighbors? Β‘ have similar βstructural rolesβ? Β‘ β¦? 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20
Material based on: Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD. β’ Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD. β’
Probability that π£ z > and π€ co-occur on u z v β a random walk over the network π¨ ! β¦ embedding of node π£ 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22
Estimate probability of visiting node π on a 1. random walk starting from node π using some random walk strategy πΊ Optimize embeddings to encode these 2. random walk statistics: π¨ ! Similarity (here: dot product β cos(π) ) encodes random walk βsimilarityβ π¨ " 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 23
Expressivity: Flexible stochastic 1. definition of node similarity that incorporates both local and higher- order neighborhood information Efficiency: Do not need to consider all 2. node pairs when training; only need to consider pairs that co-occur on random walks 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 24
Β‘ Intuition: Find embedding of nodes to π -dimensional space so that node similarity is preserved Β‘ Idea: Learn node embedding such that nearby nodes are close together in the network Β‘ Given a node π , how do we define nearby nodes? Β§ π ! π£ β¦ neighbourhood of π£ obtained by some strategy π 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 25
Β‘ Given π» = (π, πΉ) Β‘ Our goal is to learn a mapping π¨: π£ β β ! Β‘ Maximize log-likelihood objective: max 8 log P(π & (π£)| π¨ # ) " # β% Β§ where π ! (π£) is neighborhood of node π£ Β‘ Given node π£ , we want to learn feature representations predictive of nodes in its neighborhood π & (π£) 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 26
Run short fixed-length random walks 1. starting from each node on the graph using some strategy R For each node π£ collect π ' (π£) , the multiset * 2. of nodes visited on random walks starting from u Optimize embeddings according to: Given 3. node π£ , predict its neighbors π & (π£) max 8 log P(π & (π£)| π¨ # ) " # β% * π ! (π£) can have repeat elements since nodes can be visited multiple times on random walks 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 27
max 8 log P(π & (π£)| π¨ # ) " # β% Β‘ Assumption: Conditional likelihood factorizes over the set of neighbors: log P(π & (π£)|π¨ # ) = 8 log P(z ( | π¨ # ) (β) ! (#) Β‘ Softmax parametrization: Why softmax? ,-.(/ " β " # ) Pr z ( π¨ # ) = We want node π€ to be most similar to node π£ β $β& ,-.(/ ' β " # ) (out of all nodes π ). Intuition: β " exp π¦ " β max exp(π¦ " ) " 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 28
Putting it all together: exp( z > β β u z v ) X X L = β log P n 2 V exp( z > u z n ) u 2 V v 2 N R ( u ) predicted probability of π£ sum over nodes π€ sum over all and π€ co-occuring on seen on random nodes π£ random walk walks starting from π£ Optimizing random walk embeddings = Finding node embeddings π that minimize L 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 30
But doing this naively is too expensive!! exp( z > β β u z v ) X X L = β log P n 2 V exp( z > u z n ) u 2 V v 2 N R ( u ) Nested sum over nodes gives O(|V| 2 ) complexity! 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 31
But doing this naively is too expensive!! exp( z > β β u z v ) X X L = β log P n 2 V exp( z > u z n ) u 2 V v 2 N R ( u ) The normalization term from the softmax is the culpritβ¦ can we approximate it? 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 32
Recommend
More recommend