x

x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine - PowerPoint PPT Presentation

? ? x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2 ? ? ? ? Machine Learning ? Node classification 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data,


  1. ? ? x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2

  2. ? ? ? ? Machine Learning ? Node classification 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3

  3. Classifying the function of proteins in the interactome Image from: Ganapathiraju et al. 2016. Schizophrenia interactome with 504 novel protein–protein interactions. Nature . 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4

  4. ¡ (Supervised) Machine Learning Lifecycle requires feature engineering every single time! Raw Structured Learning Model Data Data Algorithm Automatically Feature Downstream Engineering learn the features task 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5

  5. Goal: Efficient task-independent feature learning for machine learning in networks! vec node u 𝑔: 𝑣 → ℝ ! ℝ ! Feature representation, embedding 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6

  6. Task: We map each node in a network to a • point in a low-dimensional space § Distributed representation for nodes – § Similarity of embedding between nodes indicates – their network similarity § Encode network information and generate node – representation 17 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7

  7. 2D embedding of nodes of the Zachary’s Karate Club network: • Zachary’s Karate Network: Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014. 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 8

  8. ¡ Modern deep learning toolbox is designed for simple sequences or grids § CNNs for fixed-size images/grids…. § RNNs or word2vec for text/sequences… 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 9

  9. But networks are far more complex! ¡ Complex topographical structure (no spatial locality like grids) vs vs. Text Te Networks ks Im Imag ages es ¡ No fixed node ordering or reference point ¡ Often dynamic and have multimodal features. 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10

  10. Assume we have a graph G : ¡ V is the vertex set ¡ A is the adjacency matrix (assume binary) ¡ No node features or extra information is used! 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12

  11. ¡ Goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the original network 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13

  12. Goal: Go similarity( u, v ) ≈ z > v z u in the original network Similarity of the embedding Ne Need t to d define! 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 14

  13. Define an encoder (i.e., a mapping from 1. nodes to embeddings) Define a node similarity function (i.e., a 2. measure of similarity in the original network) Optimize the parameters of the encoder 3. so that: similarity( u, v ) ≈ z > v z u in the original network Similarity of the embedding 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 15

  14. ¡ Encoder maps each node to a low- dimensional vector d -dimensional embedding enc ( v ) = z v node in the input graph ¡ Similarity function specifies how relationships in vector space map to relationships in the original network similarity( u, v ) ≈ z > v z u Similarity of u and v in dot product between node the original network embeddings 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 16

  15. ¡ Simplest encoding approach: encoder is just an embedding-lookup enc ( v ) = Zv Matrix, each column is 𝑒 -dim node Z ∈ R d × |V| embedding [w [what w we l learn!] !] Indicator vector, all zeroes v ∈ I |V| except a one in column indicating node 𝑤 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 17

  16. ¡ Simplest encoding approach: encoder is just an embedding-lookup embedding vector for a specific node embedding matrix Dimension/size Z = of embeddings one column per node 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 18

  17. Simplest encoding approach: encoder is just an embedding-lookup Each node is assigned a unique embedding vector Many methods: node2vec, DeepWalk, LINE 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 19

  18. Key choice of methods is how they define node similarity. E.g., should two nodes have similar embeddings if they…. ¡ are connected? ¡ share neighbors? ¡ have similar “structural roles”? ¡ …? 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20

  19. Material based on: Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD. • Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD. •

  20. Probability that 𝑣 z > and 𝑤 co-occur on u z v ≈ a random walk over the network 𝑨 ! … embedding of node 𝑣 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22

  21. Estimate probability of visiting node 𝒘 on a 1. random walk starting from node 𝒗 using some random walk strategy 𝑺 Optimize embeddings to encode these 2. random walk statistics: 𝑨 ! Similarity (here: dot product ≈ cos(𝜄) ) encodes random walk “similarity” 𝑨 " 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 23

  22. Expressivity: Flexible stochastic 1. definition of node similarity that incorporates both local and higher- order neighborhood information Efficiency: Do not need to consider all 2. node pairs when training; only need to consider pairs that co-occur on random walks 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 24

  23. ¡ Intuition: Find embedding of nodes to 𝑒 -dimensional space so that node similarity is preserved ¡ Idea: Learn node embedding such that nearby nodes are close together in the network ¡ Given a node 𝒗 , how do we define nearby nodes? § 𝑂 ! 𝑣 … neighbourhood of 𝑣 obtained by some strategy 𝑆 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 25

  24. ¡ Given 𝐻 = (𝑊, 𝐹) ¡ Our goal is to learn a mapping 𝑨: 𝑣 → ℝ ! ¡ Maximize log-likelihood objective: max 8 log P(𝑂 & (𝑣)| 𝑨 # ) " # ∈% § where 𝑂 ! (𝑣) is neighborhood of node 𝑣 ¡ Given node 𝑣 , we want to learn feature representations predictive of nodes in its neighborhood 𝑂 & (𝑣) 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 26

  25. Run short fixed-length random walks 1. starting from each node on the graph using some strategy R For each node 𝑣 collect 𝑂 ' (𝑣) , the multiset * 2. of nodes visited on random walks starting from u Optimize embeddings according to: Given 3. node 𝑣 , predict its neighbors 𝑂 & (𝑣) max 8 log P(𝑂 & (𝑣)| 𝑨 # ) " # ∈% * 𝑂 ! (𝑣) can have repeat elements since nodes can be visited multiple times on random walks 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 27

  26. max 8 log P(𝑂 & (𝑣)| 𝑨 # ) " # ∈% ¡ Assumption: Conditional likelihood factorizes over the set of neighbors: log P(𝑂 & (𝑣)|𝑨 # ) = 8 log P(z ( | 𝑨 # ) (∈) ! (#) ¡ Softmax parametrization: Why softmax? ,-.(/ " ⋅" # ) Pr z ( 𝑨 # ) = We want node 𝑤 to be most similar to node 𝑣 ∑ $∈& ,-.(/ ' ⋅" # ) (out of all nodes 𝑜 ). Intuition: ∑ " exp 𝑦 " ≈ max exp(𝑦 " ) " 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 28

  27. Putting it all together: exp( z > ✓ ◆ u z v ) X X L = − log P n 2 V exp( z > u z n ) u 2 V v 2 N R ( u ) predicted probability of 𝑣 sum over nodes 𝑤 sum over all and 𝑤 co-occuring on seen on random nodes 𝑣 random walk walks starting from 𝑣 Optimizing random walk embeddings = Finding node embeddings 𝒜 that minimize L 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 30

  28. But doing this naively is too expensive!! exp( z > ✓ ◆ u z v ) X X L = − log P n 2 V exp( z > u z n ) u 2 V v 2 N R ( u ) Nested sum over nodes gives O(|V| 2 ) complexity! 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 31

  29. But doing this naively is too expensive!! exp( z > ✓ ◆ u z v ) X X L = − log P n 2 V exp( z > u z n ) u 2 V v 2 N R ( u ) The normalization term from the softmax is the culprit… can we approximate it? 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 32

Recommend


More recommend