network embedding as matrix factorization unifying
play

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, - PowerPoint PPT Presentation

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec Jiezhong Qiu Tsinghua University February 21, 2018 Joint work with Yuxiao Dong (MSR), Hao Ma (MSR), Jian Li (IIIS, Tsinghua), Kuansan Wang (MSR), Jie Tang


  1. Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec Jiezhong Qiu Tsinghua University February 21, 2018 Joint work with Yuxiao Dong (MSR), Hao Ma (MSR), Jian Li (IIIS, Tsinghua), Kuansan Wang (MSR), Jie Tang (DCST, Tsinghua)

  2. Motivation and Problem Formulation Problem Formulation Give a network G = ( V, E ) , aim to learn a function f : V → R p to capture neighborhood similarity and community membership. Applications: ◮ link prediction ◮ community detection ◮ label classification Figure 1: A toy example (Figure from DeepWalk).

  3. History of Network Embedding 2017 metapath2vec [Dong et al.] 2016 node2vec [ Grover & Leskovec ] 2015 LINE & PTE [Tang et al.] 2014 DeepWalk [Perozzi et al.] word2vec (skip-gram) [Mikolov et al.] 2013 2009 SocDim [Tang & Liu] Spectral Clustering v.s. Kernel k-means [Dhillon et al.] 2005 Spectral Clustering [Ng et al.] 2002 2000 Image Segmentation [Shi & Malik] A large body of literature 1996 [Pothen et al.] [Simon] [Bolla], [Hagen & Kahng] [Hendrickson & Leland] [Van Driessche & Roose], [Barnard et al.] [Spielman & Teng], [Guattery & Miller] 1973 Fiedler Vector [Fiedler] Spectral Partitioning [Donath, Hoffman]

  4. Contents Preliminaries Main Theoretic Results Notations DeepWalk (KDD’14) LINE (WWW’15) PTE (KDD’15) node2vec (KDD’16) NetMF NetMF for a Small Window Size T NetMF for a Large Window Size T Experiments

  5. Notations Consider an undirected weighted graph G = ( V, E ) , where | V | = n and | E | = m . ◮ Adjacency matrix A ∈ R n × n : + � a i,j > 0 ( i, j ) ∈ E A i,j = ( i, j ) �∈ E . 0 ◮ Degree matrix D = diag( d 1 , · · · , d n ) , where d i is the generalized degree of vertex i . ◮ Volume of the graph G : vol( G ) = � � j A i,j . i Assumption G = ( V, E ) is connected, undirected, and not bipartite, which d w makes P ( w ) = vol( G ) a unique stationary distribution.

  6. DeepWalk — Roadmap �� Output: Input Node Random Skip-gram G=(V,E) Embedding Walk

  7. DeepWalk — a Two-step Algorithm Algorithm 1: DeepWalk 1 for n = 1 , 2 , . . . , N do Pick w n 1 according to a probability distribution P ( w 1 ) ; 2 Generate a vertex sequence ( w n 1 , · · · , w n L ) of length L by a 3 random walk on network G ; for j = 1 , 2 , . . . , L − T do 4 for r = 1 , . . . , T do 5 Add vertex-context pair ( w n j , w n j + r ) to multiset D ; 6 Add vertex-context pair ( w n j + r , w n j ) to multiset D ; 7 8 Run SGNS on D with b negative samples.

  8. DeepWalk — Roadmap �� Output: Input Node Random Skip-gram G=(V,E) Embedding Walk Levy & Goldberg (NIPS 14) b Number of negative samples #( w, c ) #( w ) Co-occurrence of w and c Occurrence of word w |D| Total number of word-context pairs #( c ) Occurrence of context c

  9. Skip-gram with Negative Sampling (SGNS) ◮ SGNS maintains a multiset D which counts the occurrence of each word-context pair ( w, c ) . ◮ Objective: � �� + b #( w )#( c ) � � � � � x ⊤ − x ⊤ L = #( w, c ) log g log g w y c w y c , |D| w c where x w , y c ∈ R d , g is the sigmoid function, and b is the number of negative samples for SGNS. ◮ For sufficiently large dimensionality d , equivalent to factorizing PMI matrix (Levy & Goldberg, NIPS’14): � #( w, c ) |D| � log . b #( w )#( c )

  10. DeepWalk — Roadmap �� Output: Input Node Random Skip-gram G=(V,E) Embedding Walk Levy & Goldberg (NIPS 14) b Number of negative samples #( w, c ) #( w ) Co-occurrence of w and c Occurrence of word w |D| Total number of word-context pairs #( c ) Occurrence of context c

  11. DeepWalk (c, d) a b c d e (c, a) (c, e) Question Suppose the multiset D is constructed based on random walk on � � #( w,c ) |D| graph, can we interpret log with graph theory b #( w )#( c ) terminologies?

  12. DeepWalk (c, d) a b c d e (c, a) (c, e) Question Suppose the multiset D is constructed based on random walk on � � #( w,c ) |D| graph, can we interpret log with graph theory b #( w )#( c ) terminologies? Challange We mix so many things together, i.e., direction and distance.

  13. DeepWalk (c, d) a b c d e (c, a) (c, e) Question Suppose the multiset D is constructed based on random walk on � � #( w,c ) |D| graph, can we interpret log with graph theory b #( w )#( c ) terminologies? Challange We mix so many things together, i.e., direction and distance. Solution Let’s distinguish them!

  14. DeepWalk Partition the multiset D into several sub-multisets according to the way in which vertex and its context appear in a random walk sequence. More formally, for r = 1 , · · · , T , we define � � ( w, c ) : ( w, c ) ∈ D , w = w n j , c = w n D − r = , → j + r � � ( w, c ) : ( w, c ) ∈ D , w = w n j + r , c = w n D ← r = . − j (c, d) D − ! 1 a b c d e (c, a) (c, e) D − D ← ! − 2 2

  15. DeepWalk as Implicit Matrix Factorization Some observations ◮ Observation 1:   � #( w, c ) |D| #( w,c ) � |D| log = log   b #( w ) · #( c ) b #( w ) #( c ) |D| |D| ◮ Observation 2: T � #( w, c ) − � #( w, c ) = 1 + #( w, c ) ← � → − r r . |D| 2 T |D − r | |D ← r | → − r =1 Sufficient to characterize #( w,c ) − and #( w,c ) ← → − r | . r r |D − r | |D ← → −

  16. DeepWalk — Theorems Theorem Denote P = D − 1 A , when the length of random walk L → ∞ , #( w, c ) − vol( G ) ( P r ) w,c and #( w, c ) ← d w d c → − p p r r vol( G ) ( P r ) c,w . → → |D − r | |D ← r | → − Theorem When the length of random walk L → ∞ , we have T � � #( w, c ) → 1 d w d c � p vol( G ) ( P r ) w,c + vol( G ) ( P r ) c,w . |D| 2 T r =1 Theorem For DeepWalk, when the length of random walk L → ∞ , � � T T #( w, c ) |D| → vol( G ) 1 ( P r ) w,c + 1 � � p ( P r ) c,w . #( w ) · #( c ) 2 T d c d w r =1 r =1

  17. DeepWalk — Conclusion Theorem DeepWalk is asymptotically and implicitly factorizing � � � � T vol( G ) 1 � � � r D − 1 A D − 1 log . b T r =1

  18. DeepWalk — Roadmap �� Output: Input Node Random Skip-gram G=(V,E) Embedding Walk Levy & Goldberg (NIPS 14) Adjacency matrix Degree matrix b Number of negative samples

  19. LINE ◮ Objective of LINE: | V | | V | � �� + bd i d j � � � � � x ⊤ − x ⊤ L = A i,j log g i y j vol( G ) log g i y j . i =1 j =1 ◮ Align it with the Objective of SGNS: � �� + b #( w )#( c ) � � � � � x ⊤ − x ⊤ L = #( w, c ) log g w y c log g w y c . |D| w c ◮ LINE is actually factorizing � vol( G ) � D − 1 AD − 1 log b ◮ Recall DeepWalk’s matrix form: � � � � T vol( G ) 1 � � � r D − 1 A D − 1 log . b T r =1 Observation LINE is a special case of DeepWalk ( T = 1 ).

  20. PTE Figure 2: Heterogeneous Text Network. ◮ word-word network G ww , A ww ∈ R # word × # word . ◮ document-word network G dw , A dw ∈ R # doc × # word . ◮ label-word network G lw , A lw ∈ R # label × # word .

  21. PTE as Implicit Matrix Factorization     row ) − 1 A ww ( D ww col ) − 1 α vol( G ww )( D ww  − log b, row ) − 1 A dw ( D dw col ) − 1 log β vol( G dw )( D dw    row ) − 1 A lw ( D lw col ) − 1 γ vol( G lw )( D lw ◮ The matrix is of shape (# word + # doc + # label ) × # word. ◮ b is the number of negative samples in training. ◮ { α, β, γ } are hyper-parameters to balance the weights of the three networks. In PTE, { α, β, γ } satisfy α vol( G ww ) = β vol( G dw ) = γ vol( G lw )

  22. node2vec — 2nd Order Random Walk  1 ( u, v ) ∈ E, ( v, w ) ∈ E, u = w ;  p    1 ( u, v ) ∈ E, ( v, w ) ∈ E, u � = w, ( w, u ) ∈ E ; T u,v,w = 1 ( u, v ) ∈ E, ( v, w ) ∈ E, u � = w, ( w, u ) �∈ E ;   q   0 otherwise . T u,v,w P u,v,w = Prob ( w j +1 = u | w j = v, w j − 1 = w ) = . � u T u,v,w Stationary Distribution � P u,v,w X v,w = X u,v w Existence guaranteed by Perron-Frobenius theorem, but may not be unique.

  23. node2vec as Implicit Matrix Factorization Theorem node2vec is asymptotically and implicitly factorizing a matrix whose entry at w -th row, c -th column is � 1 � �� � � T c,w,u + � u X w,u P r u X c,u P r r =1 w,c,u 2 T log b ( � u X w,u ) ( � u X c,u )

  24. Contents Preliminaries Main Theoretic Results Notations DeepWalk (KDD’14) LINE (WWW’15) PTE (KDD’15) node2vec (KDD’16) NetMF NetMF for a Small Window Size T NetMF for a Large Window Size T Experiments

  25. Roadmap �� Output: Input Node Random Skip-gram G=(V,E) Embedding Walk Levy & Goldberg (NIPS 14) Matrix Factorization

  26. NetMF ◮ Factorize the DeepWalk matrix: � � � � T vol( G ) 1 � � � r D − 1 A D − 1 log . b T r =1 ◮ For numerical reason, we use truncated logarithm — ˜ log( x ) = log (max(1 , x )) 1 . 5 1 . 0 0 . 5 0 . 0 0 1 2 3 4 5 Figure 3: Truncated Logarithm

  27. NetMF for a Small Window Size T Algorithm 2: NetMF for a Small Window Size T 1 Compute P 1 , · · · , P T ; �� T r =1 P r � 2 Compute M = vol( G ) D − 1 ; bT 3 Compute M ′ = max( M , 1) ; 4 Rank- d approximation by SVD: log M ′ = U d Σ d V ⊤ d ; √ Σ d as network embedding. 5 return U d

Recommend


More recommend