http cs246 stanford edu high dim graph infinite machine
play

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems


  1. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

  2. High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Community Web Decision Association Clustering Detection advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

  3. Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

  4. Connections between political blogs Polarization of the network [Adamic-Glance, 2005] 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

  5. Citation networks and Maps of science [Börner et al., 2012] 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

  6. domain2 domain1 router domain3 Internet 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

  7. Seven Bridges of Königsberg [Euler, 1735] Return to the starting point by traveling each link of the graph once and only once. 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

  8.  Web as a directed graph:  Nodes: Webpages  Edges: Hyperlinks I teach a class on CS224W: Networks. Classes are in the Gates Computer building Science Department at Stanford Stanford University 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

  9.  Web as a directed graph:  Nodes: Webpages  Edges: Hyperlinks I teach a class on CS224W: Networks. Classes are in the Gates Computer building Science Department at Stanford Stanford University 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

  10. 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

  11.  How to organize the Web?  First try: Human curated Web directories  Yahoo, DMOZ, LookSmart  Second try: Web Search  Information Retrieval investigates: Find relevant docs in a small and trusted set  Newspaper articles, Patents, etc.  But: Web is huge , full of untrusted documents, random things, web spam, etc. 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

  12. 2 challenges of web search:  (1) Web contains many sources of information Who to “trust”?  Trick: Trustworthy pages may point to each other!  (2) What is the “best” answer to query “newspaper”?  No single right answer  Trick: Pages that actually know about newspapers might all be pointing to many newspapers 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

  13.  All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

  14.  We will cover the following Link Analysis approaches for computing importances of nodes in a graph:  Page Rank  Hubs and Authorities (HITS)  Topic-Specific (Personalized) Page Rank  Web Spam Detection Algorithms 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

  15.  Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?  Think of in-links as votes:  www.stanford.edu has 23,400 in-links  www.joe-schmoe.com has 1 in-link  Are all in-links are equal?  Links from important pages count more  Recursive question! 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

  16. 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

  17.  Each link’s vote is proportional to the importance of its source page  If page j with importance r j has n out-links, each link gets r j / n votes  Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

  18.  A “vote” from an important The web in 1839 page is worth more y/2  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r j for page j m a m a/2 r   i r “Flow” equations: j d r y = r y /2 + r a /2  i j i r a = r y /2 + r m r m = r a /2 𝒆 𝒋 … out -degree of node 𝒋 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

  19. Flow equations:  3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2  No unique solution  All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness:  𝒔 𝒛 + 𝒔 𝒃 + 𝒔 𝒏 = 𝟐 𝟑 𝟑 𝟐  Solution: 𝒔 𝒛 = 𝟔 , 𝒔 𝒃 = 𝟔 , 𝒔 𝒏 = 𝟔  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

  20.  Stochastic adjacency matrix 𝑵  Let page 𝑗 has 𝑒 𝑗 out-links 1  If 𝑗 → 𝑘 , then 𝑁 𝑘𝑗 = else 𝑁 𝑘𝑗 = 0 𝑒 𝑗  𝑵 is a column stochastic matrix  Columns sum to 1  Rank vector 𝒔 : vector with an entry per page  𝑠 𝑗 is the importance score of page 𝑗  𝑠 𝑗 = 1 𝑗 r    The flow equations can be written i r j d 𝒔 = 𝑵 ⋅ 𝒔  i j i 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

  21.  r  i r  Remember the flow equation: j d   Flow equation in the matrix form i j i 𝑵 ⋅ 𝒔 = 𝒔  Suppose page i links to 3 pages, including j i r j j . = r i 1/3 . M r r = 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

  22.  The flow equations can be written 𝒔 = 𝑵 ∙ 𝒔  So the rank vector r is an eigenvector of the stochastic web matrix M  In fact, its first or principal eigenvector, NOTE: x is an eigenvector with with corresponding eigenvalue 1 the corresponding eigenvalue λ if:  Largest eigenvalue of M is 1 since M is 𝑩𝒚 = 𝝁𝒚 column stochastic  We know r is unit length and each column of M sums to one, so 𝑵𝒔 ≤ 𝟐  We can now efficiently solve for r ! The method is called Power iteration 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

  23. y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = M∙r r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

  24.  Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Power iteration: a simple iterative scheme  Suppose there are N web pages ( t )    r  Initialize: r (0) = [1/N,….,1/N] T ( 1 ) t i r j d   Iterate: r (t+1) = M ∙ r (t) i j i d i …. out -degree of node i  Stop when | r (t+1) – r (t) | 1 <   | x | 1 =  1 ≤ i ≤ N |x i | is the L 1 norm 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

  25. y a m  Power Iteration: y y ½ ½ 0  Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m 𝑠 𝑗 m 0 ½ 0  1: 𝑠′ 𝑘 = 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2  2: 𝑠 = 𝑠′ r a = r y /2 + r m  Goto 1 r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

  26. y a m  Power Iteration: y y ½ ½ 0  Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m 𝑠 𝑗 m 0 ½ 0  1: 𝑠′ 𝑘 = 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2  2: 𝑠 = 𝑠′ r a = r y /2 + r m  Goto 1 r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

  27.  Power iteration: A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)  𝒔 (𝟐) = 𝑵 ⋅ 𝒔 (𝟏)  𝒔 (𝟑) = 𝑵 ⋅ 𝒔 𝟐 = 𝑵 𝑵𝒔 𝟐 = 𝑵 𝟑 ⋅ 𝒔 𝟏  𝒔 (𝟒) = 𝑵 ⋅ 𝒔 𝟑 = 𝑵 𝑵 𝟑 𝒔 𝟏 = 𝑵 𝟒 ⋅ 𝒔 𝟏  Claim: Sequence 𝑵 ⋅ 𝒔 𝟏 , 𝑵 𝟑 ⋅ 𝒔 𝟏 , … 𝑵 𝒍 ⋅ 𝒔 𝟏 , … approaches the dominant eigenvector of 𝑵 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

Recommend


More recommend