http cs246 stanford edu web pages are not equally
play

http://cs246.stanford.edu Web pages are not equally important - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs. www.stanford.edu We already know: Since there is large diversity in the


  1. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

  2.  Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  We already know: Since there is large diversity in the connectivity of the vs. webgraph we can rank the pages by the link structure 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

  3.  We will cover the following Link Analysis approaches to computing importances of nodes in a graph:  Page Rank  Hubs and Authorities (HITS)  Topic-Specific (Personalized) Page Rank  Web Spam Detection Algorithms 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

  4.  Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?  Think of in-links as votes:  www.stanford.edu has 23,400 inlinks  www.joe-schmoe.com has 1 inlink  Are all in-links are equal?  Links from important pages count more  Recursive question! 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

  5.  Each link’s vote is proportional to the importance of its source page  If page p with importance x has n out-links, each link gets x/n votes  Page p ’s own importance is the sum of the votes on its in-links p 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

  6.  A “vote” from an important The web in 1839 page is worth more y/2  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r j for node j m a m a/2 r ∑ = i r Flow equations: j d out (i) r y = r y /2 + r a /2 → i j r a = r y /2 + r m r m = r a /2 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

  7. Flow equations:  3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2  No unique solution  All solutions equivalent modulo scale factor  Additional constraint forces uniqueness  y + a + m = 1  y = 2/5, a = 2/5, m = 1/5  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

  8.  Stochastic adjacency matrix M  Let page j has d j out-links  If j → i , then M ij = 1/d j else M ij = 0  M is a column stochastic matrix  Columns sum to 1  Rank vector r : vector with an entry per page  r i is the importance score of page i  ∑ i r i = 1  The flow equations can be written r = M r 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

  9.  Suppose page j links to 3 pages, including i j i i = 1/3 M r r 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

  10.  The flow equations can be written r = M ∙ r  So the rank vector is an eigenvector of the stochastic web matrix  In fact, its first or principal eigenvector, with corresponding eigenvalue 1 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

  11. y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = Mr r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

  12.  Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Power iteration: a simple iterative scheme  Suppose there are N web pages ( t ) + = r ∑  Initialize: r (0) = [1/N,….,1/N] T ( t 1 ) i r j d →  Iterate: r (t+1) = M ∙ r (t) i j i d i …. out-degree of node i  Stop when | r (t+1) – r (t) | 1 < ε  | x | 1 = ∑ 1 ≤ i ≤ N |x i | is the L 1 norm  Can use any other vector norm e.g., Euclidean 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

  13. y a m  Power Iteration: y y ½ ½ 0  Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m 𝑠 𝑗 m 0 ½ 0 𝑘 = ∑  𝑠 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2  And iterate r a = r y /2 + r m  r i = ∑ j M ij ∙r j r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

  14. i 1 i 2 i 3  Imagine a random web surfer:  At any time t , surfer is on some page u  At time t+1 , the surfer follows an j r ∑ = out-link from u uniformly at random i r j d out (i)  Ends up on some page v linked from u → i j  Process repeats indefinitely  Let:  p (t) … vector whose i th coordinate is the prob. that the surfer is at page i at time t  p (t) is a probability distribution over pages 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

  15. i 1 i 2 i 3  Where is the surfer at time t+1 ?  Follows a link uniformly at random j p (t+1) = M · p (t) + = ⋅ p ( t 1 ) M p ( t )  Suppose the random walk reaches a state p (t+1) = M · p (t) = p (t) then p (t) is stationary distribution of a random walk  Our rank vector r satisfies r = M · r  So, it is a stationary distribution for the random walk 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

  16. + = ( t ) r ∑ r = ( t 1 ) i r Mr or j equivalently d → i j i  Does this converge?  Does it converge to what we want?  Are results reasonable? 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

  17. + = ( t ) r ∑ ( t 1 ) i r a b j d → i j i  Example: r a 1 0 1 0 = 1 r b 0 1 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

  18. ( t ) + = r ∑ ( t 1 ) i r a b j d → i j i  Example: r a 1 0 0 0 = 0 r b 0 1 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

  19. 2 problems:  Some pages are “ dead ends ” (have no out-links)  Such pages cause importance to “leak out”  Spider traps (all out links are within the group)  Eventually spider traps absorb all importance 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

  20. y a m  Power Iteration: y y ½ ½ 0  Set 𝑠 𝑘 = 1 a ½ 0 0 a m 𝑠 𝑗 m 0 ½ 1 𝑘 = ∑  𝑠 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2  And iterate r a = r y /2 r m = r a /2 + r m  Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 … 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

  21.  The Google solution for spider traps: At each time step, the random surfer has two options:  With probability β , follow a link at random  With probability 1- β , jump to some page uniformly at random  Common values for β are in the range 0.8 to 0.9  Surfer will teleport out of spider trap within a few time steps y y a a m m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

  22. y a m  Power Iteration: y y ½ ½ 0  Set 𝑠 𝑘 = 1 a ½ 0 0 a m 𝑠 𝑗 m 0 ½ 0 𝑘 = ∑  𝑠 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2  And iterate r a = r y /2 r m = r a /2  Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 … 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

  23.  Teleports: Follow random teleport links with probability 1.0 from dead-ends  Adjust matrix accordingly y y a a m m y a m y a m ⅓ y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

  24. + = ( t 1 ) ( t ) r Mr Markov Chains  Set of states X  Transition matrix P where P ij = P(X t =i | X t-1 =j)  π specifying the probability of being at each state x ∈ X  Goal is to find π such that π = P π 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

  25.  Theory of Markov chains  Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic , irreducible and aperiodic . 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

  26.  Stochastic: Every column sums to 1  A possible solution: Add green links 1 = + T • a i …=1 if node i has ( 1 S M a ) out deg 0, =0 else • 1 …vector of all 1s n y a m y r y = r y /2 + r a /2 + r m /3 y ½ ½ 1/3 r a = r y /2+ r m /3 a ½ 0 1/3 r m = r a /2 + r m /3 a m m 0 ½ 1/3 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

  27.  A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k .  A possible solution: Add green links y a m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

  28.  From any state, there is a non-zero probability of going from any one state to any another  A possible solution: Add green links y a m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

Recommend


More recommend