CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu We already know: Since there is large diversity in the connectivity of the vs. webgraph we can rank the pages by the link structure 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
We will cover the following Link Analysis approaches to computing importances of nodes in a graph: Page Rank Hubs and Authorities (HITS) Topic-Specific (Personalized) Page Rank Web Spam Detection Algorithms 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
Idea: Links as votes Page is more important if it has more links In-coming links? Out-going links? Think of in-links as votes: www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink Are all in-links are equal? Links from important pages count more Recursive question! 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4
Each link’s vote is proportional to the importance of its source page If page p with importance x has n out-links, each link gets x/n votes Page p ’s own importance is the sum of the votes on its in-links p 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
A “vote” from an important The web in 1839 page is worth more y/2 A page is important if it is y pointed to by other important a/2 pages y/2 Define a “rank” r j for node j m a m a/2 r ∑ = i r Flow equations: j d out (i) r y = r y /2 + r a /2 → i j r a = r y /2 + r m r m = r a /2 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
Flow equations: 3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2 No unique solution All solutions equivalent modulo scale factor Additional constraint forces uniqueness y + a + m = 1 y = 2/5, a = 2/5, m = 1/5 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 7
Stochastic adjacency matrix M Let page j has d j out-links If j → i , then M ij = 1/d j else M ij = 0 M is a column stochastic matrix Columns sum to 1 Rank vector r : vector with an entry per page r i is the importance score of page i ∑ i r i = 1 The flow equations can be written r = M r 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
Suppose page j links to 3 pages, including i j i i = 1/3 M r r 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
The flow equations can be written r = M ∙ r So the rank vector is an eigenvector of the stochastic web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 10
y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = Mr r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 11
Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme Suppose there are N web pages ( t ) + = r ∑ Initialize: r (0) = [1/N,….,1/N] T ( t 1 ) i r j d → Iterate: r (t+1) = M ∙ r (t) i j i d i …. out-degree of node i Stop when | r (t+1) – r (t) | 1 < ε | x | 1 = ∑ 1 ≤ i ≤ N |x i | is the L 1 norm Can use any other vector norm e.g., Euclidean 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 12
y a m Power Iteration: y y ½ ½ 0 Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m 𝑠 𝑗 m 0 ½ 0 𝑘 = ∑ 𝑠 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 And iterate r a = r y /2 + r m r i = ∑ j M ij ∙r j r m = r a /2 Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 13
i 1 i 2 i 3 Imagine a random web surfer: At any time t , surfer is on some page u At time t+1 , the surfer follows an j r ∑ = out-link from u uniformly at random i r j d out (i) Ends up on some page v linked from u → i j Process repeats indefinitely Let: p (t) … vector whose i th coordinate is the prob. that the surfer is at page i at time t p (t) is a probability distribution over pages 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14
i 1 i 2 i 3 Where is the surfer at time t+1 ? Follows a link uniformly at random j p (t+1) = M · p (t) + = ⋅ p ( t 1 ) M p ( t ) Suppose the random walk reaches a state p (t+1) = M · p (t) = p (t) then p (t) is stationary distribution of a random walk Our rank vector r satisfies r = M · r So, it is a stationary distribution for the random walk 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15
+ = ( t ) r ∑ r = ( t 1 ) i r Mr or j equivalently d → i j i Does this converge? Does it converge to what we want? Are results reasonable? 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
+ = ( t ) r ∑ ( t 1 ) i r a b j d → i j i Example: r a 1 0 1 0 = 1 r b 0 1 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17
( t ) + = r ∑ ( t 1 ) i r a b j d → i j i Example: r a 1 0 0 0 = 0 r b 0 1 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18
2 problems: Some pages are “ dead ends ” (have no out-links) Such pages cause importance to “leak out” Spider traps (all out links are within the group) Eventually spider traps absorb all importance 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19
y a m Power Iteration: y y ½ ½ 0 Set 𝑠 𝑘 = 1 a ½ 0 0 a m 𝑠 𝑗 m 0 ½ 1 𝑘 = ∑ 𝑠 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 And iterate r a = r y /2 r m = r a /2 + r m Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 … 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
The Google solution for spider traps: At each time step, the random surfer has two options: With probability β , follow a link at random With probability 1- β , jump to some page uniformly at random Common values for β are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a a m m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 21
y a m Power Iteration: y y ½ ½ 0 Set 𝑠 𝑘 = 1 a ½ 0 0 a m 𝑠 𝑗 m 0 ½ 0 𝑘 = ∑ 𝑠 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 And iterate r a = r y /2 r m = r a /2 Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 … 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 22
Teleports: Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly y y a a m m y a m y a m ⅓ y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23
+ = ( t 1 ) ( t ) r Mr Markov Chains Set of states X Transition matrix P where P ij = P(X t =i | X t-1 =j) π specifying the probability of being at each state x ∈ X Goal is to find π such that π = P π 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24
Theory of Markov chains Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic , irreducible and aperiodic . 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25
Stochastic: Every column sums to 1 A possible solution: Add green links 1 = + T • a i …=1 if node i has ( 1 S M a ) out deg 0, =0 else • 1 …vector of all 1s n y a m y r y = r y /2 + r a /2 + r m /3 y ½ ½ 1/3 r a = r y /2+ r m /3 a ½ 0 1/3 r m = r a /2 + r m /3 a m m 0 ½ 1/3 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26
A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k . A possible solution: Add green links y a m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 27
From any state, there is a non-zero probability of going from any one state to any another A possible solution: Add green links y a m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 28
Recommend
More recommend