Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 4: Analyzing Graphs (2/2) Ali Abedi Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford University) These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ 1
Structure of the Course Analyzing Graphs Relational Data Analyzing Text Data Mining Analyzing “Core” framework features and algorithm design 2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
Query: University of Waterloo fakeuw.ca uwaterloo.ca University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo Ranked retrieval fails! 4
Web contains many sources of information Who to “trust”? ▪ Trick: Trustworthy pages may point to each other! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
All web pages are not equally “important” www.joeschmoe.com vs. www.stanford.edu There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
Idea: Links as votes ▪ Page is more important if it has more links ▪ In-coming links? Out-going links? Think of in-links as votes: ▪ www.stanford.edu has 23,400 in-links ▪ www.joeschmoe.com has 1 in-link Are all in-links equal? ▪ Links from important pages count more ▪ Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
Each link’s vote is proportional to the importance of its source page If page j with importance r j has n out-links, each link gets r j / n votes Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
Define a “rank” r j for page j y/2 r = y i r j d a/2 → i j i y/2 𝒆 𝒋 … out -degree of node 𝒋 m a m a/2 “Flow” equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
Flow equations: 3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2 ▪ No unique solution ▪ All solutions equivalent modulo the scale factor Additional constraint forces uniqueness: ▪ 𝒔 𝒛 + 𝒔 𝒃 + 𝒔 𝒏 = 𝟐 𝟑 𝟑 𝟐 ▪ Solution: 𝒔 𝒛 = 𝟔 , 𝒔 𝒃 = 𝟔 , 𝒔 𝒏 = 𝟔 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
Stochastic adjacency matrix 𝑵 ▪ Let page 𝑗 has 𝑒 𝑗 out-links 1 ▪ If 𝑗 → 𝑘 , then 𝑁 𝑘𝑗 = else 𝑁 𝑘𝑗 = 0 𝑒 𝑗 ▪ 𝑵 is a column stochastic matrix ▪ Columns sum to 1 y/2 y a m y y ½ ½ 0 a/2 a ½ 0 1 y/2 m m 0 ½ 0 a m a/2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
y a m Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m m 0 ½ 0 𝑠 𝑗 ▪ 1: 𝑠′ 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 ▪ 2: 𝑠 = 𝑠′ r a = r y /2 + r m ▪ Goto 1 r m = r a /2 Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
y a m Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m m 0 ½ 0 𝑠 𝑗 ▪ 1: 𝑠′ 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 ▪ 2: 𝑠 = 𝑠′ r a = r y /2 + r m ▪ Goto 1 r m = r a /2 Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
i 1 i 2 i 3 Imagine a random web surfer: ▪ At any time 𝒖 , surfer is on some page 𝒋 ▪ At time 𝒖 + 𝟐 , the surfer follows an j r out-link from 𝒋 uniformly at random = i r j d out (i) ▪ Ends up on some page 𝒌 linked from 𝒋 → i j ▪ Process repeats indefinitely Let: 𝒒(𝒖) … vector whose 𝒋 th coordinate is the prob. that the surfer is at page 𝒋 at time 𝒖 ▪ So, 𝒒(𝒖) is a probability distribution over pages J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
i 1 i 2 i 3 Where is the surfer at time t+1 ? ▪ Follows a link uniformly at random j 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) + = ( 1 ) M ( ) p t p t Suppose the random walk reaches a state 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) = 𝒒(𝒖) then 𝒒(𝒖) is stationary distribution of a random walk J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions , the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
+ = ( t ) r ( 1 ) t i r j d → i j i Does this converge? Does it converge to what we want? Are results reasonable? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
( ) + = t r ( t 1 ) i a b r j d → i j i Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
( ) + = t r ( t 1 ) i a b r j d → i j i Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22
Dead end 2 problems: (1) Some pages are dead ends (have no out-links) ▪ Random walk has “nowhere” to go to ▪ Such pages cause importance to “leak out” (2) Spider traps: (all out-links are within the group) ▪ Random walker gets “stuck” in a trap ▪ And eventually spider traps absorb all importance J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
y a m Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 a ½ 0 0 a m m 0 ½ 1 𝑠 𝑗 ▪ 𝑠 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 m is a spider trap r y = r y /2 + r a /2 ▪ And iterate r a = r y /2 r m = r a /2 + r m Example: r y 1/3 2/6 3/12 5/24 0 … r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, … All the PageRank score gets “trapped” in node m. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24
The Google solution for spider traps: At each time step, the random surfer has two options ▪ With prob. , follow a link at random ▪ With prob. 1- , jump to some random page ▪ Common values for are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a a m m J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25
y a m Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 a ½ 0 0 a m m 0 ½ 0 𝑠 𝑗 ▪ 𝑠 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 ▪ And iterate r a = r y /2 r m = r a /2 Example: r y 1/3 2/6 3/12 5/24 0 … r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … Here the PageRank “leaks” out since the matrix is not stochastic. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26
Teleports: Follow random teleport links with probability 1.0 from dead-ends ▪ Adjust matrix accordingly y y a a m m y a m y a m ⅓ y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Why are dead-ends and spider traps a problem and why do teleports solve the problem? Spider-traps are not a problem, but with traps PageRank scores are not what we want ▪ Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps Dead-ends are a problem ▪ The matrix is not column stochastic, so our initial assumptions are not met ▪ Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28
Recommend
More recommend