Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org
#1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering (60 votes) #3: SVM – Classification (58 votes) #4: Apriori - Frequent Itemsets (52 votes) #5: EM – Clustering (48 votes) #6: PageRank – Link mining (46 votes) #7: AdaBoost – Boosting (45 votes) #7: kNN – Classification (45 votes) #7: Naive Bayes – Classification (45 votes) #10: CART – Classification (34 votes) Data Mining: Concepts and 2 Techniques
How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Content based: Find relevant docs Top-k ranking based on TF-IDF Works well in a small and trusted set J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Link based ranking algorithms PageRank HITS Data Mining: Concepts and 7 Techniques
All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
Idea: Links as votes Page is more important if it has more links In-coming links? Out-going links? Think of in-links as votes: www.stanford.edu has 23,400 in-links www.joe-schmoe.com has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
Each link’s vote is proportional to the importance of its source page If page j with importance r j has n out-links, each link gets r j / n votes Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
A “vote” from an important The web in 1839 page is worth more y/2 A page is important if it is y pointed to by other important a/2 pages y/2 Define a “rank” r j for page j m a m a/2 r i r “Flow” equations: j d r y = r y /2 + r a /2 i j i r a = r y /2 + r m r m = r a /2 𝒆 𝒋 … out -degree of node 𝒋 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
Flow equations: 3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2 No unique solution All solutions equivalent modulo the scale factor Additional constraint forces uniqueness: 𝒔 𝒛 + 𝒔 𝒃 + 𝒔 𝒏 = 𝟐 𝟑 𝟑 𝟐 Solution: 𝒔 𝒛 = 𝟔 , 𝒔 𝒃 = 𝟔 , 𝒔 𝒏 = 𝟔 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
Stochastic adjacency matrix 𝑵 Let page 𝑗 has 𝑒 𝑗 out-links 1 If 𝑗 → 𝑘 , then 𝑁 𝑘𝑗 = else 𝑁 𝑘𝑗 = 0 𝑒 𝑗 𝑵 is a column stochastic matrix Columns sum to 1 Rank vector 𝒔 : vector with an entry per page 𝑠 𝑗 is the importance score of page 𝑗 𝑗 𝑠 = 1 𝑗 r i The flow equations can be written r j d 𝒔 = 𝑵 ⋅ 𝒔 i j i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
r i r Remember the flow equation: j d Flow equation in the matrix form i j i 𝑵 ⋅ 𝒔 = 𝒔 Suppose page i links to 3 pages, including j i r j j . = r i 1/3 . M r r = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = M∙r r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
The flow equations can be written 𝒔 = 𝑵 ∙ 𝒔 So the rank vector r is an eigenvector of the stochastic web matrix M In fact, its first or principal eigenvector, NOTE: x is an eigenvector with with corresponding eigenvalue 1 the corresponding eigenvalue λ if: Largest eigenvalue of M is 1 since M is 𝑩𝒚 = 𝝁𝒚 column stochastic (with non-negative entries) We can now efficiently solve for r ! The method is called Power iteration J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme Suppose there are N web pages ( t ) r Initialize: r (0) = [1/N,….,1/N] T ( 1 ) t i r j d Iterate: r (t+1) = M ∙ r (t) i j i d i …. out -degree of node i Stop when | r (t+1) – r (t) | 1 < | x | 1 = 1≤i≤N |x i | is the L 1 norm Can use any other vector norm, e.g., Euclidean J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
y a m y y ½ ½ 0 a ½ 0 1 a m m 0 ½ 0 Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
( t ) r ( t 1 ) i r a b j d i j i Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
( t ) r ( t 1 ) i r a b j d i j i Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
i 1 i 2 i 3 Imagine a random web surfer: At any time 𝒖 , surfer is on some page 𝒋 At time 𝒖 + 𝟐 , the surfer follows an j r out-link from 𝒋 uniformly at random i r j d out (i) Ends up on some page 𝒌 linked from 𝒋 i j Process repeats indefinitely Let: 𝒒(𝒖) … vector whose 𝒋 th coordinate is the prob. that the surfer is at page 𝒋 at time 𝒖 So, 𝒒(𝒖) is a probability distribution over pages J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22
i 1 i 2 i 3 Where is the surfer at time t+1 ? Follows a link uniformly at random j 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) ( 1 ) M ( ) p t p t Suppose the random walk reaches a state 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) = 𝒒(𝒖) then 𝒒(𝒖) is stationary distribution of a random walk Our original rank vector 𝒔 satisfies 𝒔 = 𝑵 ⋅ 𝒔 So, 𝒔 is a stationary distribution for the random walk J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
( ) t r r ( 1 ) t i r Mr or j equivalently d i j i Does this converge? Does it converge to what we want? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24
A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions (strong connected , no dead ends) the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26
Dead end 2 problems: (1) Some pages are dead ends (have no out-links) Random walk has “nowhere” to go to Such pages cause importance to “leak out” (2) Spider traps: (all out-links are within the group) Random walked gets “stuck” in a trap And eventually spider traps absorb all importance J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
y a m y y ½ ½ 0 a ½ 0 0 a m m 0 ½ 0 Example: r y 1/3 2/6 3/12 5/24 0 … r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … Here the PageRank “leaks” out since the matrix is not stochastic. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28
Teleports: Follow random teleport links from dead-ends Adjust matrix accordingly y y a a m m y a m y a m ⅓ y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
Recommend
More recommend