http mmds org 1 c4 5 decision tree classification 61 votes
play

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 - PowerPoint PPT Presentation

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering (60 votes) #3: SVM


  1. Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

  2.  #1: C4.5 Decision Tree - Classification (61 votes)  #2: K-Means - Clustering (60 votes)  #3: SVM – Classification (58 votes)  #4: Apriori - Frequent Itemsets (52 votes)  #5: EM – Clustering (48 votes)  #6: PageRank – Link mining (46 votes)  #7: AdaBoost – Boosting (45 votes)  #7: kNN – Classification (45 votes)  #7: Naive Bayes – Classification (45 votes)  #10: CART – Classification (34 votes) Data Mining: Concepts and 2 Techniques

  3.  How to organize the Web?  First try: Human curated Web directories  Yahoo, DMOZ, LookSmart  Second try: Web Search  Content based: Find relevant docs  Top-k ranking based on TF-IDF  Works well in a small and trusted set J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

  4. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

  5.  Link based ranking algorithms  PageRank  HITS Data Mining: Concepts and 7 Techniques

  6.  All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

  7.  Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?  Think of in-links as votes:  www.stanford.edu has 23,400 in-links  www.joe-schmoe.com has 1 in-link  Are all in-links are equal?  Links from important pages count more  Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

  8. A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

  9.  Each link’s vote is proportional to the importance of its source page  If page j with importance r j has n out-links, each link gets r j / n votes  Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

  10.  A “vote” from an important The web in 1839 page is worth more y/2  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r j for page j m a m a/2 r   i r “Flow” equations: j d r y = r y /2 + r a /2  i j i r a = r y /2 + r m r m = r a /2 𝒆 𝒋 … out -degree of node 𝒋 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

  11. Flow equations:  3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2  No unique solution  All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness:  𝒔 𝒛 + 𝒔 𝒃 + 𝒔 𝒏 = 𝟐 𝟑 𝟑 𝟐  Solution: 𝒔 𝒛 = 𝟔 , 𝒔 𝒃 = 𝟔 , 𝒔 𝒏 = 𝟔  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

  12.  Stochastic adjacency matrix 𝑵  Let page 𝑗 has 𝑒 𝑗 out-links 1  If 𝑗 → 𝑘 , then 𝑁 𝑘𝑗 = else 𝑁 𝑘𝑗 = 0 𝑒 𝑗  𝑵 is a column stochastic matrix  Columns sum to 1  Rank vector 𝒔 : vector with an entry per page  𝑠 𝑗 is the importance score of page 𝑗  𝑗 𝑠 = 1 𝑗 r   i  The flow equations can be written r j d  𝒔 = 𝑵 ⋅ 𝒔 i j i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

  13. r   i r  Remember the flow equation: j d   Flow equation in the matrix form i j i 𝑵 ⋅ 𝒔 = 𝒔  Suppose page i links to 3 pages, including j i r j j . = r i 1/3 . M r r = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

  14. y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = M∙r r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

  15.  The flow equations can be written 𝒔 = 𝑵 ∙ 𝒔  So the rank vector r is an eigenvector of the stochastic web matrix M  In fact, its first or principal eigenvector, NOTE: x is an eigenvector with with corresponding eigenvalue 1 the corresponding eigenvalue λ if:  Largest eigenvalue of M is 1 since M is 𝑩𝒚 = 𝝁𝒚 column stochastic (with non-negative entries)  We can now efficiently solve for r ! The method is called Power iteration J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

  16.  Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Power iteration: a simple iterative scheme  Suppose there are N web pages ( t )    r  Initialize: r (0) = [1/N,….,1/N] T ( 1 ) t i r j d  Iterate: r (t+1) = M ∙ r (t)  i j i d i …. out -degree of node i  Stop when | r (t+1) – r (t) | 1 <  | x | 1 =  1≤i≤N |x i | is the L 1 norm Can use any other vector norm, e.g., Euclidean J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

  17. y a m y y ½ ½ 0 a ½ 0 1 a m m 0 ½ 0  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

  18.   ( t )  r ( t 1 ) i r a b j d  i j i  Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

  19.   ( t )  r ( t 1 ) i r a b j d  i j i  Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

  20. i 1 i 2 i 3  Imagine a random web surfer:  At any time 𝒖 , surfer is on some page 𝒋  At time 𝒖 + 𝟐 , the surfer follows an j r   out-link from 𝒋 uniformly at random i r j d out (i)  Ends up on some page 𝒌 linked from 𝒋  i j  Process repeats indefinitely  Let:  𝒒(𝒖) … vector whose 𝒋 th coordinate is the prob. that the surfer is at page 𝒋 at time 𝒖  So, 𝒒(𝒖) is a probability distribution over pages J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

  21. i 1 i 2 i 3  Where is the surfer at time t+1 ?  Follows a link uniformly at random j 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖)    ( 1 ) M ( ) p t p t  Suppose the random walk reaches a state 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) = 𝒒(𝒖) then 𝒒(𝒖) is stationary distribution of a random walk  Our original rank vector 𝒔 satisfies 𝒔 = 𝑵 ⋅ 𝒔  So, 𝒔 is a stationary distribution for the random walk J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

  22. ( )   t  r r  ( 1 ) t i r Mr or j equivalently d  i j i  Does this converge?  Does it converge to what we want? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

  23.  A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions (strong connected , no dead ends) the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

  24. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

  25. Dead end 2 problems:  (1) Some pages are dead ends (have no out-links)  Random walk has “nowhere” to go to  Such pages cause importance to “leak out”  (2) Spider traps: (all out-links are within the group)  Random walked gets “stuck” in a trap  And eventually spider traps absorb all importance J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

  26. y a m y y ½ ½ 0 a ½ 0 0 a m m 0 ½ 0  Example: r y 1/3 2/6 3/12 5/24 0 … r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … Here the PageRank “leaks” out since the matrix is not stochastic. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

  27.  Teleports: Follow random teleport links from dead-ends  Adjust matrix accordingly y y a a m m y a m y a m ⅓ y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29

Recommend


More recommend