http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 - PowerPoint PPT Presentation

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

 #1: C4.5 Decision Tree - Classification (61 votes)  #2: K-Means - Clustering (60 votes)  #3: SVM – Classification (58 votes)  #4: Apriori - Frequent Itemsets (52 votes)  #5: EM – Clustering (48 votes)  #6: PageRank – Link mining (46 votes)  #7: AdaBoost – Boosting (45 votes)  #7: kNN – Classification (45 votes)  #7: Naive Bayes – Classification (45 votes)  #10: CART – Classification (34 votes) Data Mining: Concepts and 2 Techniques

 How to organize the Web?  First try: Human curated Web directories  Yahoo, DMOZ, LookSmart  Second try: Web Search  Content based: Find relevant docs  Top-k ranking based on TF-IDF  Works well in a small and trusted set J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

 Link based ranking algorithms  PageRank  HITS Data Mining: Concepts and 7 Techniques

 All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

 Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?  Think of in-links as votes:  www.stanford.edu has 23,400 in-links  www.joe-schmoe.com has 1 in-link  Are all in-links are equal?  Links from important pages count more  Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

 Each link’s vote is proportional to the importance of its source page  If page j with importance r j has n out-links, each link gets r j / n votes  Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

 A “vote” from an important The web in 1839 page is worth more y/2  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r j for page j m a m a/2 r   i r “Flow” equations: j d r y = r y /2 + r a /2  i j i r a = r y /2 + r m r m = r a /2 𝒆 𝒋 … out -degree of node 𝒋 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

Flow equations:  3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2  No unique solution  All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness:  𝒔 𝒛 + 𝒔 𝒃 + 𝒔 𝒏 = 𝟐 𝟑 𝟑 𝟐  Solution: 𝒔 𝒛 = 𝟔 , 𝒔 𝒃 = 𝟔 , 𝒔 𝒏 = 𝟔  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

 Stochastic adjacency matrix 𝑵  Let page 𝑗 has 𝑒 𝑗 out-links 1  If 𝑗 → 𝑘 , then 𝑁 𝑘𝑗 = else 𝑁 𝑘𝑗 = 0 𝑒 𝑗  𝑵 is a column stochastic matrix  Columns sum to 1  Rank vector 𝒔 : vector with an entry per page  𝑠 𝑗 is the importance score of page 𝑗  𝑗 𝑠 = 1 𝑗 r   i  The flow equations can be written r j d  𝒔 = 𝑵 ⋅ 𝒔 i j i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

r   i r  Remember the flow equation: j d   Flow equation in the matrix form i j i 𝑵 ⋅ 𝒔 = 𝒔  Suppose page i links to 3 pages, including j i r j j . = r i 1/3 . M r r = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = M∙r r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

 The flow equations can be written 𝒔 = 𝑵 ∙ 𝒔  So the rank vector r is an eigenvector of the stochastic web matrix M  In fact, its first or principal eigenvector, NOTE: x is an eigenvector with with corresponding eigenvalue 1 the corresponding eigenvalue λ if:  Largest eigenvalue of M is 1 since M is 𝑩𝒚 = 𝝁𝒚 column stochastic (with non-negative entries)  We can now efficiently solve for r ! The method is called Power iteration J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

 Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Power iteration: a simple iterative scheme  Suppose there are N web pages ( t )    r  Initialize: r (0) = [1/N,….,1/N] T ( 1 ) t i r j d  Iterate: r (t+1) = M ∙ r (t)  i j i d i …. out -degree of node i  Stop when | r (t+1) – r (t) | 1 <  | x | 1 =  1≤i≤N |x i | is the L 1 norm Can use any other vector norm, e.g., Euclidean J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

y a m y y ½ ½ 0 a ½ 0 1 a m m 0 ½ 0  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

  ( t )  r ( t 1 ) i r a b j d  i j i  Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

  ( t )  r ( t 1 ) i r a b j d  i j i  Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

i 1 i 2 i 3  Imagine a random web surfer:  At any time 𝒖 , surfer is on some page 𝒋  At time 𝒖 + 𝟐 , the surfer follows an j r   out-link from 𝒋 uniformly at random i r j d out (i)  Ends up on some page 𝒌 linked from 𝒋  i j  Process repeats indefinitely  Let:  𝒒(𝒖) … vector whose 𝒋 th coordinate is the prob. that the surfer is at page 𝒋 at time 𝒖  So, 𝒒(𝒖) is a probability distribution over pages J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

i 1 i 2 i 3  Where is the surfer at time t+1 ?  Follows a link uniformly at random j 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖)    ( 1 ) M ( ) p t p t  Suppose the random walk reaches a state 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) = 𝒒(𝒖) then 𝒒(𝒖) is stationary distribution of a random walk  Our original rank vector 𝒔 satisfies 𝒔 = 𝑵 ⋅ 𝒔  So, 𝒔 is a stationary distribution for the random walk J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

( )   t  r r  ( 1 ) t i r Mr or j equivalently d  i j i  Does this converge?  Does it converge to what we want? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

 A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions (strong connected , no dead ends) the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

Dead end 2 problems:  (1) Some pages are dead ends (have no out-links)  Random walk has “nowhere” to go to  Such pages cause importance to “leak out”  (2) Spider traps: (all out-links are within the group)  Random walked gets “stuck” in a trap  And eventually spider traps absorb all importance J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

y a m y y ½ ½ 0 a ½ 0 0 a m m 0 ½ 0  Example: r y 1/3 2/6 3/12 5/24 0 … r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … Here the PageRank “leaks” out since the matrix is not stochastic. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

 Teleports: Follow random teleport links from dead-ends  Adjust matrix accordingly y y a a m m y a m y a m ⅓ y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 - PowerPoint PPT Presentation

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering (60 votes) #3: SVM

Prophecy International Holdings Limited Annual General Meeting 2017 Resolution 1 Votes Number

Decision Tree Decision Trees A decision tree is a decision support tool that uses a tree-like

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

MM MMDS Moroccan Membrane and Desalination Society Moroccan Membrane and Desalination Society

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

A Brief History of Decision Tree Implementation MAX AUSTIN Overview Famous Decision Tree

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

www.escardio.org www.escardio.org www.escardio.org www.escardio.org www.escardio.org

Decision tree learning Aim: find a small tree consistent with the training examples Idea:

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

8 May 2014 1 Introduction Welcome House-keeping Purpose The purpose of the retail review is

Weeble: Enabling Low-Power Nodes to Coexist with High-Power Nodes in White Space Networks Boidar

PHYSICAL ELECTRONICS(ECE3540) CHAPTER 8 THE PN JUNCTION DIODE CHAPTER 8 THE PN JUNCTION

String Theory in the LHC Era J Marsano (marsano@uchicago.edu) 1 Tuesday, May 1, 12 String

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

ACCELERATORS CINDY JOE SATURDAY MORNING PHYSICS OCTOBER 21, 2017 ABOUT ME Grew up in Arkansas

Web Information Retrieval Lecture 13 Introduction to text classification and clustering

Text classification I (Nave Bayes) CE-324: Modern Information Retrieval Sharif University of