CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University

 Instead of generic popularity can we measure Instead of generic popularity, can we measure popularity within a topic?  E.g., computer science, health  Bias the random walk  When the random walker teleports, he picks a page from a set S of web pages from a set S of web pages  S contains only pages that are relevant to the topic  E g Open Directory (DMOZ) pages for a given topic E.g., Open Directory (DMOZ) pages for a given topic (www.dmoz.org)  For each teleport set S, we get a different rank vector r S 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

 Let:  Let:  A ik =  M ik + (1 ‐  )/|S| if i  S  M ik  M otherwise th i  A is stochastic!  We have weighted all pages in the teleport set S equally teleport set S equally  Could also assign different weights to pages 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

Suppose S = { 1} ,  = 0.8 0.2 0.2 1 0.5 0.5 0.4 0.4 1 1 0.8 2 3 Node I teration 1 1 0.8 0.8 0 1 2… stable 1 1.0 0.2 0.52 0.294 4 2 0 0.4 0.08 0.118 3 0 0.4 0.08 0.327 4 4 0 0 0 0 0 32 0.32 0 261 0.261 Note how we initialize the PageRank vector differently from the unbiased PageRank case. 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

 Experimental results [Haveliwala 2000]  Experimental results [Haveliwala 2000]  Picked 16 topics  Teleport sets determined using DMOZ Teleport sets determined using DMOZ  E.g., arts, business, sports,…  “Blind study” using volunteers  35 test queries  Results ranked using PageRank and TSPR of most closely related topic  E.g., bicycling using Sports ranking  In most cases volunteers preferred TSPR ranking  I t l t f d TSPR ki 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

 User can pick from a menu  User can pick from a menu  Use Naïve Bayes to classify query into a topic  Can use the context of the query  Can use the context of the query  E.g., query is launched from a web page talking about a known topic about a known topic  History of queries e.g., “basketball” followed by “Jordan” Jordan  User context e.g., user’s My Yahoo settings, bookmarks, … bookmarks, … 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

 Goal:  Goal:  Don’t just find newspapers but also find “experts” – people who link in a coordinated way to many – people who link in a coordinated way to many good newspapers  Idea: link voting Idea: link voting  Quality as an expert (hub): NYT: 10  Total sum of votes of pages pointed to Total sum of votes of pages pointed to Ebay: 3 Ebay: 3 Yahoo: 3  Quality as an content (authority): CNN: 8  Total sum of votes of experts WSJ: 9 p  Principle of repeated improvement 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

Interesting documents fall into two classes: Interesting documents fall into two classes: 1. Authorities are pages containing useful information  Newspaper home pages  Course home pages  Home pages of auto manufacturers 2. Hubs are pages that link to authorities p g  List of newspapers NYT: 10 Ebay: 3  Course bulletin Yahoo: 3  CNN: 8 List of US auto manufacturers WSJ: 9 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

 A good hub links to many good authorities  A good hub links to many good authorities  A good authority is linked from many good g y y g hubs  Model using two scores for each node: f  Hub score and Authority score  Represented as vectors h and a 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

 Each page i has 2 kinds of scores:  Each page i has 2 kinds of scores:  Hub score: h i  A th  Authority score : a i it  Algorithm:  Initialize: a i =h i =1 I iti li h 1  Then keep iterating:   h   Authority: A th it a h j i    i j  Hub: h a i j  i j  Normalize:  Normalize:  a i =1,  h i =1 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

 HITS uses adjacency matrix  HITS uses adjacency matrix A [ i j ] = 1 A [ i , j ] = 1 if page i links to page j if page i links to page j , 0 else  A T , the transpose of A , is similar to the PageRank matrix M but A T has 1’s where M PageRank matrix M but A has 1 s where M has fractions 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

Yahoo y a m y y 1 1 1 1 1 1 A = a 1 0 1 m 0 1 0 Amazon M’soft 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15

 Notation:  Notation:  Vector a=(a 1 …,a n ), h=(h 1 …,h n )  Adj  Adjacency matrix (n x n): A ij =1 if i  j t i ( ) A 1 if i j  Then:        h h a h h A A a i j i ij j  i j j h   So:  So: h A Aa  Likewise: a   T a A A h h 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

 The hub score of page i is proportional to the  The hub score of page i is proportional to the sum of the authority scores of the pages it links to: h = λ Aa links to: h = λ Aa  Constant λ is a scale factor, λ =1/  h i  The authority score of page i is proportional to the sum of the hub scores of the pages it is p g linked from: a = μ A T h  Constant μ is scale factor, μ =1/  a i Constant μ is scale factor, μ 1/  a i 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17

 The HITS algorithm:  The HITS algorithm:  Initialize h , a to all 1’s  R  Repeat: t  h = Aa  Scale h so that its sums to 1 0  Scale h so that its sums to 1.0  a = A T h  Scale a so that its sums to 1.0  Until h , a converge (i.e., change very little) 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18

1 1 1 1 1 0 Yahoo T = 1 0 1 T A = 1 0 1 A 1 0 1 A A 1 0 1 0 1 0 1 1 0 Amazon Amazon M’soft . . . 1 1 = 1 1 1 1 1 1 1 1 a(yahoo) a(yahoo) . . . 0.732 = 1 1 4/5 0.75 a(amazon) . . . 1 = 1 1 1 1 a(m’soft) . . . h(yahoo) = 1 1 1 1.000 1 . . . h(amazon) = 1 2/3 0.73 0.732 0.71 0.27 . . . h(m’soft) = 1 1/3 0.268 0.29 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19

 Algorithm:  Algorithm:  Set: a = h = 1 n  Repeat: Repeat:  h=Ma, a=M T h  Normalize a is being updated (in 2 steps): a is being updated (in 2 steps):  Then: a=M T (Ma) T M T (Ma)=(M T M)a new h h is updated (in 2 steps): p ( p ) new a new a M (M T h)=(MM T )h  Thus, in 2k steps: a=(M T M) k a a=(M M) a Repeated matrix powering Repeated matrix powering h=(MM T ) k h 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

 h = λ Aa  a = μ A T h  h = λμ AA T h  a = λμ A T A a λ A T A  Under reasonable assumptions about A, the Under reasonable assumptions about A, the HITS iterative algorithm converges to vectors h* and a*:  h* is the principal eigenvector of matrix AA T  a* is the principal eigenvector of matrix A T A 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

Hubs Authorities Most densely ‐ connected core Most densely connected core (primary core) Less densely ‐ connected core Less densely connected core (secondary core) 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

 A single topic can have many bipartite cores  A single topic can have many bipartite cores  Corresponding to different meanings or points of view: points of view:  abortion: pro ‐ choice, pro ‐ life  evolution: darwinian, intelligent design e o ut o da a , te ge t des g  jaguar: auto, Mac, NFL team, panthera onca  How to find such secondary cores? H fi d h d ? 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

 Once we find the primary core we can  Once we find the primary core, we can remove its links from the graph  Repeat HITS algorithm on residual graph to find the next bipartite core p  Roughly, correspond to non ‐ primary eigenvectors of AA T and A T A T T f d 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

 We need a well connected graph of pages for  We need a well ‐ connected graph of pages for HITS to work well: 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Instead of generic popularity can we measure Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Would like to do

CS345a: Data Mining Jure Leskovec Stanford University Instructors: Instructors: Jure

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University HW3 is out HW3

Clustering Algorithms CS345a: Data Mining Jure Leskovec and Anand

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Friday 5:30 at

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Homework 2 is out:

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Feature selection:

CS345a: Data Mining Jure Leskovec Stanford University CPU Machine Learning, Statistics Memory

DATA MINING LECTURE 15 The Map-Reduce Computational Paradigm Most of the slides are taken from:

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

Jeffrey D. Ullman You can download a free copy of Mining of Massive Datasets , by Jure

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means -

End-toEnd In-memory Graph Analytics Jure Leskovec (@jure) Including joint work with Rok Sosic,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 4: Analyzing Graphs

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

Modeling Information Diffusion in Implicit Networks. Jaewon Yang Jure Leskovec IEEE

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

CS341: Project in Mining Massive Datasets Michele Catasta, Jure Leskovec, Jeffrey Ullman Agenda

SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity Qingyuan Zhao 1 ,

Course : Data mining Topic : Similarity search Aristides Gionis Aalto University Department of

Course : Data mining Topic : Locality-sensitive hashing (LSH) Aristides Gionis Aalto University