Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu 4/9/2008 1
Web Mining � Web mining vs. data mining � Structure (or lack of it) � Linkage structure and lack of structure in textual information � Scale � Data generated per day is comparable to largest conventional data warehouses � Speed � Often need to react to evolving usage patterns in real-time (e.g., merchandising)
Web Mining � Structure Mining � Extracting info from topology of the Web (links among pages) � Content Mining � Extracting info from page content (text, images, audio or video, etc) � Natural language processing and information retrieval � Usage Mining � Extracting info from user’s usage data on the web (how user visits the pages or makes transactions) 4/9/2008 Li Xiong 3
Web Mining 4/9/2008 4
Web Mining � Web structure mining � Web graph structure and link analysis � Web text mining � Text representation and IR models � Web usage mining � Collaborative filtering 4/9/2008 Li Xiong 5
Structure of Web Graph � Web as a directed graph � Pages = nodes, hyperlinks = edges � Problem: Understand the macroscopic structure and evolution of the web graph � Practical implications � Crawling, browsing, computation of link analysis algorithms
Power-law degree distribution Source: Broder et al, 00
Bow-tie Structure (Broder et al. 00)
The Daisy Structure (Donato et al. 05) 4/9/2008 9
Link Analysis � Problem: exploit the link structure of a graph to order or prioritize the set of objects within the graph � Application of social network analysis at actor level: centrality and prestige � Algorithms � PageRank � HITS 10 April 9, 2008 Li Xiong
PageRank (Brin & Page’98) � Intuition � Web pages are not equally “important” � www.joe-schmoe.com v www.stanford.edu � Links as citations: a page cited often is more important � www.stanford.edu has 23,400 inlinks � www.joe-schmoe.com has 1 inlink � Recursive model: links from heavily linked pages weighted more � PageRank is essentially the eigenvector prestige in social network
Simple Recursive Flow Model � Each link’s vote is proportional to the importance of its source page � If page P with importance x has n outlinks, each link gets x/n votes � Page P’s own importance is the sum of the votes on its inlinks y = y /2 + a /2 y/2 Yahoo y a = y /2 + m m = a /2 a/2 y/2 Solving the equation with constraint: y+ a+ m = 1 m y = 2/5, a = 2/5, m = 1/5 Amazon M’soft a/2 m a
Matrix formulation Web link matrix M : one row and one column per web page � ⎧ 1 ∈ ⎪ ( , ) if i j E = ⎨ M O ij j ⎪ 0 ⎩ otherwise Rank vector r : one entry per web page � Flow equation: r = Mr � r is an eigenvector of the M � j i i = j M r r
Matrix formulation Example y a m y 1/2 1/2 0 Yahoo a 1/2 0 1 m 0 1/2 0 r = Mr Amazon M’soft y 1/2 1/2 0 y y = y /2 + a /2 a = 1/2 0 1 a m 0 1/2 0 m a = y /2 + m m = a /2
Power I teration method Solving equation: r = Mr � Suppose there are N web pages � Initialize: r 0 = [1/N,….,1/N] T � Iterate: r k+ 1 = Mr k � Stop when | r k+ 1 - r k | 1 < ε � | x | 1 = ∑ 1 ≤ i ≤ N |x i | is the L 1 norm � Can use any other vector norm e.g., Euclidean
Power I teration Example y a m Yahoo y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Amazon M’soft y 1/3 1/3 5/12 3/8 2/5 a = . . . 1/3 1/2 1/3 11/24 2/5 m 1/3 1/6 1/4 1/6 1/5
Random Walk I nterpretation Imagine a random web surfer � � At any time t, surfer is on some page P � At time t+ 1, the surfer follows an outlink from P uniformly at random � Ends up on some page Q linked from P � Process repeats indefinitely p (t) is the probability distribution whose i th component is the � probability that the surfer is at page i at time t
The stationary distribution � Where is the surfer at time t+ 1? � p (t+ 1) = Mp (t) � Suppose the random walk reaches a state such that p (t+ 1) = Mp (t) = p (t) � Then p (t) is a stationary distribution for the random walk � Our rank vector r satisfies r = Mr
Existence and Uniqueness of the Solution � Theory of random walks (aka Markov processes): A finite Markov chain defined by the stochastic matrix has a unique stationary probability distribution if the matrix is irreducible and aperiodic . 19 April 9, 2008 Mining and Searching Graphs in Graph Databases
M is a not stochastic matrix � M is the transition matrix of the Web graph ⎧ 1 ∈ ⎪ ( , ) if i j E = ⎨ M O ij j ⎪ 0 ⎩ otherwise n ∑ = 1 � It does not satisfy M ij = 1 i � Many web pages have no out-links � Such pages are called the dangling pages . CS583, Bing Liu, UIC 20
M is a not irreducible � Irreducible means that the Web graph G is strongly connected. Definition: A directed graph G = ( V , E ) is strongly connected if and only if, for each pair of nodes u , v ∈ V , there is a path from u to v . � A general Web graph is not irreducible because � for some pair of nodes u and v , there is no path from u to v . CS583, Bing Liu, UIC 21
M is a not aperiodic � A state i in a Markov chain being periodic means that there exists a directed cycle that the chain has to traverse. Definition: A state i is periodic with period k > 1 if k is the smallest number such that all paths leading from state i back to state i have a length that is a multiple of k . � If a state is not periodic (i.e., k = 1), it is aperiodic . � A Markov chain is aperiodic if all states are aperiodic. CS583, Bing Liu, UIC 22
Solution: Random teleports � Add a link from each page to every page � At each time step, the random surfer has a small probability teleporting to those links � With probability β , follow a link at random � With probability 1- β , jump to some page uniformly at random � Common values for β are in the range 0.8 to 0.9
Random teleports Example ( β = 0.8 ) 1/2 1/2 0 1/3 1/3 1/3 1/2 0 0 + 0.2 1/3 1/3 1/3 0.8 Yahoo 0 1/2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 Amazon M’soft y 1 1.00 0.84 0.776 7/11 a = . . . 1 0.60 0.60 0.536 5/11 m 1 1.40 1.56 1.688 21/11
Matrix formulation � Matrix vector A � A ij = β M ij + (1- β )/N � M ij = 1/|O(j)| when j → i and M ij = 0 otherwise � Verify that A is a stochastic matrix � The page rank vector r is the principal eigenvector of this matrix � satisfying r = Ar � Equivalently, r is the stationary distribution of the random walk with teleports
Advantages and Limitations of PageRank � Fighting spam � PageRank is a global measure and is query independent � Computed offline � Criticism: query-independence. � It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic. CS583, Bing Liu, UIC 26
HI TS: Capturing Authorities & Hubs (Kleinberg’98) Intuitions � � Pages that are widely cited are good authorities � Pages that cite many other pages are good hubs HITS (Hypertext-Induced Topic Selection) � � When the user issues a search query, HITS expands the list of relevant pages returned by a search engine and produces two rankings Hubs Authorities 1. Authorities are pages containing useful information and linked by Hubs course home pages � home pages of auto manufacturers � 2. Hubs are pages that link to Authorities course bulletin � list of US auto manufacturers � 27 April 9, 2008 Data Mining: Concepts and Techniques
Matrix Formulation � Transition (adjacency) matrix A � A [ i , j ] = 1 if page i links to page j , 0 if Hubs Authorities not � The hub score vector h : score is proportional to the sum of the authority scores of the pages it links to � h = λ A a � Constant λ is a scale factor � The authority score vector a : score is proportional to the sum of the hub scores of the pages it is linked from � a = μ A T h � Constant μ is scale factor
Transition Matrix Example y a m Yahoo y 1 1 1 A = a 1 0 1 m 0 1 0 Amazon M’soft
I terative algorithm � Initialize h , a to all 1’s � h = Aa � Scale h so that its max entry is 1.0 � a = A T h � Scale a so that its max entry is 1.0 � Continue until h , a converge
I terative Algorithm Example 1 1 1 1 1 0 A T = 1 0 1 A = 1 0 1 0 1 0 1 1 0 . . . 1 = 1 1 1 1 a(yahoo) . . . 0.732 = 1 1 4/5 0.75 a(amazon) . . . 1 = 1 1 1 1 a(m’soft) . . . h(yahoo) = 1 1 1 1.000 1 . . . h(amazon) = 1 2/3 0.73 0.732 0.71 . . . h(m’soft) = 1 1/3 0.27 0.268 0.29
Existence and Uniqueness of the Solution h = λ A a a = μ A T h h = λμ AA T h a = λμ A T A a Under reasonable assumptions about A , the dual iterative algorithm converges to vectors h* and a* such that: h* is the principal eigenvector of the matrix AA T • a* is the principal eigenvector of the matrix A T A •
Recommend
More recommend