CS473: Link Analysis Luo Si Department of Computer Science Purdue University Borrowed Slides from Prof. Rong Jin (MSU)
Citation Analysis
Web Structure Web is a graph – Each web site correspond to a node – A link from one site to another site forms a directed edge What does it look likes? Web is small world The diameter of the web is 19 e.g. the average number of clicks from one web site to another is 19
Bowtie Structure Strongly Connected Component Broder et al., 2001
Bowtie Structure Sites that link towards the ‘center’ of the web Broder et al., 2001
Bowtie Structure Sites that link from the ‘center’ of the web Broder et al., 2001
Inlinks and Outlinks Both degrees of incoming and outgoing links follow power law Broder et al., 2001
Early Approaches Basic Assumptions • Hyperlinks contain information about the human judgment of a site • The more incoming links to a site, the more it is judged important Bray 1996 • The visibility of a site is measured by the number of other sites pointing to it • The luminosity of a site is measured by the number of other sites to which it points Limitation: failure to capture the relative importance of different parents (children) sites
HITS - Kleinberg’s Algorithm • HITS – Hypertext Induced Topic Selection • For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v • A site is very authoritative if it receives many citations. Citations from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites
Authority and Hubness 2 5 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4)
Authority and Hubness: Version 1 Recursive dependency HubsAuthorities(G) |V| 1 1 [1,…,1] Є R a v ( ) h w ( ) 2 a h 1 0 0 w pa v [ ] 3 t 1 h v ( ) a w ( ) 4 repeat 5 for each v in V w ch v [ ] 6 do a (v) Σ h (w) t w Є pa[v] t -1 7 h (v) Σ a (w) w Є pa[v] t t -1 8 t t + 1 9 until || a – a || + || h – h || < ε t t -1 t t -1 10 return (a , h ) t t
Authority and Hubness: Version 1 Recursive dependency HubsAuthorities(G) |V| 1 1 [1,…,1] Є R a v ( ) h w ( ) 2 a h 1 0 0 w pa v [ ] 3 t 1 h v ( ) a w ( ) 4 repeat 5 for each v in V w ch v [ ] 6 do a (v) Σ h (w) t w Є pa[v] t -1 7 h (v) Σ a (w) w Є pa[v] t t -1 8 t t + 1 9 until || a – a || + || h – h || < ε t t -1 t t -1 10 return (a , h ) t t Problems ?
Authority and Hubness: Version 2 Recursive dependency HubsAuthorities(G) |V| a v ( ) h w ( ) 1 1 [1,…,1] Є R 2 a h 1 w pa v [ ] 0 0 3 t 1 h v ( ) a w ( ) 4 repeat w ch v [ ] 5 for each v in V 6 do a (v) Σ h (w) t + Normalization w Є pa[v] t -1 a v ( ) 7 h (v) Σ a (w) w Є pa[v] t a v ( ) t -1 8 a a / || a || t t a w ( ) 9 h h / || h || w t t 10 t t + 1 h v ( ) h v ( ) 11 until || a – a || + || h – h || < ε t t -1 t t -1 h w ( ) 12 return (a , h ) t t w
HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights
Authority and Hubness Authority score – Not only depends on the number of incoming links – But also the ‘quality’ (e.g., hubness) of the incoming links Hubness score – Not only depends on the number of outgoing links – But also the ‘quality’ (e.g., hubness) of the outgoing links
Authority and Hub Column vector a : a i is the authority score for the i-th site Column vector h : h i is the hub score for the i-th site Matrix M : 1 the th site points to the th site i j M i j , 0 otherwise M =
Authority and Hub Vector a : a i is the authority score for the i-th site Vector h : h i is the hub score for the i-th site Matrix M : 1 the th site points to the th site i j M i j , 0 otherwise • Recursive dependency : a(v) Σ h(w) w Є pa[v] h(v) Σ a(w) w Є ch[v]
Authority and Hub Column vector a : a i is the authority score for the i-th site Column vector h : h i is the hub score for the i-th site Matrix M : 1 the th site points to the th site i j M i j , 0 otherwise • Recursive dependency : a(v) Σ h(w) T a M h w Є pa[v] h Ma h(v) Σ a(w) w Є ch[v]
Authority and Hub Column vector a : a i is the authority score for the i-th site Column vector h : h i is the hub score for the i-th site Matrix M : 1 the th site points to the th site i j M i j , Normalization 0 otherwise Procedure • Recursive dependency : T a M h a(v) Σ h(w) t t t w Є pa[v] h Ma h(v) Σ a(w) t t t w Є ch[v]
Authority and Hub T T a M Ma a M h t t t t t t t T h Ma h MM h t t t t t t t Apply singular vector decomposition to matrix M T T a u h , v M U ΣV u v 1 1 i i i i
PageRank Introduced by Page et al (1998) – The weight is assigned by the rank of parents Difference with HITS – HITS takes Hubness & Authority weights – The page rank is proportional to its parents’ rank, but inversely proportional to its parents’ outdegree
Matrix Notation 1 1 the th site points to the th site i j M 0 i j , j M M B i j , j i j , i j , 0 otherwise 0 otherwise M = B =
Matrix Notation r r : represents the rank score for the i-th web page i r = α B T r α : eigenvalue r : eigenvector of B Finding Pagerank to find principle eigenvector of B
Matrix Notation
Random Walk Model Consider a random walk through the Web graph B = ? ? ? ? ?
Random Walk Model Consider a random walk through the Web graph B =
Random Walk Model Consider a random walk through the Web graph B =
Random Walk Model Consider a random walk through the Web graph B = T , what is portion of time that the surfer will spend time on each site?
Random Walk Model Consider a random walk through the Web graph B = p k ( ) : percentage of time that the surfer will stay at the i-th site p k ( ) p i ( ) B i k , i T p B p
Adding Self Loop Allow surfer to decide to stay on the same place B = B ' B (1 ) I
Recommend
More recommend