cs473
play

CS473: Link Analysis Luo Si Department of Computer Science Purdue - PowerPoint PPT Presentation

CS473: Link Analysis Luo Si Department of Computer Science Purdue University Borrowed Slides from Prof. Rong Jin (MSU) Citation Analysis Web Structure Web is a graph Each web site correspond to a node A link from one site to another


  1. CS473: Link Analysis Luo Si Department of Computer Science Purdue University Borrowed Slides from Prof. Rong Jin (MSU)

  2. Citation Analysis

  3. Web Structure Web is a graph – Each web site correspond to a node – A link from one site to another site forms a directed edge What does it look likes?  Web is small world  The diameter of the web is 19 e.g. the average number of clicks from one web site to another is 19

  4. Bowtie Structure Strongly Connected Component Broder et al., 2001

  5. Bowtie Structure Sites that link towards the ‘center’ of the web Broder et al., 2001

  6. Bowtie Structure Sites that link from the ‘center’ of the web Broder et al., 2001

  7. Inlinks and Outlinks Both degrees of incoming and outgoing links follow power law Broder et al., 2001

  8. Early Approaches Basic Assumptions • Hyperlinks contain information about the human judgment of a site • The more incoming links to a site, the more it is judged important Bray 1996 • The visibility of a site is measured by the number of other sites pointing to it • The luminosity of a site is measured by the number of other sites to which it points  Limitation: failure to capture the relative importance of different parents (children) sites

  9. HITS - Kleinberg’s Algorithm • HITS – Hypertext Induced Topic Selection • For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v • A site is very authoritative if it receives many citations. Citations from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites

  10. Authority and Hubness 2 5 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4)

  11. Authority and Hubness: Version 1 Recursive dependency HubsAuthorities(G) |V|  1 1  [1,…,1] Є R  a v ( ) h w ( ) 2 a  h  1  0 0 w pa v [ ] 3 t  1   h v ( ) a w ( ) 4 repeat 5 for each v in V  w ch v [ ] 6 do a (v)  Σ h (w) t w Є pa[v] t -1 7 h (v)  Σ a (w) w Є pa[v] t t -1 8 t  t + 1 9 until || a – a || + || h – h || < ε t t -1 t t -1 10 return (a , h ) t t

  12. Authority and Hubness: Version 1 Recursive dependency HubsAuthorities(G) |V|  1 1  [1,…,1] Є R  a v ( ) h w ( ) 2 a  h  1  0 0 w pa v [ ] 3 t  1   h v ( ) a w ( ) 4 repeat 5 for each v in V  w ch v [ ] 6 do a (v)  Σ h (w) t w Є pa[v] t -1 7 h (v)  Σ a (w) w Є pa[v] t t -1 8 t  t + 1 9 until || a – a || + || h – h || < ε t t -1 t t -1 10 return (a , h ) t t Problems ?

  13. Authority and Hubness: Version 2 Recursive dependency HubsAuthorities(G) |V|   a v ( ) h w ( ) 1 1  [1,…,1] Є R 2 a  h  1  w pa v [ ] 0 0 3 t  1   h v ( ) a w ( ) 4 repeat  w ch v [ ] 5 for each v in V 6 do a (v)  Σ h (w) t + Normalization w Є pa[v] t -1 a v ( ) 7 h (v)  Σ a (w) w Є pa[v] t  a v ( ) t -1 8 a  a / || a ||  t t a w ( ) 9 h  h / || h || w t t 10 t  t + 1 h v ( )  h v ( ) 11 until || a – a || + || h – h || < ε t t -1 t t -1  h w ( ) 12 return (a , h ) t t w

  14. HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights

  15. Authority and Hubness Authority score – Not only depends on the number of incoming links – But also the ‘quality’ (e.g., hubness) of the incoming links Hubness score – Not only depends on the number of outgoing links – But also the ‘quality’ (e.g., hubness) of the outgoing links

  16. Authority and Hub Column vector a : a i is the authority score for the i-th site Column vector h : h i is the hub score for the i-th site Matrix M :  1 the th site points to the th site i j   M i j , 0 otherwise  M =

  17. Authority and Hub Vector a : a i is the authority score for the i-th site Vector h : h i is the hub score for the i-th site Matrix M :  1 the th site points to the th site i j   M i j , 0 otherwise  • Recursive dependency : a(v)  Σ h(w) w Є pa[v] h(v)  Σ a(w) w Є ch[v]

  18. Authority and Hub Column vector a : a i is the authority score for the i-th site Column vector h : h i is the hub score for the i-th site Matrix M :  1 the th site points to the th site i j   M i j , 0 otherwise  • Recursive dependency : a(v)  Σ h(w)  T a M h w Є pa[v]  h Ma h(v)  Σ a(w) w Є ch[v]

  19. Authority and Hub Column vector a : a i is the authority score for the i-th site Column vector h : h i is the hub score for the i-th site Matrix M :  1 the th site points to the th site i j   M i j , Normalization 0 otherwise  Procedure • Recursive dependency : T   a M h a(v)  Σ h(w) t t t w Є pa[v]   h Ma h(v)  Σ a(w) t t t w Є ch[v]

  20. Authority and Hub T     T   a M Ma   a M h t t t t t t t    T h Ma     h MM h  t t t t t t t Apply singular vector decomposition to matrix M     T T   a u h , v M U ΣV u v 1 1 i i i i

  21. PageRank Introduced by Page et al (1998) – The weight is assigned by the rank of parents Difference with HITS – HITS takes Hubness & Authority weights – The page rank is proportional to its parents’ rank, but inversely proportional to its parents’ outdegree

  22. Matrix Notation  1    1 the th site points to the th site i j M 0   i j , j   M M   B i j , j i j , i j , 0 otherwise   0 otherwise  M = B =

  23. Matrix Notation r r : represents the rank score for the i-th web page i r = α B T r α : eigenvalue r : eigenvector of B Finding Pagerank  to find principle eigenvector of B

  24. Matrix Notation

  25. Random Walk Model Consider a random walk through the Web graph B = ? ? ? ? ?

  26. Random Walk Model Consider a random walk through the Web graph B =

  27. Random Walk Model Consider a random walk through the Web graph B =

  28. Random Walk Model Consider a random walk through the Web graph B = T   , what is portion of time that the surfer will spend time on each site?

  29. Random Walk Model Consider a random walk through the Web graph B = p k ( ) : percentage of time that the surfer will stay at the i-th site   p k ( ) p i ( ) B i k , i T  p B p

  30. Adding Self Loop Allow surfer to decide to stay on the same place B =      B ' B (1 ) I

Recommend


More recommend