1998 enter link analysis
play

1998: enter Link Analysis uses hyperlink structure to focus the - PDF document

1998: enter Link Analysis uses hyperlink structure to focus the relevant set combine traditional IR score with popularity score Page and Brin 1998 Kleinberg Web Information Retrieval IR before the Web = traditional IR IR on the Web =


  1. 1998: enter Link Analysis • uses hyperlink structure to focus the relevant set • combine traditional IR score with popularity score Page and Brin 1998 Kleinberg

  2. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR

  3. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections?

  4. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections? • It’s huge. – over 10 billion pages, average page size of 500KB – 20 times size of Library of Congress print collection – Deep Web - 400 X bigger than Surface Web

  5. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections? • It’s huge. – over 10 billion pages, average page size of 500KB – 20 times size of Library of Congress print collection – Deep Web - 400 X bigger than Surface Web • It’s dynamic. – content changes: 40% of pages change in a week, 23% of .com change daily – size changes: billions of pages added each year

  6. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections? • It’s huge. – over 10 billion pages, average page size of 500KB – 20 times size of Library of Congress print collection – Deep Web - 400 X bigger than Surface Web • It’s dynamic. – content changes: 40% of pages change in a week, 23% of .com change daily – size changes: billions of pages added each year • It’s self-organized. – no standards, review process, formats – errors, falsehoods, link rot, and spammers!

  7. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections? • It’s huge. – over 10 billion pages, average page size of 500KB – 20 times size of Library of Congress print collection – Deep Web - 400 X bigger than Surface Web • It’s dynamic. – content changes: 40% of pages change in a week, 23% of .com change daily – size changes: billions of pages added each year • It’s self-organized. – no standards, review process, formats – errors, falsehoods, link rot, and spammers! A Herculean Task!

  8. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections? • It’s huge. – over 10 billion pages, each about 500KB – 20 times size of Library of Congress print collection – Deep Web - 400 X bigger than Surface Web • It’s dynamic. – content changes: 40% of pages change in a week, 23% of .com change daily – size changes: billions of pages added each year • It’s self-organized. – no standards, review process, formats – errors, falsehoods, link rot, and spammers! • Ah, but it’s hyperlinked ! – Vannevar Bush’s 1945 memex

  9. Elements of a Web Search Engine WWW Page Repository Crawler Module User Queries s t l u Indexing Module s e R query-independent Query Module Ranking Module Indexes Special-purpose indexes Content Index Structure Index

  10. The Ranking Module (generates popularity scores) • Measure the importance of each page

  11. The Ranking Module (generates popularity scores) • Measure the importance of each page • The measure should be Independent of any query — Primarily determined by the link structure of the Web — Tempered by some content considerations

  12. The Ranking Module (generates popularity scores) • Measure the importance of each page • The measure should be Independent of any query — Primarily determined by the link structure of the Web — Tempered by some content considerations • Compute these measures off-line long before any queries are processed

  13. The Ranking Module (generates popularity scores) • Measure the importance of each page • The measure should be Independent of any query — Primarily determined by the link structure of the Web — Tempered by some content considerations • Compute these measures off-line long before any queries are processed � technology distinguishes it from all com- • Google’s PageRank c petitors

  14. The Ranking Module (generates popularity scores) • Measure the importance of each page • The measure should be Independent of any query — Primarily determined by the link structure of the Web — Tempered by some content considerations • Compute these measures off-line long before any queries are processed � technology distinguishes it from all com- • Google’s PageRank c petitors Google’s PageRank = Google’s $$$$$

  15. Google’s PageRank (Lawrence Page & Sergey Brin 1998) The Google Goals • Create a PageRank r ( P ) that is not query dependent ⊲ Off-line calculations — No query time computation • Let the Web vote with in-links ⊲ But not by simple link counts — One link to P from Yahoo! is important — Many links to P from me is not Share The Vote • ⊲ Yahoo! casts many “votes” — value of vote from Y ahoo ! is diluted ⊲ If Yahoo! “votes” for n pages — Then P receives only r ( Y ) /n credit from Y

  16. Google’s PageRank (Lawrence Page & Sergey Brin 1998) The Google Goals • Create a PageRank r ( P ) that is not query dependent ⊲ Off-line calculations — No query time computation • Let the Web vote with in-links ⊲ But not by simple link counts — One link to P from Yahoo! is important — Many links to P from me is not Share The Vote • ⊲ Yahoo! casts many “votes” — value of vote from Y ahoo ! is diluted ⊲ If Yahoo! “votes” for n pages — Then P receives only r ( Y ) /n credit from Y

  17. Google’s PageRank (Lawrence Page & Sergey Brin 1998) The Google Goals • Create a PageRank r ( P ) that is not query dependent ⊲ Off-line calculations — No query time computation • Let the Web vote with in-links ⊲ But not by simple link counts — One link to P from Yahoo! is important — Many links to P from me is not Share The Vote • ⊲ Yahoo! casts many “votes” — value of vote from Y ahoo ! is diluted ⊲ If Yahoo! “votes” for n pages — Then P receives only r ( Y ) /n credit from Y

  18. Google’s PageRank (Lawrence Page & Sergey Brin 1998) The Google Goals • Create a PageRank r ( P ) that is not query dependent ⊲ Off-line calculations — No query time computation • Let the Web vote with in-links ⊲ But not by simple link counts — One link to P from Yahoo! is important — Many links to P from me is not Share The Vote • ⊲ Yahoo! casts many “votes” — value of vote from Y ahoo ! is diluted ⊲ If Yahoo! “votes” for n pages — Then P receives only r ( Y ) /n credit from Y

  19. PageRank The De fi nition B P = { all pages pointing to P } r ( P ) � r ( P ) = | P | | P | = number of out links from P P ∈B P

  20. PageRank The De fi nition B P = { all pages pointing to P } r ( P ) � r ( P ) = | P | | P | = number of out links from P P ∈B P Successive Re fi nement Start with r 0 ( P i ) = 1 /n for all pages P 1 , P 2 , ... , P n

  21. PageRank The De fi nition B P = { all pages pointing to P } r ( P ) � r ( P ) = | P | | P | = number of out links from P P ∈B P Successive Re fi nement Start with r 0 ( P i ) = 1 /n for all pages P 1 , P 2 , ... , P n Iteratively re fi ne rankings for each page r 0 ( P ) � r 1 ( P i ) = | P | P ∈B Pi

  22. PageRank The De fi nition B P = { all pages pointing to P } r ( P ) � r ( P ) = | P | | P | = number of out links from P P ∈B P Successive Re fi nement Start with r 0 ( P i ) = 1 /n for all pages P 1 , P 2 , ... , P n Iteratively re fi ne rankings for each page r 0 ( P ) � r 1 ( P i ) = | P | P ∈B Pi r 1 ( P ) � r 2 ( P i ) = | P | P ∈B Pi

  23. PageRank The De fi nition B P = { all pages pointing to P } r ( P ) � r ( P ) = | P | | P | = number of out links from P P ∈B P Successive Re fi nement Start with r 0 ( P i ) = 1 /n for all pages P 1 , P 2 , ... , P n Iteratively re fi ne rankings for each page r 0 ( P ) � r 1 ( P i ) = | P | P ∈B Pi r 1 ( P ) � r 2 ( P i ) = | P | P ∈B Pi ... r j ( P ) � r j + 1 ( P i ) = | P | P ∈B Pi

  24. In Matrix Notation After Step k — π T k = [ r k ( P 1 ) , r k ( P 2 ) , ... , r k ( P n )]

  25. In Matrix Notation After Step k — π T k = [ r k ( P 1 ) , r k ( P 2 ) , ... , r k ( P n )] � 1 / | P i | if i → j — π T k + 1 = π T k H where h ij = otherwise 0

  26. In Matrix Notation After Step k — π T k = [ r k ( P 1 ) , r k ( P 2 ) , ... , r k ( P n )] � 1 / | P i | if i → j — π T k + 1 = π T k H where h ij = otherwise 0 — PageRank vector = π T = lim k →∞ π T k = eigenvector for H Provided that the limit exists

  27. Tiny Web 1 2 P 1 P 2 P 3 P 4 P 5 P 6 ⎛ ⎞ P 1 P 2 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 3 P 3 ⎜ ⎟ H = ⎜ ⎟ P 4 ⎜ ⎟ ⎜ ⎟ P 5 ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ 6 5 P 6 4

  28. Tiny Web 1 2 P 1 P 2 P 3 P 4 P 5 P 6 ⎛ ⎞ P 1 1 / 2 1 / 2 0 0 0 0 P 2 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 3 P 3 ⎜ ⎟ H = ⎜ ⎟ P 4 ⎜ ⎟ ⎜ ⎟ P 5 ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ 6 5 P 6 4

Recommend


More recommend