the pagerank algorithm and web search
play

The PageRank Algorithm and Web Search John Orr Engines - PowerPoint PPT Presentation

The PageRank Algorithm The PageRank Algorithm and Web Search John Orr Engines Introduction PageRank Computation Further issues John Lindsay Orr University Of Nebraska Lincoln April 2010 jorr@math.unl.edu 1 / 37 What is PageReank?


  1. The PageRank Algorithm The PageRank Algorithm and Web Search John Orr Engines Introduction PageRank Computation Further issues John Lindsay Orr University Of Nebraska – Lincoln April 2010 jorr@math.unl.edu 1 / 37

  2. What is PageReank? The PageRank Algorithm John Orr Introduction PageRank PageRank is an algorithm for ranking the importance of Computation webpages. Further issues It was developed in the late ’90’s by Larry Page and Sergey Brin, at that time grad students at Stranford. 2 / 37

  3. References The PageRank Algorithm John Orr Introduction Brin and Page, The anatomy of a large-scale hypertextual web PageRank search engine, 1998 Computation Page, Brin, Motwani, Rajeev, Winograd, The PageRank Further issues citation ranking, 1998 Bonato, A course on the web graph, AMS 2008 Bryan and Leise, The $25,000,000,000 eigenvector, SIAM Review 2006 3 / 37

  4. The job of a search engine The PageRank Algorithm John Orr Introduction PageRank Computation The job of a search engine is to receive queries and return a Further issues usable list of relevant matches, within in a reasonable time. 4 / 37

  5. The job of a search engine The PageRank Algorithm John Orr Introduction PageRank Computation The job of a search engine is to receive queries and return a Further issues usable list of relevant matches, within in a reasonable time. 4 / 37

  6. What is the web? The PageRank Algorithm John Orr Introduction The web is a distributed, linked collection of documents. PageRank Computation Further issues 5 / 37

  7. What is the web? The PageRank Algorithm John Orr Introduction The web is a distributed, linked collection of documents. PageRank This isn’t as obvious as it sounds: Computation Further issues HTML or other content types? Static or dynamic? HTTP(S) or other protocols? Public or restricted? 5 / 37

  8. The web is big But how big? The PageRank Algorithm John Orr It’s hard to tell how big, because estimates vary wildly and are Introduction constantly changing. PageRank Computation What counts as a web page: a URL, or the content returned? Further issues The “surface web” or the “deep web”? Google (2008) claimed to have identified 1 trillion URLs, but they only index a fraction of those. The size of the “indexed web” today is probably measured in the 10’s of billions. 6 / 37

  9. The web is big Simple evidence The PageRank Algorithm John Orr Introduction A Google query on *a* finds over 25 billion results. PageRank A breadth-first search rooted at http://www.math.unl.edu Computation Further issues found 21,000 internal pages. What percentage of UNL is the Math Dept? What percentage of the web is UNL? Surely 20 , 000 × 50 × 10 , 000 = 10 10 is a huge underestimate. 7 / 37

  10. How does a search engine work? The PageRank Algorithm John Orr Introduction PageRank Computation Further issues 8 / 37

  11. The need for ranking The PageRank Algorithm John Orr Introduction A Google query on “cat” found 591,000,000 results. A search PageRank for “PageRank” got 81,000,000. Computation Further issues 1 Word/term frequency 2 Word/term context ( h1 , h2 , strong , etc.) 3 Back-link counts All very vulnerable to SEO spamming. 9 / 37

  12. Link analysis The PageRank Algorithm John Orr Introduction PageRank Computation PageRank – and other ranking algorithms, e.g., HITS – use Further issues global link analysis. 10 / 37

  13. PageRank: The goal The PageRank Algorithm Let W be the web-graph. Vertices are pages and there is a John Orr directed edge from u to v if a hyperlink, Introduction <a href="...">cat</a> , is found in u , pointing to v . (Ignore PageRank multiple links and loops.) Computation Further issues Let n = | W | ( n ∼ 10 10 ). Seek a single vector r ∈ R n , with 1 r i ≥ 0 2 � r � 1 = 1 (i.e., stochastic), where each r i represents the relative importance of page v i . 11 / 37

  14. PageRank: The goal The PageRank Algorithm Let W be the web-graph. Vertices are pages and there is a John Orr directed edge from u to v if a hyperlink, Introduction <a href="...">cat</a> , is found in u , pointing to v . (Ignore PageRank multiple links and loops.) Computation Further issues Let n = | W | ( n ∼ 10 10 ). Seek a single vector r ∈ R n , with 1 r i ≥ 0 2 � r � 1 = 1 (i.e., stochastic), where each r i represents the relative importance of page v i . 11 / 37

  15. What’s important? The PageRank Algorithm John Orr A page is important if a lot of important pages cite it. Introduction PageRank Computation Further issues 12 / 37

  16. What’s important? The PageRank Algorithm John Orr A page is important if a lot of important pages cite it. Introduction PageRank � Computation r i = r j Further issues v j → v i 12 / 37

  17. What’s important? The PageRank Algorithm John Orr A page is important if a lot of important pages cite it. Introduction PageRank � Computation r i = r j Further issues v j → v i 1 � r i = r j d + j v j → v i 12 / 37

  18. What’s important? The PageRank Algorithm John Orr Let A be the adjacency matrix of the directed graph W (i.e., Introduction a i,j = 1 if v i → v j , otherwise zero). PageRank Let D = diag ( d + 1 , . . . , d + Computation n ) . Further issues Let A 0 = D − 1 A (allowing for non-invertibility) Then r = rA 0 In other words, find an eigenvector ( the eigenvector?) of A 0 for λ = 1 . 13 / 37

  19. Example The PageRank Algorithm John Orr a b Introduction PageRank Computation Further issues c d   1 1 1 0 3 3 3 0 0 0 0   A 0 =   1 1 0 0   2 2 0 0 0 0 14 / 37

  20. Problems Sinks The PageRank Algorithm John Orr There are sure to be sinks in W . Introduction If W is a chain then PageRank   Computation 0 1 0 · · · 0 Further issues 0 0 1 0 · · · 0     0 0 0 1 0 · · · 0   A 0 =   . . ...  . .  . .   0 0 · · · which is nilpotent and so sp ( A 0 ) = { 0 } I.e., solutions to rA 0 = r do not exist. 15 / 37

  21. Problems Connectedness The PageRank Algorithm John Orr Introduction W is not strongly connected or even connected. PageRank Computation � A ′ � ∗ Further issues A 0 = A ′′ 0 The multiplicity of λ = 1 is greater than 1. I.e., solutions to rA 0 = r are not unique. 16 / 37

  22. Random surfer model The PageRank Algorithm Imagine a (finite state, discrete time, time-homogenous) John Orr Markov Process on W . Introduction PageRank At each step the surfer clicks a link uniformly at random from Computation the links on her current page. Further issues If the page has no outlinks, pick a page uniformly at random from W . The transition probabilities for this process are A 1 = A 0 + 1 nz T 1 where z is the indicator vector for the sinks ( z i = 1 if d + i = 0 and is 0 otherwise), and 1 = (1 , 1 , . . . , 1) . 17 / 37

  23. Example The PageRank Algorithm a b John Orr Introduction PageRank Computation Further issues c d       1 1 1 1 1 1 0 0 0 3 3 3 3 3 3 1 1 1 1 0 0 0 0  +1 1       4 4 4 4 A 1 =  [1 , 1 , 1 , 1] =       1 1 1 1 0 0 4 0 0 0     2 2 2 2 1 1 1 1 0 0 0 0 1 4 4 4 4 18 / 37

  24. Random surfer model The PageRank Algorithm John Orr Introduction The transition matrix PageRank Computation A 1 = A 0 + 1 nz T 1 Further issues = D − 1 A + 1 nz T 1 is a row-stochastic matrix. 19 / 37

  25. Random surfer model The PageRank Algorithm John Orr Introduction The stationary distribution of the process is the long-term PageRank proportion of the time that the surfer will spend on each page. Computation Further issues If p = ( p i ) is the stationary distribution then p = pA 1 and so we are still seeking an eigenvector for λ = 1 , but now of our modified matrix, A 1 . 20 / 37

  26. Stochastic matrices The PageRank Algorithm John Orr Lemma Introduction PageRank If S is a (row) stochastic matrix then λ = 1 is an eigenvalue. Computation Further issues Proof. S 1 T = 1 T . In other words, 1 T is a right eigenvector, and so there must exist left eigenvectors too. 21 / 37

  27. Perron’s Theorem The PageRank Algorithm John Orr Theorem Introduction Let P > 0 and let ρ be the spectral radius of P . Then. . . PageRank 1 . . . ρ is positive and is an eigenvalue of P , Computation 2 . . . ρ has left and right eigenvectors with positive entries, Further issues 3 . . . ρ has algebraic & geometric multiplicity 1, and 4 . . . all the other eigenvalues are less than ρ in magnitude. Proof. Find a fixed point of Px/ � Px � 1 on x i ≥ 0 , � x i = 1 . . . 22 / 37

  28. Perron’s Theorem Stochastic matrices The PageRank Algorithm John Orr Introduction So if P is a positive row-stochastic matrix, and x is a positive PageRank left eigenvector for ρ , then Computation Further issues � x � 1 = x 1 T = x ( P 1 T ) = ( xP ) 1 T = ρx 1 T = ρ � x � 1 and so ρ = 1 23 / 37

  29. But there’s still a problem. . . The PageRank Algorithm John Orr Our transition matrix Introduction PageRank A 1 = D − 1 A + 1 nz T 1 Computation Further issues isn’t positive. (If A 1 were irreducible we could use the Perron-Frobenius Theorem.) It’s the same issue as before; failure of (strong) connectedness. 24 / 37

Recommend


More recommend