PageRank CS16: Introduction to Data Structures & Algorithms Spring 2020
Outline ‣ The WWW & Search Engines ‣ Basic PageRank ‣ (Real) PageRank ‣ PageRank in practice 2
The World Wide Web ‣ Created by Tim-Berners Lee in 1989 ‣ Collection of “pages” ‣ Pages are ‣ identified by Uniform Resource Locator (URL) ‣ composed of text & hyperlinks (pointers to other pages) 3
Hypertext Hypertext and hyperlinks predate the WWW ‣ Hypertext Editing System (HES) in 1967 ‣ ‣ Ted Nelson, Andy van Dam + Brown students File Retrieval and Editing System (FRESS) in 1968 ‣ ‣ Andy van Dam + Brown students (including Bob Wallace) ‣ used in Brown’s “Introduction to Poetry” in 1975 & 1976 oN-Line System (NLS) in 1968 ‣ ‣ Douglas Engelbart 4
Growth of the Web # of Websites 1,800,000,000 1,350,000,000 900,000,000 450,000,000 0 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2020 5
Growth of the Web # of Websites 18,000,000 15,000,000 12,000,000 9,000,000 6,000,000 3,000,000 0 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Yahoo (2K) Google (2.4M) Altavista (23K) 6
Search Engines The Web is great but how do find what we need? ‣ Search engine ‣ ‣ system that indexes collection of web pages ‣ returns relevant pages when queried with keyword(s) ‣ Q: how do we build a search engine? 7
Search Engines ‣ Idea #1 ‣ build a dictionary that maps keywords to URLs ‣ use hash tables or binary search trees (see Lecture 05) ‣ what’s the problem with this approach? ‣ some keywords will have too many URLs to check ‣ let’s rank the pages by relevance! ‣ Q: how do we rank pages by relevance? 8
Search Engines ‣ Rank by frequency ‣ build a dictionary that maps keywords to URLs ‣ use hash tables or binary search trees (see end of Lecture 05) store URLs ranked by the # of times keyword appears in page ‣ ‣ Q: Is this a good idea? ‣ Why or why not? ‣ 9
Search Engines ‣ Rank by frequency ‣ build a dictionary that maps keywords to URLs ‣ use hash tables or binary search trees (see end of Lecture 05) turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle store URLs ranked by the # of times keyword appears in page ‣ turtle turtle turtle turtle turtle ‣ Q: Is this a good idea? turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle ‣ Why or why not? turtle turtle turtle turtle turtle ‣ Imagine searching for “turtle” turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle 10
‣ Sergey Brin & Larry Page ‣ PhD students doing research in information retrieval ‣ noticed that links were important too! ‣ intuition that links conveyed information about importance ‣ But what exactly? and how can you make use of links? 11
PageRank ‣ How does PageRank work? ‣ Why does it work? ‣ How do you implement it efficiently? ‣ Google indexes “hundreds of billions” of pages ‣ answers and ranks in 0.5 seconds ‣ processes 40,000 queries a second ‣ 3.5 billion per day ‣ Using clever algorithms and data structures ! From Lecture 01 12
The PageRank Algorithm ‣ Views WWW as a directed graph G=(V,E) ‣ web pages are the vertices ‣ hyperlinks are the edges ‣ High-level idea ‣ algorithm works by rounds ‣ think of the pagerank of a page as some amount of fluid ‣ at each round a page pushes its pagerank/fluid to the pages it links to 13
<latexit sha1_base64="6Fy2YTG6XmJhjxyAFgso8yb6PE=">ACMXicbVDLSsNAFJ34rPVdelmsAjtpiQiqAuh6EZ3VawtNKVMpN26GQS5lEoab7JjV8iuNCFilt/wklaRFsPDJx7zr3cuceLGJXKtl+shcWl5ZXV3Fp+fWNza7uws3svQy0wqeOQhaLpIUkY5aSuqGKkGQmCAo+Rhje4TP3GkAhJQ36nRhFpB6jHqU8xUkbqFK5jV/qwdpuUhmV4Dl2pg06soUs5zBzKUyeBri8Qjn+adTmJx1kVapW46RTKNoVOwOcJ86UFMEUtU7hye2GWAeEK8yQlC3HjlQ7RkJRzEiSd7UkEcID1CMtQzkKiGzH2ckJPDRKF/qhMI8rmKm/J2IUSDkKPNMZINWXs14q/ue1tPJP2zHlkVaE48kiXzOoQpjmB7tUEKzYyBCEBTV/hbiPTDjKpJw3ITizJ8+T+lHlrOLcHBerF9M0cmAfHIAScMAJqIrUAN1gMEDeAZv4N16tF6tD+tz0rpgTWf2wB9YX9+56KmH</latexit> <latexit sha1_base64="6Fy2YTG6XmJhjxyAFgso8yb6PE=">ACMXicbVDLSsNAFJ34rPVdelmsAjtpiQiqAuh6EZ3VawtNKVMpN26GQS5lEoab7JjV8iuNCFilt/wklaRFsPDJx7zr3cuceLGJXKtl+shcWl5ZXV3Fp+fWNza7uws3svQy0wqeOQhaLpIUkY5aSuqGKkGQmCAo+Rhje4TP3GkAhJQ36nRhFpB6jHqU8xUkbqFK5jV/qwdpuUhmV4Dl2pg06soUs5zBzKUyeBri8Qjn+adTmJx1kVapW46RTKNoVOwOcJ86UFMEUtU7hye2GWAeEK8yQlC3HjlQ7RkJRzEiSd7UkEcID1CMtQzkKiGzH2ckJPDRKF/qhMI8rmKm/J2IUSDkKPNMZINWXs14q/ue1tPJP2zHlkVaE48kiXzOoQpjmB7tUEKzYyBCEBTV/hbiPTDjKpJw3ITizJ8+T+lHlrOLcHBerF9M0cmAfHIAScMAJqIrUAN1gMEDeAZv4N16tF6tD+tz0rpgTWf2wB9YX9+56KmH</latexit> <latexit sha1_base64="6Fy2YTG6XmJhjxyAFgso8yb6PE=">ACMXicbVDLSsNAFJ34rPVdelmsAjtpiQiqAuh6EZ3VawtNKVMpN26GQS5lEoab7JjV8iuNCFilt/wklaRFsPDJx7zr3cuceLGJXKtl+shcWl5ZXV3Fp+fWNza7uws3svQy0wqeOQhaLpIUkY5aSuqGKkGQmCAo+Rhje4TP3GkAhJQ36nRhFpB6jHqU8xUkbqFK5jV/qwdpuUhmV4Dl2pg06soUs5zBzKUyeBri8Qjn+adTmJx1kVapW46RTKNoVOwOcJ86UFMEUtU7hye2GWAeEK8yQlC3HjlQ7RkJRzEiSd7UkEcID1CMtQzkKiGzH2ckJPDRKF/qhMI8rmKm/J2IUSDkKPNMZINWXs14q/ue1tPJP2zHlkVaE48kiXzOoQpjmB7tUEKzYyBCEBTV/hbiPTDjKpJw3ITizJ8+T+lHlrOLcHBerF9M0cmAfHIAScMAJqIrUAN1gMEDeAZv4N16tF6tD+tz0rpgTWf2wB9YX9+56KmH</latexit> <latexit sha1_base64="6Fy2YTG6XmJhjxyAFgso8yb6PE=">ACMXicbVDLSsNAFJ34rPVdelmsAjtpiQiqAuh6EZ3VawtNKVMpN26GQS5lEoab7JjV8iuNCFilt/wklaRFsPDJx7zr3cuceLGJXKtl+shcWl5ZXV3Fp+fWNza7uws3svQy0wqeOQhaLpIUkY5aSuqGKkGQmCAo+Rhje4TP3GkAhJQ36nRhFpB6jHqU8xUkbqFK5jV/qwdpuUhmV4Dl2pg06soUs5zBzKUyeBri8Qjn+adTmJx1kVapW46RTKNoVOwOcJ86UFMEUtU7hye2GWAeEK8yQlC3HjlQ7RkJRzEiSd7UkEcID1CMtQzkKiGzH2ckJPDRKF/qhMI8rmKm/J2IUSDkKPNMZINWXs14q/ue1tPJP2zHlkVaE48kiXzOoQpjmB7tUEKzYyBCEBTV/hbiPTDjKpJw3ITizJ8+T+lHlrOLcHBerF9M0cmAfHIAScMAJqIrUAN1gMEDeAZv4N16tF6tD+tz0rpgTWf2wB9YX9+56KmH</latexit> The Basic PageRank Algorithm At every round ‣ ‣ each vertex splits its PR evenly among its outgoing edges ‣ each vertex receives PR from all its incoming edges ‣ this is done using an update rule which is run on every vertex ‣ The update rule for Basic PageRank is: PR ( u ) X PR ( v ) = | out ( u ) | u ∈ in ( v ) 14
Basic PageRank: Example 1 Round 1 .2 .2 .2 B D .2 .1 .2 A .1 .2 .2 E C .2 .2 15
Basic PageRank: Example 1 Round 2 .4 .4 .4 B D .1 .2 0 A .2 .1 .1 E C 0 .1 16
Basic PageRank: Example 1 Round 3 .5 .1 .1 B D .2 .25 0 A .25 .2 .2 E C 0 .2 17
Basic PageRank: Example 1 Round 4 .3 .2 .2 B D .25 .15 0 A .15 .25 .25 E C 0 .25 18
Basic PageRank: Example 1 Round 5 .45 .25 .25 B D .15 .225 0 A .225 .15 .15 E C 0 .15 19
Basic PageRank: Example 1 Round 6 .40 .15 .15 B D .225 .20 0 A .20 .225 .225 E C 0 .225 20
Basic PageRank: Example 1 Round 7 .375 .225 .225 B D .20 .1875 0 A .1875 .20 .20 E C 0 .20 21
Basic PageRank: Example 1 Round 8 .425 .20 .20 B D .1875 .2125 0 A .2125 .1875 .1875 E C 0 .1875 22
Basic PageRank ‣ At Round 8 ‣ B has pagerank .425 ‣ D has pagerank .2 ‣ A has pagerank . 1875 ‣ C has pagerank .1875 ‣ E has pagerank 0 ‣ What happens if we keep going? ‣ does the ranking stabilize or will C have pagerank higher than D? ‣ can E end up with non-zero pagerank? ‣ for certain graphs, if we keep going long enough pageranks will stabilize 23
Observations The sum of all pageranks always equals 1 ‣ ‣ pagerank is moved around but never created or destroyed Pages with many incoming edges accumulate more pagerank (e.g., A, D) ‣ even better if incoming edges from pages with high pagerank (e.g., B) ‣ Pages with no incoming edges lose all their pagerank (e.g., E) ‣ Intuitively ‣ the more a page is linked to, the higher its rank ‣ the more a page is linked to by high-ranked pages, the higher it ranks ‣ 24
Basic PageRank BasicPageRank(G, k): # k is number of “rounds” for v in V: v.rank = 1/|V| for i from 1 to k: for v in V: v.prevrank = v.rank for v in V: v.rank = 0 for u in v.incoming: v.rank = v.rank + u.prevrank/|u.outgoing| runtime of ‣ ‣ O(|V|+k(|V|+|E|))=O(|V|+|E|) ‣ assuming k is a constant 25
Basic PageRank: Example 2 B D A E C 26
Basic PageRank on Example 2 BasicPageRank(G, k): # k is number of “rounds” k = 3 for v in V: v.rank = 1/|V| for i from 1 to k: for v in V: v.prevrank = v.rank for v in V: v.rank = 0 3 min for u in v.incoming: v.rank = v.rank + u.prevrank/|u.outgoing| Activity #1 27
Basic PageRank: Example 2 Round 1 .2 .2 .2 B D .2 .2 A .1 .1 .2 E C .2 .2 28
Basic PageRank: Example 2 Round 2 .7 .3 .3 B D A 0 E C 0 0 29
Basic PageRank: Example 2 Round 3 1 0 B D A 0 E C 0 0 30
What’s going on?
Basic PageRank: Example 2 1 0 B D A 0 E C 0 0 ‣ All the pagerank got trapped at B ‣ happens if graph has sinks (i.e., nodes w/ no outgoing edges) 32
Basic PageRank: Example 3 .25 .25 A B C D .25 .25 33
Basic PageRank: Example 3 1/4 1/4 1/8 A B 1/4 1/8 1/4 C D 1/4 1/4 1/4 34
Basic PageRank: Example 3 1/8 0 A B 1/8 1/2 C D 1/2 3/8 3/8 35
Basic PageRank: Example 3 0 0 A B 1/2 C D 1/2 1/2 1/2 36
Basic PageRank: Example 3 0 0 A B 1/2 C D 1/2 1/2 1/2 ‣ All the pagerank got trapped at C and D 37
Recommend
More recommend