cs425 algorithms for web scale data
play

CS425: Algorithms for Web Scale Data Most of the slides are from the - PowerPoint PPT Presentation

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Graph data overview Problems with early


  1. CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org

  2.  Graph data overview  Problems with early search engines  PageRank Model ▪ Flow Formulation ▪ Matrix Interpretation ▪ Random Walk Interpretation ▪ Google’s Formulation  How to Compute PageRank CS425: Algorithms for Web-Scale Data 2

  3. Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

  4. Connections between political blogs Polarization of the network [Adamic-Glance, 2005] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

  5. Citation networks and Maps of science [Börner et al., 2012] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

  6. domain2 domain1 router domain3 Internet J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

  7. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

  8.  How to organize the Web?  First try: Human curated Web directories ▪ Yahoo, DMOZ, LookSmart  Second try: Web Search ▪ Information Retrieval investigates: Find relevant docs in a small and trusted set ▪ Newspaper articles, Patents, etc. ▪ But: Web is huge , full of untrusted documents, random things, web spam, etc. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

  9. 2 challenges of web search:  (1) Web contains many sources of information Who to “trust”? ▪ Trick: Trustworthy pages may point to each other!  (2) What is the “best” answer to query “newspaper”? ▪ No single right answer ▪ Trick: Pages that actually know about newspapers might all be pointing to many newspapers J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

  10. Early Search Engines  Inverted index  Data structure that return pointers to all pages a term occurs  Which page to return first?  Where do the search terms appear in the page?  How many occurrences of the search terms in the page?  What if a spammer tries to fool the search engine? 10 CS 425 – Lecture 1 Mustafa Ozdal, Bilkent University

  11. Fooling Early Search Engines  Example: A spammer wants his page to be in the top search results for the term “movies”.  Approach 1:  Add thousands of copies of the term “movies” to your page.  Make them invisible.  Approach 2:  Search the term “movies”.  Copy the contents of the top page to your page.  Make it invisible.  Problem: Ranking only based on page contents  Early search engines almost useless because of spam. 11 CS 425 – Lecture 1 Mustafa Ozdal, Bilkent University

  12. Google’s Innovations  Basic idea: Search engine believes what other pages say about you instead of what you say about yourself.  Main innovations: 1. Define the importance of a page based on:  How many pages point to it?  How important are those pages? 2. Judge the contents of a page based on:  Which terms appear in the page?  Which terms are used to link to the page? 12 CS 425 – Lecture 1 Mustafa Ozdal, Bilkent University

  13.  All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

  14.  We will cover the following Link Analysis approaches for computing importances of nodes in a graph: ▪ Page Rank ▪ Topic-Specific (Personalized) Page Rank ▪ Web Spam Detection Algorithms J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

  15.  Think of in-links as votes: ▪ www.stanford.edu has 23,400 in-links ▪ www.joe-schmoe.com has 1 in-link  Are all in-links are equal? ▪ Links from important pages count more ▪ Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

  16. A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

  17.  Each link’s vote is proportional to the importance of its source page  If page j with importance r j has n out-links, each link gets r j / n votes  Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

  18.  A “vote” from an important page is worth more y/2  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r j for page j m a m a/2  r  i r “Flow” equations: j d r y = r y /2 + r a /2  i j i r a = r y /2 + r m r m = r a /2 𝒆 𝒋 … out -degree of node 𝒋 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

  19. Flow equations:  3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2 ▪ No unique solution ▪ All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness: ▪ 𝒔 𝒛 + 𝒔 𝒃 + 𝒔 𝒏 = 𝟐 𝟑 𝟑 𝟐 ▪ Solution: 𝒔 𝒛 = 𝟔 , 𝒔 𝒃 = 𝟔 , 𝒔 𝒏 = 𝟔  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

  20.  Adjacency matrix 𝑵 ▪ Let page 𝑗 have 𝑒 𝑗 out-links 1 ▪ If 𝑗 → 𝑘 , then 𝑁 𝑘𝑗 = else 𝑁 𝑘𝑗 = 0 𝑒 𝑗  Rank vector 𝒔 : vector with an entry per page ▪ 𝑠 𝑗 is the importance score of page 𝑗 ▪ σ 𝑗 𝑠 = 1 𝑗  r  i  The flow equations can be written r j d  𝒔 = 𝑵 ⋅ 𝒔 i j i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

  21. y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = M∙r r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

  22.  r  i r  Remember the flow equation: j d   Flow equation in the matrix form i j i 𝑵 ⋅ 𝒔 = 𝒔 ▪ Suppose page i links to 3 pages, including j i r j j . = r i 1/3 . M r r = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

  23. Exercise: Matrix Formulation r M r A B r A r A 1/2 0 0 1 1/3 0 r B r B 1/2 0 . = 1/3 0 0 1/2 r C r C 0 1/3 1/2 0 r D r D C D 25 CS 425 – Lecture 1 Mustafa Ozdal, Bilkent University

  24. Linear Algebra Reminders  A is a column stochastic matrix iff each of its columns add up to 1 and there are no negative entries.  Our adjacency matrix M is column stochastic. Why?  If there exist a vector x and a scalar λ such that Ax = λ x, then:  x is an eigenvector and λ is an eigenvalue of A  The principal eigenvector is the one that corresponds to the largest eigenvalue.  The largest eigenvalue of a column stochastic matrix is 1. Ax = x, where x is the principal eigenvector 26 CS 425 – Lecture 1 Mustafa Ozdal, Bilkent University

  25.  PageRank flow formulation: 𝒔 = 𝑵 ∙ 𝒔  So the rank vector r is an eigenvector of the stochastic web matrix M NOTE: x is an eigenvector with ▪ In fact, its first or principal eigenvector, the corresponding eigenvalue λ if: with corresponding eigenvalue 1 𝑩𝒚 = 𝝁𝒚  We can now efficiently solve for r ! The method is called Power iteration J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

  26.  Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Power iteration: a simple iterative scheme ▪ Suppose there are N web pages   ( t ) r  ▪ Initialize: r (0) = [1/N,….,1/N] T ( t 1 ) i r j d ▪ Iterate: r (t+1) = M ∙ r (t)  i j i d i …. out -degree of node i ▪ Stop when | r (t+1) – r (t) | 1 <  | x | 1 =  1≤i≤N |x i | is the L 1 norm Can use any other vector norm, e.g., Euclidean J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

  27. y a m  Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m m 0 ½ 0 𝑠 𝑗 ▪ 1: 𝑠′ 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 ▪ 2: 𝑠 = 𝑠′ r a = r y /2 + r m ▪ Goto 1 r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29

Recommend


More recommend