link analysis
play

Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 - PowerPoint PPT Presentation

CS6220 Data Mining Techniques Fall 2017 Derbinsky Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques Fall 2017 Derbinsky Outline 1. Lets Build a


  1. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Link Analysis Lecture 7 Link Analysis November 29, 2017 1

  2. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Outline 1. Let’s Build a Search Engine :) The Model – Problem #1: Fast Text Search – Problem #2: Ranking Documents – Enter the ers – 2. Bringing Order to the Web Voting with Your Links – Two Equivalent Views • Problem Representation – Spiders Everywhere! • Return of the Spammer (Farms) – 3. Related Approaches Topic-Specific PageRank – SimRank – HITS: Hubs and Authorities – Link Analysis November 29, 2017 2

  3. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky What Makes a Search Engine? Inputs • – The Web (set of webpages) • Text, images, downloads • Links to other pages – User query • Simplest: set of words Outputs • – Ranked list of the “best” documents related to the query – Desired properties • Fast & scalable • Relevant results • Expressive queries • Up-to-date Link Analysis November 29, 2017 3

  4. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Problem #1 Link Analysis November 29, 2017 4

  5. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Fast & Scalable: # of Documents > 130 Trillion Pages ~ 2 Billion Users ~ 400 Million Products Link Analysis November 29, 2017 5

  6. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Fast & Scalable: Queries/sec > 63K ~ 6K Link Analysis November 29, 2017 6

  7. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Linear-Time Google • Assume simple query: ~10.4M blu ray ~ 7.75 miles • Single repo of all pages ~ 15 x Burj Khalifa – 130T * 250 w/page * 8 bytes/w ~ 260 PB • Require 1s response time – 8M * 3GHz 64-bit CPU (assume 1 cycle/w) – (8M * 63K) CPUs * 35W/CPU * $0.14/kWH ~ $0.7M/sec ~ $21.6T/year ~ 1.25 x US GDP ~67 CPU/person Link Analysis November 29, 2017 7

  8. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Enter: The Inverted Index Physical Book Find all pages that contain the word “Husky” Link Analysis November 29, 2017 8

  9. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky In the Beginning… Link Analysis November 29, 2017 9

  10. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky In the Beginning… Link Analysis November 29, 2017 10

  11. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Initial Approaches • Human-curated (e.g. Yahoo) – Hand-written descriptions – Wait time for inclusion • Text-search (e.g. Lycos) – Prone to term spam • Core Question : how to automatically rank pages (i.e. efficiently) in a quality way that is resistant to term spam? – And, at least early on, in an unsupervised fashion (i.e. given only the contents of the pages + structure of the web) Link Analysis November 29, 2017 11

  12. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky A Humble Academic Paper L. Page, S. Brin, R. Motwani, T. Winograd Link Analysis November 29, 2017 12

  13. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Basic Assumptions • Pages with more inbound links are more important • Inbound links from important pages carry more weight Link Analysis November 29, 2017 13

  14. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky The PageRank Model • Each out-link is an implicit conveyance of authority to the target page – Equal amount per link P ( j ) X P ( i ) = O j • So the PageRank of ( j,i ) ∈ E node i, P(i), equals the sum of each incoming link’s O j : number of out-links of page j PageRank proportion Link Analysis November 29, 2017 14

  15. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky An Equivalent View of PageRank • At time t a surfer is on some page – Start uniformly at random • At time t+1 the surfer follows a link to a new page at random – A Markov Chain • Define PageRank where the surfer is likely to be after a long time Link Analysis November 29, 2017 15

  16. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky The Model as a Directed Graph A B C D Link Analysis November 29, 2017 16

  17. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Random Surfer via Transition Matrix • Weight each edge equally… A B P(X|A) Starting at A, probability of arriving at node X   1 / 2 0 1 0 1 / 3 1 / 2 0 0   C D   1 / 3 1 / 2 0 0   1 / 3 1 / 2 0 0 P(A|X) Starting at X, “probability” of arriving at node D Link Analysis November 29, 2017 17

  18. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Checkup • Assume a surfer has an equal probability of starting at any site   1 / 2 0 1 0 • What is the 1 / 3 1 / 2 0 0   probability of arriving   1 / 3 1 / 2 0 0   at each of the four 1 / 3 1 / 2 0 0 sites at the next time step given the transition matrix? Link Analysis November 29, 2017 18

  19. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Answer       1 / 2 1 / 4 9 / 24 0 1 0 1 / 3 1 / 2 1 / 4 5 / 24 0 0       =       1 / 3 1 / 2 1 / 4 5 / 24 0 0       1 / 3 1 / 2 1 / 4 5 / 24 0 0 Link Analysis November 29, 2017 19

  20. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky A Few More Waves…           1 / 4 9 / 24 15 / 48 11 / 32 3 / 9 1 / 4 5 / 24 11 / 48 7 / 32 2 / 9                     1 / 4 5 / 24 11 / 48 7 / 32 2 / 9           1 / 4 5 / 24 11 / 48 7 / 32 2 / 9 . . . Link Analysis November 29, 2017 20

  21. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky An Aside… Was This Surprising?   1 / 2 0 1 0 1 / 3 1 / 2 0 0   Same probability of   1 / 3 1 / 2 0 0 arriving at B/C/D –   1 / 3 1 / 2 0 0 call each X Link Analysis November 29, 2017 21

  22. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky An Aside… Was This Surprising?   1 / 2 0 1 0 ( ½ )X + X 1 / 3 1 / 2 0 0   Same probability of   1 / 3 1 / 2 0 0 arriving at B/C/D –   1 / 3 1 / 2 0 0 call each X Link Analysis November 29, 2017 22

  23. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky An Aside… Was This Surprising?   1 / 2 0 1 0 ( ½ )X + X 1 / 3 1 / 2 0 0   Same probability of   1 / 3 1 / 2 0 0 arriving at B/C/D –   1 / 3 1 / 2 0 0 call each X X + X + X + (3 / 2) X = 1 9 2 X = 1 X = 2 2 · X = 3 3 2 · 2 9 = 3 9 9 Link Analysis November 29, 2017 23

  24. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Checkup     1 / 2 3 / 9 0 1 0 1 / 3 1 / 2 2 / 9 0 0      =     1 / 3 1 / 2 2 / 9 0 0    1 / 3 1 / 2 2 / 9 0 0 Link Analysis November 29, 2017 24

  25. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky Answer       1 / 2 3 / 9 3 / 9 0 1 0 1 / 3 1 / 2 2 / 9 2 / 9 0 0        =       1 / 3 1 / 2 2 / 9 2 / 9 0 0      1 / 3 1 / 2 2 / 9 2 / 9 0 0 MP = P 𝜇 = ? MP = λ P Link Analysis November 29, 2017 25

  26. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky What Just Happened!!?? • If certain conditions hold, 1 is the largest eigenvalue and the PageRank vector is the principal eigenvector of the transition matrix – The conditions held for our example, but not in the general case for the web (yet!) • We intuitively used a method called power iteration to compute P – Useful particularly when the matrix in question is large & sparse, as the method doesn’t require any decomposition – Convergence: typically when the residual (norm of the difference between P vs P’) is below a threshold; may require many iterations, typically few are good enough for the web (according to Google) Link Analysis November 29, 2017 26

  27. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky So What are these Conditions…? • Stochastic: columns sum to 1 – Done… right? • Strongly connected: possible to get from any node to any other node – Might be very unlikely, but must be possible! Link Analysis November 29, 2017 27

  28. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky What Would Happen If… A B C D Link Analysis November 29, 2017 28

  29. CS6220 – Data Mining Techniques ・ ・・ Fall 2017 ・ ・・ Derbinsky What Would Happen If… A B C is termed a dead end C D Link Analysis November 29, 2017 29

Recommend


More recommend