data mining techniques
play

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.) Web search before PageRank Human-curated (e.g. Yahoo, Looksmart)


  1. Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, 
 Tan et al., Leskovec et al.)

  2. Web search before PageRank • Human-curated 
 (e.g. Yahoo, Looksmart) • Hand-written descriptions • Wait time for inclusion • Text-search 
 (e.g. WebCrawler, Lycos) • Prone to term-spam (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  3. PageRank: Links as Votes Not all pages are equally important Few/no Many inbound 
 inbound 
 links links Links from Links from unimportant important pages pages • Pages with more inbound links are more important • Inbound links from important pages carry more weight (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  4. Example: PageRank Scores A B 3.3 C 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  5. PageRank: Recursive Formulation i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 • A link’s vote is proportional to the importance of its source page • If page j with importance r j has n out-links, each link gets r j / n votes • Page j ’s own importance is the sum of the votes on its in-links (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  6. Equivalent Formulation: Random Surfer i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 • At time t a surfer is on some page i • At time t+1 the surfer follows a 
 link to a new page at random • Define rank r i as fraction of time spent on page i (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  7. PageRank: The “Flow” Model r r y /2 r i ∑ = j d i j i → y “Flow” equations: r a /2 r y /2 r y = r y /2 + r a /2 r m r a = r y /2 + r m a m r m = r a /2 r a /2 • 3 equations, 3 unknowns • Impose constraint: r y + r a + r m = 1 • Solution: r y = 2/5 , r a = 2/5 , r m = 1/5 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  8. PageRank: The “Flow” Model r r y /2 r i ∑ = j d i j i → y “Flow” equations: r a /2 r y /2 r y = r y /2 + r a /2 r m r a = r y /2 + r m a m r m = r a /2 r a /2 r = M·r Matrix M is stochastic (i.e. columns sum to one) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  9. PageRank: Eigenvector Problem • PageRank: Solve for eigenvector r = M r 
 with eigenvalue λ = 1 • Eigenvector with λ = 1 is guaranteed 
 to exist since M is a stochastic matrix 
 (i.e. if a = M b then Σ a i = Σ b i ) • Problem : There are billions of pages on the 
 internet. How do we solve for eigenvector 
 with order 10 10 elements?

  10. PageRank: Power Iteration Model for random Surfer: • At time t = 0 pick a page at random • At each subsequent time t follow an 
 outgoing link at random Probabilistic interpretation:

  11. PageRank: Power Iteration y/2 y a/2 y/2 m a m a/2 p t converges to r . Iterate until | p t - p t -1 | < ε (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  12. Intermezzo: Markov Chains Markov Property Irreducibility Ergodicity Stationary distribution (for ergodic chains)

  13. Aside: Ergodicity • PageRank is assumes a random walk 
 model for individual surfers • Equivalent assumption : flow model 
 in which equal fractions of surfers 
 follow each link at every time • Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

  14. Aside: Ergodicity • PageRank is assumes a random walk 
 model for individual surfers • Equivalent assumption : flow model 
 in which equal fractions of surfers 
 follow each link at every time • Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

  15. PageRank: Problems 1. Dead Ends • Nodes with no outgoing links. Dead end • Where do surfers go next? 2. Spider Traps • Subgraph with no outgoing links to wider graph Spider trap • Surfers are “trapped” with 
 Not irreducible no way out. (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  16. Power Iteration: Dead Ends y/2 y a/2 y/2 a m a/2 Probability not conserved (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  17. Power Iteration: Dead Ends y/2 y a/2 y/2 a m (teleport at dead ends) a/2 Fixes “probability sink” issue (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  18. Power Iteration: Spider Traps y/2 y a/2 m y/2 a m a/2 Probability accumulates in traps (surfers get stuck) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  19. Solution: Random Teleports Model for teleporting random surfer: • At time t = 0 pick a page at random • At each subsequent time t • With probability β follow an 
 outgoing link at random • With probability 1- β teleport 
 to a new initial location at random PageRank Equation [Page & Brin 1998] β r i + ( 1 − β ) 1 X r j = d i N i → j (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  20. Power Iteration: Teleports y a m (can use power iteration as normal) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  21. Power Iteration: Teleports y a m (can use power iteration as normal) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  22. Power Iteration: Teleports y a m (can use power iteration as normal) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  23. Computing PageRank • M is sparse - only store nonzero entries • Space proportional roughly to number of links • Say 10N, or 4*10*1 billion = 40GB • Still won’t fit in memory, but will fit on disk source degree destination nodes node 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  24. Block-based Update Algorithm • Break r new into k blocks that fit in memory • Scan M and r old once for each block r old r new src degree destination 0 0 4 0, 1, 3, 5 0 1 1 1 2 0, 5 2 2 2 3 3, 4 2 4 3 M 5 4 5 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  25. Block-Stripe Update Algorithm Break M into stripes : Each stripe contains only destination nodes in the corresponding block of r new src degree destination r new 0 4 0, 1 0 1 3 0 r old 1 0 2 2 1 1 2 0 4 3 3 2 4 2 2 3 3 5 0 4 5 4 1 3 5 5 2 2 4 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  26. Problems: Term Spam • How do you make your page appear to be about movies? • (1) Add the word movie 1,000 times to your page • Set text color to the background color, 
 so only search engines would see it • (2) Or, run the query “movie” on your 
 target search engine • See what page came first in the listings • Copy it into your page, make it “invisible” • These and similar techniques are term spam (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  27. Google’s Solution to Term Spam • Believe what people say about you, rather than what you say about yourself • Use words in the anchor text (words that appear underlined to represent the link) and its surrounding text • PageRank as a tool to measure the “importance” of Web pages (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  28. Problems 2: Link Spam • Once Google became the dominant search engine, spammers began to work out ways to fool Google • Spam farms were developed to concentrate PageRank on a single page • Link spam: • Creating link structures that 
 boost PageRank of a page (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  29. Link Spamming • Three kinds of web pages from a 
 spammer’s point of view • Inaccessible pages • Accessible pages • e.g., blog comments pages 
 (spammer can post links to his pages) • Owned pages • Completely controlled by spammer • May span multiple domain names (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  30. Link Farms • Spammer’s goal: • Maximize the PageRank of target page t • Technique: • Get as many links from accessible pages 
 as possible to target page t • Construct “link farm” to get PageRank 
 multiplier effect (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  31. Link Farms Accessible Owned 1 Inaccessible 2 t M Millions of 
 farm pages One of the most common and effective 
 organizations for a link farm (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  32. PageRank: Extensions • Topic-specific PageRank : • Restrict teleportation to some set S 
 of pages related to a specific topic • Set p 0i = 1/ | S | if i ∈ S , p 0i = 0 otherwise • Trust Propagation • Use set S of trusted pages 
 as teleport set

  33. Hidden Markov Models

  34. Time Series with Distinct States

  35. Can we use a Gaussian Mixture Model? Time Series Histogram Posterior on states Mixture

  36. Can we use a Gaussian Mixture Model? Time Series Histogram Posterior on states Mixture

  37. Hidden Markov Models Estimate from GMM Estimate from HMM • Idea: Mixture model + Markov chain for states • Can model correlation between subsequent states 
 (more likely to be in same state than different state) 


Recommend


More recommend