web spam
play

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 - PowerPoint PPT Presentation

Web Dynamics Web Spam Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-1/49 Agenda Web spam - Why and what? Web Spam - Spam taxonomy Overview


  1. Web Dynamics Web Spam Web Spam Marc Spaniol Marc Spaniol Saarbrücken, July 23, 2009 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-1/49

  2. Agenda • Web spam - Why and what? Web Spam - Spam taxonomy  Overview Marc Spaniol  Strategies in detail o Link spam o Link farms  Examples • Countermeasures - Spam detection - Labeling and assessment - Combating spam - Web spam challenge Databases and • Conclusion Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-2/49

  3. Web Spam Why? “Spam industry had a revenue Time elapsed to reach hit position Time spent looking at hit position Web Spam potential of $4.5 billion in year 2004 if they had been able to completely Marc Spaniol fool all search engines on all commercially viable queries” [Amitay 2004] Databases and Information Systems [Granka, Joachims, Gay 2004] Prof. Dr. G. Weikum MPII-Sp-0709-3/49

  4. Web Spam What’s the Problem? 2004 .de crawl Unknown 0.4% Courtesy: T. Suel Alias 0.3% Web Spam Empty 0.4% Non-existent 7.9% Marc Spaniol Ad 3.7% Weborg 0.8% Spam 16.5% Reputable 70.0% Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-4/49

  5. Web Spam • Target of spammers - Not end users (directly) Web Spam - High revenue from customers for search engine “optimization” (especially Google) - Indirect revenue Marc Spaniol  Affiliate programs, Google AdSense  Ad display, traffic funneling • Spam taxonomy - Content spam  Keywords  Popular expressions  Mis-spellings - Link spam “farms”  Densely connected sites  Redirects - Cloaking and hiding Databases and Information Systems [Benczúr et al. 2008] - Spam in social media Prof. Dr. G. Weikum MPII-Sp-0709-5/49

  6. Overview Spamming Web Spam Marc Spaniol Boosting Hiding Term Links Content Hiding Cloaking Redirection Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-6/49

  7. Spammed Ranking Elements • Term frequency (tf in the tf.idf, Okapi BM25 etc. ranking schemes) • Term frequency weighted by HTML elements Web Spam - Title - Headers Marc Spaniol - Font size - Face • Heaviest weight in ranking - URL, domain name part - Anchor text: <a href=“…”>Best Saarbruecken nightlife</a> • Structural information - URL length - Depth from server root - Indegree - PageRank - Link based centrality Databases and All Web information retrieval ranking elements spammed Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-7/49

  8. Content Spam • Domain name adjustableloanmortgagemastersonline.compay.dahannusaprima.co.uk Web Spam buy-canon-rebel-20d-lens-case.camerasx.com Marc Spaniol • Anchor text (title, H1, etc) <a href=“target.html”>free, great deals, cheap, inexpensive, cheap, free</a> • Meta keywords <meta name=“keywords” content=“UK Swingers, UK, swingers, swinging, genuine, adult contacts, connect4fun, sex, …”> Databases and Information Systems [Gyöngyi, Garcia-Molina, 2005] Prof. Dr. G. Weikum MPII-Sp-0709-8/49

  9. Parking Domain Web Spam Marc Spaniol <div style="position:absolute; top:20px; width:600px; height:90px; overflow:hidden;"><font size=-1>atangledweb.co.uk currently offline<br>atangledweb.co.uk back soon<br></font><br><br><a href="http://www.atangledweb.co.uk"><font size=-1>atangledweb.co.uk</font></a><br><br><br> Soundbridge HomeMusic WiFi Media Play<a class=l href="http://www.atangledweb.co.uk/index01.html">-</a>... SanDisk Sansa e250 - 2GB MP3 Player -<a class=l href="http://www.atangledweb.co.uk/index02.html">-</a>... AIGO F820+ 1GB Beach inspired MP3 Pla<a class=l href="http://www.atangledweb.co.uk/index03.html">-</a>... Targus I-Pod Mini Sound Enhancer<a class=l href="http://www.atangledweb.co.uk/index04.html">-</a>... Sony NWA806FP.CE7 4GB video WALKMAN <a class=l href="http://www.atangledweb.co.uk/index05.html">-</a>... Ministry of Sound 512MB MP3 player<a class=l href="http://www.mp3roze.co.uk/cat7000.html">-</a>... Nokia 6125 - Fold Design - 1.3 Megapi<a class=l href="http://www.mp3roze.co.uk/cat7001.html">-</a>... Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-9/49

  10. Keyword Stuffing & Generated Copies Web Spam Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-10/49

  11. Google ads Web Spam Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-11/49

  12. Link Spam “Hyperlink structure contains an enormous amount of latent human annotation that can be extremely valuable for automatically Web Spam inferring notions of authority.” (Chakrabarti et. al. ’99) Marc Spaniol • Hyperlinks: Good, Bad, Ugly Honest link, human annotation No value of recommendation, e.g. “affiliate programs”, navigation, ads … Deliberate manipulation, link spam Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-12/49

  13. PageRank PageRank of page p 0 : Outgoing links from p i Web Spam p 0 = c Σ i p i /|F(i)| + (1-c) Marc Spaniol PageRank of random jump damping Generalized (vector): p i pointing to p 0 factor (1 – c) 1 N c T’p p = + N “1” vector Transition matrix Score vector • One page is important if it is pointed to by many other pages • Based on the link structure Databases and The algorithm of PageRank is vulnerable to link spamming Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-13/49

  14. Link Farms • Entry point from the honest web - Honey pots: Copies of quality content Web Spam - Dead links to parking domain - Blog or guestbook comment spam Marc Spaniol Hijacked Farm W W W Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-14/49

  15. Spam Farm: Pages λ 1 p 1 Web Spam λ 0 ? λ 2 p 2 p 0 Marc Spaniol Target page λ k p k • Each farm has only one • The target of the spammer is to increase this page’s ranking Boosting pages • Controlled by the spammer • Pointing to the target page in order to increase its PageRank Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-15/49

  16. Spam Farm: External Links λ 1 p 1 Web Spam λ 0 ? λ 2 Marc Spaniol p 2 p 0 λ k p k Leakage • Fractions of PageRank • Link to the pages are added from pages outside the Farm (forum, blog, …) • The spammer has no or limited control on them Databases and • λ = λ 0 + … + λ k Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-16/49

  17. Simple Farm Model 1 W W W Web Spam 2 p 0 Marc Spaniol • PageRank of target page: p 0 k • Number of all pages: N • Damping factor: c • Leakage contributed by accessible pages: λ • PageRank of each farm page: (1-c)/N p 0 = λ+ k*c*[(1 -c)/N] + (1-c)/N = λ+ [(1 -c)(ck+1)]/N By making k large, we can make p 0 as large as we want Databases and Information Systems No multiplier effect for “acquired” page rank Prof. Dr. G. Weikum MPII-Sp-0709-17/49

  18. Optimal Farm Model 1 W W W 2 Web Spam q 0 Marc Spaniol • Optimal PageRank of target page: q 0 • Number of all pages: N • Damping factor: c k • Leakage contributed by accessible pages: λ • PageRank of each farm page: cq 0 /k + (1-c)/N q 0 = λ+ ck[cq 0 /k + (1-c)/N] + (1-c)/N = λ+ c 2 q 0 + c(1-c)k/N + (1-c)/N … = λ/(1 -c 2 ) + [(1-c)(ck+1)]/N(1-c 2 ) = p 0 /(1-c 2 ) By making k large, we can make q 0 as large as we want For c = 0.85 “performance” gain: 1/(1 -c 2 ) = 3.6 Databases and Information Systems Multiplier effect for “acquired” page rank Prof. Dr. G. Weikum MPII-Sp-0709-18/49

  19. Simple vs. Optimal Farm Simple: Optimal: Optimal: Each boosting page only points to the Short reinforcement loop(s) The target points to all boosting pages Web Spam target page There are no links among boosting pages Fewer links (1 – c)(ck + 1) Marc Spaniol p 0 = c λ q 0 = p 0 / (1 – c 2 ) r 0 = p 0 / (1 – c 2 ) + N q 1 p 1 r 2 λ λ λ q 2 q 0 r 0 p 2 p 0 r k r 1 q k p k Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-19/49

  20. Optimality without Leakage For mathematical simplification only Idea: Interpret leakage as additional boosting pages Web Spam q k+1 q 1 Marc Spaniol q 0 q 2 q k+m m = λ N / (1-c) q k (1 – c)(ck + 1) ! c λ (1 – c)[c(k + m) + 1] + = Databases and (1 – c 2 ) (1 – c 2 )N (1 – c 2 )N Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-20/49

  21. Alliance of two Farms Intuitive: Each boosting page points to both targets Web Spam p 0 q 0 Marc Spaniol p 1 p 2 p k q 1 q 2 q m 2(k + m) new links p 0 = c Σ i=1,...,k p i /2 + c Σ j=1,...,m q j /2 + (1-c)/N Redistribution of PageRank q 0 = c Σ i=1,...,k p i /2 + c Σ j=1,...,m q j /2 + (1-c)/N c(k + m)/2 + 1 p 0 = q 0 = p i = c(p 0 +q 0 )/(k+m) + (1-c)/N, i= 1,...,k (1 + c)N Databases and q j = c(p 0 +q 0 )/(k+m) + (1-c)/N, j= 1,...,m Information Systems Convenient for the smaller Farm Prof. Dr. G. Weikum MPII-Sp-0709-21/49

  22. Alliance of two Farms Better: Only the target pages are interconnected with each other Web Spam p 0 q 0 Marc Spaniol p 1 p 2 p k q 1 q 2 q m only 2 new links Redistribution of PageRank c(k + m)/2 + 1 p 0 = q 0 = (1 + c)N Databases and Information Systems Convenient for the smaller Farm Prof. Dr. G. Weikum MPII-Sp-0709-22/49

Recommend


More recommend